diff --git a/observability/recipes-observability/index.html b/observability/recipes-observability/index.html
index b35ed637e..8ee14d14f 100644
--- a/observability/recipes-observability/index.html
+++ b/observability/recipes-observability/index.html
@@ -6553,8 +6553,7 @@ <h2 id="azure-devops-pipelines-reporting-with-power-bi">Azure DevOps Pipelines R
 <p>The <a href="https://github.com/Azure-Samples/powerbi-pipeline-report">Azure DevOps Pipelines Report</a> contains a <a href="https://learn.microsoft.com/en-us/power-bi/fundamentals/power-bi-overview">Power BI</a> template for monitoring project, pipeline, and pipeline run data from an Azure DevOps (AzDO) organization.</p>
 <p>This dashboard recipe provides observability for AzDO pipelines by displaying various metrics (i.e. average runtime, run outcome statistics, etc.) in a table. Additionally, the second page of the template visualizes pipeline success and failure trends using Power BI charts. Documentation and setup information can be found in the project README.</p>
 <h2 id="python-logger-class-for-application-insights-using-opencensus">Python Logger Class for Application Insights using OpenCensus</h2>
-<p>This repository contains "AppLogger" class which is a python logger class for Application Insights using Opencensus. It also contains sample code that shows the usage of "AppLogger".</p>
-<p><a href="https://github.com/Azure-Samples/azure-monitor-opencensus-python/tree/master/azure_monitor/python_logger_opencensus_azure">GitHub Repo</a></p>
+<p>The Azure SDK for Python contains an <a href="https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/monitor/azure-monitor-opentelemetry">Azure Monitor Opentelemetry Distro client library for Python </a>. You can view samples of how to use the library in this <a href="https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/monitor/azure-monitor-opentelemetry/samples">GitHub Repo</a>. With this library you can easily collect traces, metrics, and logs.</p>
 <h2 id="java-opentelemetry-examples">Java OpenTelemetry Examples</h2>
 <p>This <a href="https://github.com/open-telemetry/opentelemetry-java-docs">GitHub Repo</a> contains a set of fully-functional, working examples of using the OpenTelemetry Java APIs and SDK.</p>
 
@@ -6563,7 +6562,7 @@ <h2 id="java-opentelemetry-examples">Java OpenTelemetry Examples</h2>
   <small>
     
       Last update:
-      <span class="git-revision-date-localized-plugin git-revision-date-localized-plugin-date">August 22, 2024</span>
+      <span class="git-revision-date-localized-plugin git-revision-date-localized-plugin-date">September 27, 2024</span>
       
     
   </small>
diff --git a/search/search_index.json b/search/search_index.json
index 69640fc87..7341fe771 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"ISE Engineering Fundamentals Playbook An engineer working for a ISE project... Has responsibilities to their team \u2013 mentor, coach, and lead. Knows their playbook . Follows their playbook. Fixes their playbook if it is broken. If they find a better playbook, they copy it. If somebody could use their playbook, they share it. Leads by example. Models the behaviors we desire both interpersonally and technically. Strives to understand how their work fits into a broader context and ensures the outcome. This is our playbook. All contributions are welcome! Please feel free to submit a pull request to get involved. Why Have a Playbook To increase overall efficiency for team members and the whole team in general. To reduce the number of mistakes and avoid common pitfalls. To strive to be better engineers and learn from other people's shared experience. If you do nothing else follow the Engineering Fundamentals Checklist ! The first week of an ISE project is a breakdown of the sections of the playbook according to the structure of an Agile sprint. General Guidance Keep the code quality bar high. Value quality and precision over \u2018getting things done\u2019. Work diligently on the one important thing. As a distributed team take time to share context via wiki, teams and backlog items. Make the simple thing work now. Build fewer features today, but ensure they work amazingly. Then add more features tomorrow. Avoid adding scope to a backlog item, instead add a new backlog item. Our goal is to ship incremental customer value. Keep backlog item details up to date to communicate the state of things with the rest of your team. Report product issues found and provide clear and repeatable engineering feedback! We all own our code and each one of us has an obligation to make all parts of the solution great. Contributing See CONTRIBUTING.md for contribution guidelines.","title":"ISE Engineering Fundamentals Playbook"},{"location":"#ise-engineering-fundamentals-playbook","text":"An engineer working for a ISE project... Has responsibilities to their team \u2013 mentor, coach, and lead. Knows their playbook . Follows their playbook. Fixes their playbook if it is broken. If they find a better playbook, they copy it. If somebody could use their playbook, they share it. Leads by example. Models the behaviors we desire both interpersonally and technically. Strives to understand how their work fits into a broader context and ensures the outcome. This is our playbook. All contributions are welcome! Please feel free to submit a pull request to get involved.","title":"ISE Engineering Fundamentals Playbook"},{"location":"#why-have-a-playbook","text":"To increase overall efficiency for team members and the whole team in general. To reduce the number of mistakes and avoid common pitfalls. To strive to be better engineers and learn from other people's shared experience. If you do nothing else follow the Engineering Fundamentals Checklist ! The first week of an ISE project is a breakdown of the sections of the playbook according to the structure of an Agile sprint.","title":"Why Have a Playbook"},{"location":"#general-guidance","text":"Keep the code quality bar high. Value quality and precision over \u2018getting things done\u2019. Work diligently on the one important thing. As a distributed team take time to share context via wiki, teams and backlog items. Make the simple thing work now. Build fewer features today, but ensure they work amazingly. Then add more features tomorrow. Avoid adding scope to a backlog item, instead add a new backlog item. Our goal is to ship incremental customer value. Keep backlog item details up to date to communicate the state of things with the rest of your team. Report product issues found and provide clear and repeatable engineering feedback! We all own our code and each one of us has an obligation to make all parts of the solution great.","title":"General Guidance"},{"location":"#contributing","text":"See CONTRIBUTING.md for contribution guidelines.","title":"Contributing"},{"location":"ISE/","text":"Who is ISE (Industry Solutions Engineering) Our team, ISE (Industry Solutions Engineering), works side-by-side with customers to help them tackle their toughest technical problems both in the cloud and on the edge. We meet customers where they are, work in the languages they use, with the open source frameworks they use, and on the operating systems they use. We work with enterprises and start-ups across many industries from financial services to manufacturing. Our work covers a broad spectrum of domains including IoT, machine learning, and high scale compute. Our \"superpower\" is that we work closely with both our customers\u2019 engineering teams and Microsoft\u2019s product engineering teams, developing real-world expertise that we can use to help our customers grow their business and help Microsoft improve our products and services. We are very community focused in our work, with one foot in Microsoft and one foot in the open source communities that we help. We make pull requests on open source projects to add support for Microsoft platforms and/or improve existing implementations. We build frameworks and other tools to make it easier for developers to use Microsoft platforms. We source all the ideas for this work by maintaining very deep connections with these communities and the customers and partners that use them. If you like variety, coding in many languages, using any available tech across our industry, digging in with our customers, hack fests, occasional travel, and telling the story of what you\u2019ve done in blog posts and at conferences, then come talk to us. You can check out some of our work on our Developer Blog","title":"Who is ISE?"},{"location":"ISE/#who-is-ise-industry-solutions-engineering","text":"Our team, ISE (Industry Solutions Engineering), works side-by-side with customers to help them tackle their toughest technical problems both in the cloud and on the edge. We meet customers where they are, work in the languages they use, with the open source frameworks they use, and on the operating systems they use. We work with enterprises and start-ups across many industries from financial services to manufacturing. Our work covers a broad spectrum of domains including IoT, machine learning, and high scale compute. Our \"superpower\" is that we work closely with both our customers\u2019 engineering teams and Microsoft\u2019s product engineering teams, developing real-world expertise that we can use to help our customers grow their business and help Microsoft improve our products and services. We are very community focused in our work, with one foot in Microsoft and one foot in the open source communities that we help. We make pull requests on open source projects to add support for Microsoft platforms and/or improve existing implementations. We build frameworks and other tools to make it easier for developers to use Microsoft platforms. We source all the ideas for this work by maintaining very deep connections with these communities and the customers and partners that use them. If you like variety, coding in many languages, using any available tech across our industry, digging in with our customers, hack fests, occasional travel, and telling the story of what you\u2019ve done in blog posts and at conferences, then come talk to us. You can check out some of our work on our Developer Blog","title":"Who is ISE (Industry Solutions Engineering)"},{"location":"engineering-fundamentals-checklist/","text":"Engineering Fundamentals Checklist This checklist helps to ensure that our projects meet our Engineering Fundamentals. Source Control The default target branch is locked. Merges are done through PRs. PRs reference related work items. Commit history is consistent and commit messages are informative (what, why). Consistent branch naming conventions. Clear documentation of repository structure. Secrets are not part of the commit history or made public. (see Credential scanning ) Public repositories follow the OSS guidelines , see Required files in default branch for public repositories . More details on source control Work Item Tracking All items are tracked in AzDevOps (or similar). The board is organized (swim lanes, feature tags, technology tags). More details on backlog management Testing Unit tests cover the majority of all components (>90% if possible). Integration tests run to test the solution e2e. More details on automated testing CI/CD Project runs CI with automated build and test on each PR. Project uses CD to manage deployments to a replica environment before PRs are merged. Main branch is always shippable. More details on continuous integration and continuous delivery Security Access is only granted on an as-needed basis Secrets are stored in secured locations and not checked in to code Data is encrypted in transit (and if necessary at rest) and passwords are hashed Is the system split into logical segments with separation of concerns? This helps limiting security vulnerabilities. More details on security Observability Significant business and functional events are tracked and related metrics collected. Application faults and errors are logged. Health of the system is monitored. The client and server side observability data can be differentiated. Logging configuration can be modified without code changes (eg: verbose mode). Incoming tracing context is propagated to allow for production issue debugging purposes. GDPR compliance is ensured regarding PII (Personally Identifiable Information). More details on observability Agile/Scrum Process Lead (fixed/rotating) runs the daily standup The agile process is clearly defined within team. The Dev Lead (+ PO/Others) are responsible for backlog management and refinement. A working agreement is established between team members and customer. More details on agile development Design Reviews Process for conducting design reviews is included in the Working Agreement . Design reviews for each major component of the solution are carried out and documented, including alternatives. Stories and/or PRs link to the design document. Each user story includes a task for design review by default, which is assigned or removed during sprint planning. Project advisors are invited to design reviews or asked to give feedback to the design decisions captured in documentation. Discover all the reviews that the customer's processes require and plan for them. Clear non-functional requirements captured (see Non-Functional Requirements Guidance ) Risks and opportunities captured (see Risk/Opportunity Management ) More details on design reviews Code Reviews There is a clear agreement in the team as to function of code reviews. The team has a code review checklist or established process. A minimum number of reviewers (usually 2) for a PR merge is enforced by policy. Linters/Code Analyzers, unit tests and successful builds for PR merges are set up. There is a process to enforce a quick review turnaround. More details on code reviews Retrospectives Retrospectives are conducted each week/at the end of each sprint. The team identifies 1-3 proposed experiments to try each week/sprint to improve the process. Experiments have owners and are added to project backlog. The team conducts longer retrospective for Milestones and project completion. More details on retrospectives Engineering Feedback The team submits feedback on business and technical blockers that prevent project success Suggestions for improvements are incorporated in the solution Feedback is detailed and repeatable More details on engineering feedback Developer Experience (DevEx) Developers on the team can: Build/Compile source to verify it is free of syntax errors and compiles. Execute all automated tests (unit, e2e, etc). Start/Launch end-to-end to simulate execution in a deployed environment. Attach a debugger to started solution or running automated tests, set breakpoints, step through code, and inspect variables. Automatically install dependencies by pressing F5 (or equivalent) in their IDE. Use local dev configuration values (i.e. .env, appsettings.development.json). More details on developer experience","title":"Engineering Fundamentals Checklist"},{"location":"engineering-fundamentals-checklist/#engineering-fundamentals-checklist","text":"This checklist helps to ensure that our projects meet our Engineering Fundamentals.","title":"Engineering Fundamentals Checklist"},{"location":"engineering-fundamentals-checklist/#source-control","text":"The default target branch is locked. Merges are done through PRs. PRs reference related work items. Commit history is consistent and commit messages are informative (what, why). Consistent branch naming conventions. Clear documentation of repository structure. Secrets are not part of the commit history or made public. (see Credential scanning ) Public repositories follow the OSS guidelines , see Required files in default branch for public repositories . More details on source control","title":"Source Control"},{"location":"engineering-fundamentals-checklist/#work-item-tracking","text":"All items are tracked in AzDevOps (or similar). The board is organized (swim lanes, feature tags, technology tags). More details on backlog management","title":"Work Item Tracking"},{"location":"engineering-fundamentals-checklist/#testing","text":"Unit tests cover the majority of all components (>90% if possible). Integration tests run to test the solution e2e. More details on automated testing","title":"Testing"},{"location":"engineering-fundamentals-checklist/#cicd","text":"Project runs CI with automated build and test on each PR. Project uses CD to manage deployments to a replica environment before PRs are merged. Main branch is always shippable. More details on continuous integration and continuous delivery","title":"CI/CD"},{"location":"engineering-fundamentals-checklist/#security","text":"Access is only granted on an as-needed basis Secrets are stored in secured locations and not checked in to code Data is encrypted in transit (and if necessary at rest) and passwords are hashed Is the system split into logical segments with separation of concerns? This helps limiting security vulnerabilities. More details on security","title":"Security"},{"location":"engineering-fundamentals-checklist/#observability","text":"Significant business and functional events are tracked and related metrics collected. Application faults and errors are logged. Health of the system is monitored. The client and server side observability data can be differentiated. Logging configuration can be modified without code changes (eg: verbose mode). Incoming tracing context is propagated to allow for production issue debugging purposes. GDPR compliance is ensured regarding PII (Personally Identifiable Information). More details on observability","title":"Observability"},{"location":"engineering-fundamentals-checklist/#agilescrum","text":"Process Lead (fixed/rotating) runs the daily standup The agile process is clearly defined within team. The Dev Lead (+ PO/Others) are responsible for backlog management and refinement. A working agreement is established between team members and customer. More details on agile development","title":"Agile/Scrum"},{"location":"engineering-fundamentals-checklist/#design-reviews","text":"Process for conducting design reviews is included in the Working Agreement . Design reviews for each major component of the solution are carried out and documented, including alternatives. Stories and/or PRs link to the design document. Each user story includes a task for design review by default, which is assigned or removed during sprint planning. Project advisors are invited to design reviews or asked to give feedback to the design decisions captured in documentation. Discover all the reviews that the customer's processes require and plan for them. Clear non-functional requirements captured (see Non-Functional Requirements Guidance ) Risks and opportunities captured (see Risk/Opportunity Management ) More details on design reviews","title":"Design Reviews"},{"location":"engineering-fundamentals-checklist/#code-reviews","text":"There is a clear agreement in the team as to function of code reviews. The team has a code review checklist or established process. A minimum number of reviewers (usually 2) for a PR merge is enforced by policy. Linters/Code Analyzers, unit tests and successful builds for PR merges are set up. There is a process to enforce a quick review turnaround. More details on code reviews","title":"Code Reviews"},{"location":"engineering-fundamentals-checklist/#retrospectives","text":"Retrospectives are conducted each week/at the end of each sprint. The team identifies 1-3 proposed experiments to try each week/sprint to improve the process. Experiments have owners and are added to project backlog. The team conducts longer retrospective for Milestones and project completion. More details on retrospectives","title":"Retrospectives"},{"location":"engineering-fundamentals-checklist/#engineering-feedback","text":"The team submits feedback on business and technical blockers that prevent project success Suggestions for improvements are incorporated in the solution Feedback is detailed and repeatable More details on engineering feedback","title":"Engineering Feedback"},{"location":"engineering-fundamentals-checklist/#developer-experience-devex","text":"Developers on the team can: Build/Compile source to verify it is free of syntax errors and compiles. Execute all automated tests (unit, e2e, etc). Start/Launch end-to-end to simulate execution in a deployed environment. Attach a debugger to started solution or running automated tests, set breakpoints, step through code, and inspect variables. Automatically install dependencies by pressing F5 (or equivalent) in their IDE. Use local dev configuration values (i.e. .env, appsettings.development.json). More details on developer experience","title":"Developer Experience (DevEx)"},{"location":"the-first-week-of-an-ise-project/","text":"The First Week of an ISE Project The purpose of this document is to: Organize content in the playbook for quick reference and discoverability Provide content in a logical structure which reflects the engineering process Extensible hierarchy to allow teams to share deep subject-matter expertise Before Starting the Project Discuss and start writing the Team Agreements. Update these documents with any process decisions made throughout the project Working Agreement Definition of Ready Definition of Done Estimation Set up the repository/repositories Decide on repository structure/s Add README.md, LICENSE, CONTRIBUTING.md, .gitignore, etc Build a Product Backlog Set up a project in your chosen project management tool (ex. Azure DevOps) INVEST in good User Stories and Acceptance Criteria Non-Functional Requirements Guidance Day 1 Plan the first sprint Agree on a sprint goal, and how to measure the sprint progress Determine team capacity Assign user stories to the sprint and split user stories into tasks Set up Work in Progress (WIP) limits Decide on test frameworks and discuss test strategies Discuss the purpose and goals of tests and how to measure test coverage Agree on how to separate unit tests from integration, load and smoke tests Design the first test cases Decide on branch naming Discuss security needs and verify that secrets are kept out of source control Day 2 Set up Source Control Agree on best practices for commits Set up basic Continuous Integration with linters and automated tests Set up meetings for Daily Stand-ups and decide on a Process Lead Discuss purpose, goals, participants and facilitation guidance Discuss timing, and how to run an efficient stand-up If the project has sub-teams, set up a Scrum of Scrums Day 3 Agree on code style and on how to assign Pull Requests Set up Build Validation for Pull Requests (2 reviewers, linters, automated tests) and agree on Definition of Done Agree on a Code Merging strategy and update the CONTRIBUTING.md Agree on logging and observability frameworks and strategies Day 4 Set up Continuous Deployment Determine what environments are appropriate for this solution For each environment discuss purpose, when deployment should trigger, pre-deployment approvers, sing-off for promotion. Decide on a versioning strategy Agree on how to Design a feature and conduct a Design Review Day 5 Conduct a Sprint Demo Conduct a Retrospective Determine required participants, how to capture input (tools) and outcome Set a timeline, and discuss facilitation, meeting structure etc. Refine the Backlog Determine required participants Update the Definition of Ready Update estimates, and the Estimation document Submit Engineering Feedback for issues encountered","title":"The First Week of an ISE Project"},{"location":"the-first-week-of-an-ise-project/#the-first-week-of-an-ise-project","text":"The purpose of this document is to: Organize content in the playbook for quick reference and discoverability Provide content in a logical structure which reflects the engineering process Extensible hierarchy to allow teams to share deep subject-matter expertise","title":"The First Week of an ISE Project"},{"location":"the-first-week-of-an-ise-project/#before-starting-the-project","text":"Discuss and start writing the Team Agreements. Update these documents with any process decisions made throughout the project Working Agreement Definition of Ready Definition of Done Estimation Set up the repository/repositories Decide on repository structure/s Add README.md, LICENSE, CONTRIBUTING.md, .gitignore, etc Build a Product Backlog Set up a project in your chosen project management tool (ex. Azure DevOps) INVEST in good User Stories and Acceptance Criteria Non-Functional Requirements Guidance","title":"Before Starting the Project"},{"location":"the-first-week-of-an-ise-project/#day-1","text":"Plan the first sprint Agree on a sprint goal, and how to measure the sprint progress Determine team capacity Assign user stories to the sprint and split user stories into tasks Set up Work in Progress (WIP) limits Decide on test frameworks and discuss test strategies Discuss the purpose and goals of tests and how to measure test coverage Agree on how to separate unit tests from integration, load and smoke tests Design the first test cases Decide on branch naming Discuss security needs and verify that secrets are kept out of source control","title":"Day 1"},{"location":"the-first-week-of-an-ise-project/#day-2","text":"Set up Source Control Agree on best practices for commits Set up basic Continuous Integration with linters and automated tests Set up meetings for Daily Stand-ups and decide on a Process Lead Discuss purpose, goals, participants and facilitation guidance Discuss timing, and how to run an efficient stand-up If the project has sub-teams, set up a Scrum of Scrums","title":"Day 2"},{"location":"the-first-week-of-an-ise-project/#day-3","text":"Agree on code style and on how to assign Pull Requests Set up Build Validation for Pull Requests (2 reviewers, linters, automated tests) and agree on Definition of Done Agree on a Code Merging strategy and update the CONTRIBUTING.md Agree on logging and observability frameworks and strategies","title":"Day 3"},{"location":"the-first-week-of-an-ise-project/#day-4","text":"Set up Continuous Deployment Determine what environments are appropriate for this solution For each environment discuss purpose, when deployment should trigger, pre-deployment approvers, sing-off for promotion. Decide on a versioning strategy Agree on how to Design a feature and conduct a Design Review","title":"Day 4"},{"location":"the-first-week-of-an-ise-project/#day-5","text":"Conduct a Sprint Demo Conduct a Retrospective Determine required participants, how to capture input (tools) and outcome Set a timeline, and discuss facilitation, meeting structure etc. Refine the Backlog Determine required participants Update the Definition of Ready Update estimates, and the Estimation document Submit Engineering Feedback for issues encountered","title":"Day 5"},{"location":"CI-CD/","text":"Continuous Integration and Continuous Delivery Continuous Integration (CI) is the engineering practice of frequently committing code in a shared repository, ideally several times a day, and performing an automated build on it. These changes are built with other simultaneous changes to the system, which enables early detection of integration issues between multiple developers working on a project. Build breaks due to integration failures are treated as the highest priority issue for all the developers on a team and generally work stops until they are fixed. Paired with an automated testing approach, continuous integration also allows us to also test the integrated build such that we can verify that not only does the code base still build correctly, but also is still functionally correct. This is also a best practice for building robust and flexible software systems. Continuous Delivery (CD) takes the Continuous Integration (CI) concept further to also test deployments of the integrated code base on a replica of the environment it will be ultimately deployed on. This enables us to learn early about any unforeseen operational issues that arise from our changes as quickly as possible and also learn about gaps in our test coverage. The goal of all of this is to ensure that the main branch is always shippable, meaning that we could, if we needed to, take a build from the main branch of our code base and ship it on production. If these concepts are unfamiliar to you, take a few minutes and read through Continuous Integration and Continuous Delivery . Our expectation is that CI/CD should be used in all the engineering projects that we do with our customers and that we are building, testing, and deploying each change we make to any software system that we are building. For a much deeper understanding of all of these concepts, the books Continuous Integration and Continuous Delivery provide a comprehensive background. Why CI/CD We want to have an automated build and deployment of our software We want automated configuration of all components We want to be able to quickly re-build the environment from scratch in case of disaster We want the latest version of the code to always be deployed to our dev/test environments We want a reliable release strategy, where the policies for release are well understood by all The Fundamentals We run a quality pipeline (with linting, unit tests etc.) on each PR/update of the main branch All cloud resources (including secrets and permissions) are provisioned through infrastructure as code templates \u2013 ex. Terraform, Bicep (ARM), Pulumi etc. All release candidates are deployed to a non-production environment through an automated process (ex Azure DevOps or Github pipelines) Releases are deployed to the production environment through an automated process Release rollbacks are carried out through a repeatable process Our release pipeline runs automated tests, validating all release candidate artifact(s) end-to-end against a non-production environment Tools Azure Pipelines Our tooling at Microsoft has made setting up integration and delivery systems like this easy. If you are unfamiliar with it, take a few moments now to read through Azure Pipelines (Previously VSTS) and for a practical walkthrough of how this works in practice, one example you can read through is CI/CD on Kubernetes with VSTS . Jenkins Jenkins is one of the most commonly used tools across the open source community. It is well-known with hundreds of plugins for every build requirement. Jenkins is free but requires a dedicated server. You can easily create a Jenkins VM using this template TravisCI Travis CI can be used for open source projects at no cost but developers must purchase an enterprise plan for private projects. This service is ideal for validation of PR's on GitHub because it is lightweight and easy to set up with no need for dedicated server setup. It also supports a Build matrix feature which allows accelerating the build and testing process by breaking them into parts. CircleCI CircleCI is a free service for open source projects with no dedicated server required. It is also ideal for validation of PR's on GitHub. CircleCI also allows workflows, parallelism and splitting your tests across any number of containers with a wide array of packages pre-installed on the build containers. AppVeyor AppVeyor is another free CI service for open source projects which also supports Windows-based builds.","title":"Continuous Integration and Continuous Delivery"},{"location":"CI-CD/#continuous-integration-and-continuous-delivery","text":"Continuous Integration (CI) is the engineering practice of frequently committing code in a shared repository, ideally several times a day, and performing an automated build on it. These changes are built with other simultaneous changes to the system, which enables early detection of integration issues between multiple developers working on a project. Build breaks due to integration failures are treated as the highest priority issue for all the developers on a team and generally work stops until they are fixed. Paired with an automated testing approach, continuous integration also allows us to also test the integrated build such that we can verify that not only does the code base still build correctly, but also is still functionally correct. This is also a best practice for building robust and flexible software systems. Continuous Delivery (CD) takes the Continuous Integration (CI) concept further to also test deployments of the integrated code base on a replica of the environment it will be ultimately deployed on. This enables us to learn early about any unforeseen operational issues that arise from our changes as quickly as possible and also learn about gaps in our test coverage. The goal of all of this is to ensure that the main branch is always shippable, meaning that we could, if we needed to, take a build from the main branch of our code base and ship it on production. If these concepts are unfamiliar to you, take a few minutes and read through Continuous Integration and Continuous Delivery . Our expectation is that CI/CD should be used in all the engineering projects that we do with our customers and that we are building, testing, and deploying each change we make to any software system that we are building. For a much deeper understanding of all of these concepts, the books Continuous Integration and Continuous Delivery provide a comprehensive background.","title":"Continuous Integration and Continuous Delivery"},{"location":"CI-CD/#why-cicd","text":"We want to have an automated build and deployment of our software We want automated configuration of all components We want to be able to quickly re-build the environment from scratch in case of disaster We want the latest version of the code to always be deployed to our dev/test environments We want a reliable release strategy, where the policies for release are well understood by all","title":"Why CI/CD"},{"location":"CI-CD/#the-fundamentals","text":"We run a quality pipeline (with linting, unit tests etc.) on each PR/update of the main branch All cloud resources (including secrets and permissions) are provisioned through infrastructure as code templates \u2013 ex. Terraform, Bicep (ARM), Pulumi etc. All release candidates are deployed to a non-production environment through an automated process (ex Azure DevOps or Github pipelines) Releases are deployed to the production environment through an automated process Release rollbacks are carried out through a repeatable process Our release pipeline runs automated tests, validating all release candidate artifact(s) end-to-end against a non-production environment","title":"The Fundamentals"},{"location":"CI-CD/#tools","text":"","title":"Tools"},{"location":"CI-CD/#azure-pipelines","text":"Our tooling at Microsoft has made setting up integration and delivery systems like this easy. If you are unfamiliar with it, take a few moments now to read through Azure Pipelines (Previously VSTS) and for a practical walkthrough of how this works in practice, one example you can read through is CI/CD on Kubernetes with VSTS .","title":"Azure Pipelines"},{"location":"CI-CD/#jenkins","text":"Jenkins is one of the most commonly used tools across the open source community. It is well-known with hundreds of plugins for every build requirement. Jenkins is free but requires a dedicated server. You can easily create a Jenkins VM using this template","title":"Jenkins"},{"location":"CI-CD/#travisci","text":"Travis CI can be used for open source projects at no cost but developers must purchase an enterprise plan for private projects. This service is ideal for validation of PR's on GitHub because it is lightweight and easy to set up with no need for dedicated server setup. It also supports a Build matrix feature which allows accelerating the build and testing process by breaking them into parts.","title":"TravisCI"},{"location":"CI-CD/#circleci","text":"CircleCI is a free service for open source projects with no dedicated server required. It is also ideal for validation of PR's on GitHub. CircleCI also allows workflows, parallelism and splitting your tests across any number of containers with a wide array of packages pre-installed on the build containers.","title":"CircleCI"},{"location":"CI-CD/#appveyor","text":"AppVeyor is another free CI service for open source projects which also supports Windows-based builds.","title":"AppVeyor"},{"location":"CI-CD/continuous-delivery/","text":"Continuous Delivery The inspiration behind continuous delivery is constantly delivering valuable software to users and developers more frequently. Applying the principles and practices laid out in this readme will help you reduce risk, eliminate manual operations and increase quality and confidence. Deploying software involves the following principles: Provision and manage the cloud environment runtime for your application (cloud resources, infrastructure, hardware, services, etc). Install the target application version across your cloud environments. Configure your application, including any required data. A continuous delivery pipeline is an automated manifestation of your process to streamline these very principles in a consistent and repeatable manner. Goal Follow industry best practices for delivering software changes to customers and developers. Establish consistency for the guiding principles and best practices when assembling continuous delivery workflows. General Guidance Define a Release Strategy It's important to establish a common understanding between the Dev Lead and application stakeholder(s) around the release strategy / design during the planning phase of a project. This common understanding includes the deployment and maintenance of the application throughout its SDLC. Release Strategy Principles Continuous Delivery by Jez Humble, David Farley cover the key considerations to follow when creating a release strategy: Parties in charge of deployments to each environment, as well as in charge of the release. An asset and configuration management strategy. An enumeration of the environments available for acceptance, capacity, integration, and user acceptance testing, and the process by which builds will be moved through these environments. A description of the processes to be followed for deployment into testing and production environments, such as change requests to be opened and approvals that need to be granted. A discussion of the method by which the application\u2019s deploy-time and runtime configuration will be managed, and how this relates to the automated deployment process. _Description of the integration with any external systems. At what stage and how are they tested as part of a release? How does the technical operator communicate with the provider in the event of a problem? _A disaster recovery plan so that the application\u2019s state can be recovered following a disaster. Which steps will need to be in place to restart or redeploy the application should it fail. _Production sizing and capacity planning: How much data will your live application create? How many log files or databases will you need? How much bandwidth and disk space will you need? What latency are clients expecting? How the initial deployment to production works. How fixing defects and applying patches to the production environment will be handled. How upgrades to the production environment will be handled, including data migration. How will upgrades be carried out to the application without destroying its state. Application Release and Environment Promotion Your release manifestation process should take the deployable build artifact created from your commit stage and deploy them across all cloud environments, starting with your test environment. The test environment ( often called Integration ) acts as a gate to validate if your test suite completes successfully for all release candidates. This validation should always begin in a test environment while inspecting the deployed release integrated from the feature / release branch containing your code changes. Code changes released into the test environment typically targets the main branch (when doing trunk ) or release branch (when doing gitflow ). The First Deployment The very first deployment of any application should be showcased to the customer in a production-like environment ( UAT ) to solicit feedback early. The UAT environment is used to obtain product owner sign-off acceptance to ultimately promote the release to production. Criteria for a Production-Like Environment Runs the same operating system as production. Has the same software installed as production. Is sized and configured the same way as production. Mirrors production's networking topology. Simulated production-like load tests are executed following a release to surface any latency or throughput degradation. Modeling Your Release Pipeline It's critical to model your test and release process to establish a common understanding between the application engineers and customer stakeholders. Specifically aligning expectations for how many cloud environments need to be pre-provisioned as well as defining sign-off gate roles and responsibilities. Release Pipeline Modeling Considerations Depict all stages an application change would have to go through before it is released to production. Define all release gate controls. Determine customer-specific Cloud RBAC groups which have the authority to approve release candidates per environment. Release Pipeline Stages The stages within your release workflow are ultimately testing a version of your application to validate it can be released in accordance to your acceptance criteria. The release pipeline should account for the following conditions: Release Selection: The developer carrying out application testing should have the capability to select which release version to deploy to the testing environment. Deployment - Release the application deployable build artifact ( created from the CI stage ) to the target cloud environment. Configuration - Applications should be configured consistently across all your environments. This configuration is applied at the time of deployment. Sensitive data like app secrets and certificates should be mastered in a fully managed PaaS key and secret store (eg Key Vault , KMS ). Any secrets used by the application should be sourced internally within the application itself. Application Secrets should not be exposed within the runtime environment. We encourage 12 Factor principles, especially when it comes to configuration management . Data Migration - Pre populate application state and/or data records which is needed for your runtime environment. This may also include test data required for your end-to-end integration test suite. Deployment smoke test. Your smoke test should also verify that your application is pointing to the correct configuration (e.g. production pointing to a UAT Database). Perform any manual or automated acceptance test scenarios. Approve the release gate to promote the application version to the target cloud environment. This promotion should also include the environment's configuration state (e.g. new env settings, feature flags, etc). Live Release Warm Up A release should be running for a period of time before it's considered live and allowed to accept user traffic. These warm up activities may include application server(s) and database(s) pre-fill any dependent cache(s) as well as establish all service connections (eg connection pool allocations, etc ). Pre-production Releases Application release candidates should be deployed to a staging environment similar to production for carrying out final manual/automated tests ( including capacity testing ). Your production and staging / pre-prod cloud environments should be setup at the beginning of your project. Application warm up should be a quantified measurement that's validated as part of your pre-prod smoke tests. Rolling-Back Releases Your release strategy should account for rollback scenarios in the event of unexpected failures following a deployment. Rolling back releases can get tricky, especially when database record/object changes occur in result of your deployment ( either inadvertently or intentionally ). If there are no data changes which need to be backed out, then you can simply trigger a new release candidate for the last known production version and promote that release along your CD pipeline. For rollback scenarios involving data changes, there are several approaches to mitigating this which fall outside the scope of this guide. Some involve database record versioning, time machining database records / objects, etc. All data files and databases should be backed up prior to each release so they could be restored. The mitigation strategy for this scenario will vary across our projects. The expectation is that this mitigation strategy should be covered as part of your release strategy. Another approach to consider when designing your release strategy is deployment rings . This approach simplifies rollback scenarios while limiting the impact of your release to end-users by gradually deploying and validating your changes in production. Zero Downtime Releases A hot deployment follows a process of switching users from one release to another with no impact to the user experience. As an example, Azure managed app services allows developers to validate app changes in a staging deployment slot before swapping it with the production slot. App Service slot swapping can also be fully automated once the source slot is fully warmed up (and auto swap is enabled). Slot swapping also simplifies release rollbacks once a technical operator restores the slots to their pre-swap states. Kubernetes natively supports rolling updates . Blue-Green Deployments Blue / Green is a deployment technique which reduces downtime by running two identical instances of a production environment called Blue and Green . Only one of these environments accepts live production traffic at a given time. In the above example, live production traffic is routed to the Green environment. During application releases, the new version is deployed to the blue environment which occurs independently from the Green environment. Live traffic is unaffected from Blue environment releases. You can point your end-to-end test suite against the Blue environment as one of your test checkouts. Migrating users to the new application version is as simple as changing the router configuration to direct all traffic to the Blue environment. This technique simplifies rollback scenarios as we can simply switch the router back to Green. Database providers like Cosmos and Azure SQL natively support data replication to help enable fully synchronized Blue Green database environments. Canary Releasing Canary releasing enables development teams to gather faster feedback when deploying new features to production. These releases are rolled out to a subset of production nodes ( where no users are routed to ) to collect early insights around capacity testing and functional completeness and impact. Once smoke and capacity tests are completed, you can route a small subset of users to the production nodes hosting the release candidate. Canary releases simplify rollbacks as you can avoid routing users to bad application versions. Try to limit the number of versions of your application running parallel in production, as it can complicate maintenance and monitoring controls. Low Code Solutions Low code solutions have increased their participation in the applications and processes and because of that it is required that a proper conjunction of disciplines improve their development. Here is a guide for continuous deployment for Low Code Solutions . Resources Continuous Delivery by Jez Humble, David Farley. Continuous integration vs. continuous delivery vs. continuous deployment Deployment Rings Tools Check out the below tools to help with some CD best practices listed above: Flux for gitops CI/CD workflow using GitOps Tekton for Kubernetes native pipelines Note Jenkins-X uses Tekton under the hood. Argo Workflows Flagger for powerful, Kubernetes native releases including blue/green, canary, and A/B testing. Not quite CD related, but checkout jsonnet , a templating language to reduce boilerplate and increase sharing between your yaml/json manifests.","title":"Continuous Delivery"},{"location":"CI-CD/continuous-delivery/#continuous-delivery","text":"The inspiration behind continuous delivery is constantly delivering valuable software to users and developers more frequently. Applying the principles and practices laid out in this readme will help you reduce risk, eliminate manual operations and increase quality and confidence. Deploying software involves the following principles: Provision and manage the cloud environment runtime for your application (cloud resources, infrastructure, hardware, services, etc). Install the target application version across your cloud environments. Configure your application, including any required data. A continuous delivery pipeline is an automated manifestation of your process to streamline these very principles in a consistent and repeatable manner.","title":"Continuous Delivery"},{"location":"CI-CD/continuous-delivery/#goal","text":"Follow industry best practices for delivering software changes to customers and developers. Establish consistency for the guiding principles and best practices when assembling continuous delivery workflows.","title":"Goal"},{"location":"CI-CD/continuous-delivery/#general-guidance","text":"","title":"General Guidance"},{"location":"CI-CD/continuous-delivery/#define-a-release-strategy","text":"It's important to establish a common understanding between the Dev Lead and application stakeholder(s) around the release strategy / design during the planning phase of a project. This common understanding includes the deployment and maintenance of the application throughout its SDLC.","title":"Define a Release Strategy"},{"location":"CI-CD/continuous-delivery/#release-strategy-principles","text":"Continuous Delivery by Jez Humble, David Farley cover the key considerations to follow when creating a release strategy: Parties in charge of deployments to each environment, as well as in charge of the release. An asset and configuration management strategy. An enumeration of the environments available for acceptance, capacity, integration, and user acceptance testing, and the process by which builds will be moved through these environments. A description of the processes to be followed for deployment into testing and production environments, such as change requests to be opened and approvals that need to be granted. A discussion of the method by which the application\u2019s deploy-time and runtime configuration will be managed, and how this relates to the automated deployment process. _Description of the integration with any external systems. At what stage and how are they tested as part of a release? How does the technical operator communicate with the provider in the event of a problem? _A disaster recovery plan so that the application\u2019s state can be recovered following a disaster. Which steps will need to be in place to restart or redeploy the application should it fail. _Production sizing and capacity planning: How much data will your live application create? How many log files or databases will you need? How much bandwidth and disk space will you need? What latency are clients expecting? How the initial deployment to production works. How fixing defects and applying patches to the production environment will be handled. How upgrades to the production environment will be handled, including data migration. How will upgrades be carried out to the application without destroying its state.","title":"Release Strategy Principles"},{"location":"CI-CD/continuous-delivery/#application-release-and-environment-promotion","text":"Your release manifestation process should take the deployable build artifact created from your commit stage and deploy them across all cloud environments, starting with your test environment. The test environment ( often called Integration ) acts as a gate to validate if your test suite completes successfully for all release candidates. This validation should always begin in a test environment while inspecting the deployed release integrated from the feature / release branch containing your code changes. Code changes released into the test environment typically targets the main branch (when doing trunk ) or release branch (when doing gitflow ).","title":"Application Release and Environment Promotion"},{"location":"CI-CD/continuous-delivery/#the-first-deployment","text":"The very first deployment of any application should be showcased to the customer in a production-like environment ( UAT ) to solicit feedback early. The UAT environment is used to obtain product owner sign-off acceptance to ultimately promote the release to production.","title":"The First Deployment"},{"location":"CI-CD/continuous-delivery/#criteria-for-a-production-like-environment","text":"Runs the same operating system as production. Has the same software installed as production. Is sized and configured the same way as production. Mirrors production's networking topology. Simulated production-like load tests are executed following a release to surface any latency or throughput degradation.","title":"Criteria for a Production-Like Environment"},{"location":"CI-CD/continuous-delivery/#modeling-your-release-pipeline","text":"It's critical to model your test and release process to establish a common understanding between the application engineers and customer stakeholders. Specifically aligning expectations for how many cloud environments need to be pre-provisioned as well as defining sign-off gate roles and responsibilities.","title":"Modeling Your Release Pipeline"},{"location":"CI-CD/continuous-delivery/#release-pipeline-modeling-considerations","text":"Depict all stages an application change would have to go through before it is released to production. Define all release gate controls. Determine customer-specific Cloud RBAC groups which have the authority to approve release candidates per environment.","title":"Release Pipeline Modeling Considerations"},{"location":"CI-CD/continuous-delivery/#release-pipeline-stages","text":"The stages within your release workflow are ultimately testing a version of your application to validate it can be released in accordance to your acceptance criteria. The release pipeline should account for the following conditions: Release Selection: The developer carrying out application testing should have the capability to select which release version to deploy to the testing environment. Deployment - Release the application deployable build artifact ( created from the CI stage ) to the target cloud environment. Configuration - Applications should be configured consistently across all your environments. This configuration is applied at the time of deployment. Sensitive data like app secrets and certificates should be mastered in a fully managed PaaS key and secret store (eg Key Vault , KMS ). Any secrets used by the application should be sourced internally within the application itself. Application Secrets should not be exposed within the runtime environment. We encourage 12 Factor principles, especially when it comes to configuration management . Data Migration - Pre populate application state and/or data records which is needed for your runtime environment. This may also include test data required for your end-to-end integration test suite. Deployment smoke test. Your smoke test should also verify that your application is pointing to the correct configuration (e.g. production pointing to a UAT Database). Perform any manual or automated acceptance test scenarios. Approve the release gate to promote the application version to the target cloud environment. This promotion should also include the environment's configuration state (e.g. new env settings, feature flags, etc).","title":"Release Pipeline Stages"},{"location":"CI-CD/continuous-delivery/#live-release-warm-up","text":"A release should be running for a period of time before it's considered live and allowed to accept user traffic. These warm up activities may include application server(s) and database(s) pre-fill any dependent cache(s) as well as establish all service connections (eg connection pool allocations, etc ).","title":"Live Release Warm Up"},{"location":"CI-CD/continuous-delivery/#pre-production-releases","text":"Application release candidates should be deployed to a staging environment similar to production for carrying out final manual/automated tests ( including capacity testing ). Your production and staging / pre-prod cloud environments should be setup at the beginning of your project. Application warm up should be a quantified measurement that's validated as part of your pre-prod smoke tests.","title":"Pre-production Releases"},{"location":"CI-CD/continuous-delivery/#rolling-back-releases","text":"Your release strategy should account for rollback scenarios in the event of unexpected failures following a deployment. Rolling back releases can get tricky, especially when database record/object changes occur in result of your deployment ( either inadvertently or intentionally ). If there are no data changes which need to be backed out, then you can simply trigger a new release candidate for the last known production version and promote that release along your CD pipeline. For rollback scenarios involving data changes, there are several approaches to mitigating this which fall outside the scope of this guide. Some involve database record versioning, time machining database records / objects, etc. All data files and databases should be backed up prior to each release so they could be restored. The mitigation strategy for this scenario will vary across our projects. The expectation is that this mitigation strategy should be covered as part of your release strategy. Another approach to consider when designing your release strategy is deployment rings . This approach simplifies rollback scenarios while limiting the impact of your release to end-users by gradually deploying and validating your changes in production.","title":"Rolling-Back Releases"},{"location":"CI-CD/continuous-delivery/#zero-downtime-releases","text":"A hot deployment follows a process of switching users from one release to another with no impact to the user experience. As an example, Azure managed app services allows developers to validate app changes in a staging deployment slot before swapping it with the production slot. App Service slot swapping can also be fully automated once the source slot is fully warmed up (and auto swap is enabled). Slot swapping also simplifies release rollbacks once a technical operator restores the slots to their pre-swap states. Kubernetes natively supports rolling updates .","title":"Zero Downtime Releases"},{"location":"CI-CD/continuous-delivery/#blue-green-deployments","text":"Blue / Green is a deployment technique which reduces downtime by running two identical instances of a production environment called Blue and Green . Only one of these environments accepts live production traffic at a given time. In the above example, live production traffic is routed to the Green environment. During application releases, the new version is deployed to the blue environment which occurs independently from the Green environment. Live traffic is unaffected from Blue environment releases. You can point your end-to-end test suite against the Blue environment as one of your test checkouts. Migrating users to the new application version is as simple as changing the router configuration to direct all traffic to the Blue environment. This technique simplifies rollback scenarios as we can simply switch the router back to Green. Database providers like Cosmos and Azure SQL natively support data replication to help enable fully synchronized Blue Green database environments.","title":"Blue-Green Deployments"},{"location":"CI-CD/continuous-delivery/#canary-releasing","text":"Canary releasing enables development teams to gather faster feedback when deploying new features to production. These releases are rolled out to a subset of production nodes ( where no users are routed to ) to collect early insights around capacity testing and functional completeness and impact. Once smoke and capacity tests are completed, you can route a small subset of users to the production nodes hosting the release candidate. Canary releases simplify rollbacks as you can avoid routing users to bad application versions. Try to limit the number of versions of your application running parallel in production, as it can complicate maintenance and monitoring controls.","title":"Canary Releasing"},{"location":"CI-CD/continuous-delivery/#low-code-solutions","text":"Low code solutions have increased their participation in the applications and processes and because of that it is required that a proper conjunction of disciplines improve their development. Here is a guide for continuous deployment for Low Code Solutions .","title":"Low Code Solutions"},{"location":"CI-CD/continuous-delivery/#resources","text":"Continuous Delivery by Jez Humble, David Farley. Continuous integration vs. continuous delivery vs. continuous deployment Deployment Rings","title":"Resources"},{"location":"CI-CD/continuous-delivery/#tools","text":"Check out the below tools to help with some CD best practices listed above: Flux for gitops CI/CD workflow using GitOps Tekton for Kubernetes native pipelines Note Jenkins-X uses Tekton under the hood. Argo Workflows Flagger for powerful, Kubernetes native releases including blue/green, canary, and A/B testing. Not quite CD related, but checkout jsonnet , a templating language to reduce boilerplate and increase sharing between your yaml/json manifests.","title":"Tools"},{"location":"CI-CD/continuous-integration/","text":"Continuous Integration We encourage engineering teams to make an upfront investment during Sprint 0 of a project to establish an automated and repeatable pipeline which continuously integrates code and releases system executable(s) to target cloud environments. Each integration should be verified by an automated build process that asserts a suite of validation tests pass and surface any errors across the developer team. We encourage teams to implement the CI/CD pipelines before any service code is written for customers, which usually happens in Sprint 0(N). This way, the engineering team can develop and test their work in isolation without impacting other developers and promote a consistent devops workflow throughout the engagement. These principles map directly agile software development lifecycle practices . Goals Continuous integration automation is an integral part of the software development lifecycle intended to reduce build integration errors and maximize velocity across a dev crew. A robust build automation pipeline will: Accelerate team velocity Prevent integration problems Avoid last minute chaos during release dates Provide a quick feedback cycle for system-wide impact of local changes Separate build and deployment stages Measure and report metrics around build failures / success(s) Increase visibility across the team enabling tighter communication Reduce human errors, which is probably the most important part of automating the builds Build Definition Managed in Git Code / Manifest Artifacts Required to Build Your Project Should be Maintained Within Your Projects Git Repository CI provider-specific build pipeline definition(s) should reside within your project(s) git repository(s). Build Automation An automated build should encompass the following principles: Build Task A single step within your build pipeline that compiles your code project into a single build artifact. Unit Testing Your build definition includes validation steps to execute a suite of automated unit tests to ensure that application components meets its design and behaves as intended. Code Style Checks Code across an engineering team must be formatted to agreed coding standards. Such standards keep code consistent, and most importantly easy for the team and customer(s) to read and refactor. Code styling consistency encourages collective ownership for project scrum teams and our partners. There are several open source code style validation tools available to choose from ( code style checks , StyleCop ). The Code Review recipes section of the playbook has suggestions for linters and preferred styles for a number of languages. Your code and documentation should avoid the use of non-inclusive language wherever possible. Follow the Inclusive Linting section to ensure your project promotes an inclusive work environment for both the team and for customers. We recommend incorporating security analysis tools within the build stage of your pipeline such as: code credential scanner, security risk detection, static analysis, etc. For Azure DevOps, you can add a security scan task to your pipeline by installing the Microsoft Security Code Analysis Extension . GitHub Actions supports a similar extension with the RIPS security scan solution . Code standards are maintained within a single configuration file. There should be a step in your build pipeline that asserts code in the latest commit conforms to the known style definition. Build Script Target A single command should have the capability of building the system. This is also true for builds running on a CI server or on a developers local machine. No IDE Dependencies It's essential to have a build that's runnable through standalone scripts and not dependent on a particular IDE. Build pipeline targets can be triggered locally on their desktops through their IDE of choice. The build process should maintain enough flexibility to run within a CI server as well. As an example, dockerizing your build process offers this level of flexibility as VSCode and IntelliJ supports docker plugin extensions. DevOps Security Checks Introduce security to your project at early stages. Follow the DevSecOps section to introduce security practices, automation, tools and frameworks as part of the CI. Build Environment Dependencies Automated Local Environment Setup We encourage maintaining a consistent developer experience for all team members. There should be a central automated manifest / process that streamlines the installation and setup of any software dependencies. This way developers can replicate the same build environment locally as the one running on a CI server. Build automation scripts often require specific software packages and version pre-installed within the runtime environment of the OS. This presents some challenges as build processes typically version lock these dependencies. All developers on the team should be able to emulate the build environment from their local desktop regardless of their OS. For projects using VS Code, leveraging Dev Containers can really help standardize the local developer experience across the team. Well established software packaging tools like Docker, Maven, npm, etc should be considered when designing your build automation tool chain. Document Local Setup The setup process for setting up a local build environment should be well documented and easy for developers to follow. Infrastructure as Code Manage as much of the following as possible, as code: Configuration Files Configuration Management(ie environment variable automation via terraform ) Secret Management(ie creating Azure secrets via terraform ) Cloud Resource Provisioning Role Assignments Load Test Scenarios Availability Alerting / Monitoring Rules and Conditions Decoupling infrastructure from the application codebase simplifies engineering teams move to cloud native applications. Terraform resource providers like Azure DevOps is making it easier for developers to manage build pipeline variables, service connections and CI/CD pipeline definitions. Sample DevOps Workflow using Terraform and Cobalt Why Repeatable and auditable changes to infrastructure make it easier to roll back to known good configurations and to rapidly expand to new stages and regions without having to hand-wire cloud resources Battle tested and templated IAC reference projects like Cobalt and Bedrock enable more engineering teams deploy secure and scalable solutions at a much more rapid pace Simplify \u201clift and shift\u201d scenarios by abstracting the complexities of cloud-native computing away from application developer teams. IAC DevOPS: Operations by Pull Request The Infrastructure deployment process built around a repo that holds the current expected state of the system / Azure environment. Operational changes are made to the running system by making commits on this repo. Git also provides a simple model for auditing deployments and rolling back to a previous state. Infrastructure Advocated Patterns You define infrastructure as code in Terraform / ARM / Ansible templates Templates are repeatable cloud resource stacks with a focus on configuration sets aligned with app scaling and throughput needs. IAC Principles Automate the Azure Environment All cloud resources are provisioned through a set of infrastructure as code templates. This also includes secrets, service configuration settings, role assignments and monitoring conditions. Azure Portal should provide a read-only view on environment resources. Any change applied to the environment should be made through the IAC CI tool-chain only. Provisioning cloud environments should be a repeatable process that's driven off the infrastructure code artifacts checked into our git repository. IAC CI Workflow When the IAC template files change through a git-based workflow, A CI build pipeline builds, validates and reconciles the target infrastructure environment's current state with the expected state. The infrastructure execution plan candidate for these fixed environments are reviewed by a cloud administrator as a gate check prior to the deployment stage of the pipeline applying the execution plan. Developer Read-Only Access to Cloud Resources Developer accounts in the Azure portal should have read-only access to IAC environment resources in Azure. Secret Automation IAC templates are deployed via a CI/CD system that has secrets automation integrated. Avoid applying changes to secrets and/or certificates directly in the Azure Portal. Infrastructure Integration Test Automation End-to-end integration tests are run as part of your IAC CI process to inspect and validate that an azure environment is ready for use. Infrastructure Documentation The deployment and cloud resource template topology should be documented and well understood within the README of the IAC git repo. Local environment and CI workflow setup steps should be documented. Configuration Validation Applications use configuration to allow different runtime behaviors and it\u2019s quite common to use files to store these settings. As developers, we might introduce errors while editing these files which would cause issues for the application to start and/or run correctly. By applying validation techniques on both syntax and semantics of our configuration, we can detect errors before the application is deployed and execute, improving the developer (user) experience. Application Configuration Files Examples JSON, with support for complex data types and data structures YAML, a super set of JSON with support for complex data types and structures TOML, a super set of JSON and a formally specified configuration file format Why Validate Application Configuration as a Separate Step? Easier Debugging & Time saving - With a configuration validation step in our pipeline, we can avoid running the application just to find it fails. It saves time on having to deploy & run, wait and then realize something is wrong in configuration. In addition, it also saves time on going through the logs to figure out what failed and why. Better user/developer experience - A simple reminder to the user that something in the configuration isn't in the right format can make all the difference between the joy of a successful deployment process and the intense frustration of having to guess what went wrong. For example, when there is a Boolean value expected, it can either be a string value like \"True\" or \"False\" or an integer value such as \"0\" or \"1\" . With configuration validation we make sure the meaning is correct for our application. Avoid data corruption and security breaches - Since the data arrives from an untrusted source, such as a user or an external webservice, it\u2019s particularly important to validate the input . Otherwise, it will run at the risk of performing errors, corrupting data, or, worse, be vulnerable to a whole array of injection attacks. What is Json Schema? JSON-Schema is the standard of JSON documents that describes the structure and the requirements of your JSON data. Although it is called JSON-Schema, it also common to use this method for YAMLs, as it is a super set of JSON. The schema is very simple; point out which fields might exist, which are required or optional, what data format they use. Other validation rules can be added on top of that basic premise, along with human-readable information. The metadata lives in schemas which are .json files as well. In addition, schema has the widest adoption among all standards for JSON validation as it covers a big part of validation scenarios. It uses easy-to-parse JSON documents for schemas and is easily extensible. How to Implement Schema Validation? Implementing schema validation is divided in two - the generation of the schemas and the validation of yaml/json files with those schemas. Generation There are two options to generate a schema: From code - we can leverage the existing models and objects in the code and generate a customized schema. From data - we can take yaml/json samples which reflect the configuration in general and use the various online tools to generate a schema. Validation The schema has 30+ validators for different languages, including 10+ for JavaScript, so no need to code it yourself. Integration Validation An effective way to identify bugs in your build at a rapid pace is to invest early into a reliable suite of automated tests that validate the baseline functionality of the system: End-to-End Integration Tests Include tests in your pipeline to validate the build candidate conforms to automated business functionality assertions. Any bugs or broken code should be reported in the test results including the failed test and relevant stack trace. All tests should be invoked through a single command. Keep the build fast. Consider automated test runtime when deciding to pull in dependencies like databases, external services and mock data loading into your test harness. Slow builds often become a bottleneck for dev teams when parallel builds on a CI server are not an option. Consider adding max timeout limits for lengthy validations to fail fast and maintain high velocity across the team. Avoid Checking in Broken Builds Automated build checks, tests, lint runs, etc should be validated locally before committing your changes to the scm repo. Test Driven Development is a practice dev crews should consider to help identify bugs and failures as early as possible within the development lifecycle. Reporting Build Failures If the build step happens to fail then the build pipeline run status should be reported as failed including relevant logs and stack traces. Test Automation Data Dependencies Any mocked dataset(s) used for unit and end-to-end integration tests should be checked into the mainline repository. Minimize any external data dependencies with your build process. Code Coverage Checks We recommend integrating code coverage tools within your build stage. Most coverage tools fail builds when the test coverage falls below a minimum threshold(80% coverage). The coverage report should be published to your CI system to track a time series of variations. Git Driven Workflow Build on Commit Every commit to the baseline repository should trigger the CI pipeline to create a new build candidate. Build artifact(s) are built, packaged, validated and deployed continuously into a non-production environment per commit. Each commit against the repository results into a CI run which checks out the sources onto the integration machine, initiates a build, and notifies the committer of the result of the build. Avoid Commenting Out Failing Tests Avoid commenting out tests in the mainline branch. By commenting out tests, we get an incorrect indication of the status of the build. Branch Policy Enforcement Protected branch policies should be setup on the main branch to ensure that CI stage(s) have passed prior to starting a code review. Code review approvers will only start reviewing a pull request once the CI pipeline run passes for the latest pushed git commit. Broken builds should block pull request reviews. Prevent commits directly into main branch. Branch Strategy Release branches should auto trigger the deployment of a build artifact to its target cloud environment. You can find additional guidance on the Azure DevOps documentation site under the Manage deployments section Deliver Quickly and Daily \"By committing regularly, every committer can reduce the number of conflicting changes. Checking in a week's worth of work runs the risk of conflicting with other features and can be very difficult to resolve. Early, small conflicts in an area of the system cause team members to communicate about the change they are making.\" In the spirit of transparency and embracing frequent communication across a dev crew, we encourage developers to commit code on a daily cadence. This approach provides visibility to feature progress and accelerates pair programming across the team. Here are some principles to consider: Everyone Commits to the Git Repository Each Day End of day checked-in code should contain unit tests at the minimum. Run the build locally before checking in to avoid CI pipeline failure saturation. You should verify what caused the error, and try to solve it as soon as possible instead of committing your code. We encourage developers to follow a lean SDLC principles . Isolate work into small chunks which ties directly to business value and refactor incrementally. Isolated Environments One of the key goals of build validation is to isolate and identify failures in staging environment(s) and minimize any disruption to live production traffic. Our E2E automated tests should run in an environment which mimics our production environment(as much as possible). This includes consistent software versions, OS, test data volume simulations, network traffic parity with production, etc. Test in a Clone of Production The production environment should be duplicated into a staging environment(QA and/or Pre-Prod) at a minimum. Pull Request Updates Trigger Staged Releases New commits related to a pull request should trigger a build / release into an integration environment. The production environment should be fully isolated from this process. Promote Infrastructure Changes Across Fixed Environments Infrastructure as code changes should be tested in an integration environment and promoted to all staging environment(s) then migrated to production with zero downtime for system users. Testing in Production There are various approaches with safely carrying out automated tests for production deployments. Some of these may include: Feature flagging A/B testing Traffic shifting Developer Access to the Latest Release Artifacts Our devops workflow should enable developers to get, install and run the latest system executable. Release executable(s) should be auto generated as part of our CI/CD pipeline(s). Developers can Access the Latest Executable The latest system executable is available for all developers on the team. There should be a well-known place where developers can reference the release artifact. Release Artifacts are Published for Each Pull Request or Merges into the Main Branch Integration Observability Applied state changes to the mainline build should be made available and communicated across the team. Centralizing logs and status(s) from build and release pipeline failures are essential for developers investigating broken builds. We recommend integrating Teams or Slack with CI/CD pipeline runs which helps keep the team continuously plugged into failures and build candidate status(s). Continuous Integration Top Level Dashboard Modern CI providers have the capability to consolidate and report build status(s) within a given dashboard. Your CI dashboard should be able to correlate a build failure with a git commit. Build Status Badge in the Project Readme There should be a build status badge included in the root README of the project. Build Notifications Your CI process should be configured to send notifications to messaging platforms like Teams / Slack once the build completes. We recommend creating a separate channel to help consolidate and isolate these notifications. Resources Martin Fowler's Continuous Integration Best Practices Bedrock Getting Started Quick Guide Cobalt Quick Start Guide Terraform Azure DevOps Provider Azure DevOps multi stage pipelines Azure Pipeline Key Concepts Azure Pipeline Environments Artifacts in Azure Pipelines Azure Pipeline permission and security roles Azure Environment approvals and checks Terraform Getting Started Guide with Azure Terraform Remote State Azure Setup Terratest - Unit and Integration Infrastructure Framework","title":"Continuous Integration"},{"location":"CI-CD/continuous-integration/#continuous-integration","text":"We encourage engineering teams to make an upfront investment during Sprint 0 of a project to establish an automated and repeatable pipeline which continuously integrates code and releases system executable(s) to target cloud environments. Each integration should be verified by an automated build process that asserts a suite of validation tests pass and surface any errors across the developer team. We encourage teams to implement the CI/CD pipelines before any service code is written for customers, which usually happens in Sprint 0(N). This way, the engineering team can develop and test their work in isolation without impacting other developers and promote a consistent devops workflow throughout the engagement. These principles map directly agile software development lifecycle practices .","title":"Continuous Integration"},{"location":"CI-CD/continuous-integration/#goals","text":"Continuous integration automation is an integral part of the software development lifecycle intended to reduce build integration errors and maximize velocity across a dev crew. A robust build automation pipeline will: Accelerate team velocity Prevent integration problems Avoid last minute chaos during release dates Provide a quick feedback cycle for system-wide impact of local changes Separate build and deployment stages Measure and report metrics around build failures / success(s) Increase visibility across the team enabling tighter communication Reduce human errors, which is probably the most important part of automating the builds","title":"Goals"},{"location":"CI-CD/continuous-integration/#build-definition-managed-in-git","text":"","title":"Build Definition Managed in Git"},{"location":"CI-CD/continuous-integration/#code-manifest-artifacts-required-to-build-your-project-should-be-maintained-within-your-projects-git-repository","text":"CI provider-specific build pipeline definition(s) should reside within your project(s) git repository(s).","title":"Code / Manifest Artifacts Required to Build Your Project Should be Maintained Within Your Projects Git Repository"},{"location":"CI-CD/continuous-integration/#build-automation","text":"An automated build should encompass the following principles:","title":"Build Automation"},{"location":"CI-CD/continuous-integration/#build-task","text":"A single step within your build pipeline that compiles your code project into a single build artifact.","title":"Build Task"},{"location":"CI-CD/continuous-integration/#unit-testing","text":"Your build definition includes validation steps to execute a suite of automated unit tests to ensure that application components meets its design and behaves as intended.","title":"Unit Testing"},{"location":"CI-CD/continuous-integration/#code-style-checks","text":"Code across an engineering team must be formatted to agreed coding standards. Such standards keep code consistent, and most importantly easy for the team and customer(s) to read and refactor. Code styling consistency encourages collective ownership for project scrum teams and our partners. There are several open source code style validation tools available to choose from ( code style checks , StyleCop ). The Code Review recipes section of the playbook has suggestions for linters and preferred styles for a number of languages. Your code and documentation should avoid the use of non-inclusive language wherever possible. Follow the Inclusive Linting section to ensure your project promotes an inclusive work environment for both the team and for customers. We recommend incorporating security analysis tools within the build stage of your pipeline such as: code credential scanner, security risk detection, static analysis, etc. For Azure DevOps, you can add a security scan task to your pipeline by installing the Microsoft Security Code Analysis Extension . GitHub Actions supports a similar extension with the RIPS security scan solution . Code standards are maintained within a single configuration file. There should be a step in your build pipeline that asserts code in the latest commit conforms to the known style definition.","title":"Code Style Checks"},{"location":"CI-CD/continuous-integration/#build-script-target","text":"A single command should have the capability of building the system. This is also true for builds running on a CI server or on a developers local machine.","title":"Build Script Target"},{"location":"CI-CD/continuous-integration/#no-ide-dependencies","text":"It's essential to have a build that's runnable through standalone scripts and not dependent on a particular IDE. Build pipeline targets can be triggered locally on their desktops through their IDE of choice. The build process should maintain enough flexibility to run within a CI server as well. As an example, dockerizing your build process offers this level of flexibility as VSCode and IntelliJ supports docker plugin extensions.","title":"No IDE Dependencies"},{"location":"CI-CD/continuous-integration/#devops-security-checks","text":"Introduce security to your project at early stages. Follow the DevSecOps section to introduce security practices, automation, tools and frameworks as part of the CI.","title":"DevOps Security Checks"},{"location":"CI-CD/continuous-integration/#build-environment-dependencies","text":"","title":"Build Environment Dependencies"},{"location":"CI-CD/continuous-integration/#automated-local-environment-setup","text":"We encourage maintaining a consistent developer experience for all team members. There should be a central automated manifest / process that streamlines the installation and setup of any software dependencies. This way developers can replicate the same build environment locally as the one running on a CI server. Build automation scripts often require specific software packages and version pre-installed within the runtime environment of the OS. This presents some challenges as build processes typically version lock these dependencies. All developers on the team should be able to emulate the build environment from their local desktop regardless of their OS. For projects using VS Code, leveraging Dev Containers can really help standardize the local developer experience across the team. Well established software packaging tools like Docker, Maven, npm, etc should be considered when designing your build automation tool chain.","title":"Automated Local Environment Setup"},{"location":"CI-CD/continuous-integration/#document-local-setup","text":"The setup process for setting up a local build environment should be well documented and easy for developers to follow.","title":"Document Local Setup"},{"location":"CI-CD/continuous-integration/#infrastructure-as-code","text":"Manage as much of the following as possible, as code: Configuration Files Configuration Management(ie environment variable automation via terraform ) Secret Management(ie creating Azure secrets via terraform ) Cloud Resource Provisioning Role Assignments Load Test Scenarios Availability Alerting / Monitoring Rules and Conditions Decoupling infrastructure from the application codebase simplifies engineering teams move to cloud native applications. Terraform resource providers like Azure DevOps is making it easier for developers to manage build pipeline variables, service connections and CI/CD pipeline definitions.","title":"Infrastructure as Code"},{"location":"CI-CD/continuous-integration/#sample-devops-workflow-using-terraform-and-cobalt","text":"","title":"Sample DevOps Workflow using Terraform and Cobalt"},{"location":"CI-CD/continuous-integration/#why","text":"Repeatable and auditable changes to infrastructure make it easier to roll back to known good configurations and to rapidly expand to new stages and regions without having to hand-wire cloud resources Battle tested and templated IAC reference projects like Cobalt and Bedrock enable more engineering teams deploy secure and scalable solutions at a much more rapid pace Simplify \u201clift and shift\u201d scenarios by abstracting the complexities of cloud-native computing away from application developer teams.","title":"Why"},{"location":"CI-CD/continuous-integration/#iac-devops-operations-by-pull-request","text":"The Infrastructure deployment process built around a repo that holds the current expected state of the system / Azure environment. Operational changes are made to the running system by making commits on this repo. Git also provides a simple model for auditing deployments and rolling back to a previous state.","title":"IAC DevOPS: Operations by Pull Request"},{"location":"CI-CD/continuous-integration/#infrastructure-advocated-patterns","text":"You define infrastructure as code in Terraform / ARM / Ansible templates Templates are repeatable cloud resource stacks with a focus on configuration sets aligned with app scaling and throughput needs.","title":"Infrastructure Advocated Patterns"},{"location":"CI-CD/continuous-integration/#iac-principles","text":"","title":"IAC Principles"},{"location":"CI-CD/continuous-integration/#automate-the-azure-environment","text":"All cloud resources are provisioned through a set of infrastructure as code templates. This also includes secrets, service configuration settings, role assignments and monitoring conditions. Azure Portal should provide a read-only view on environment resources. Any change applied to the environment should be made through the IAC CI tool-chain only. Provisioning cloud environments should be a repeatable process that's driven off the infrastructure code artifacts checked into our git repository.","title":"Automate the Azure Environment"},{"location":"CI-CD/continuous-integration/#iac-ci-workflow","text":"When the IAC template files change through a git-based workflow, A CI build pipeline builds, validates and reconciles the target infrastructure environment's current state with the expected state. The infrastructure execution plan candidate for these fixed environments are reviewed by a cloud administrator as a gate check prior to the deployment stage of the pipeline applying the execution plan.","title":"IAC CI Workflow"},{"location":"CI-CD/continuous-integration/#developer-read-only-access-to-cloud-resources","text":"Developer accounts in the Azure portal should have read-only access to IAC environment resources in Azure.","title":"Developer Read-Only Access to Cloud Resources"},{"location":"CI-CD/continuous-integration/#secret-automation","text":"IAC templates are deployed via a CI/CD system that has secrets automation integrated. Avoid applying changes to secrets and/or certificates directly in the Azure Portal.","title":"Secret Automation"},{"location":"CI-CD/continuous-integration/#infrastructure-integration-test-automation","text":"End-to-end integration tests are run as part of your IAC CI process to inspect and validate that an azure environment is ready for use.","title":"Infrastructure Integration Test Automation"},{"location":"CI-CD/continuous-integration/#infrastructure-documentation","text":"The deployment and cloud resource template topology should be documented and well understood within the README of the IAC git repo. Local environment and CI workflow setup steps should be documented.","title":"Infrastructure Documentation"},{"location":"CI-CD/continuous-integration/#configuration-validation","text":"Applications use configuration to allow different runtime behaviors and it\u2019s quite common to use files to store these settings. As developers, we might introduce errors while editing these files which would cause issues for the application to start and/or run correctly. By applying validation techniques on both syntax and semantics of our configuration, we can detect errors before the application is deployed and execute, improving the developer (user) experience.","title":"Configuration Validation"},{"location":"CI-CD/continuous-integration/#application-configuration-files-examples","text":"JSON, with support for complex data types and data structures YAML, a super set of JSON with support for complex data types and structures TOML, a super set of JSON and a formally specified configuration file format","title":"Application Configuration Files Examples"},{"location":"CI-CD/continuous-integration/#why-validate-application-configuration-as-a-separate-step","text":"Easier Debugging & Time saving - With a configuration validation step in our pipeline, we can avoid running the application just to find it fails. It saves time on having to deploy & run, wait and then realize something is wrong in configuration. In addition, it also saves time on going through the logs to figure out what failed and why. Better user/developer experience - A simple reminder to the user that something in the configuration isn't in the right format can make all the difference between the joy of a successful deployment process and the intense frustration of having to guess what went wrong. For example, when there is a Boolean value expected, it can either be a string value like \"True\" or \"False\" or an integer value such as \"0\" or \"1\" . With configuration validation we make sure the meaning is correct for our application. Avoid data corruption and security breaches - Since the data arrives from an untrusted source, such as a user or an external webservice, it\u2019s particularly important to validate the input . Otherwise, it will run at the risk of performing errors, corrupting data, or, worse, be vulnerable to a whole array of injection attacks.","title":"Why Validate Application Configuration as a Separate Step?"},{"location":"CI-CD/continuous-integration/#what-is-json-schema","text":"JSON-Schema is the standard of JSON documents that describes the structure and the requirements of your JSON data. Although it is called JSON-Schema, it also common to use this method for YAMLs, as it is a super set of JSON. The schema is very simple; point out which fields might exist, which are required or optional, what data format they use. Other validation rules can be added on top of that basic premise, along with human-readable information. The metadata lives in schemas which are .json files as well. In addition, schema has the widest adoption among all standards for JSON validation as it covers a big part of validation scenarios. It uses easy-to-parse JSON documents for schemas and is easily extensible.","title":"What is Json Schema?"},{"location":"CI-CD/continuous-integration/#how-to-implement-schema-validation","text":"Implementing schema validation is divided in two - the generation of the schemas and the validation of yaml/json files with those schemas.","title":"How to Implement Schema Validation?"},{"location":"CI-CD/continuous-integration/#generation","text":"There are two options to generate a schema: From code - we can leverage the existing models and objects in the code and generate a customized schema. From data - we can take yaml/json samples which reflect the configuration in general and use the various online tools to generate a schema.","title":"Generation"},{"location":"CI-CD/continuous-integration/#validation","text":"The schema has 30+ validators for different languages, including 10+ for JavaScript, so no need to code it yourself.","title":"Validation"},{"location":"CI-CD/continuous-integration/#integration-validation","text":"An effective way to identify bugs in your build at a rapid pace is to invest early into a reliable suite of automated tests that validate the baseline functionality of the system:","title":"Integration Validation"},{"location":"CI-CD/continuous-integration/#end-to-end-integration-tests","text":"Include tests in your pipeline to validate the build candidate conforms to automated business functionality assertions. Any bugs or broken code should be reported in the test results including the failed test and relevant stack trace. All tests should be invoked through a single command. Keep the build fast. Consider automated test runtime when deciding to pull in dependencies like databases, external services and mock data loading into your test harness. Slow builds often become a bottleneck for dev teams when parallel builds on a CI server are not an option. Consider adding max timeout limits for lengthy validations to fail fast and maintain high velocity across the team.","title":"End-to-End Integration Tests"},{"location":"CI-CD/continuous-integration/#avoid-checking-in-broken-builds","text":"Automated build checks, tests, lint runs, etc should be validated locally before committing your changes to the scm repo. Test Driven Development is a practice dev crews should consider to help identify bugs and failures as early as possible within the development lifecycle.","title":"Avoid Checking in Broken Builds"},{"location":"CI-CD/continuous-integration/#reporting-build-failures","text":"If the build step happens to fail then the build pipeline run status should be reported as failed including relevant logs and stack traces.","title":"Reporting Build Failures"},{"location":"CI-CD/continuous-integration/#test-automation-data-dependencies","text":"Any mocked dataset(s) used for unit and end-to-end integration tests should be checked into the mainline repository. Minimize any external data dependencies with your build process.","title":"Test Automation Data Dependencies"},{"location":"CI-CD/continuous-integration/#code-coverage-checks","text":"We recommend integrating code coverage tools within your build stage. Most coverage tools fail builds when the test coverage falls below a minimum threshold(80% coverage). The coverage report should be published to your CI system to track a time series of variations.","title":"Code Coverage Checks"},{"location":"CI-CD/continuous-integration/#git-driven-workflow","text":"","title":"Git Driven Workflow"},{"location":"CI-CD/continuous-integration/#build-on-commit","text":"Every commit to the baseline repository should trigger the CI pipeline to create a new build candidate. Build artifact(s) are built, packaged, validated and deployed continuously into a non-production environment per commit. Each commit against the repository results into a CI run which checks out the sources onto the integration machine, initiates a build, and notifies the committer of the result of the build.","title":"Build on Commit"},{"location":"CI-CD/continuous-integration/#avoid-commenting-out-failing-tests","text":"Avoid commenting out tests in the mainline branch. By commenting out tests, we get an incorrect indication of the status of the build.","title":"Avoid Commenting Out Failing Tests"},{"location":"CI-CD/continuous-integration/#branch-policy-enforcement","text":"Protected branch policies should be setup on the main branch to ensure that CI stage(s) have passed prior to starting a code review. Code review approvers will only start reviewing a pull request once the CI pipeline run passes for the latest pushed git commit. Broken builds should block pull request reviews. Prevent commits directly into main branch.","title":"Branch Policy Enforcement"},{"location":"CI-CD/continuous-integration/#branch-strategy","text":"Release branches should auto trigger the deployment of a build artifact to its target cloud environment. You can find additional guidance on the Azure DevOps documentation site under the Manage deployments section","title":"Branch Strategy"},{"location":"CI-CD/continuous-integration/#deliver-quickly-and-daily","text":"\"By committing regularly, every committer can reduce the number of conflicting changes. Checking in a week's worth of work runs the risk of conflicting with other features and can be very difficult to resolve. Early, small conflicts in an area of the system cause team members to communicate about the change they are making.\" In the spirit of transparency and embracing frequent communication across a dev crew, we encourage developers to commit code on a daily cadence. This approach provides visibility to feature progress and accelerates pair programming across the team. Here are some principles to consider:","title":"Deliver Quickly and Daily"},{"location":"CI-CD/continuous-integration/#everyone-commits-to-the-git-repository-each-day","text":"End of day checked-in code should contain unit tests at the minimum. Run the build locally before checking in to avoid CI pipeline failure saturation. You should verify what caused the error, and try to solve it as soon as possible instead of committing your code. We encourage developers to follow a lean SDLC principles . Isolate work into small chunks which ties directly to business value and refactor incrementally.","title":"Everyone Commits to the Git Repository Each Day"},{"location":"CI-CD/continuous-integration/#isolated-environments","text":"One of the key goals of build validation is to isolate and identify failures in staging environment(s) and minimize any disruption to live production traffic. Our E2E automated tests should run in an environment which mimics our production environment(as much as possible). This includes consistent software versions, OS, test data volume simulations, network traffic parity with production, etc.","title":"Isolated Environments"},{"location":"CI-CD/continuous-integration/#test-in-a-clone-of-production","text":"The production environment should be duplicated into a staging environment(QA and/or Pre-Prod) at a minimum.","title":"Test in a Clone of Production"},{"location":"CI-CD/continuous-integration/#pull-request-updates-trigger-staged-releases","text":"New commits related to a pull request should trigger a build / release into an integration environment. The production environment should be fully isolated from this process.","title":"Pull Request Updates Trigger Staged Releases"},{"location":"CI-CD/continuous-integration/#promote-infrastructure-changes-across-fixed-environments","text":"Infrastructure as code changes should be tested in an integration environment and promoted to all staging environment(s) then migrated to production with zero downtime for system users.","title":"Promote Infrastructure Changes Across Fixed Environments"},{"location":"CI-CD/continuous-integration/#testing-in-production","text":"There are various approaches with safely carrying out automated tests for production deployments. Some of these may include: Feature flagging A/B testing Traffic shifting","title":"Testing in Production"},{"location":"CI-CD/continuous-integration/#developer-access-to-the-latest-release-artifacts","text":"Our devops workflow should enable developers to get, install and run the latest system executable. Release executable(s) should be auto generated as part of our CI/CD pipeline(s).","title":"Developer Access to the Latest Release Artifacts"},{"location":"CI-CD/continuous-integration/#developers-can-access-the-latest-executable","text":"The latest system executable is available for all developers on the team. There should be a well-known place where developers can reference the release artifact.","title":"Developers can Access the Latest Executable"},{"location":"CI-CD/continuous-integration/#release-artifacts-are-published-for-each-pull-request-or-merges-into-the-main-branch","text":"","title":"Release Artifacts are Published for Each Pull Request or Merges into the Main Branch"},{"location":"CI-CD/continuous-integration/#integration-observability","text":"Applied state changes to the mainline build should be made available and communicated across the team. Centralizing logs and status(s) from build and release pipeline failures are essential for developers investigating broken builds. We recommend integrating Teams or Slack with CI/CD pipeline runs which helps keep the team continuously plugged into failures and build candidate status(s).","title":"Integration Observability"},{"location":"CI-CD/continuous-integration/#continuous-integration-top-level-dashboard","text":"Modern CI providers have the capability to consolidate and report build status(s) within a given dashboard. Your CI dashboard should be able to correlate a build failure with a git commit.","title":"Continuous Integration Top Level Dashboard"},{"location":"CI-CD/continuous-integration/#build-status-badge-in-the-project-readme","text":"There should be a build status badge included in the root README of the project.","title":"Build Status Badge in the Project Readme"},{"location":"CI-CD/continuous-integration/#build-notifications","text":"Your CI process should be configured to send notifications to messaging platforms like Teams / Slack once the build completes. We recommend creating a separate channel to help consolidate and isolate these notifications.","title":"Build Notifications"},{"location":"CI-CD/continuous-integration/#resources","text":"Martin Fowler's Continuous Integration Best Practices Bedrock Getting Started Quick Guide Cobalt Quick Start Guide Terraform Azure DevOps Provider Azure DevOps multi stage pipelines Azure Pipeline Key Concepts Azure Pipeline Environments Artifacts in Azure Pipelines Azure Pipeline permission and security roles Azure Environment approvals and checks Terraform Getting Started Guide with Azure Terraform Remote State Azure Setup Terratest - Unit and Integration Infrastructure Framework","title":"Resources"},{"location":"CI-CD/dev-sec-ops/","text":"DevSecOps The Concept of DevSecOps DevSecOps or DevOps security is about introducing security earlier in the life cycle of application development (a.k.a shift-left), thus minimizing the impact of vulnerabilities and bringing security closer to development team. Why By embracing shift-left mentality, DevSecOps encourages organizations to bridge the gap that often exists between development and security teams to the point where many of the security processes are automated and are effectively handled by the development team. DevSecOps Practices This section covers different tools, frameworks and resources allowing introduction of DevSecOps best practices to your project at early stages of development. Topics covered: Credential Scanning - automatically inspecting a project to ensure that no secrets are included in the project's source code. Secrets Rotation - automated process by which the secret, used by the application, is refreshed and replaced by a new secret. Static Code Analysis - analyze source code or compiled versions of code to help find security flaws. Penetration Testing - a simulated attack against your application to check for exploitable vulnerabilities. Container Dependencies Scanning - search for vulnerabilities in container operating systems, language packages and application dependencies. Evaluation of Open Source Libraries - make it harder to apply open source supply chain attacks by evaluating the libraries you use.","title":"DevSecOps"},{"location":"CI-CD/dev-sec-ops/#devsecops","text":"","title":"DevSecOps"},{"location":"CI-CD/dev-sec-ops/#the-concept-of-devsecops","text":"DevSecOps or DevOps security is about introducing security earlier in the life cycle of application development (a.k.a shift-left), thus minimizing the impact of vulnerabilities and bringing security closer to development team.","title":"The Concept of DevSecOps"},{"location":"CI-CD/dev-sec-ops/#why","text":"By embracing shift-left mentality, DevSecOps encourages organizations to bridge the gap that often exists between development and security teams to the point where many of the security processes are automated and are effectively handled by the development team.","title":"Why"},{"location":"CI-CD/dev-sec-ops/#devsecops-practices","text":"This section covers different tools, frameworks and resources allowing introduction of DevSecOps best practices to your project at early stages of development. Topics covered: Credential Scanning - automatically inspecting a project to ensure that no secrets are included in the project's source code. Secrets Rotation - automated process by which the secret, used by the application, is refreshed and replaced by a new secret. Static Code Analysis - analyze source code or compiled versions of code to help find security flaws. Penetration Testing - a simulated attack against your application to check for exploitable vulnerabilities. Container Dependencies Scanning - search for vulnerabilities in container operating systems, language packages and application dependencies. Evaluation of Open Source Libraries - make it harder to apply open source supply chain attacks by evaluating the libraries you use.","title":"DevSecOps Practices"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/","text":"Azure DevOps Service Connection Security Service Connections are used in Azure DevOps Pipelines to connect to external services, like Azure, GitHub, Docker, Kubernetes, and many other services. Service Connections can be used to authenticate to these external services and to invoke diverse types of commands, like create and update resources in Azure, upload container images to Docker, or deploy applications to Kubernetes. To be able to invoke these commands, Service Connections need to have the right permissions to do so, for most types of Service Connections the permissions can be scoped to a subset of resources to limit the access they have. To improve the principle of least privilege, it's often very common to have separate Service Connections for different environments like Dev/Test/QA/Prod. Secure Service Connection Securing Service Connections can be achieved by using several methods. User permissions can be configured to ensure only the correct users can create, view, use, and manage the Service Connection. Pipeline-level permissions can be configured to ensure only approved YAML pipelines are able to use the Service Connection. Project permissions can be configured to ensure only certain Azure DevOps projects are able to use the Service Connection. After using the above methods, what is secured is who can use the Service Connections. What still isn't secured however, is what can be done with the Service Connections. Because Service Connections have all the necessary permissions in the external services, it is crucial to secure Service Connections so they cannot be misused by accident or by malicious users. An example of this is a Azure DevOps Pipeline that uses a Service Connection to an Azure Resource Group (or entire subscription) to list all resources and then delete those resources. Without the correct security in place, it could be possible to execute this Pipeline, without any validation or reviews being done. pool : vmImage : ubuntu-latest steps : - task : AzureCLI@2 inputs : azureSubscription : 'Production Service Connection' scriptType : 'pscore' scriptLocation : 'inlineScript' inlineScript : | $resources = az resource list foreach ($resource in $resources) { az resource delete --ids $resource.id } Pipeline Security Caveat YAML pipelines can be triggered without the need for a pull request, this introduces a security risk. In good practice, Pull Requests and Code Reviews should be used to ensure the code that is being deployed, is being reviewed by a second person and potentially automatically being checked for vulnerabilities and other security issues. However, YAML Pipelines can be executed without the need for a Pull Request and Code Reviews. This allows the (malicious) user to make changes using the Service Connection which would normally require a reviewer. The configuration of when a pipeline should be triggered is specified in the YAML Pipeline itself and therefore a pipeline can be configured to execute on changes in a temporary branch. In this temporary branch, any changes made to the pipeline itself will be executed without being reviewed. If the given pipeline has been granted Pipeline-level permissions to use a specific Service Connection, any command can be executed using that Service Connection, without anyone reviewing the command. Since Service Connections can have a lot of permissions in the external service, executing any pipeline without review could potentially have big consequences. Service Connection Checks To prevent accidental mis-use of Service Connections there are several checks that can be configured. These checks are configured on the Service Connection itself and therefore can only be configured by the owner or administrator of that Service Connection. A user of a certain YAML Pipeline cannot modify these checks since the checks are not defined in the YAML file itself. Configuration can be done in the Approvals and Checks menu on the Service Connection. Branch Control By configuring Branch Control on a Service Connection, you can control that the Service Connection can only be used in a YAML Pipeline if the pipeline is running from a specific branch. By configuring Branch Control to only allow the main branch (and potentially release branches) you can ensure a YAML Pipeline can only use the Service Connection after any changes to that pipeline have been merged into the main branch, and therefore has passed any Pull Requests checks and Code Reviews. As an additional check, Branch Control can verify if Branch Protections (like required Pull Requests and Code Reviews) are actually configured on the allowed branches. With Branch Control in place, in combination with Branch Protections, it is not possible anymore to run any commands against a Service Connection without having multiple persons review the commands. Therefore accidental, or malicious, mis-use of the permissions a Service Connection has is not possible anymore. Note: When setting a wildcard for the Allowed Branches, anyone could still create a branch matching that wildcard and would be able to use the Service Connection. Using git permissions it can be configured so only administrators are allowed to create certain branches, like release branches.*","title":"Azure DevOps Service Connection Security"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#azure-devops-service-connection-security","text":"Service Connections are used in Azure DevOps Pipelines to connect to external services, like Azure, GitHub, Docker, Kubernetes, and many other services. Service Connections can be used to authenticate to these external services and to invoke diverse types of commands, like create and update resources in Azure, upload container images to Docker, or deploy applications to Kubernetes. To be able to invoke these commands, Service Connections need to have the right permissions to do so, for most types of Service Connections the permissions can be scoped to a subset of resources to limit the access they have. To improve the principle of least privilege, it's often very common to have separate Service Connections for different environments like Dev/Test/QA/Prod.","title":"Azure DevOps Service Connection Security"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#secure-service-connection","text":"Securing Service Connections can be achieved by using several methods. User permissions can be configured to ensure only the correct users can create, view, use, and manage the Service Connection. Pipeline-level permissions can be configured to ensure only approved YAML pipelines are able to use the Service Connection. Project permissions can be configured to ensure only certain Azure DevOps projects are able to use the Service Connection. After using the above methods, what is secured is who can use the Service Connections. What still isn't secured however, is what can be done with the Service Connections. Because Service Connections have all the necessary permissions in the external services, it is crucial to secure Service Connections so they cannot be misused by accident or by malicious users. An example of this is a Azure DevOps Pipeline that uses a Service Connection to an Azure Resource Group (or entire subscription) to list all resources and then delete those resources. Without the correct security in place, it could be possible to execute this Pipeline, without any validation or reviews being done. pool : vmImage : ubuntu-latest steps : - task : AzureCLI@2 inputs : azureSubscription : 'Production Service Connection' scriptType : 'pscore' scriptLocation : 'inlineScript' inlineScript : | $resources = az resource list foreach ($resource in $resources) { az resource delete --ids $resource.id }","title":"Secure Service Connection"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#pipeline-security-caveat","text":"YAML pipelines can be triggered without the need for a pull request, this introduces a security risk. In good practice, Pull Requests and Code Reviews should be used to ensure the code that is being deployed, is being reviewed by a second person and potentially automatically being checked for vulnerabilities and other security issues. However, YAML Pipelines can be executed without the need for a Pull Request and Code Reviews. This allows the (malicious) user to make changes using the Service Connection which would normally require a reviewer. The configuration of when a pipeline should be triggered is specified in the YAML Pipeline itself and therefore a pipeline can be configured to execute on changes in a temporary branch. In this temporary branch, any changes made to the pipeline itself will be executed without being reviewed. If the given pipeline has been granted Pipeline-level permissions to use a specific Service Connection, any command can be executed using that Service Connection, without anyone reviewing the command. Since Service Connections can have a lot of permissions in the external service, executing any pipeline without review could potentially have big consequences.","title":"Pipeline Security Caveat"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#service-connection-checks","text":"To prevent accidental mis-use of Service Connections there are several checks that can be configured. These checks are configured on the Service Connection itself and therefore can only be configured by the owner or administrator of that Service Connection. A user of a certain YAML Pipeline cannot modify these checks since the checks are not defined in the YAML file itself. Configuration can be done in the Approvals and Checks menu on the Service Connection.","title":"Service Connection Checks"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#branch-control","text":"By configuring Branch Control on a Service Connection, you can control that the Service Connection can only be used in a YAML Pipeline if the pipeline is running from a specific branch. By configuring Branch Control to only allow the main branch (and potentially release branches) you can ensure a YAML Pipeline can only use the Service Connection after any changes to that pipeline have been merged into the main branch, and therefore has passed any Pull Requests checks and Code Reviews. As an additional check, Branch Control can verify if Branch Protections (like required Pull Requests and Code Reviews) are actually configured on the allowed branches. With Branch Control in place, in combination with Branch Protections, it is not possible anymore to run any commands against a Service Connection without having multiple persons review the commands. Therefore accidental, or malicious, mis-use of the permissions a Service Connection has is not possible anymore. Note: When setting a wildcard for the Allowed Branches, anyone could still create a branch matching that wildcard and would be able to use the Service Connection. Using git permissions it can be configured so only administrators are allowed to create certain branches, like release branches.*","title":"Branch Control"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/","text":"Dependency and Container Scanning Dependency and Container scanning is performed in order to search for vulnerabilities in operating systems, language and application packages. Why Dependency and Container Scanning Container images are standard application delivery format in cloud-native environments. Having a broad selection of images from the community, we often choose a community base image, and then add packages that we need to it, which might also come from community sources. Those arbitrary dependencies might introduce vulnerabilities to our image and application. Applying Dependency and Container Scanning Images that contain software with security vulnerabilities become exploitable at runtime. When building an image in your CI pipeline, image scanning must be a requirement for a build to pass. Images that did not pass scanning should never be pushed to your production-accessible container registry. Dependency and Container scanning best practices: Base Image - if your image is built on top of a third-party base image, validate the following: The image comes from a well-known company or open-source group. It is hosted on a reputable registry. The Dockerfile is available, and check for dependencies installed in it. The image is frequently updated - old images might not contain the latest security updates. Remove Non-Essential Software - Start with a minimal base image and install only the tools, libraries and configuration files that are required by your application. Avoid installing the following tools or remove them if present: - Network tools and clients: e.g., wget, curl, netcat, ssh. - Shells: e.g. sh, bash. Note that removing shells also prevents the use of shell scripts at runtime. Instead, use an executable when possible. - Compilers and debuggers. These should be used only in build and development containers, but never in production containers. Container images should be immutable - download and include all the required dependencies during the image build. Scan for vulnerabilities in software dependencies - today there is likely no software project without some form of external libraries, dependencies or open source. While it allows the development team to focus on their application code, the dependency brings forth an expected downside where the security posture of the real application is now resting on it. To detect vulnerabilities contained within a project\u2019s dependencies use container scanning tools which as part of their analysis scan the software dependencies (see \"Dependency and Container Scanning Frameworks and Tools\"). Dependency and Container Scanning Frameworks and Tools Trivy - a simple and comprehensive vulnerability scanner for containers (doesn't support Windows containers) Aqua - dependency and container scanning for applications running on AKS, ACI and Windows Containers. Has an integration with AzDO pipelines. Dependency-Check Plugin for SonarQube - OnPrem dependency scanning Mend (previously WhiteSource) - Open Source Scanning Software Conclusion A powerful technology such as containers should be used carefully. Install the minimal requirements needed for your application, be aware of the software dependencies your application is using and make sure to maintain it over time by using container and dependencies scanning tools.","title":"Dependency and Container Scanning"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#dependency-and-container-scanning","text":"Dependency and Container scanning is performed in order to search for vulnerabilities in operating systems, language and application packages.","title":"Dependency and Container Scanning"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#why-dependency-and-container-scanning","text":"Container images are standard application delivery format in cloud-native environments. Having a broad selection of images from the community, we often choose a community base image, and then add packages that we need to it, which might also come from community sources. Those arbitrary dependencies might introduce vulnerabilities to our image and application.","title":"Why Dependency and Container Scanning"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#applying-dependency-and-container-scanning","text":"Images that contain software with security vulnerabilities become exploitable at runtime. When building an image in your CI pipeline, image scanning must be a requirement for a build to pass. Images that did not pass scanning should never be pushed to your production-accessible container registry. Dependency and Container scanning best practices: Base Image - if your image is built on top of a third-party base image, validate the following: The image comes from a well-known company or open-source group. It is hosted on a reputable registry. The Dockerfile is available, and check for dependencies installed in it. The image is frequently updated - old images might not contain the latest security updates. Remove Non-Essential Software - Start with a minimal base image and install only the tools, libraries and configuration files that are required by your application. Avoid installing the following tools or remove them if present: - Network tools and clients: e.g., wget, curl, netcat, ssh. - Shells: e.g. sh, bash. Note that removing shells also prevents the use of shell scripts at runtime. Instead, use an executable when possible. - Compilers and debuggers. These should be used only in build and development containers, but never in production containers. Container images should be immutable - download and include all the required dependencies during the image build. Scan for vulnerabilities in software dependencies - today there is likely no software project without some form of external libraries, dependencies or open source. While it allows the development team to focus on their application code, the dependency brings forth an expected downside where the security posture of the real application is now resting on it. To detect vulnerabilities contained within a project\u2019s dependencies use container scanning tools which as part of their analysis scan the software dependencies (see \"Dependency and Container Scanning Frameworks and Tools\").","title":"Applying Dependency and Container Scanning"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#dependency-and-container-scanning-frameworks-and-tools","text":"Trivy - a simple and comprehensive vulnerability scanner for containers (doesn't support Windows containers) Aqua - dependency and container scanning for applications running on AKS, ACI and Windows Containers. Has an integration with AzDO pipelines. Dependency-Check Plugin for SonarQube - OnPrem dependency scanning Mend (previously WhiteSource) - Open Source Scanning Software","title":"Dependency and Container Scanning Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#conclusion","text":"A powerful technology such as containers should be used carefully. Install the minimal requirements needed for your application, be aware of the software dependencies your application is using and make sure to maintain it over time by using container and dependencies scanning tools.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/","text":"Evaluate Open Source Software Given the rise in threat of open source software supply chain attacks , developers should identify potential candidates for open-source dependencies and evaluate them against your needs and the required security posture. Why Evaluate Open Source Software Open source software is a critical part of modern software development. It is important to evaluate the open source software uses to ensure it meets the needs and is secure. Security is not a given with open source software, and furthermore, what is secure today may not be secure tomorrow so scanning dependencies for known vulnerabilities doesn't always cover all bases. This is why we need to look for evidence of a strong security posture and a commitment to security from the maintainers of the open source software we use. When to Evaluate Open Source Software You should evaluate open source software before you use it in your project. This is especially important if the software is a dependency of your project, as it can introduce security vulnerabilities and other issues into your project. Code reviewers should also be aware of the open source software used in the project and be able to use the tools and resources mentioned below to evaluate the security of the open source software that is being added to the project. Applying Open Source Software Evaluation When evaluating open source software, consider the following: Can you avoid adding it as a dependency? The best dependency is the one you don't have. Is it maintained? How often and at what engineering rigor (i.e. code reviews, branch protection, tests) Is there evidence that effort is taken to make it secure? Can you find a reference that it is used significantly downstream by other projects or is referenced by known and trusted documentation? How many stars and forks does it have on GitHub? Is it easy to use securely? Does the license allow you to use it in your project? Are there instructions on how to report vulnerabilities? Does it have any known vulnerabilities or security issues? Are its dependencies secure, or at least up to date and actively maintained? Has it been audited by a third party such as the OpenSSF Security Reviews ? Tools for Evaluating Open Source Software OpenSSF Scorecards - This tool actually automates some of the checks in the list above and can be used to evaluate the security posture of open source projects. This can run as a GitHub action or in the Command Line Interface (CLI) to provide a security scorecard for open source projects. Note which metrics are important to you, your organization and the customer's. This tool is used by known open source program offices (OSPO) for measuring open source contributions by their employees. OWASP Dependency-Check - a software composition analysis utility that identifies project dependencies and checks if there are any known, publicly disclosed, vulnerabilities. Concise Guide for Evaluating Open Source Software - a guide to help you expand upon the knowledge in this page to evaluate open source software.","title":"Evaluate Open Source Software"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#evaluate-open-source-software","text":"Given the rise in threat of open source software supply chain attacks , developers should identify potential candidates for open-source dependencies and evaluate them against your needs and the required security posture.","title":"Evaluate Open Source Software"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#why-evaluate-open-source-software","text":"Open source software is a critical part of modern software development. It is important to evaluate the open source software uses to ensure it meets the needs and is secure. Security is not a given with open source software, and furthermore, what is secure today may not be secure tomorrow so scanning dependencies for known vulnerabilities doesn't always cover all bases. This is why we need to look for evidence of a strong security posture and a commitment to security from the maintainers of the open source software we use.","title":"Why Evaluate Open Source Software"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#when-to-evaluate-open-source-software","text":"You should evaluate open source software before you use it in your project. This is especially important if the software is a dependency of your project, as it can introduce security vulnerabilities and other issues into your project. Code reviewers should also be aware of the open source software used in the project and be able to use the tools and resources mentioned below to evaluate the security of the open source software that is being added to the project.","title":"When to Evaluate Open Source Software"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#applying-open-source-software-evaluation","text":"When evaluating open source software, consider the following: Can you avoid adding it as a dependency? The best dependency is the one you don't have. Is it maintained? How often and at what engineering rigor (i.e. code reviews, branch protection, tests) Is there evidence that effort is taken to make it secure? Can you find a reference that it is used significantly downstream by other projects or is referenced by known and trusted documentation? How many stars and forks does it have on GitHub? Is it easy to use securely? Does the license allow you to use it in your project? Are there instructions on how to report vulnerabilities? Does it have any known vulnerabilities or security issues? Are its dependencies secure, or at least up to date and actively maintained? Has it been audited by a third party such as the OpenSSF Security Reviews ?","title":"Applying Open Source Software Evaluation"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#tools-for-evaluating-open-source-software","text":"OpenSSF Scorecards - This tool actually automates some of the checks in the list above and can be used to evaluate the security posture of open source projects. This can run as a GitHub action or in the Command Line Interface (CLI) to provide a security scorecard for open source projects. Note which metrics are important to you, your organization and the customer's. This tool is used by known open source program offices (OSPO) for measuring open source contributions by their employees. OWASP Dependency-Check - a software composition analysis utility that identifies project dependencies and checks if there are any known, publicly disclosed, vulnerabilities. Concise Guide for Evaluating Open Source Software - a guide to help you expand upon the knowledge in this page to evaluate open source software.","title":"Tools for Evaluating Open Source Software"},{"location":"CI-CD/dev-sec-ops/penetration-testing/","text":"Penetration Testing A penetration test is a simulated attack against your application to check for exploitable security issues. Why Penetration Testing Penetration testing performed on a running application. As such, it tests the application E2E with all of its layers. It's output is a real simulated attack on the application that succeeded, therefore it is a critical issue in your application and should be addressed as soon as possible. Applying Penetration Testing Many organizations perform manual penetration testing. But new vulnerabilities found every day. Therefore, it is a good practice to have an automated penetration testing performed. To achieve this automation use penetration testing tools to uncover vulnerabilities, such as unsanitized inputs that are susceptible to code injection attacks. Insights provided by the penetration test can then be used to fine-tune your WAF security policies and patch detected vulnerabilities. Penetration Testing Frameworks and Tools OWASP Zed Attack Proxy (ZAP) - OWASP penetration testing tool for web applications. Conclusion Penetration testing is essential to check for vulnerabilities in your application and protect it from simulated attacks. Insights provided by Penetration testing can identify weak spots in an organization's security posture, as well as measure the compliance of its security policy, test the staff's awareness of security issues and determine whether -- and how -- the organization would be subject to security disasters.","title":"Penetration Testing"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#penetration-testing","text":"A penetration test is a simulated attack against your application to check for exploitable security issues.","title":"Penetration Testing"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#why-penetration-testing","text":"Penetration testing performed on a running application. As such, it tests the application E2E with all of its layers. It's output is a real simulated attack on the application that succeeded, therefore it is a critical issue in your application and should be addressed as soon as possible.","title":"Why Penetration Testing"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#applying-penetration-testing","text":"Many organizations perform manual penetration testing. But new vulnerabilities found every day. Therefore, it is a good practice to have an automated penetration testing performed. To achieve this automation use penetration testing tools to uncover vulnerabilities, such as unsanitized inputs that are susceptible to code injection attacks. Insights provided by the penetration test can then be used to fine-tune your WAF security policies and patch detected vulnerabilities.","title":"Applying Penetration Testing"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#penetration-testing-frameworks-and-tools","text":"OWASP Zed Attack Proxy (ZAP) - OWASP penetration testing tool for web applications.","title":"Penetration Testing Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#conclusion","text":"Penetration testing is essential to check for vulnerabilities in your application and protect it from simulated attacks. Insights provided by Penetration testing can identify weak spots in an organization's security posture, as well as measure the compliance of its security policy, test the staff's awareness of security issues and determine whether -- and how -- the organization would be subject to security disasters.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/secrets-management/","text":"Secrets Management Secret management refers to the tools and practices used to manage digital authentication credentials (like API keys, tokens, passwords, and certificates). These secrets are used to protect access to sensitive data and services, making their management critical for security. We should assume any repo we work on may go public at any time and protect our secrets, even if the repo is initially private. Importance of Secrets Management In modern software development, applications often need to interact with other software components, APIs, and services. These interactions often require authentication, which is typically handled using secrets. If these secrets are not managed properly, they can be exposed, leading to potential security breaches. Best Practices for Secrets Management Centralized Secret Storage: Store all secrets in a centralized, encrypted location. This reduces the risk of secrets being lost or exposed. Access Control: Implement strict access control policies. Only authorized entities should have access to secrets. Rotation of Secrets: Regularly change secrets to reduce the risk if a secret is compromised. Audit Trails: Keep a record of when and who accessed which secret. This can help in identifying suspicious activities. Automated Secret Management: Automate the processes of secret creation, rotation, and deletion. This reduces the risk of human error. Remember, the goal of secret management is to protect sensitive information from unauthorized access and potential security threats. General Approach The general approach is to keep secrets in separate configuration files that are not checked in to the repo. Add the files to the .gitignore to prevent that they're checked in. Each developer maintains their own local version of the file or, if required, circulate them via private channels e.g. a Teams chat. In a production system, assuming Azure, create the secrets in the environment of the running process. We can do this by manually editing the 'Applications Settings' section of the resource, but a script using the Azure CLI to do the same is a useful time-saving utility. See az webapp config appsettings for more details. It's best practice to maintain separate secrets configurations for each environment that you run. e.g. dev, test, prod, local etc The secrets-per-branch recipe describes a simple way to manage separate secrets configurations for each environment. Note: even if the secret was only pushed to a feature branch and never merged, it's still a part of the git history. Follow these instructions to remove any sensitive data and/or regenerate any keys and other sensitive information added to the repo. If a key or secret made it into the code base, rotate the key/secret so that it's no longer active Keeping Secrets Secret The care taken to protect our secrets applies both to how we get and store them, but also to how we use them. Don't log secrets Don't put them in reporting Don't send them to other applications, as part of URLs, forms, or in any other way other than to make a request to the service that requires that secret Enhanced-Security Applications The techniques outlined below provide good security and a common pattern for a wide range of languages. They rely on the fact that Azure keeps application settings (the environment) encrypted until your app runs. They do not prevent secrets from existing in plaintext in memory at runtime. In particular, for garbage collected languages those values may exist for longer than the lifetime of the variable, and may be visible when debugging a memory dump of the process. If you are working on an application with enhanced security requirements you should consider using additional techniques to maintain encryption on secrets throughout the application lifetime. Always rotate encryption keys on a regular basis. Techniques for Secrets Management These techniques make the loading of secrets transparent to the developer. C#/.NET Modern .NET Solution For .NET SDK (version 2.0 or higher) we have dotnet secrets , a tool provided by the .NET SDK that allows you to manage and protect sensitive information, such as API keys, connection strings, and other secrets, during development. The secrets are stored securely on your machine and can be accessed by your .NET applications. # Initialize dotnet secret dotnet user-secrets init # Adding secret # dotnet user-secrets set <KEY> <VALUE> dotnet user-secrets set ExternalServiceApiKey my-api-key-12345 # Update Secret dotnet user-secrets set ExternalServiceApiKey updated-api-key-67890 To access the secrets; using Microsoft.Extensions.Configuration ; var builder = new ConfigurationBuilder () . AddUserSecrets < Startup > (); var configuration = builder . Build (); var externalServiceApiKey = configuration [ \"ExternalServiceApiKey\" ]; Deployment Considerations When deploying your application to production, it's essential to ensure that your secrets are securely managed. Here are some deployment-related implications: Remove Development Secrets: Before deploying to production, remove any development secrets from your application configuration. You can use environment variables or a more secure secret management solution like Azure Key Vault or AWS Secrets Manager in production. Secure Deployment: Ensure that your production server is secure, and access to secrets is controlled. Never store secrets directly in source code or configuration files. Key Rotation: Consider implementing a secret rotation policy to regularly update your secrets in production. .NET Framework Solution Use the file attribute of the appSettings element to load secrets from a local file. <?xml version=\"1.0\" encoding=\"utf-8\"?> <configuration> <appSettings file= \"..\\..\\secrets.config\" > \u2026 </appSettings> <startup> <supportedRuntime version= \"v4.0\" sku= \".NETFramework,Version=v4.6.1\" /> </startup> \u2026 </configuration> Access secrets: static void Main ( string [] args ) { String mySecret = System . Configuration . ConfigurationManager . AppSettings [ \"mySecret\" ]; } When running in Azure, ConfigurationManager will load these settings from the process environment. We don't need to upload secrets files to the server or change any code. Node Store secrets in environment variables or in a .env file $ cat .env MY_SECRET = mySecret Use the dotenv package to load and access environment variables require('dotenv').config() let mySecret = process.env(\"MY_SECRET\") Python Store secrets in environment variables or in a .env file $ cat .env MY_SECRET = mySecret Use the dotenv package to load and access environment variables import os from dotenv import load_dotenv load_dotenv () my_secret = os . getenv ( 'MY_SECRET' ) Another good library for reading environment variables is environs from environs import Env env = Env () env . read_env () my_secret = os . environ [ \"MY_SECRET\" ] Databricks Databricks has the option of using dbutils as a secure way to retrieve credentials and not reveal them within the notebooks running on Databricks The following steps lay out a clear pathway to creating new secrets and then utilizing them within a notebook on Databricks: Install and configure the Databricks CLI on your local machine Get the Databricks personal access token Create a scope for the secrets Create secrets Validation Automated credential scanning can be performed on the code regardless of the programming language.","title":"Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#secrets-management","text":"Secret management refers to the tools and practices used to manage digital authentication credentials (like API keys, tokens, passwords, and certificates). These secrets are used to protect access to sensitive data and services, making their management critical for security. We should assume any repo we work on may go public at any time and protect our secrets, even if the repo is initially private.","title":"Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#importance-of-secrets-management","text":"In modern software development, applications often need to interact with other software components, APIs, and services. These interactions often require authentication, which is typically handled using secrets. If these secrets are not managed properly, they can be exposed, leading to potential security breaches.","title":"Importance of Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#best-practices-for-secrets-management","text":"Centralized Secret Storage: Store all secrets in a centralized, encrypted location. This reduces the risk of secrets being lost or exposed. Access Control: Implement strict access control policies. Only authorized entities should have access to secrets. Rotation of Secrets: Regularly change secrets to reduce the risk if a secret is compromised. Audit Trails: Keep a record of when and who accessed which secret. This can help in identifying suspicious activities. Automated Secret Management: Automate the processes of secret creation, rotation, and deletion. This reduces the risk of human error. Remember, the goal of secret management is to protect sensitive information from unauthorized access and potential security threats.","title":"Best Practices for Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#general-approach","text":"The general approach is to keep secrets in separate configuration files that are not checked in to the repo. Add the files to the .gitignore to prevent that they're checked in. Each developer maintains their own local version of the file or, if required, circulate them via private channels e.g. a Teams chat. In a production system, assuming Azure, create the secrets in the environment of the running process. We can do this by manually editing the 'Applications Settings' section of the resource, but a script using the Azure CLI to do the same is a useful time-saving utility. See az webapp config appsettings for more details. It's best practice to maintain separate secrets configurations for each environment that you run. e.g. dev, test, prod, local etc The secrets-per-branch recipe describes a simple way to manage separate secrets configurations for each environment. Note: even if the secret was only pushed to a feature branch and never merged, it's still a part of the git history. Follow these instructions to remove any sensitive data and/or regenerate any keys and other sensitive information added to the repo. If a key or secret made it into the code base, rotate the key/secret so that it's no longer active","title":"General Approach"},{"location":"CI-CD/dev-sec-ops/secrets-management/#keeping-secrets-secret","text":"The care taken to protect our secrets applies both to how we get and store them, but also to how we use them. Don't log secrets Don't put them in reporting Don't send them to other applications, as part of URLs, forms, or in any other way other than to make a request to the service that requires that secret","title":"Keeping Secrets Secret"},{"location":"CI-CD/dev-sec-ops/secrets-management/#enhanced-security-applications","text":"The techniques outlined below provide good security and a common pattern for a wide range of languages. They rely on the fact that Azure keeps application settings (the environment) encrypted until your app runs. They do not prevent secrets from existing in plaintext in memory at runtime. In particular, for garbage collected languages those values may exist for longer than the lifetime of the variable, and may be visible when debugging a memory dump of the process. If you are working on an application with enhanced security requirements you should consider using additional techniques to maintain encryption on secrets throughout the application lifetime. Always rotate encryption keys on a regular basis.","title":"Enhanced-Security Applications"},{"location":"CI-CD/dev-sec-ops/secrets-management/#techniques-for-secrets-management","text":"These techniques make the loading of secrets transparent to the developer.","title":"Techniques for Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#cnet","text":"","title":"C#/.NET"},{"location":"CI-CD/dev-sec-ops/secrets-management/#modern-net-solution","text":"For .NET SDK (version 2.0 or higher) we have dotnet secrets , a tool provided by the .NET SDK that allows you to manage and protect sensitive information, such as API keys, connection strings, and other secrets, during development. The secrets are stored securely on your machine and can be accessed by your .NET applications. # Initialize dotnet secret dotnet user-secrets init # Adding secret # dotnet user-secrets set <KEY> <VALUE> dotnet user-secrets set ExternalServiceApiKey my-api-key-12345 # Update Secret dotnet user-secrets set ExternalServiceApiKey updated-api-key-67890 To access the secrets; using Microsoft.Extensions.Configuration ; var builder = new ConfigurationBuilder () . AddUserSecrets < Startup > (); var configuration = builder . Build (); var externalServiceApiKey = configuration [ \"ExternalServiceApiKey\" ];","title":"Modern .NET Solution"},{"location":"CI-CD/dev-sec-ops/secrets-management/#deployment-considerations","text":"When deploying your application to production, it's essential to ensure that your secrets are securely managed. Here are some deployment-related implications: Remove Development Secrets: Before deploying to production, remove any development secrets from your application configuration. You can use environment variables or a more secure secret management solution like Azure Key Vault or AWS Secrets Manager in production. Secure Deployment: Ensure that your production server is secure, and access to secrets is controlled. Never store secrets directly in source code or configuration files. Key Rotation: Consider implementing a secret rotation policy to regularly update your secrets in production.","title":"Deployment Considerations"},{"location":"CI-CD/dev-sec-ops/secrets-management/#net-framework-solution","text":"Use the file attribute of the appSettings element to load secrets from a local file. <?xml version=\"1.0\" encoding=\"utf-8\"?> <configuration> <appSettings file= \"..\\..\\secrets.config\" > \u2026 </appSettings> <startup> <supportedRuntime version= \"v4.0\" sku= \".NETFramework,Version=v4.6.1\" /> </startup> \u2026 </configuration> Access secrets: static void Main ( string [] args ) { String mySecret = System . Configuration . ConfigurationManager . AppSettings [ \"mySecret\" ]; } When running in Azure, ConfigurationManager will load these settings from the process environment. We don't need to upload secrets files to the server or change any code.","title":".NET Framework Solution"},{"location":"CI-CD/dev-sec-ops/secrets-management/#node","text":"Store secrets in environment variables or in a .env file $ cat .env MY_SECRET = mySecret Use the dotenv package to load and access environment variables require('dotenv').config() let mySecret = process.env(\"MY_SECRET\")","title":"Node"},{"location":"CI-CD/dev-sec-ops/secrets-management/#python","text":"Store secrets in environment variables or in a .env file $ cat .env MY_SECRET = mySecret Use the dotenv package to load and access environment variables import os from dotenv import load_dotenv load_dotenv () my_secret = os . getenv ( 'MY_SECRET' ) Another good library for reading environment variables is environs from environs import Env env = Env () env . read_env () my_secret = os . environ [ \"MY_SECRET\" ]","title":"Python"},{"location":"CI-CD/dev-sec-ops/secrets-management/#databricks","text":"Databricks has the option of using dbutils as a secure way to retrieve credentials and not reveal them within the notebooks running on Databricks The following steps lay out a clear pathway to creating new secrets and then utilizing them within a notebook on Databricks: Install and configure the Databricks CLI on your local machine Get the Databricks personal access token Create a scope for the secrets Create secrets","title":"Databricks"},{"location":"CI-CD/dev-sec-ops/secrets-management/#validation","text":"Automated credential scanning can be performed on the code regardless of the programming language.","title":"Validation"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/","text":"Credential Scanning Credential scanning is the practice of automatically inspecting a project to ensure that no secrets are included in the project's source code. Secrets include database passwords, storage connection strings, admin logins, service principals, etc. Why Credential Scanning Including secrets in a project's source code is a significant risk, as it might make those secrets available to unwanted parties. Even if it seems that the source code is accessible to the same people who are privy to the secrets, this situation is likely to change as the project grows. Spreading secrets in different places makes them harder to manage, access control, and revoke efficiently. Secrets that are committed to source control are also harder to discard of, since they will persist in the source's history. Another consideration is that coupling the project's code to its infrastructure and deployment specifics is limiting and considered a bad practice. From a software design perspective, the code should be independent of the runtime configuration that will be used to run it, and that runtime configuration includes secrets. As such, there should be a clear boundary between code and secrets: secrets should be managed outside of the source code and credential scanning should be employed to ensure that this boundary is never violated. Applying Credential Scanning Ideally, credential scanning should be run as part of a developer's workflow (e.g. via a git pre-commit hook ), however, to protect against developer error, credential scanning must also be enforced as part of the continuous integration process to ensure that no credentials ever get merged to a project's main branch. To implement credential scanning for a project, consider the following: Store secrets in an external secure store that is meant to store sensitive information Use secrets scanning tools to asses your repositories current state by scanning it's full history for secrets Incorporate an automated secrets scanning tool into your CI pipeline to detect unintentional committing of secrets Avoid git add . commands on git Add sensitive files to .gitignore Credential Scanning Frameworks and Tools Recipes and Scenarios - detect-secrets is an aptly named module for detecting secrets within a code base. Use detect-secrets inside Azure DevOps Pipeline Microsoft Security Code Analysis extension Additional Tools - CodeQL \u2013 GitHub security. CodeQL lets you query code as if it was data. Write a query to find all variants of a vulnerability Git-secrets - Prevents you from committing passwords and other sensitive information to a git repository. Conclusion Secret management is essential to every project. Storing secrets in external secrets store and incorporating this mindset into your workflow will improve your security posture and will result in cleaner code.","title":"Credential Scanning"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#credential-scanning","text":"Credential scanning is the practice of automatically inspecting a project to ensure that no secrets are included in the project's source code. Secrets include database passwords, storage connection strings, admin logins, service principals, etc.","title":"Credential Scanning"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#why-credential-scanning","text":"Including secrets in a project's source code is a significant risk, as it might make those secrets available to unwanted parties. Even if it seems that the source code is accessible to the same people who are privy to the secrets, this situation is likely to change as the project grows. Spreading secrets in different places makes them harder to manage, access control, and revoke efficiently. Secrets that are committed to source control are also harder to discard of, since they will persist in the source's history. Another consideration is that coupling the project's code to its infrastructure and deployment specifics is limiting and considered a bad practice. From a software design perspective, the code should be independent of the runtime configuration that will be used to run it, and that runtime configuration includes secrets. As such, there should be a clear boundary between code and secrets: secrets should be managed outside of the source code and credential scanning should be employed to ensure that this boundary is never violated.","title":"Why Credential Scanning"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#applying-credential-scanning","text":"Ideally, credential scanning should be run as part of a developer's workflow (e.g. via a git pre-commit hook ), however, to protect against developer error, credential scanning must also be enforced as part of the continuous integration process to ensure that no credentials ever get merged to a project's main branch. To implement credential scanning for a project, consider the following: Store secrets in an external secure store that is meant to store sensitive information Use secrets scanning tools to asses your repositories current state by scanning it's full history for secrets Incorporate an automated secrets scanning tool into your CI pipeline to detect unintentional committing of secrets Avoid git add . commands on git Add sensitive files to .gitignore","title":"Applying Credential Scanning"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#credential-scanning-frameworks-and-tools","text":"Recipes and Scenarios - detect-secrets is an aptly named module for detecting secrets within a code base. Use detect-secrets inside Azure DevOps Pipeline Microsoft Security Code Analysis extension Additional Tools - CodeQL \u2013 GitHub security. CodeQL lets you query code as if it was data. Write a query to find all variants of a vulnerability Git-secrets - Prevents you from committing passwords and other sensitive information to a git repository.","title":"Credential Scanning Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#conclusion","text":"Secret management is essential to every project. Storing secrets in external secrets store and incorporating this mindset into your workflow will improve your security posture and will result in cleaner code.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/","text":"Secrets Rotation Secret rotation is the process of refreshing the secrets that are used by the application. The best way to authenticate to Azure services is by using a managed identity, but there are some scenarios where that isn't an option. In those cases, access keys or secrets are used. You should periodically rotate access keys or secrets. Why Secrets Rotation Secrets are an asset and as such have a potential to be leaked or stolen. By rotating the secrets, we are revoking any secrets that may have been compromised. Therefore, secrets should be rotated frequently. Managed Identity Azure Managed identities are automatically issues by Azure in order to identify individual resources, and can be used for authentication in place of secrets and passwords. The appeal in using Managed Identities is the elimination of management of secrets and credentials. They are not required on developers machines or checked into source control, and they don't need to be rotated. Managed identities are considered safer than the alternatives and is the recommended choice. Applying Secrets Rotation If Azure Managed Identity can't be used. This and the following sections will explain how rotation of secrets can be achieved: To promote frequent rotation of a secret - define an automated periodic secret rotation process. The secret rotation process might result in a downtime when the application is restarted to introduce the new secret. A common solution for that is to have two versions of secret available, also referred to as Blue/Green Secret rotation. By having a second secret at hand, we can start a second instance of the application with that secret before the previous secret is revoked, thus avoiding any downtime. Secrets Rotation Frameworks and Tools For rotation of a secret for resources that use one set of authentication credentials click here For rotation of a secret for resources that have two sets of authentication credentials click here Conclusion Refreshing secrets is important to ensure that your secret stays a secret without causing downtime to your application.","title":"Secrets Rotation"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#secrets-rotation","text":"Secret rotation is the process of refreshing the secrets that are used by the application. The best way to authenticate to Azure services is by using a managed identity, but there are some scenarios where that isn't an option. In those cases, access keys or secrets are used. You should periodically rotate access keys or secrets.","title":"Secrets Rotation"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#why-secrets-rotation","text":"Secrets are an asset and as such have a potential to be leaked or stolen. By rotating the secrets, we are revoking any secrets that may have been compromised. Therefore, secrets should be rotated frequently.","title":"Why Secrets Rotation"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#managed-identity","text":"Azure Managed identities are automatically issues by Azure in order to identify individual resources, and can be used for authentication in place of secrets and passwords. The appeal in using Managed Identities is the elimination of management of secrets and credentials. They are not required on developers machines or checked into source control, and they don't need to be rotated. Managed identities are considered safer than the alternatives and is the recommended choice.","title":"Managed Identity"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#applying-secrets-rotation","text":"If Azure Managed Identity can't be used. This and the following sections will explain how rotation of secrets can be achieved: To promote frequent rotation of a secret - define an automated periodic secret rotation process. The secret rotation process might result in a downtime when the application is restarted to introduce the new secret. A common solution for that is to have two versions of secret available, also referred to as Blue/Green Secret rotation. By having a second secret at hand, we can start a second instance of the application with that secret before the previous secret is revoked, thus avoiding any downtime.","title":"Applying Secrets Rotation"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#secrets-rotation-frameworks-and-tools","text":"For rotation of a secret for resources that use one set of authentication credentials click here For rotation of a secret for resources that have two sets of authentication credentials click here","title":"Secrets Rotation Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#conclusion","text":"Refreshing secrets is important to ensure that your secret stays a secret without causing downtime to your application.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/","text":"Static Code Analysis Static code analysis is a method of detecting security issues by examining the source code of the application. Why Static Code Analysis Compared to code reviews, Static code analysis tools are more fast, accurate and through. As it operates on the source code itself, it is a very early indicator for issues, and coding errors found earlier are less costly to fix. Applying Static Code Analysis Static Code Analysis should be integrated in your build process. There are many tools available for Static Code Analysis, choose the ones that meet your programming language and development techniques. Static Code Analysis Frameworks and Tools SonarCloud - static code analysis with cloud-based software as a service product. OWASP Source code Analysis - OWASP recommendations for source code analysis tools Conclusion Static code analysis is essential to identify potential problems and security issues in the code. It allows you to detect bugs and security issues at an early stage.","title":"Static Code Analysis"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#static-code-analysis","text":"Static code analysis is a method of detecting security issues by examining the source code of the application.","title":"Static Code Analysis"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#why-static-code-analysis","text":"Compared to code reviews, Static code analysis tools are more fast, accurate and through. As it operates on the source code itself, it is a very early indicator for issues, and coding errors found earlier are less costly to fix.","title":"Why Static Code Analysis"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#applying-static-code-analysis","text":"Static Code Analysis should be integrated in your build process. There are many tools available for Static Code Analysis, choose the ones that meet your programming language and development techniques.","title":"Applying Static Code Analysis"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#static-code-analysis-frameworks-and-tools","text":"SonarCloud - static code analysis with cloud-based software as a service product. OWASP Source code Analysis - OWASP recommendations for source code analysis tools","title":"Static Code Analysis Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#conclusion","text":"Static code analysis is essential to identify potential problems and security issues in the code. It allows you to detect bugs and security issues at an early stage.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/","text":"Running detect-secrets in Azure DevOps Pipelines Overview In this article, you can find information on how to integrate YELP detect-secrets into your Azure DevOps Pipeline. The proposed code can be part of the classic CI process or (preferred way) build validation for PRs before merging to the main branch. Azure DevOps Pipeline Proposed Azure DevOps Pipeline contains multiple steps described below: Set Python 3 as default Install detect-secrets using pip Run detect-secrets tool Publish results in the Pipeline Artifact Note: It's an optional step, but for future investigation .json file with results may be helpful. Analyzing detect-secrets results Note: This step does a simple analysis of the .json file. If any secret has been detected, then break the build with exit code 1. Note: The below example has 2 jobs: for Linux and Windows agents. You do not have to use both jobs - just adjust the pipeline to your needs. Note: Windows example does not use the latest version of detect-secrets. It is related to the bug in the detect-secret tool (see more in Issue#452 ). It is highly recommended to monitor the fix for the issue and use the latest version if possible by removing version tag ==1.0.3 in the pip install command. trigger : - none jobs : - job : ubuntu displayName : \"detect-secrets on Ubuntu Linux agent\" pool : vmImage : ubuntu-latest steps : - task : UsePythonVersion@0 displayName : \"Set Python 3 as default\" inputs : versionSpec : \"3\" addToPath : true architecture : \"x64\" - bash : pip install detect-secrets displayName : \"Install detect-secrets using pip\" - bash : | detect-secrets --version detect-secrets scan --all-files --force-use-all-plugins --exclude-files FETCH_HEAD > $(Pipeline.Workspace)/detect-secrets.json displayName : \"Run detect-secrets tool\" - task : PublishPipelineArtifact@1 displayName : \"Publish results in the Pipeline Artifact\" inputs : targetPath : \"$(Pipeline.Workspace)/detect-secrets.json\" artifact : \"detect-secrets-ubuntu\" publishLocation : \"pipeline\" - bash : | dsjson=$(cat $(Pipeline.Workspace)/detect-secrets.json) echo \"${dsjson}\" count=$(echo \"${dsjson}\" | jq -c -r '.results | length') if [ $count -gt 0 ]; then msg=\"Secrets were detected in code. ${count} file(s) affected.\" echo \"##vso[task.logissue type=error]${msg}\" echo \"##vso[task.complete result=Failed;]${msg}.\" else echo \"##vso[task.complete result=Succeeded;]No secrets detected.\" fi displayName : \"Analyzing detect-secrets results\" - job : windows displayName : \"detect-secrets on Windows agent\" pool : vmImage : windows-latest steps : - task : UsePythonVersion@0 displayName : \"Set Python 3 as default\" inputs : versionSpec : \"3\" addToPath : true architecture : \"x64\" - script : pip install detect-secrets==1.0.3 displayName : \"Install detect-secrets using pip\" - script : | detect-secrets --version detect-secrets scan --all-files --force-use-all-plugins > $(Pipeline.Workspace)/detect-secrets.json displayName : \"Run detect-secrets tool\" - task : PublishPipelineArtifact@1 displayName : \"Publish results in the Pipeline Artifact\" inputs : targetPath : \"$(Pipeline.Workspace)/detect-secrets.json\" artifact : \"detect-secrets-windows\" publishLocation : \"pipeline\" - pwsh : | $dsjson = Get-Content $(Pipeline.Workspace)/detect-secrets.json Write-Output $dsjson $dsObj = $dsjson | ConvertFrom-Json $count = ($dsObj.results | Get-Member -MemberType NoteProperty).Count if ($count -gt 0) { $msg = \"Secrets were detected in code. $count file(s) affected. \" Write-Host \"##vso[task.logissue type=error]$msg\" Write-Host \"##vso[task.complete result=Failed;]$msg\" } else { Write-Host \"##vso[task.complete result=Succeeded;]No secrets detected.\" } displayName : \"Analyzing detect-secrets results\"","title":"Running `detect-secrets` in Azure DevOps Pipelines"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/#running-detect-secrets-in-azure-devops-pipelines","text":"","title":"Running detect-secrets in Azure DevOps Pipelines"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/#overview","text":"In this article, you can find information on how to integrate YELP detect-secrets into your Azure DevOps Pipeline. The proposed code can be part of the classic CI process or (preferred way) build validation for PRs before merging to the main branch.","title":"Overview"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/#azure-devops-pipeline","text":"Proposed Azure DevOps Pipeline contains multiple steps described below: Set Python 3 as default Install detect-secrets using pip Run detect-secrets tool Publish results in the Pipeline Artifact Note: It's an optional step, but for future investigation .json file with results may be helpful. Analyzing detect-secrets results Note: This step does a simple analysis of the .json file. If any secret has been detected, then break the build with exit code 1. Note: The below example has 2 jobs: for Linux and Windows agents. You do not have to use both jobs - just adjust the pipeline to your needs. Note: Windows example does not use the latest version of detect-secrets. It is related to the bug in the detect-secret tool (see more in Issue#452 ). It is highly recommended to monitor the fix for the issue and use the latest version if possible by removing version tag ==1.0.3 in the pip install command. trigger : - none jobs : - job : ubuntu displayName : \"detect-secrets on Ubuntu Linux agent\" pool : vmImage : ubuntu-latest steps : - task : UsePythonVersion@0 displayName : \"Set Python 3 as default\" inputs : versionSpec : \"3\" addToPath : true architecture : \"x64\" - bash : pip install detect-secrets displayName : \"Install detect-secrets using pip\" - bash : | detect-secrets --version detect-secrets scan --all-files --force-use-all-plugins --exclude-files FETCH_HEAD > $(Pipeline.Workspace)/detect-secrets.json displayName : \"Run detect-secrets tool\" - task : PublishPipelineArtifact@1 displayName : \"Publish results in the Pipeline Artifact\" inputs : targetPath : \"$(Pipeline.Workspace)/detect-secrets.json\" artifact : \"detect-secrets-ubuntu\" publishLocation : \"pipeline\" - bash : | dsjson=$(cat $(Pipeline.Workspace)/detect-secrets.json) echo \"${dsjson}\" count=$(echo \"${dsjson}\" | jq -c -r '.results | length') if [ $count -gt 0 ]; then msg=\"Secrets were detected in code. ${count} file(s) affected.\" echo \"##vso[task.logissue type=error]${msg}\" echo \"##vso[task.complete result=Failed;]${msg}.\" else echo \"##vso[task.complete result=Succeeded;]No secrets detected.\" fi displayName : \"Analyzing detect-secrets results\" - job : windows displayName : \"detect-secrets on Windows agent\" pool : vmImage : windows-latest steps : - task : UsePythonVersion@0 displayName : \"Set Python 3 as default\" inputs : versionSpec : \"3\" addToPath : true architecture : \"x64\" - script : pip install detect-secrets==1.0.3 displayName : \"Install detect-secrets using pip\" - script : | detect-secrets --version detect-secrets scan --all-files --force-use-all-plugins > $(Pipeline.Workspace)/detect-secrets.json displayName : \"Run detect-secrets tool\" - task : PublishPipelineArtifact@1 displayName : \"Publish results in the Pipeline Artifact\" inputs : targetPath : \"$(Pipeline.Workspace)/detect-secrets.json\" artifact : \"detect-secrets-windows\" publishLocation : \"pipeline\" - pwsh : | $dsjson = Get-Content $(Pipeline.Workspace)/detect-secrets.json Write-Output $dsjson $dsObj = $dsjson | ConvertFrom-Json $count = ($dsObj.results | Get-Member -MemberType NoteProperty).Count if ($count -gt 0) { $msg = \"Secrets were detected in code. $count file(s) affected. \" Write-Host \"##vso[task.logissue type=error]$msg\" Write-Host \"##vso[task.complete result=Failed;]$msg\" } else { Write-Host \"##vso[task.complete result=Succeeded;]No secrets detected.\" } displayName : \"Analyzing detect-secrets results\"","title":"Azure DevOps Pipeline"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/","text":"Credential Scanning Tool: detect-secrets Background The detect-secrets tool is an open source project that uses heuristics and rules to scan for a wide range of secrets. We can extend the tool with custom rules and heuristics via a simple Python plugin API . Unlike other credential scanning tools, detect-secrets does not attempt to check a project's entire git history when invoked, but instead scans the project's current state. This means that the tool runs quickly which makes it ideal for use in continuous integration pipelines. detect-secrets employs the concept of a \"baseline file\", i.e. a list of known secrets already present in the repository, and we can configure it to ignore any of these pre-existing secrets when running. This makes it easy to gradually introduce the tool into a pre-existing project. The baseline file also provides a simple and convenient way of handling false positives. We can white-list the false positive in the baseline file to ignore it on future invocations of the tool. Setup # install system dependencies: diff, jq, python3 (if on Linux-based OS) apt-get install -y diffutils jq python3 python3-pip # install system dependencies: diff, jq, python3 (if on Windows) winget install Python.Python.3 choco install diffutils jq -y # install the detect-secrets tool python3 -m pip install detect-secrets # run the tool to establish a list of known secrets # review this file thoroughly and check it into the repository detect-secrets scan > .secrets.baseline Pre-Commit Hook It is recommended to use detect-secrets in your development environment as a Git pre-commit hook. First, follow the pre-commit installation instructions to install the tool in your development environment. Then, add the following to your .pre-commit-config.yaml : repos : - repo : https://github.com/Yelp/detect-secrets rev : v1.4.0 hooks : - id : detect-secrets args : [ '--baseline' , '.secrets.baseline' ] Usage in CI Pipelines # backup the list of known secrets cp .secrets.baseline .secrets.new # find all the secrets in the repository detect-secrets scan --baseline .secrets.new $( find . -type f ! -name '.secrets.*' ! -path '*/.git*' ) # if there is any difference between the known and newly detected secrets, break the build list_secrets () { jq -r '.results | keys[] as $key | \"\\($key),\\(.[$key] | .[] | .hashed_secret)\"' \" $1 \" | sort ; } if ! diff < ( list_secrets .secrets.baseline ) < ( list_secrets .secrets.new ) > & 2 ; then echo \"Detected new secrets in the repo\" > & 2 exit 1 fi","title":"Credential Scanning Tool: `detect-secrets`"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#credential-scanning-tool-detect-secrets","text":"","title":"Credential Scanning Tool: detect-secrets"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#background","text":"The detect-secrets tool is an open source project that uses heuristics and rules to scan for a wide range of secrets. We can extend the tool with custom rules and heuristics via a simple Python plugin API . Unlike other credential scanning tools, detect-secrets does not attempt to check a project's entire git history when invoked, but instead scans the project's current state. This means that the tool runs quickly which makes it ideal for use in continuous integration pipelines. detect-secrets employs the concept of a \"baseline file\", i.e. a list of known secrets already present in the repository, and we can configure it to ignore any of these pre-existing secrets when running. This makes it easy to gradually introduce the tool into a pre-existing project. The baseline file also provides a simple and convenient way of handling false positives. We can white-list the false positive in the baseline file to ignore it on future invocations of the tool.","title":"Background"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#setup","text":"# install system dependencies: diff, jq, python3 (if on Linux-based OS) apt-get install -y diffutils jq python3 python3-pip # install system dependencies: diff, jq, python3 (if on Windows) winget install Python.Python.3 choco install diffutils jq -y # install the detect-secrets tool python3 -m pip install detect-secrets # run the tool to establish a list of known secrets # review this file thoroughly and check it into the repository detect-secrets scan > .secrets.baseline","title":"Setup"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#pre-commit-hook","text":"It is recommended to use detect-secrets in your development environment as a Git pre-commit hook. First, follow the pre-commit installation instructions to install the tool in your development environment. Then, add the following to your .pre-commit-config.yaml : repos : - repo : https://github.com/Yelp/detect-secrets rev : v1.4.0 hooks : - id : detect-secrets args : [ '--baseline' , '.secrets.baseline' ]","title":"Pre-Commit Hook"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#usage-in-ci-pipelines","text":"# backup the list of known secrets cp .secrets.baseline .secrets.new # find all the secrets in the repository detect-secrets scan --baseline .secrets.new $( find . -type f ! -name '.secrets.*' ! -path '*/.git*' ) # if there is any difference between the known and newly detected secrets, break the build list_secrets () { jq -r '.results | keys[] as $key | \"\\($key),\\(.[$key] | .[] | .hashed_secret)\"' \" $1 \" | sort ; } if ! diff < ( list_secrets .secrets.baseline ) < ( list_secrets .secrets.new ) > & 2 ; then echo \"Detected new secrets in the repo\" > & 2 exit 1 fi","title":"Usage in CI Pipelines"},{"location":"CI-CD/gitops/deploying-with-gitops/","text":"Deploying with GitOps What is GitOps? \"GitOps is an operational framework that takes DevOps best practices used for application development such as version control, collaboration, compliance, and CI/CD, and applies them to infrastructure automation.\" See GitLab: What is GitOps? . Why should I use GitOps? GitOps simply allows faster deployments by having git repositories in the center offering a clear audit trail via git commits and no direct environment access. Read more on Why should I use GitOps? The below diagram compares traditional CI/CD vs GitOps workflow: Tools for GitOps Some popular GitOps frameworks for Kubernetes backed by CNCF community: Flux V2 Argo CD Rancher Fleet Deploying Using GitOps GitOps with Flux v2 can be enabled in Azure Kubernetes Service (AKS) managed clusters or Azure Arc-enabled Kubernetes connected clusters as a cluster extension. After the microsoft.flux cluster extension is installed, you can create one or more fluxConfigurations resources that sync your Git repository sources to the cluster and reconcile the cluster to the desired state. With GitOps, you can use your Git repository as the source of truth for cluster configuration and application deployment. Tutorial: Deploy configurations using GitOps on an Azure Arc-enabled Kubernetes cluster Tutorial: Implement CI/CD with GitOps Multi-cluster and multi-tenant environment with Flux v2","title":"Deploying with GitOps"},{"location":"CI-CD/gitops/deploying-with-gitops/#deploying-with-gitops","text":"","title":"Deploying with GitOps"},{"location":"CI-CD/gitops/deploying-with-gitops/#what-is-gitops","text":"\"GitOps is an operational framework that takes DevOps best practices used for application development such as version control, collaboration, compliance, and CI/CD, and applies them to infrastructure automation.\" See GitLab: What is GitOps? .","title":"What is GitOps?"},{"location":"CI-CD/gitops/deploying-with-gitops/#why-should-i-use-gitops","text":"GitOps simply allows faster deployments by having git repositories in the center offering a clear audit trail via git commits and no direct environment access. Read more on Why should I use GitOps? The below diagram compares traditional CI/CD vs GitOps workflow:","title":"Why should I use GitOps?"},{"location":"CI-CD/gitops/deploying-with-gitops/#tools-for-gitops","text":"Some popular GitOps frameworks for Kubernetes backed by CNCF community: Flux V2 Argo CD Rancher Fleet","title":"Tools for GitOps"},{"location":"CI-CD/gitops/deploying-with-gitops/#deploying-using-gitops","text":"GitOps with Flux v2 can be enabled in Azure Kubernetes Service (AKS) managed clusters or Azure Arc-enabled Kubernetes connected clusters as a cluster extension. After the microsoft.flux cluster extension is installed, you can create one or more fluxConfigurations resources that sync your Git repository sources to the cluster and reconcile the cluster to the desired state. With GitOps, you can use your Git repository as the source of truth for cluster configuration and application deployment. Tutorial: Deploy configurations using GitOps on an Azure Arc-enabled Kubernetes cluster Tutorial: Implement CI/CD with GitOps Multi-cluster and multi-tenant environment with Flux v2","title":"Deploying Using GitOps"},{"location":"CI-CD/gitops/github-workflows/","text":"GitHub Workflows A workflow is a configurable automated process made up of one or more jobs where each of these jobs can be an action in GitHub. Currently, a YAML file format is supported for defining a workflow in GitHub. Additional information on GitHub actions and GitHub Workflows in the links posted in the resources section below. Workflow per Environment The general approach is to have one pipeline, where the code is built, tested and deployed, and the artifact is then promoted to the next environment, eventually to be deployed into production. There are multiple ways in GitHub that an environment setup can be achieved. One way it can be done is to have one workflow for multiple environments, but the complexity increases as additional processes and jobs are added to a workflow, which does not mean it cannot be done for small pipelines. The plus point of having one workflow is that, when an artifact flows from one environment to another the state and environment values between the deployment environments can be passed easily. One way to get around the complexity of a single workflow is to have separate workflows for different environments, making sure that only the artifacts created and validated are promoted from one environment to another, as well as, the workflow is small enough, to debug any issues seen in any of the workflows. In this case, the state and environment values need to be passed from one deployment environment to another. Multiple workflows also helps to keep the deployments to the environments independent thus reducing the time to deploy and find issues earlier than later in the process. Also, since the environments are independent of each other, any failures in deploying to one environment does not block deployments to other environments. One tradeoff in this method, is that with different workflows for each environment, the maintenance increases as the complexity of workflows increase over time. Resources GitHub Actions GitHub Workflows","title":"GitHub Workflows"},{"location":"CI-CD/gitops/github-workflows/#github-workflows","text":"A workflow is a configurable automated process made up of one or more jobs where each of these jobs can be an action in GitHub. Currently, a YAML file format is supported for defining a workflow in GitHub. Additional information on GitHub actions and GitHub Workflows in the links posted in the resources section below.","title":"GitHub Workflows"},{"location":"CI-CD/gitops/github-workflows/#workflow-per-environment","text":"The general approach is to have one pipeline, where the code is built, tested and deployed, and the artifact is then promoted to the next environment, eventually to be deployed into production. There are multiple ways in GitHub that an environment setup can be achieved. One way it can be done is to have one workflow for multiple environments, but the complexity increases as additional processes and jobs are added to a workflow, which does not mean it cannot be done for small pipelines. The plus point of having one workflow is that, when an artifact flows from one environment to another the state and environment values between the deployment environments can be passed easily. One way to get around the complexity of a single workflow is to have separate workflows for different environments, making sure that only the artifacts created and validated are promoted from one environment to another, as well as, the workflow is small enough, to debug any issues seen in any of the workflows. In this case, the state and environment values need to be passed from one deployment environment to another. Multiple workflows also helps to keep the deployments to the environments independent thus reducing the time to deploy and find issues earlier than later in the process. Also, since the environments are independent of each other, any failures in deploying to one environment does not block deployments to other environments. One tradeoff in this method, is that with different workflows for each environment, the maintenance increases as the complexity of workflows increase over time.","title":"Workflow per Environment"},{"location":"CI-CD/gitops/github-workflows/#resources","text":"GitHub Actions GitHub Workflows","title":"Resources"},{"location":"CI-CD/gitops/secret-management/","text":"Secrets Management with GitOps GitOps projects have git repositories in the center that are considered a source of truth for managing both infrastructure and application. This infrastructure and application will require secured access to other resources of the system through secrets. Committing clear-text secrets into git repositories is unacceptable even if the repositories are private to your team and organization. Teams need a secure way to handle secrets when using GitOps. There are many ways to manage secrets with GitOps and at high level can be categorized into: Encrypted secrets in git repositories Reference to secrets stored in the external key vault TLDR : Referencing secrets in an external key vault is the recommended approach. It is easier to orchestrate secret rotation and more scalable with multiple clusters and/or teams. Encrypted Secrets in Git Repositories In this approach, Developers manually encrypt secrets using a public key, and the key can only be decrypted by the custom Kubernetes controller running in the target cluster. Some popular tools for his approach are Bitnami Sealed Secrets , Mozilla SOPS All the secret encryption tools share the following: Secret changes are managed by making changes within the GitOps repository which provides great traceability All secrets can be rotated by making changes in GitOps, without accessing the cluster They support fully disconnected gitops scenarios Secrets are stored encrypted in the gitops repository, if the private encryption key is leaked and the attacker has access to the repo, all secrets can be decrypted Bitnami Sealed Secrets Sealed Secrets use asymmetric encryption to encrypt secrets. A Kubernetes controller generates a key-pair (private-public) and stores the private key in the cluster's etcd database as a Kubernetes secret. Developers use Kubeseal CLI to seal secrets before committing to the git repo. Some of the key points of using Sealed Secrets are: Support automatic key rotation for the private key and can be used to enforce re-encryption of secrets Due to automatic renewal of the sealing key , the key needs to be prefetched from the cluster or cluster set up to store the sealing key on renewal in a secondary location Multi-tenancy support at the namespace level can be enforced by the controller When sealing secrets developers need a connection to the cluster control plane to fetch the public key or the public key has to be explicitly shared with the developer If the private key in the cluster is lost for some reason all secrets need to be re-encrypted followed by a new key-pair generation Does not scale with multi-cluster, because every cluster will require a controller having its own key pair Can only encrypt secret resource type The Flux documentation has inconsistences in the Azure Key Vault examples Mozilla SOPS SOPS: Secrets OPerationS is an encryption tool that supports YAML, JSON, ENV, INI, and BINARY formats and encrypts with AWS KMS, GCP KMS, Azure Key Vault, age, and PGP and is not just limited to Kubernetes. It supports integration with some common key management systems including Azure Key Vault, where one or more key management system is used to store the encryption key for encrypting secrets and not the actual secrets. Some of the key points of using SOPS are: Flux has native support for SOPS with cluster-side decryption Provides an added layer of security as the private key used for decryption is protected in an external key vault To use the Helm CLI for encryption the ( Helm Secrets ) plugin is needed Needs ( KSOPS )( kustomize-sopssecretgenerator ) plugin to work with Kustomization Does not scale with larger teams as each developer has to encrypt the secrets The public key is sufficient for creating brand new files. The secret key is required for decrypting and editing existing files because SOPS computes a MAC on all values. When using the public key solely to add or remove a field, the whole file should be deleted and recreated Supports several types of keys that can be used in both connected and disconnected state. A secret can have a list of keys and will try do decrypt with all of them. Reference to Secrets Stored in an External Key Vault (Recommended) This approach relies on a key management system like Azure Key Vault to hold the secrets and the git manifest in the repositories has reference to the key vault secrets. Developers do not perform any cryptographic operations with files in repositories. Kubernetes operators running in the target cluster are responsible for pulling the secrets from the key vault and making them available either as Kubernetes secrets or secrets volume mounted to the pod. All the below tools share the following: Secrets are not stored in the repository Supports Prometheus metrics for observability Supports sync with Kubernetes Secrets Supports Linux and Windows containers Provides enterprise-grade external secret management Easily scalable with multi-cluster and larger teams Both solutions support either Azure Active Directory (Azure AD) service principal or managed identity for authentication with the Key Vault . For secret rotation ideas, see Secrets Rotation on Environment Variables and Mounted Secrets For how to authenticate private container registries with a service principal see: Authenticated Private Container Registry Azure Key Vault Provider for Secrets Store CSI Driver Azure Key Vault Provider (AKVP) for Kubernetes secret store CSI Driver allows you to get secret contents stored in an Azure Key Vault instance and use the Secrets Store CSI driver interface to mount them into Kubernetes pods. Mounts secrets/keys/certs to pod using a CSI Inline volume. Azure Key Vault Provider for Secrets Store CSI Driver install guide . CSI driver will need access to Azure Key Vault either through a service principal or managed identity (recommended). To make this access secure you can leverage Azure AD Workload Identity (recommended) or AAD Pod Identity . Please note AAD pod identity will soon be replaced by workload identity. Product Group Links provided for AKVP with SSCSID: 1. Differences between ESO / SSCSID ( GitHub Issue ) 2. Secrets Management on K8S talk here (Native Secrets, Vault.io, and ESO vs. SSCSID) Advantages: Supports pod portability with the SecretProviderClass CRD Supports auto rotation of secrets with customizable sync intervals per cluster . Seems to be the MSFT choice (Secrets Store CSI driver is heavily contributed by MSFT and Kubernetes-SIG) Disadvantages: Missing disconnected scenario support : When the node is offline the SSCSID fails to fetch the secret and thus mounting the volume fails, making scaling and restarting pods not possible while being offline AKVP can only access Key Vault from a non-Azure environment using a service principal The Kubernetes Secret containing the service principal credentials need to be created as a secret in the same namespace as the application pod. If pods in multiple namespaces need to use the same SP to access Key Vault, this Kubernetes Secret needs to be created in each namespace. The GitOps repo must contain the name of the Key Vault within the SecretProviderClass Must mount secrets as volumes to allow syncing into Kubernetes Secrets Uses more resources (4 pods; CSI Storage driver and provider) and is a daemonset - not test on RPS / resource usage External Secrets Operator with Azure Key Vault The External Secrets Operator (ESO) is an open-sourced Kubernetes operator that can read secrets from external secret stores (e.g., Azure Key Vault) and sync those into Kubernetes Secrets. In contrast to the CSI Driver, the ESO controller creates the secrets on the cluster as K8s secrets, instead of mounting them as volumes to pods. Docs on using ESO Azure Key vault provider here . ESO will need access to Azure Key Vault either through the use of a service principal or managed identity (via Azure AD Workload Identity (recommended) or AAD Pod Identity ). Advantages: Supports auto rotation of secrets with customizable sync intervals per secret . Components are split into different CRDs for namespace (ExternalSecret, SecretStore) and cluster-wide (ClusterSecretStore, ClusterExternalSecret) making syncing more manageable i.r.t. different deployments/pods etc. Service Principal secret for the (Cluster)SecretStores could placed in a namespaced that only the ESO can access (see Shared ClusterSecretStore ). Resource efficient (single pod) - not test on RPS / resource usage. Open source and high contributions, ( GitHub ) Mounting Secrets as volumes is supported via K8S's APIs (see here ) Partial disconnected scenario support: As ESO is using native K8s secrets the cluster can be offline, and it does not have any implications towards restarting and scaling pods while being offline Disadvantages: The GitOps repo must contain the name of the Key Vault within the SecretStore / ClusterSecretStore or a ConfigMap linking to it Must create secrets as K8s secrets Resources Sealed Secrets with Flux v2 Mozilla SOPS with Flux v2 Secret Management with Argo CD Secret management Workflow Appendix Authenticated Private Container Registry An option on how to authenticate private container registries (e.g., ACR): Use a dockerconfigjson Kubernetes Secret on Pod-Level with ImagePullSecret (This can be also defined on namespace-level )","title":"Secrets Management with GitOps"},{"location":"CI-CD/gitops/secret-management/#secrets-management-with-gitops","text":"GitOps projects have git repositories in the center that are considered a source of truth for managing both infrastructure and application. This infrastructure and application will require secured access to other resources of the system through secrets. Committing clear-text secrets into git repositories is unacceptable even if the repositories are private to your team and organization. Teams need a secure way to handle secrets when using GitOps. There are many ways to manage secrets with GitOps and at high level can be categorized into: Encrypted secrets in git repositories Reference to secrets stored in the external key vault TLDR : Referencing secrets in an external key vault is the recommended approach. It is easier to orchestrate secret rotation and more scalable with multiple clusters and/or teams.","title":"Secrets Management with GitOps"},{"location":"CI-CD/gitops/secret-management/#encrypted-secrets-in-git-repositories","text":"In this approach, Developers manually encrypt secrets using a public key, and the key can only be decrypted by the custom Kubernetes controller running in the target cluster. Some popular tools for his approach are Bitnami Sealed Secrets , Mozilla SOPS All the secret encryption tools share the following: Secret changes are managed by making changes within the GitOps repository which provides great traceability All secrets can be rotated by making changes in GitOps, without accessing the cluster They support fully disconnected gitops scenarios Secrets are stored encrypted in the gitops repository, if the private encryption key is leaked and the attacker has access to the repo, all secrets can be decrypted","title":"Encrypted Secrets in Git Repositories"},{"location":"CI-CD/gitops/secret-management/#bitnami-sealed-secrets","text":"Sealed Secrets use asymmetric encryption to encrypt secrets. A Kubernetes controller generates a key-pair (private-public) and stores the private key in the cluster's etcd database as a Kubernetes secret. Developers use Kubeseal CLI to seal secrets before committing to the git repo. Some of the key points of using Sealed Secrets are: Support automatic key rotation for the private key and can be used to enforce re-encryption of secrets Due to automatic renewal of the sealing key , the key needs to be prefetched from the cluster or cluster set up to store the sealing key on renewal in a secondary location Multi-tenancy support at the namespace level can be enforced by the controller When sealing secrets developers need a connection to the cluster control plane to fetch the public key or the public key has to be explicitly shared with the developer If the private key in the cluster is lost for some reason all secrets need to be re-encrypted followed by a new key-pair generation Does not scale with multi-cluster, because every cluster will require a controller having its own key pair Can only encrypt secret resource type The Flux documentation has inconsistences in the Azure Key Vault examples","title":"Bitnami Sealed Secrets"},{"location":"CI-CD/gitops/secret-management/#mozilla-sops","text":"SOPS: Secrets OPerationS is an encryption tool that supports YAML, JSON, ENV, INI, and BINARY formats and encrypts with AWS KMS, GCP KMS, Azure Key Vault, age, and PGP and is not just limited to Kubernetes. It supports integration with some common key management systems including Azure Key Vault, where one or more key management system is used to store the encryption key for encrypting secrets and not the actual secrets. Some of the key points of using SOPS are: Flux has native support for SOPS with cluster-side decryption Provides an added layer of security as the private key used for decryption is protected in an external key vault To use the Helm CLI for encryption the ( Helm Secrets ) plugin is needed Needs ( KSOPS )( kustomize-sopssecretgenerator ) plugin to work with Kustomization Does not scale with larger teams as each developer has to encrypt the secrets The public key is sufficient for creating brand new files. The secret key is required for decrypting and editing existing files because SOPS computes a MAC on all values. When using the public key solely to add or remove a field, the whole file should be deleted and recreated Supports several types of keys that can be used in both connected and disconnected state. A secret can have a list of keys and will try do decrypt with all of them.","title":"Mozilla SOPS"},{"location":"CI-CD/gitops/secret-management/#reference-to-secrets-stored-in-an-external-key-vault-recommended","text":"This approach relies on a key management system like Azure Key Vault to hold the secrets and the git manifest in the repositories has reference to the key vault secrets. Developers do not perform any cryptographic operations with files in repositories. Kubernetes operators running in the target cluster are responsible for pulling the secrets from the key vault and making them available either as Kubernetes secrets or secrets volume mounted to the pod. All the below tools share the following: Secrets are not stored in the repository Supports Prometheus metrics for observability Supports sync with Kubernetes Secrets Supports Linux and Windows containers Provides enterprise-grade external secret management Easily scalable with multi-cluster and larger teams Both solutions support either Azure Active Directory (Azure AD) service principal or managed identity for authentication with the Key Vault . For secret rotation ideas, see Secrets Rotation on Environment Variables and Mounted Secrets For how to authenticate private container registries with a service principal see: Authenticated Private Container Registry","title":"Reference to Secrets Stored in an External Key Vault (Recommended)"},{"location":"CI-CD/gitops/secret-management/#azure-key-vault-provider-for-secrets-store-csi-driver","text":"Azure Key Vault Provider (AKVP) for Kubernetes secret store CSI Driver allows you to get secret contents stored in an Azure Key Vault instance and use the Secrets Store CSI driver interface to mount them into Kubernetes pods. Mounts secrets/keys/certs to pod using a CSI Inline volume. Azure Key Vault Provider for Secrets Store CSI Driver install guide . CSI driver will need access to Azure Key Vault either through a service principal or managed identity (recommended). To make this access secure you can leverage Azure AD Workload Identity (recommended) or AAD Pod Identity . Please note AAD pod identity will soon be replaced by workload identity. Product Group Links provided for AKVP with SSCSID: 1. Differences between ESO / SSCSID ( GitHub Issue ) 2. Secrets Management on K8S talk here (Native Secrets, Vault.io, and ESO vs. SSCSID) Advantages: Supports pod portability with the SecretProviderClass CRD Supports auto rotation of secrets with customizable sync intervals per cluster . Seems to be the MSFT choice (Secrets Store CSI driver is heavily contributed by MSFT and Kubernetes-SIG) Disadvantages: Missing disconnected scenario support : When the node is offline the SSCSID fails to fetch the secret and thus mounting the volume fails, making scaling and restarting pods not possible while being offline AKVP can only access Key Vault from a non-Azure environment using a service principal The Kubernetes Secret containing the service principal credentials need to be created as a secret in the same namespace as the application pod. If pods in multiple namespaces need to use the same SP to access Key Vault, this Kubernetes Secret needs to be created in each namespace. The GitOps repo must contain the name of the Key Vault within the SecretProviderClass Must mount secrets as volumes to allow syncing into Kubernetes Secrets Uses more resources (4 pods; CSI Storage driver and provider) and is a daemonset - not test on RPS / resource usage","title":"Azure Key Vault Provider for Secrets Store CSI Driver"},{"location":"CI-CD/gitops/secret-management/#external-secrets-operator-with-azure-key-vault","text":"The External Secrets Operator (ESO) is an open-sourced Kubernetes operator that can read secrets from external secret stores (e.g., Azure Key Vault) and sync those into Kubernetes Secrets. In contrast to the CSI Driver, the ESO controller creates the secrets on the cluster as K8s secrets, instead of mounting them as volumes to pods. Docs on using ESO Azure Key vault provider here . ESO will need access to Azure Key Vault either through the use of a service principal or managed identity (via Azure AD Workload Identity (recommended) or AAD Pod Identity ). Advantages: Supports auto rotation of secrets with customizable sync intervals per secret . Components are split into different CRDs for namespace (ExternalSecret, SecretStore) and cluster-wide (ClusterSecretStore, ClusterExternalSecret) making syncing more manageable i.r.t. different deployments/pods etc. Service Principal secret for the (Cluster)SecretStores could placed in a namespaced that only the ESO can access (see Shared ClusterSecretStore ). Resource efficient (single pod) - not test on RPS / resource usage. Open source and high contributions, ( GitHub ) Mounting Secrets as volumes is supported via K8S's APIs (see here ) Partial disconnected scenario support: As ESO is using native K8s secrets the cluster can be offline, and it does not have any implications towards restarting and scaling pods while being offline Disadvantages: The GitOps repo must contain the name of the Key Vault within the SecretStore / ClusterSecretStore or a ConfigMap linking to it Must create secrets as K8s secrets","title":"External Secrets Operator with Azure Key Vault"},{"location":"CI-CD/gitops/secret-management/#resources","text":"Sealed Secrets with Flux v2 Mozilla SOPS with Flux v2 Secret Management with Argo CD Secret management Workflow","title":"Resources"},{"location":"CI-CD/gitops/secret-management/#appendix","text":"","title":"Appendix"},{"location":"CI-CD/gitops/secret-management/#authenticated-private-container-registry","text":"An option on how to authenticate private container registries (e.g., ACR): Use a dockerconfigjson Kubernetes Secret on Pod-Level with ImagePullSecret (This can be also defined on namespace-level )","title":"Authenticated Private Container Registry"},{"location":"CI-CD/gitops/secret-management/azure-devops-secret-management-per-branch/","text":"Azure DevOps: Managing Settings on a Per-Branch Basis When using Azure DevOps Pipelines for CI/CD, it's convenient to leverage the built-in pipeline variables for secrets management , but using pipeline variables for secrets management has its disadvantages: Pipeline variables are managed outside the code that references them. This makes it easy to introduce drift between the source code and the secrets, e.g. adding a reference to a new secret in code but forgetting to add it to the pipeline variables (leads to confusing build breaks), or deleting a reference to a secret in code and forgetting to remote it from the pipeline variables (leads to confusing pipeline variables). Pipeline variables are global shared state. This can lead to confusing situations and hard to debug problems when developers make concurrent changes to the pipeline variables which may override each other. Having a single global set of pipeline variables also makes it impossible for secrets to vary per environment (e.g. when using a branch-based deployment model where 'master' deploys using the production secrets, 'development' deploys using the staging secrets, and so forth). A solution to these limitations is to manage secrets in the Git repository jointly with the project's source code. As described in secrets management , don't check secrets into the repository in plain text. Instead we can add an encrypted version of our secrets to the repository and enable our CI/CD agents and developers to decrypt the secrets for local usage with some pre-shared key. This gives us the best of both worlds: a secure storage for secrets as well as side-by-side management of secrets and code. # first, make sure that we never commit our plain text secrets and generate a strong encryption key echo \".env\" >> .gitignore ENCRYPTION_KEY = \" $( LC_ALL = C < /dev/urandom tr -dc '_A-Z-a-z-0-9' | head -c128 ) \" # now let's add some secret to our .env file echo \"MY_SECRET=...\" >> .env # also update our secrets documentation file cat >> .env.template <<< \" # enter description of your secret here MY_SECRET= \" # next, encrypt the plain text secrets; the resulting .env.enc file can safely be committed to the repository echo \" ${ ENCRYPTION_KEY } \" | openssl enc -aes-256-cbc -md sha512 -pass stdin -in .env -out .env.enc git add .env.enc .env.template git commit -m \"Update secrets\" When running the CI/CD, the build server can now access the secrets by decrypting them. E.g. for Azure DevOps, configure ENCRYPTION_KEY as a secret pipeline variable and then add the following step to azure-pipelines.yml : steps : - script : echo \"$(ENCRYPTION_KEY)\" | openssl enc -aes-256-cbc -md sha512 -pass stdin -in .env.enc -out .env -d displayName : Decrypt secrets You can also use variable groups linked directly to Azure key vault for your pipelines to manage all secrets in one location.","title":"Azure DevOps: Managing Settings on a Per-Branch Basis"},{"location":"CI-CD/gitops/secret-management/azure-devops-secret-management-per-branch/#azure-devops-managing-settings-on-a-per-branch-basis","text":"When using Azure DevOps Pipelines for CI/CD, it's convenient to leverage the built-in pipeline variables for secrets management , but using pipeline variables for secrets management has its disadvantages: Pipeline variables are managed outside the code that references them. This makes it easy to introduce drift between the source code and the secrets, e.g. adding a reference to a new secret in code but forgetting to add it to the pipeline variables (leads to confusing build breaks), or deleting a reference to a secret in code and forgetting to remote it from the pipeline variables (leads to confusing pipeline variables). Pipeline variables are global shared state. This can lead to confusing situations and hard to debug problems when developers make concurrent changes to the pipeline variables which may override each other. Having a single global set of pipeline variables also makes it impossible for secrets to vary per environment (e.g. when using a branch-based deployment model where 'master' deploys using the production secrets, 'development' deploys using the staging secrets, and so forth). A solution to these limitations is to manage secrets in the Git repository jointly with the project's source code. As described in secrets management , don't check secrets into the repository in plain text. Instead we can add an encrypted version of our secrets to the repository and enable our CI/CD agents and developers to decrypt the secrets for local usage with some pre-shared key. This gives us the best of both worlds: a secure storage for secrets as well as side-by-side management of secrets and code. # first, make sure that we never commit our plain text secrets and generate a strong encryption key echo \".env\" >> .gitignore ENCRYPTION_KEY = \" $( LC_ALL = C < /dev/urandom tr -dc '_A-Z-a-z-0-9' | head -c128 ) \" # now let's add some secret to our .env file echo \"MY_SECRET=...\" >> .env # also update our secrets documentation file cat >> .env.template <<< \" # enter description of your secret here MY_SECRET= \" # next, encrypt the plain text secrets; the resulting .env.enc file can safely be committed to the repository echo \" ${ ENCRYPTION_KEY } \" | openssl enc -aes-256-cbc -md sha512 -pass stdin -in .env -out .env.enc git add .env.enc .env.template git commit -m \"Update secrets\" When running the CI/CD, the build server can now access the secrets by decrypting them. E.g. for Azure DevOps, configure ENCRYPTION_KEY as a secret pipeline variable and then add the following step to azure-pipelines.yml : steps : - script : echo \"$(ENCRYPTION_KEY)\" | openssl enc -aes-256-cbc -md sha512 -pass stdin -in .env.enc -out .env -d displayName : Decrypt secrets You can also use variable groups linked directly to Azure key vault for your pipelines to manage all secrets in one location.","title":"Azure DevOps: Managing Settings on a Per-Branch Basis"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/","text":"Secrets Rotation of Environment Variables and Mounted Secrets in Pods This document covers some ways you can do secret rotation with environment variables and mounted secrets in Kubernetes pods Mapping Secrets via secretKeyRef with Environment Variables If we map a K8s native secret via a secretKeyRef into an environment variable and we rotate keys the environment variable is not updated even though the K8s native secret has been updated. We need to restart the Pod so changes get populated. Reloader solves this issue with a K8S controller. ... env : - name : EVENTHUB_CONNECTION_STRING valueFrom : secretKeyRef : name : poc-creds key : EventhubConnectionString ... Mapping Secrets via volumeMounts (ESO Way) If we map a K8s native secret via a volume mount and we rotate keys the file gets updated. The application needs to then be able pick up the changes without a restart (requiring most likely custom logic in the application to support this). Then no restart of the application is required. ... volumeMounts : - name : mounted-secret mountPath : /mnt/secrets-store readOnly : true volumes : - name : mounted-secret secret : secretName : poc-creds ... Mapping Secrets via volumeMounts (AKVP SSCSID Way) SSCSID focuses on mounting external secrets into the CSI. Thus if we rotate keys the file gets updated. The application needs to then be able pick up the changes without a restart (requiring most likely custom logic in the application to support this). Then no restart of the application is required. ... volumeMounts : - name : app-secrets-store-inline mountPath : \"/mnt/app-secrets-store\" readOnly : true volumes : - name : app-secrets-store-inline csi : driver : secrets-store.csi.k8s.io readOnly : true volumeAttributes : secretProviderClass : akvp-app nodePublishSecretRef : name : secrets-store-sp-creds ...","title":"Secrets Rotation of Environment Variables and Mounted Secrets in Pods"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/#secrets-rotation-of-environment-variables-and-mounted-secrets-in-pods","text":"This document covers some ways you can do secret rotation with environment variables and mounted secrets in Kubernetes pods","title":"Secrets Rotation of Environment Variables and Mounted Secrets in Pods"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/#mapping-secrets-via-secretkeyref-with-environment-variables","text":"If we map a K8s native secret via a secretKeyRef into an environment variable and we rotate keys the environment variable is not updated even though the K8s native secret has been updated. We need to restart the Pod so changes get populated. Reloader solves this issue with a K8S controller. ... env : - name : EVENTHUB_CONNECTION_STRING valueFrom : secretKeyRef : name : poc-creds key : EventhubConnectionString ...","title":"Mapping Secrets via secretKeyRef with Environment Variables"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/#mapping-secrets-via-volumemounts-eso-way","text":"If we map a K8s native secret via a volume mount and we rotate keys the file gets updated. The application needs to then be able pick up the changes without a restart (requiring most likely custom logic in the application to support this). Then no restart of the application is required. ... volumeMounts : - name : mounted-secret mountPath : /mnt/secrets-store readOnly : true volumes : - name : mounted-secret secret : secretName : poc-creds ...","title":"Mapping Secrets via volumeMounts (ESO Way)"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/#mapping-secrets-via-volumemounts-akvp-sscsid-way","text":"SSCSID focuses on mounting external secrets into the CSI. Thus if we rotate keys the file gets updated. The application needs to then be able pick up the changes without a restart (requiring most likely custom logic in the application to support this). Then no restart of the application is required. ... volumeMounts : - name : app-secrets-store-inline mountPath : \"/mnt/app-secrets-store\" readOnly : true volumes : - name : app-secrets-store-inline csi : driver : secrets-store.csi.k8s.io readOnly : true volumeAttributes : secretProviderClass : akvp-app nodePublishSecretRef : name : secrets-store-sp-creds ...","title":"Mapping Secrets via volumeMounts (AKVP SSCSID Way)"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/","text":"Continuous Delivery on Low-Code and No-Code Solutions Low-code and no-code platforms have taken a spot in a wide variety of Business Solutions involving process automation, AI models, Bots, Business Applications and Business Intelligence. The scenarios enabled by these platforms are constantly evolving and opening a spot for productive roles. This has been exactly the reason why bringing more professional tools to their development have become necessary such as controlled and automated delivery. In the case of Power Platform products, the adoption of a CI/CD process may seem to increase the development complexity to a solution oriented to Citizen Developers it is more important to make the development process more scalable and capable of dealing with new features and bug corrections in a faster way. Environments in Power Platform Solutions Environments are spaces where Power Platform Solutions exists. They store, manage and share everything related to the solution like data, apps, chat bots, flows and models. They also serve as containers to separate apps that might have different roles, security requirements or just target audiences. They can be used to create different stages of the solution development process, the expected model of working with environments in a CI/CD process will be as the following image suggests. Environments Considerations Whenever an environment has been created, its resources can be only accessed by users within the same tenant which is an Azure Active Directory tenant in fact. When you create an app in an environment that app can only interact with data sources that are also deployed in that same environment, this includes connections, flows and Dataverse databases. This is an important consideration when dealing with a CD process. Deployment Strategy With three environments already created to represent the stages of the deployment, the goal now is to automate the deployment from one environment to another. Each environment will require the creation of its own solution: business logic and data. Step 1 Development team will be working in a Dev environment. These environments according to the team could be one for the team or one for each developer. Once changes have been made, the first step will be packaging the solution and export it into source control. Step 2 Second step is about the solution, you need to have a managed solution to deploy to other environments such as Stage or Production so now you should use a JIT environment where you would import your unmanaged solution and export them as managed. These solution files won't be checked into source control but will be stored as a build artifact in the pipeline making them available to be deployed in the release pipeline. This is where the second environment will be used. This second environment will be responsible of receiving the output managed solution coming from the artifact. Step 3 Third and final step will import the solution into the production environment, this means that this stage will take the artifact from last step and will export it. When working in this environment you can also version your product in order to make a better trace of the product. Tools Most used tools to get this process completed are: Power Platform Build Tools There is also a non graphical tool that could be used to work with this CD process. The Power CLI tool. Resources Application lifecycle management with Microsoft Power Platform","title":"CD on low code solutions"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#continuous-delivery-on-low-code-and-no-code-solutions","text":"Low-code and no-code platforms have taken a spot in a wide variety of Business Solutions involving process automation, AI models, Bots, Business Applications and Business Intelligence. The scenarios enabled by these platforms are constantly evolving and opening a spot for productive roles. This has been exactly the reason why bringing more professional tools to their development have become necessary such as controlled and automated delivery. In the case of Power Platform products, the adoption of a CI/CD process may seem to increase the development complexity to a solution oriented to Citizen Developers it is more important to make the development process more scalable and capable of dealing with new features and bug corrections in a faster way.","title":"Continuous Delivery on Low-Code and No-Code Solutions"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#environments-in-power-platform-solutions","text":"Environments are spaces where Power Platform Solutions exists. They store, manage and share everything related to the solution like data, apps, chat bots, flows and models. They also serve as containers to separate apps that might have different roles, security requirements or just target audiences. They can be used to create different stages of the solution development process, the expected model of working with environments in a CI/CD process will be as the following image suggests.","title":"Environments in Power Platform Solutions"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#environments-considerations","text":"Whenever an environment has been created, its resources can be only accessed by users within the same tenant which is an Azure Active Directory tenant in fact. When you create an app in an environment that app can only interact with data sources that are also deployed in that same environment, this includes connections, flows and Dataverse databases. This is an important consideration when dealing with a CD process.","title":"Environments Considerations"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#deployment-strategy","text":"With three environments already created to represent the stages of the deployment, the goal now is to automate the deployment from one environment to another. Each environment will require the creation of its own solution: business logic and data.","title":"Deployment Strategy"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#step-1","text":"Development team will be working in a Dev environment. These environments according to the team could be one for the team or one for each developer. Once changes have been made, the first step will be packaging the solution and export it into source control.","title":"Step 1"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#step-2","text":"Second step is about the solution, you need to have a managed solution to deploy to other environments such as Stage or Production so now you should use a JIT environment where you would import your unmanaged solution and export them as managed. These solution files won't be checked into source control but will be stored as a build artifact in the pipeline making them available to be deployed in the release pipeline. This is where the second environment will be used. This second environment will be responsible of receiving the output managed solution coming from the artifact.","title":"Step 2"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#step-3","text":"Third and final step will import the solution into the production environment, this means that this stage will take the artifact from last step and will export it. When working in this environment you can also version your product in order to make a better trace of the product.","title":"Step 3"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#tools","text":"Most used tools to get this process completed are: Power Platform Build Tools There is also a non graphical tool that could be used to work with this CD process. The Power CLI tool.","title":"Tools"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#resources","text":"Application lifecycle management with Microsoft Power Platform","title":"Resources"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/","text":"CI Pipeline for Better Documentation Introduction Most projects start with spikes, where developers and analysts produce lots of documentation. Sometimes, these documents don't have a standard and each team member writes them accordingly with their preference. Add to that the time a reviewer will spend confirming grammar, searching for typos or non-inclusive language. This pipeline helps address that! The Pipeline The pipeline uses the following npm modules: markdownlint : add standardization using rules markdown-link-check : check the links in the documentation and report broken ones write-good : linter for English prose We have been using this pipeline for more than one year in different engagements and always received great feedback from the customers! How Does it Work To start using this pipeline: Download the files from this repository Unzip the folders and files to your repository root if the repository is empty - if it's not brand new, copy the files and make the required adjustments: - check .azdo so it matches your repository standard - check package.json so you don't overwrite one you already have in the process. Also update the file if you changed the name of the .azdo folder. Create the pipeline in Azure DevOps or GitHub Resources Markdown Code Reviews in the Engineering Fundamentals Playbook","title":"CI pipeline for better documentation"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#ci-pipeline-for-better-documentation","text":"","title":"CI Pipeline for Better Documentation"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#introduction","text":"Most projects start with spikes, where developers and analysts produce lots of documentation. Sometimes, these documents don't have a standard and each team member writes them accordingly with their preference. Add to that the time a reviewer will spend confirming grammar, searching for typos or non-inclusive language. This pipeline helps address that!","title":"Introduction"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#the-pipeline","text":"The pipeline uses the following npm modules: markdownlint : add standardization using rules markdown-link-check : check the links in the documentation and report broken ones write-good : linter for English prose We have been using this pipeline for more than one year in different engagements and always received great feedback from the customers!","title":"The Pipeline"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#how-does-it-work","text":"To start using this pipeline: Download the files from this repository Unzip the folders and files to your repository root if the repository is empty - if it's not brand new, copy the files and make the required adjustments: - check .azdo so it matches your repository standard - check package.json so you don't overwrite one you already have in the process. Also update the file if you changed the name of the .azdo folder. Create the pipeline in Azure DevOps or GitHub","title":"How Does it Work"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#resources","text":"Markdown Code Reviews in the Engineering Fundamentals Playbook","title":"Resources"},{"location":"CI-CD/recipes/ci-with-jupyter-notebooks/","text":"CI with Jupyter Notebooks As Azure DevOps doesn't allow code reviewers to comment directly in Jupyter Notebooks, Data Scientists(DSs) have to convert the notebooks to scripts before they commit and push these files to the repository. This document aims to automate this process in Azure DevOps, so the DSs don't need to execute anything locally. Problem Statement A Data Science repository has this folder structure: . \u251c\u2500\u2500 notebooks \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 00 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 01 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 02 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 03 .ipynb \u2514\u2500\u2500 scripts \u251c\u2500\u2500 Machine Learning Experiments - 00 .py \u251c\u2500\u2500 Machine Learning Experiments - 01 .py \u251c\u2500\u2500 Machine Learning Experiments - 02 .py \u2514\u2500\u2500 Machine Learning Experiments - 03 .py The python files are needed to allow Pull Request reviewers to add comments to the notebooks, they can add comments to the Python scripts and we apply these comments to the notebooks. Since we have to run this process manually before we add files to a commit, this manual process is error prone, e.g. If we create a notebook, generate the script from it, but later make some changes and forget to generate a new script for the changes. Solution One way to avoid this is to create the scripts in the repository from the commit. This document will describe this process. We can add a pipeline with the following steps to the repository to run in ipynb files: Go to the Project Settings -> Repositories -> Security -> User Permissions Add the Build Service in Users the permission to Contribute Create a new pipeline. In the newly created pipeline we add: Trigger to run on ipynb files: trigger: paths: include: - '*.ipynb' - '**/*.ipynb' Select the pool as Linux: pool: vmImage: ubuntu-latest Set the directory where we want to store the scripts: variables: REPO_URL: # Azure DevOps URL in the format: dev.azure.com/<Organization>/<Project>/_git/<RepoName> Now we will start the core of the pipeline: 1. Upgrade pip - script: | python -m pip install --upgrade pip displayName: 'Upgrade pip' 1. Install nbconvert and ipython : - script: | pip install nbconvert ipython displayName: 'install nbconvert & ipython' 1. Install pandoc : - script: | sudo apt install -y pandoc displayName: \"Install pandoc\" 1. Find the notebook files ( ipynb ) in the last commit to the repo and convert it to scripts ( py ): - task: Bash@3 inputs: targetType: 'inline' script: | IPYNB_PATH=($(git diff-tree --no-commit-id --name-only -r $(Build.SourceVersion) | grep '[.]ipynb$')) echo $IPYNB_PATH [ -z \"$IPYNB_PATH\" ] && echo \"Nothing to convert\" || jupyter nbconvert --to script $IPYNB_PATH displayName: \"Convert Notebook to script\" 1. Commit these changes to the repository: - bash: | git config --global user.email \"build@dev.azure.com\" git config --global user.name \"build\" git add . git commit -m 'Convert Jupyter notebooks' || echo \"No changes to commit\" && NO_CHANGES=1 [ -z \"$NO_CHANGES\" ] || git push https://$(System.AccessToken)@$(REPO_URL) HEAD:$(Build.SourceBranchName) displayName: \"Commit notebook to repository\" Now we have a pipeline that will generate the scripts as we commit our notebooks.","title":"CI with jupyter notebooks"},{"location":"CI-CD/recipes/ci-with-jupyter-notebooks/#ci-with-jupyter-notebooks","text":"As Azure DevOps doesn't allow code reviewers to comment directly in Jupyter Notebooks, Data Scientists(DSs) have to convert the notebooks to scripts before they commit and push these files to the repository. This document aims to automate this process in Azure DevOps, so the DSs don't need to execute anything locally.","title":"CI with Jupyter Notebooks"},{"location":"CI-CD/recipes/ci-with-jupyter-notebooks/#problem-statement","text":"A Data Science repository has this folder structure: . \u251c\u2500\u2500 notebooks \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 00 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 01 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 02 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 03 .ipynb \u2514\u2500\u2500 scripts \u251c\u2500\u2500 Machine Learning Experiments - 00 .py \u251c\u2500\u2500 Machine Learning Experiments - 01 .py \u251c\u2500\u2500 Machine Learning Experiments - 02 .py \u2514\u2500\u2500 Machine Learning Experiments - 03 .py The python files are needed to allow Pull Request reviewers to add comments to the notebooks, they can add comments to the Python scripts and we apply these comments to the notebooks. Since we have to run this process manually before we add files to a commit, this manual process is error prone, e.g. If we create a notebook, generate the script from it, but later make some changes and forget to generate a new script for the changes.","title":"Problem Statement"},{"location":"CI-CD/recipes/ci-with-jupyter-notebooks/#solution","text":"One way to avoid this is to create the scripts in the repository from the commit. This document will describe this process. We can add a pipeline with the following steps to the repository to run in ipynb files: Go to the Project Settings -> Repositories -> Security -> User Permissions Add the Build Service in Users the permission to Contribute Create a new pipeline. In the newly created pipeline we add: Trigger to run on ipynb files: trigger: paths: include: - '*.ipynb' - '**/*.ipynb' Select the pool as Linux: pool: vmImage: ubuntu-latest Set the directory where we want to store the scripts: variables: REPO_URL: # Azure DevOps URL in the format: dev.azure.com/<Organization>/<Project>/_git/<RepoName> Now we will start the core of the pipeline: 1. Upgrade pip - script: | python -m pip install --upgrade pip displayName: 'Upgrade pip' 1. Install nbconvert and ipython : - script: | pip install nbconvert ipython displayName: 'install nbconvert & ipython' 1. Install pandoc : - script: | sudo apt install -y pandoc displayName: \"Install pandoc\" 1. Find the notebook files ( ipynb ) in the last commit to the repo and convert it to scripts ( py ): - task: Bash@3 inputs: targetType: 'inline' script: | IPYNB_PATH=($(git diff-tree --no-commit-id --name-only -r $(Build.SourceVersion) | grep '[.]ipynb$')) echo $IPYNB_PATH [ -z \"$IPYNB_PATH\" ] && echo \"Nothing to convert\" || jupyter nbconvert --to script $IPYNB_PATH displayName: \"Convert Notebook to script\" 1. Commit these changes to the repository: - bash: | git config --global user.email \"build@dev.azure.com\" git config --global user.name \"build\" git add . git commit -m 'Convert Jupyter notebooks' || echo \"No changes to commit\" && NO_CHANGES=1 [ -z \"$NO_CHANGES\" ] || git push https://$(System.AccessToken)@$(REPO_URL) HEAD:$(Build.SourceBranchName) displayName: \"Commit notebook to repository\" Now we have a pipeline that will generate the scripts as we commit our notebooks.","title":"Solution"},{"location":"CI-CD/recipes/inclusive-linting/","text":"Inclusive Linting As software professionals we should strive to promote an inclusive work environment, which naturally extends to the code and documentation we write. It's important to keep the use of inclusive language consistent across an entire project or repository. To achieve this, we recommend using a text file analysis tool such as an inclusive linter and including this as a step in your CI pipelines. What to Lint for The primary goal of an inclusive linter is to flag any occurrences of non-inclusive language within source code (and optionally suggest some alternatives). Non-inclusive words or phrases in a project can be found anywhere from comments and documentation to variable names. An inclusive linter may include its own dictionary of \"default\" non-inclusive words and phrases to run against as a good starting point. These tools can also be customizable, oftentimes offering the ability to omit some terms and/or add your own. The ability to add additional terms to your linter has the added benefit of enabling linting of sensitive language on top of inclusive linting. This can prevent things such as customer names or other non-public information from making it into your git history, for instance. Getting Started with an Inclusive Linter woke One inclusive linter we recommend is woke . It is a language-agnostic CLI tool that detects non-inclusive language in your source code and recommends alternatives. While woke automatically applies a default ruleset with non-inclusive terms to lint for, you can also apply a custom rule config (via a yaml file) with additional terms to lint for. Running the tool locally on a file or directory is relatively straightforward: $ woke test.txt test.txt:2:2-6: ` guys ` may be insensitive, use ` folks ` , ` people ` instead ( warning ) * guys ^ woke can be run locally on your machine or CI/CD system via CLI and is also available as a two GitHub Actions: Run woke Run woke with Reviewdog To use the standard \"Run woke\" GitHub Action with the default ruleset in a CI pipeline: Add the woke action as a step in your project's CI pipeline yaml: name : ci on : - pull_request jobs : woke : name : woke runs-on : ubuntu-latest steps : - name : Checkout uses : actions/checkout@v2 - name : woke uses : get-woke/woke-action@v0 with : # Cause the check to fail on any broke rules fail-on-error : true Run your pipeline View the output in the \"Actions\" tab in the main repository view Resources woke default ruleset example.yaml Run woke Run woke with reviewdog docs","title":"Inclusive Linting"},{"location":"CI-CD/recipes/inclusive-linting/#inclusive-linting","text":"As software professionals we should strive to promote an inclusive work environment, which naturally extends to the code and documentation we write. It's important to keep the use of inclusive language consistent across an entire project or repository. To achieve this, we recommend using a text file analysis tool such as an inclusive linter and including this as a step in your CI pipelines.","title":"Inclusive Linting"},{"location":"CI-CD/recipes/inclusive-linting/#what-to-lint-for","text":"The primary goal of an inclusive linter is to flag any occurrences of non-inclusive language within source code (and optionally suggest some alternatives). Non-inclusive words or phrases in a project can be found anywhere from comments and documentation to variable names. An inclusive linter may include its own dictionary of \"default\" non-inclusive words and phrases to run against as a good starting point. These tools can also be customizable, oftentimes offering the ability to omit some terms and/or add your own. The ability to add additional terms to your linter has the added benefit of enabling linting of sensitive language on top of inclusive linting. This can prevent things such as customer names or other non-public information from making it into your git history, for instance.","title":"What to Lint for"},{"location":"CI-CD/recipes/inclusive-linting/#getting-started-with-an-inclusive-linter","text":"","title":"Getting Started with an Inclusive Linter"},{"location":"CI-CD/recipes/inclusive-linting/#woke","text":"One inclusive linter we recommend is woke . It is a language-agnostic CLI tool that detects non-inclusive language in your source code and recommends alternatives. While woke automatically applies a default ruleset with non-inclusive terms to lint for, you can also apply a custom rule config (via a yaml file) with additional terms to lint for. Running the tool locally on a file or directory is relatively straightforward: $ woke test.txt test.txt:2:2-6: ` guys ` may be insensitive, use ` folks ` , ` people ` instead ( warning ) * guys ^ woke can be run locally on your machine or CI/CD system via CLI and is also available as a two GitHub Actions: Run woke Run woke with Reviewdog To use the standard \"Run woke\" GitHub Action with the default ruleset in a CI pipeline: Add the woke action as a step in your project's CI pipeline yaml: name : ci on : - pull_request jobs : woke : name : woke runs-on : ubuntu-latest steps : - name : Checkout uses : actions/checkout@v2 - name : woke uses : get-woke/woke-action@v0 with : # Cause the check to fail on any broke rules fail-on-error : true Run your pipeline View the output in the \"Actions\" tab in the main repository view","title":"woke"},{"location":"CI-CD/recipes/inclusive-linting/#resources","text":"woke default ruleset example.yaml Run woke Run woke with reviewdog docs","title":"Resources"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/","text":"Reusing Dev Containers Within a Pipeline Given a repository with a local development container a.k.a. dev container that contains all the tooling required for development, would it make sense to reuse that container for running the tooling in the Continuous Integration pipelines? Options for Building Dev Containers Within a Pipeline There are three ways to build devcontainers within pipeline: With GitHub - devcontainers/ci builds the container with the devcontainer.json . Example here: devcontainers/ci \u00b7 Getting Started . With GitHub - devcontainers/cli , which is the same as the above, but using the underlying CLI directly without tasks. Building the DockerFile with docker build . This option excludes all configuration/features specified within the devcontainer.json . Considered Options Run CI pipelines in the native environment Run CI pipelines in the dev container via building image locally Run CI pipelines in the dev container with a container registry Here are below pros and cons for both approaches: Run CI Pipelines in the Native Environment Pros Cons Can use any pipeline tasks available Need to keep two sets of tooling and their versions in sync No container registry Can take some time to start, based on tools/dependencies required Agent will always be up to date with security patches The dev container should always be built within each run of the CI pipeline, to verify the changes within the branch haven't broken anything Run CI Pipelines in the Dev Container Without Image Caching Pros Cons Utilities scripts will work out of the box Need to rebuild the container for each run, given that there may be changes within the branch being built Rules used (for linting or unit tests) will be the same on the CI Not everything in the container is needed for the CI pipeline\u00b9 No surprise for the developers, local outputs (of linting for instance) will be the same in the CI Some pipeline tasks will not be available All tooling and their versions defined in a single place Building the image for each pipeline run is slow\u00b2 Tools/dependencies are already present The dev container is being tested to include all new tooling in addition to not being broken \u00b9: container size can be reduced by exporting the layer that contains only the tooling needed for the CI pipeline \u00b2: could be mitigated via adding image caching without using a container registry Run CI Pipelines in the Dev Container with Image Registry Pros Cons Utilities scripts will work out of the box Need to rebuild the container for each run, given that there may be changes within the branch being built No surprise for the developers, local outputs (of linting for instance) will be the same in the CI Not everything in the container is needed for the CI pipeline\u00b9 Rules used (for linting or unit tests) will be the same on the CI Some pipeline tasks will not be available\u00b2 All tooling and their versions defined in a single place Require access to a container registry to host the container within the pipeline\u00b3 Tools/dependencies are already present The dev container is being tested to include all new tooling in addition to not being broken Publishing the container built from devcontainer.json allows you to reference it in the cacheFrom in devcontainer.json (see docs ). By doing this, VS Code will use the published image as a layer cache when building \u00b9: container size can be reduces by exporting the layer that contains only the tooling needed for the CI pipeline. This would require building the image without tasks \u00b2: using container jobs in AzDO you can use all tasks (as far as I can tell). Reference: Dockerizing DevOps V2 - AzDO container jobs - DEV Community \u00b3: within GH actions, the default Github Actions token can be used for accessing GHCR without setting up separate registry, see the example below. Note: This does not build the Dockerfile together with the devcontainer.json - uses : whoan/docker-build-with-cache-action@v5 id : cache with : username : $GITHUB_ACTOR password : \"${{ secrets.GITHUB_TOKEN }}\" registry : docker.pkg.github.com image_name : devcontainer dockerfile : .devcontainer/Dockerfile","title":"Reusing Dev Containers Within a Pipeline"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#reusing-dev-containers-within-a-pipeline","text":"Given a repository with a local development container a.k.a. dev container that contains all the tooling required for development, would it make sense to reuse that container for running the tooling in the Continuous Integration pipelines?","title":"Reusing Dev Containers Within a Pipeline"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#options-for-building-dev-containers-within-a-pipeline","text":"There are three ways to build devcontainers within pipeline: With GitHub - devcontainers/ci builds the container with the devcontainer.json . Example here: devcontainers/ci \u00b7 Getting Started . With GitHub - devcontainers/cli , which is the same as the above, but using the underlying CLI directly without tasks. Building the DockerFile with docker build . This option excludes all configuration/features specified within the devcontainer.json .","title":"Options for Building Dev Containers Within a Pipeline"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#considered-options","text":"Run CI pipelines in the native environment Run CI pipelines in the dev container via building image locally Run CI pipelines in the dev container with a container registry Here are below pros and cons for both approaches:","title":"Considered Options"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#run-ci-pipelines-in-the-native-environment","text":"Pros Cons Can use any pipeline tasks available Need to keep two sets of tooling and their versions in sync No container registry Can take some time to start, based on tools/dependencies required Agent will always be up to date with security patches The dev container should always be built within each run of the CI pipeline, to verify the changes within the branch haven't broken anything","title":"Run CI Pipelines in the Native Environment"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#run-ci-pipelines-in-the-dev-container-without-image-caching","text":"Pros Cons Utilities scripts will work out of the box Need to rebuild the container for each run, given that there may be changes within the branch being built Rules used (for linting or unit tests) will be the same on the CI Not everything in the container is needed for the CI pipeline\u00b9 No surprise for the developers, local outputs (of linting for instance) will be the same in the CI Some pipeline tasks will not be available All tooling and their versions defined in a single place Building the image for each pipeline run is slow\u00b2 Tools/dependencies are already present The dev container is being tested to include all new tooling in addition to not being broken \u00b9: container size can be reduced by exporting the layer that contains only the tooling needed for the CI pipeline \u00b2: could be mitigated via adding image caching without using a container registry","title":"Run CI Pipelines in the Dev Container Without Image Caching"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#run-ci-pipelines-in-the-dev-container-with-image-registry","text":"Pros Cons Utilities scripts will work out of the box Need to rebuild the container for each run, given that there may be changes within the branch being built No surprise for the developers, local outputs (of linting for instance) will be the same in the CI Not everything in the container is needed for the CI pipeline\u00b9 Rules used (for linting or unit tests) will be the same on the CI Some pipeline tasks will not be available\u00b2 All tooling and their versions defined in a single place Require access to a container registry to host the container within the pipeline\u00b3 Tools/dependencies are already present The dev container is being tested to include all new tooling in addition to not being broken Publishing the container built from devcontainer.json allows you to reference it in the cacheFrom in devcontainer.json (see docs ). By doing this, VS Code will use the published image as a layer cache when building \u00b9: container size can be reduces by exporting the layer that contains only the tooling needed for the CI pipeline. This would require building the image without tasks \u00b2: using container jobs in AzDO you can use all tasks (as far as I can tell). Reference: Dockerizing DevOps V2 - AzDO container jobs - DEV Community \u00b3: within GH actions, the default Github Actions token can be used for accessing GHCR without setting up separate registry, see the example below. Note: This does not build the Dockerfile together with the devcontainer.json - uses : whoan/docker-build-with-cache-action@v5 id : cache with : username : $GITHUB_ACTOR password : \"${{ secrets.GITHUB_TOKEN }}\" registry : docker.pkg.github.com image_name : devcontainer dockerfile : .devcontainer/Dockerfile","title":"Run CI Pipelines in the Dev Container with Image Registry"},{"location":"CI-CD/recipes/github-actions/runtime-variables/","text":"Runtime Variables in GitHub Actions Objective While GitHub Actions is a popular choice for writing and running CI/CD pipelines, especially for open source projects hosted on GitHub, it lacks specific quality of life features found in other CI/CD environments. One key feature that GitHub Actions has not yet implemented is the ability to mock and inject runtime variables into a workflow, in order to test the pipeline itself. This provides a bridge between a pre-existing feature in Azure DevOps, and one that has not yet released inside GitHub Actions. Target Audience This guide assumes that you are familiar with CI/CD, and understand the security implications of CI/CD pipelines. We also assume basic knowledge with GitHub Actions, including how to write and run a basic CI/CD pipeline, checkout repositories inside the action, use Marketplace Actions with version control, etc. We assume that you, as a CI/CD engineer, want to inject environment variables or environment flags into your pipelines and workflows in order to test them, and are using GitHub Actions to accomplish this. Usage Scenario Many integration or end-to-end workflows require specific environment variables that are only available at runtime. For example, a workflow might be doing the following: In this situation, testing the pipeline is extremely difficult without having to make external calls to the resource. In many cases, making external calls to the resource can be expensive or time-consuming, significantly slowing down inner loop development. Azure DevOps, as an example, offers a way to define pipeline variables on a manual trigger: GitHub Actions does not do so yet. Solution To workaround this, the easiest solution is to add runtime variables to either commit messages or the PR Body, and grep for the variable. GitHub Actions provides grep functionality natively using a contains function, which is what we shall be specifically using. In scope: We will scope this to injecting a single environment variable into a pipeline, with a previously known key and value. Out of Scope: While the solution is obviously extensible using shell scripting or any other means of creating variables, this solution serves well as the proof of the basic concept. No such scripting is provided in this guide. Additionally, teams may wish to formalize this process using a PR Template that has an additional section for the variables being provided. This is not however included in this guide. Security Warning: This is NOT for injecting secrets as the commit messages and PR body can be retrieved by a third party, are stored in git log , and can otherwise be read by a malicious individual using a variety of tools. Rather, this is for testing a workflow that needs simple variables to be injected into it, as above. If you need to retrieve secrets or sensitive information , use the GitHub Action for Azure Key Vault or some other similar secret storage and retrieval service. Commit Message Variables How to inject a single variable into the environment for use, with a specified key and value. In this example, the key is COMMIT_VAR and the value is [commit var] . Pre-requisites: Pipeline triggers are correctly set up to trigger on pushed commits (Here we will use actions-test-branch as the branch of choice) Code Snippet: on : push : branches : - actions-test-branch jobs : Echo-On-Commit : runs-on : ubuntu-latest steps : - name : \"Checkout Repository\" uses : actions/checkout@v2 - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} run : | if ${COMMIT_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" Code Explanation: The first part of the code is setting up Push triggers on the working branch and checking out the repository, so we will not explore that in detail. - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} This is a named step inside the only Job in our GitHub Actions pipeline. Here, we set an environment variable for the step: Any code or action that the step calls will now have the environment variable available. contains is a GitHub Actions function that is available by default in all workflows. It returns a Boolean true or false value. In this situation, it checks to see if the commit message on the last push, accessed using github.event.head_commit.message . The ${{...}} is necessary to use the GitHub Context and make the functions and github.event variables available for the command. run : | if ${COMMIT_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi The run command here checks to see if the COMMIT_VAR variable has been set to true , and if it has, it sets a secondary flag to true, and echoes this behavior. It does the same if the variable is false . The specific reason to do this is to allow for the flag variable to be used in further steps instead of having to reuse the COMMIT_VAR in every step. Further, it allows for the flag to be used in the if step of an action, as in the next part of the snippet. - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" In this part of the snippet, the next step in the same job is now using the flag that was set in the previous step. This allows the user to: Reuse the flag instead of repeatedly accessing the GitHub Context Set the flag using multiple conditions, instead of just one. For example, a different step might ALSO set the flag to true or false for different reasons. Change the variable in exactly one place instead of having to change it in multiple places Shorter Alternative: The \"Set flag from commit\" step can be simplified to the following in order to make the code much shorter, although not necessarily more readable: - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} run : | echo \"flag=${COMMIT_VAR}\" >> $GITHUB_ENV echo \"set flag to ${COMMIT_VAR}\" Usage: Including the Variable Push to branch master : > git add. > git commit -m \"Running GitHub Actions Test [commit var]\" > git push This triggers the workflow (as will any push). As the [commit var] is in the commit message, the ${COMMIT_VAR} variable in the workflow will be set to true and result in the following: Not Including the Variable Push to branch master : > git add. > git commit -m \"Running GitHub Actions Test\" > git push This triggers the workflow (as will any push). As the [commit var] is not in the commit message, the ${COMMIT_VAR} variable in the workflow will be set to false and result in the following: PR Body Variables When a PR is made, the PR Body can also be used to set up variables. These variables can be made available to all the workflow runs that stem from that PR, which can help ensure that commit messages are more informative and less cluttered, and reduces the work on the developer. Once again, this for an expected key and value. In this case, the key is PR_VAR and the value is [pr var] . Pre-requisites: Pipeline triggers are correctly set up to trigger on a pull request into a specific branch. (Here we will use master as the destination branch.) Code Snippet: on : pull_request : branches : - master jobs : Echo-On-PR : runs-on : ubuntu-latest steps : - name : \"Checkout Repository\" uses : actions/checkout@v2 - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} run : | if ${PR_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" Code Explanation: The first part of the YAML file simply sets up the Pull Request Trigger. The majority of the following code is identical, so we will only explain the differences. - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} In this section, the PR_VAR environment variable is set to true or false depending on whether the [pr var] string is in the PR Body. Shorter Alternative: Similarly to the above, the YAML step can be simplified to the following in order to make the code much shorter, although not necessarily more readable: - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} run : | echo \"flag=${PR_VAR}\" >> $GITHUB_ENV echo \"set flag to ${PR_VAR}\" Usage: Create a Pull Request into master , and include the expected variable in the body somewhere: The GitHub Action will trigger automatically, and since [pr var] is present in the PR Body, it will set the flag to true, as shown below: Real World Scenarios There are many real world scenarios where controlling environment variables can be extremely useful. Some are outlined below: Avoiding Expensive External Calls Developer A is in the process of writing and testing an integration pipeline. The integration pipeline needs to make a call to an external service such as Azure Data Factory or Databricks, wait for a result, and then echo that result. The workflow could look like this: The workflow inherently takes time and is expensive to run, as it involves maintaining a Databricks cluster while also waiting for the response. This external dependency can be removed by essentially mocking the response for the duration of writing and testing other parts of the workflow, and mocking the response in situations where the actual response either does not matter, or is not being directly tested. Skipping Long CI processes Developer B is in the process of writing and testing a CI/CD pipeline. The pipeline has multiple CI stages, each of which runs sequentially. The workflow might look like this: In this case, each CI stage needs to run before the next one starts, and errors in the middle of the process can cause the entire pipeline to fail. While this might be intended behavior for the pipeline in some situations (Perhaps you don't want to run a more involved, longer build or run a time-consuming test coverage suite if the CI process is failing), it means that steps need to be commented out or deleted when testing the pipeline itself. Instead, an additional step could check for a [skip ci $N] tag in either the commit messages or PR Body, and skip a specific stage of the CI build. This ensures that the final pipeline does not have changes committed to it that render it broken, as sometimes happens when commenting out/deleting steps. It additionally allows for a mechanism to repeatedly test individual steps by skipping the others, making developing the pipeline significantly easier.","title":"Runtime Variables in GitHub Actions"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#runtime-variables-in-github-actions","text":"","title":"Runtime Variables in GitHub Actions"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#objective","text":"While GitHub Actions is a popular choice for writing and running CI/CD pipelines, especially for open source projects hosted on GitHub, it lacks specific quality of life features found in other CI/CD environments. One key feature that GitHub Actions has not yet implemented is the ability to mock and inject runtime variables into a workflow, in order to test the pipeline itself. This provides a bridge between a pre-existing feature in Azure DevOps, and one that has not yet released inside GitHub Actions.","title":"Objective"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#target-audience","text":"This guide assumes that you are familiar with CI/CD, and understand the security implications of CI/CD pipelines. We also assume basic knowledge with GitHub Actions, including how to write and run a basic CI/CD pipeline, checkout repositories inside the action, use Marketplace Actions with version control, etc. We assume that you, as a CI/CD engineer, want to inject environment variables or environment flags into your pipelines and workflows in order to test them, and are using GitHub Actions to accomplish this.","title":"Target Audience"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#usage-scenario","text":"Many integration or end-to-end workflows require specific environment variables that are only available at runtime. For example, a workflow might be doing the following: In this situation, testing the pipeline is extremely difficult without having to make external calls to the resource. In many cases, making external calls to the resource can be expensive or time-consuming, significantly slowing down inner loop development. Azure DevOps, as an example, offers a way to define pipeline variables on a manual trigger: GitHub Actions does not do so yet.","title":"Usage Scenario"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#solution","text":"To workaround this, the easiest solution is to add runtime variables to either commit messages or the PR Body, and grep for the variable. GitHub Actions provides grep functionality natively using a contains function, which is what we shall be specifically using. In scope: We will scope this to injecting a single environment variable into a pipeline, with a previously known key and value. Out of Scope: While the solution is obviously extensible using shell scripting or any other means of creating variables, this solution serves well as the proof of the basic concept. No such scripting is provided in this guide. Additionally, teams may wish to formalize this process using a PR Template that has an additional section for the variables being provided. This is not however included in this guide. Security Warning: This is NOT for injecting secrets as the commit messages and PR body can be retrieved by a third party, are stored in git log , and can otherwise be read by a malicious individual using a variety of tools. Rather, this is for testing a workflow that needs simple variables to be injected into it, as above. If you need to retrieve secrets or sensitive information , use the GitHub Action for Azure Key Vault or some other similar secret storage and retrieval service.","title":"Solution"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#commit-message-variables","text":"How to inject a single variable into the environment for use, with a specified key and value. In this example, the key is COMMIT_VAR and the value is [commit var] . Pre-requisites: Pipeline triggers are correctly set up to trigger on pushed commits (Here we will use actions-test-branch as the branch of choice) Code Snippet: on : push : branches : - actions-test-branch jobs : Echo-On-Commit : runs-on : ubuntu-latest steps : - name : \"Checkout Repository\" uses : actions/checkout@v2 - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} run : | if ${COMMIT_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" Code Explanation: The first part of the code is setting up Push triggers on the working branch and checking out the repository, so we will not explore that in detail. - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} This is a named step inside the only Job in our GitHub Actions pipeline. Here, we set an environment variable for the step: Any code or action that the step calls will now have the environment variable available. contains is a GitHub Actions function that is available by default in all workflows. It returns a Boolean true or false value. In this situation, it checks to see if the commit message on the last push, accessed using github.event.head_commit.message . The ${{...}} is necessary to use the GitHub Context and make the functions and github.event variables available for the command. run : | if ${COMMIT_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi The run command here checks to see if the COMMIT_VAR variable has been set to true , and if it has, it sets a secondary flag to true, and echoes this behavior. It does the same if the variable is false . The specific reason to do this is to allow for the flag variable to be used in further steps instead of having to reuse the COMMIT_VAR in every step. Further, it allows for the flag to be used in the if step of an action, as in the next part of the snippet. - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" In this part of the snippet, the next step in the same job is now using the flag that was set in the previous step. This allows the user to: Reuse the flag instead of repeatedly accessing the GitHub Context Set the flag using multiple conditions, instead of just one. For example, a different step might ALSO set the flag to true or false for different reasons. Change the variable in exactly one place instead of having to change it in multiple places Shorter Alternative: The \"Set flag from commit\" step can be simplified to the following in order to make the code much shorter, although not necessarily more readable: - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} run : | echo \"flag=${COMMIT_VAR}\" >> $GITHUB_ENV echo \"set flag to ${COMMIT_VAR}\" Usage: Including the Variable Push to branch master : > git add. > git commit -m \"Running GitHub Actions Test [commit var]\" > git push This triggers the workflow (as will any push). As the [commit var] is in the commit message, the ${COMMIT_VAR} variable in the workflow will be set to true and result in the following: Not Including the Variable Push to branch master : > git add. > git commit -m \"Running GitHub Actions Test\" > git push This triggers the workflow (as will any push). As the [commit var] is not in the commit message, the ${COMMIT_VAR} variable in the workflow will be set to false and result in the following:","title":"Commit Message Variables"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#pr-body-variables","text":"When a PR is made, the PR Body can also be used to set up variables. These variables can be made available to all the workflow runs that stem from that PR, which can help ensure that commit messages are more informative and less cluttered, and reduces the work on the developer. Once again, this for an expected key and value. In this case, the key is PR_VAR and the value is [pr var] . Pre-requisites: Pipeline triggers are correctly set up to trigger on a pull request into a specific branch. (Here we will use master as the destination branch.) Code Snippet: on : pull_request : branches : - master jobs : Echo-On-PR : runs-on : ubuntu-latest steps : - name : \"Checkout Repository\" uses : actions/checkout@v2 - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} run : | if ${PR_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" Code Explanation: The first part of the YAML file simply sets up the Pull Request Trigger. The majority of the following code is identical, so we will only explain the differences. - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} In this section, the PR_VAR environment variable is set to true or false depending on whether the [pr var] string is in the PR Body. Shorter Alternative: Similarly to the above, the YAML step can be simplified to the following in order to make the code much shorter, although not necessarily more readable: - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} run : | echo \"flag=${PR_VAR}\" >> $GITHUB_ENV echo \"set flag to ${PR_VAR}\" Usage: Create a Pull Request into master , and include the expected variable in the body somewhere: The GitHub Action will trigger automatically, and since [pr var] is present in the PR Body, it will set the flag to true, as shown below:","title":"PR Body Variables"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#real-world-scenarios","text":"There are many real world scenarios where controlling environment variables can be extremely useful. Some are outlined below:","title":"Real World Scenarios"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#avoiding-expensive-external-calls","text":"Developer A is in the process of writing and testing an integration pipeline. The integration pipeline needs to make a call to an external service such as Azure Data Factory or Databricks, wait for a result, and then echo that result. The workflow could look like this: The workflow inherently takes time and is expensive to run, as it involves maintaining a Databricks cluster while also waiting for the response. This external dependency can be removed by essentially mocking the response for the duration of writing and testing other parts of the workflow, and mocking the response in situations where the actual response either does not matter, or is not being directly tested.","title":"Avoiding Expensive External Calls"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#skipping-long-ci-processes","text":"Developer B is in the process of writing and testing a CI/CD pipeline. The pipeline has multiple CI stages, each of which runs sequentially. The workflow might look like this: In this case, each CI stage needs to run before the next one starts, and errors in the middle of the process can cause the entire pipeline to fail. While this might be intended behavior for the pipeline in some situations (Perhaps you don't want to run a more involved, longer build or run a time-consuming test coverage suite if the CI process is failing), it means that steps need to be commented out or deleted when testing the pipeline itself. Instead, an additional step could check for a [skip ci $N] tag in either the commit messages or PR Body, and skip a specific stage of the CI build. This ensures that the final pipeline does not have changes committed to it that render it broken, as sometimes happens when commenting out/deleting steps. It additionally allows for a mechanism to repeatedly test individual steps by skipping the others, making developing the pipeline significantly easier.","title":"Skipping Long CI processes"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/","text":"Save Terraform Output to a Variable Group (Azure DevOps) This recipe applies only to terraform usage with Azure DevOps. It assumes your familiar with terraform commands and Azure Pipelines. Context When terraform is used to automate the provisioning of the infrastructure, an Azure Pipeline is generally dedicated to apply terraform configuration files. It will create, update, delete Azure resources to provision your infrastructure changes. Once files are applied, some Output Values (for instance resource group name, app service name) can be referenced and outputted by terraform. These values must be generally retrieved afterwards, used as input variables for the deployment of services happening in separate pipelines. output \"core_resource_group_name\" { description = \"The resource group name\" value = module.core.resource_group_name } output \"core_key_vault_name\" { description = \"The key vault name.\" value = module.core.key_vault_name } output \"core_key_vault_url\" { description = \"The key vault url.\" value = module.core.key_vault_url } The purpose of this recipe is to answer the following statement: How to make terraform output values available across multiple pipelines ? Solution One suggested solution is to store outputted values in the Library with a Variable Group . Variable groups is a convenient way store values you might want to be passed into a YAML pipeline. In addition, all assets defined in the Library share a common security model. You can control who can define new items in a library, and who can use an existing item. For this purpose, we are using the following commands: terraform output to extract the value of an output variable from the state file (provided by Terraform CLI ) az pipelines variable-group to manage variable groups (provided by Azure DevOps CLI ) You can use the following script once terraform apply is completed to create/update the variable group. Script (update-variablegroup.sh) Parameters Name Description DEVOPS_ORGANIZATION The URI of the Azure DevOps organization. DEVOPS_PROJECT The name or id of the Azure DevOps project. GROUP_NAME The name of the variable group targeted. Implementation choices: If a variable group already exists, a valid option could be to delete and rebuild the group from scratch. However, as authorization could have been updated at the group level, we prefer to avoid this option. The script remove instead all variables in the targeted group and add them back with latest values. Permissions are not impacted. A variable group cannot be empty. It must contains at least one variable. A temporary uuid value is created to mitigate this issue, and removed once variables are updated. #!/bin/bash set -e export DEVOPS_ORGANIZATION = $1 export DEVOPS_PROJECT = $2 export GROUP_NAME = $3 # configure the azure devops cli az devops configure --defaults organization = ${ DEVOPS_ORGANIZATION } project = ${ DEVOPS_PROJECT } --use-git-aliases true # get the variable group id (if already exists) group_id = $( az pipelines variable-group list --group-name ${ GROUP_NAME } --query '[0].id' -o json ) if [ -z \" ${ group_id } \" ] ; then # create a new variable group tf_output = $( terraform output -json | jq -r 'to_entries[] | \"\\(.key)=\\(.value.value)\"' ) az pipelines variable-group create --name ${ GROUP_NAME } --variables ${ tf_output } --authorize true else # get existing variables var_list = $( az pipelines variable-group variable list --group-id ${ group_id } ) # add temporary uuid variable (a variable group cannot be empty) uuid = $( cat /proc/sys/kernel/random/uuid ) az pipelines variable-group variable create --group-id ${ group_id } --name ${ uuid } # delete existing variables for row in $( echo ${ var_list } | jq -r 'to_entries[] | \"\\(.key)\"' ) ; do az pipelines variable-group variable delete --group-id ${ group_id } --name ${ row } --yes done # create variables with latest values (from terraform) for row in $( terraform output -json | jq -c 'to_entries[]' ) ; do _jq () { echo ${ row } | jq -r ${ 1 } } az pipelines variable-group variable create --group-id ${ group_id } --name $( _jq '.key' ) --value $( _jq '.value.value' ) --secret $( _jq '.value.sensitive' ) done # delete temporary uuid variable az pipelines variable-group variable delete --group-id ${ group_id } --name ${ uuid } --yes fi Authenticate with Azure DevOps Most commands used in previous script interact with Azure DevOps and do require authentication. You can authenticate using the System.AccessToken security token used by the running pipeline, by assigning it to an environment variable named AZURE_DEVOPS_EXT_PAT , as shown in the following example (see Azure DevOps CLI in Azure Pipeline YAML for additional information). In addition, you can notice we are also using predefined variables to target the Azure DevOps organization and project (respectively System.TeamFoundationCollectionUri and System.TeamProjectId ). - task : Bash@3 displayName : 'Update variable group using terraform outputs' inputs : targetType : filePath arguments : $(System.TeamFoundationCollectionUri) $(System.TeamProjectId) \"Platform-VG\" workingDirectory : $(terraformDirectory) filePath : $(scriptsDirectory)/update-variablegroup.sh env : AZURE_DEVOPS_EXT_PAT : $(System.AccessToken) System variables Description System.AccessToken Special variable that carries the security token used by the running build. System.TeamFoundationCollectionUri The URI of the Azure DevOps organization. System.TeamProjectId The ID of the project that this build belongs to. Library security Roles are defined for Library items, and membership of these roles governs the operations you can perform on those items. Role for library item Description Reader Can view the item. User Can use the item when authoring build or release pipelines. For example, you must be a 'User' for a variable group to use it in a release pipeline. Administrator Can also manage membership of all other roles for the item. The user who created an item gets automatically added to the Administrator role for that item. By default, the following groups get added to the Administrator role of the library: Build Administrators, Release Administrators, and Project Administrators. Creator Can create new items in the library, but this role doesn't include Reader or User permissions. The Creator role can't manage permissions for other users. When using System.AccessToken , service account <ProjectName> Build Service identity will be used to access the Library. Please ensure in Pipelines > Library > Security section that this service account has Administrator role at the Library or Variable Group level to create/update/delete variables (see. Library of assets for additional information).","title":"Save Terraform Output to a Variable Group (Azure DevOps)"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#save-terraform-output-to-a-variable-group-azure-devops","text":"This recipe applies only to terraform usage with Azure DevOps. It assumes your familiar with terraform commands and Azure Pipelines.","title":"Save Terraform Output to a Variable Group (Azure DevOps)"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#context","text":"When terraform is used to automate the provisioning of the infrastructure, an Azure Pipeline is generally dedicated to apply terraform configuration files. It will create, update, delete Azure resources to provision your infrastructure changes. Once files are applied, some Output Values (for instance resource group name, app service name) can be referenced and outputted by terraform. These values must be generally retrieved afterwards, used as input variables for the deployment of services happening in separate pipelines. output \"core_resource_group_name\" { description = \"The resource group name\" value = module.core.resource_group_name } output \"core_key_vault_name\" { description = \"The key vault name.\" value = module.core.key_vault_name } output \"core_key_vault_url\" { description = \"The key vault url.\" value = module.core.key_vault_url } The purpose of this recipe is to answer the following statement: How to make terraform output values available across multiple pipelines ?","title":"Context"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#solution","text":"One suggested solution is to store outputted values in the Library with a Variable Group . Variable groups is a convenient way store values you might want to be passed into a YAML pipeline. In addition, all assets defined in the Library share a common security model. You can control who can define new items in a library, and who can use an existing item. For this purpose, we are using the following commands: terraform output to extract the value of an output variable from the state file (provided by Terraform CLI ) az pipelines variable-group to manage variable groups (provided by Azure DevOps CLI ) You can use the following script once terraform apply is completed to create/update the variable group.","title":"Solution"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#script-update-variablegroupsh","text":"","title":"Script (update-variablegroup.sh)"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#parameters","text":"Name Description DEVOPS_ORGANIZATION The URI of the Azure DevOps organization. DEVOPS_PROJECT The name or id of the Azure DevOps project. GROUP_NAME The name of the variable group targeted. Implementation choices: If a variable group already exists, a valid option could be to delete and rebuild the group from scratch. However, as authorization could have been updated at the group level, we prefer to avoid this option. The script remove instead all variables in the targeted group and add them back with latest values. Permissions are not impacted. A variable group cannot be empty. It must contains at least one variable. A temporary uuid value is created to mitigate this issue, and removed once variables are updated. #!/bin/bash set -e export DEVOPS_ORGANIZATION = $1 export DEVOPS_PROJECT = $2 export GROUP_NAME = $3 # configure the azure devops cli az devops configure --defaults organization = ${ DEVOPS_ORGANIZATION } project = ${ DEVOPS_PROJECT } --use-git-aliases true # get the variable group id (if already exists) group_id = $( az pipelines variable-group list --group-name ${ GROUP_NAME } --query '[0].id' -o json ) if [ -z \" ${ group_id } \" ] ; then # create a new variable group tf_output = $( terraform output -json | jq -r 'to_entries[] | \"\\(.key)=\\(.value.value)\"' ) az pipelines variable-group create --name ${ GROUP_NAME } --variables ${ tf_output } --authorize true else # get existing variables var_list = $( az pipelines variable-group variable list --group-id ${ group_id } ) # add temporary uuid variable (a variable group cannot be empty) uuid = $( cat /proc/sys/kernel/random/uuid ) az pipelines variable-group variable create --group-id ${ group_id } --name ${ uuid } # delete existing variables for row in $( echo ${ var_list } | jq -r 'to_entries[] | \"\\(.key)\"' ) ; do az pipelines variable-group variable delete --group-id ${ group_id } --name ${ row } --yes done # create variables with latest values (from terraform) for row in $( terraform output -json | jq -c 'to_entries[]' ) ; do _jq () { echo ${ row } | jq -r ${ 1 } } az pipelines variable-group variable create --group-id ${ group_id } --name $( _jq '.key' ) --value $( _jq '.value.value' ) --secret $( _jq '.value.sensitive' ) done # delete temporary uuid variable az pipelines variable-group variable delete --group-id ${ group_id } --name ${ uuid } --yes fi","title":"Parameters"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#authenticate-with-azure-devops","text":"Most commands used in previous script interact with Azure DevOps and do require authentication. You can authenticate using the System.AccessToken security token used by the running pipeline, by assigning it to an environment variable named AZURE_DEVOPS_EXT_PAT , as shown in the following example (see Azure DevOps CLI in Azure Pipeline YAML for additional information). In addition, you can notice we are also using predefined variables to target the Azure DevOps organization and project (respectively System.TeamFoundationCollectionUri and System.TeamProjectId ). - task : Bash@3 displayName : 'Update variable group using terraform outputs' inputs : targetType : filePath arguments : $(System.TeamFoundationCollectionUri) $(System.TeamProjectId) \"Platform-VG\" workingDirectory : $(terraformDirectory) filePath : $(scriptsDirectory)/update-variablegroup.sh env : AZURE_DEVOPS_EXT_PAT : $(System.AccessToken) System variables Description System.AccessToken Special variable that carries the security token used by the running build. System.TeamFoundationCollectionUri The URI of the Azure DevOps organization. System.TeamProjectId The ID of the project that this build belongs to.","title":"Authenticate with Azure DevOps"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#library-security","text":"Roles are defined for Library items, and membership of these roles governs the operations you can perform on those items. Role for library item Description Reader Can view the item. User Can use the item when authoring build or release pipelines. For example, you must be a 'User' for a variable group to use it in a release pipeline. Administrator Can also manage membership of all other roles for the item. The user who created an item gets automatically added to the Administrator role for that item. By default, the following groups get added to the Administrator role of the library: Build Administrators, Release Administrators, and Project Administrators. Creator Can create new items in the library, but this role doesn't include Reader or User permissions. The Creator role can't manage permissions for other users. When using System.AccessToken , service account <ProjectName> Build Service identity will be used to access the Library. Please ensure in Pipelines > Library > Security section that this service account has Administrator role at the Library or Variable Group level to create/update/delete variables (see. Library of assets for additional information).","title":"Library security"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/","text":"Sharing Common Variables / Naming Conventions Between Terraform Modules What are we Trying to Solve? When deploying infrastructure using code, it's common practice to split the code into different modules that are responsible for the deployment of a part or a component of the infrastructure. In Terraform, this can be done by using modules . In this case, it is useful to be able to share some common variables as well as centralize naming conventions of the different resources, to ensure it will be easy to refactor when it has to change, despite the dependencies that exist between modules. For example, let's consider 2 modules: Network module, responsible for deploying Virtual Network, Subnets, NSGs and Private DNS Zones Azure Kubernetes Service module responsible for deploying AKS cluster There are dependencies between these modules, like the Kubernetes cluster that will be deployed into the virtual network from the Network module. To do that, it must reference the name of the virtual network, as well as the resource group it is deployed in. And ideally, we would like these dependencies to be loosely coupled, as much as possible, to keep agility in how the modules are deployed and keep independent lifecycle. This page explains a way to solve this with Terraform. How to Do It? Context Let's consider the following structure for our modules: modules \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf Now, assume that you deploy a virtual network for the development environment, with the following properties: name: vnet-dev resource group: rg-dev-network Then at some point, you need to inject these values into the Kubernetes module, to get a reference to it through a data source, for example: data \"azurem_virtual_network\" \"vnet\" { name = var.vnet_name resource_group_name = var.vnet_rg_name } In the snippet above, the virtual network name and resource group are defined through variable. This is great, but if this changes in the future, then the values of these variables must change too. In every module they are used. Being able to manage naming in a central place will make sure the code can easily be refactored in the future, without updating all modules. About Terraform Variables In Terraform, every input variable must be defined at the configuration (or module) level, using the variable block. By convention, this is often done in a variables.tf file, in the module. This file contains variable declaration and default values. Values can be set using variables configuration files (.tfvars), environment variables or CLI arg when using the terraform plan or apply commands. One of the limitation of the variables declaration is that it's not possible to compose variables, locals or Terraform built-in functions are used for that. Common Terraform Module One way to bypass this limitations is to introduce a \"common\" module, that will not deploy any resources, but just compute / calculate and output the resource names and shared variables, and be used by all other modules, as a dependency. modules \u251c\u2500\u2500 common \u2502 \u251c\u2500\u2500 output.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf variables.tf: variable \"environment_name\" { type = string description = \"The name of the environment.\" } variable \"location\" { type = string description = \"The Azure region where the resources will be created. Default is westeurope.\" default = \"westeurope\" } output.tf: # Shared variables output \"location\" { value = var.location } output \"subscription\" { value = var.subscription } # Virtual Network Naming output \"vnet_rg_name\" { value = \"rg-network-${var.environment_name}\" } output \"vnet_name\" { value = \"vnet-${var.environment_name}\" } # AKS Naming output \"aks_rg_name\" { value = \"rg-aks-${var.environment_name}\" } output \"aks_name\" { value = \"aks-${var.environment_name}\" } Now, if you execute the Terraform apply for the common module, you get all the shared/common variables in outputs: $ terraform plan -var environment_name = \"dev\" -var subscription = \" $( az account show --query id -o tsv ) \" Changes to Outputs: + aks_name = \"aks-dev\" + aks_rg_name = \"rg-aks-dev\" + location = \"westeurope\" + subscription = \"01010101-1010-0101-1010-010101010101\" + vnet_name = \"vnet-dev\" + vnet_rg_name = \"rg-network-dev\" You can apply this plan to save these new output values to the Terraform state, without changing any real infrastructure. Use the Common Terraform Module Using the common Terraform module in any other module is super easy. For example, this is what you can do in the Azure Kubernetes module main.tf file: module \"common\" { source = \"../common\" environment_name = var.environment_name subscription = var.subscription } data \"azurerm_subnet\" \"aks_subnet\" { name = \"AksSubnet\" virtual_network_name = module.common.vnet_name resource_group_name = module.common.vnet_rg_name } resource \"azurerm_kubernetes_cluster\" \"aks\" { name = module.common.aks_name resource_group_name = module.common.aks_rg_name location = module.common.location dns_prefix = module.common.aks_name identity { type = \"SystemAssigned\" } default_node_pool { name = \"default\" vm_size = \"Standard_DS2_v2\" vnet_subnet_id = data.azurerm_subnet.aks_subnet.id } } Then, you can execute the terraform plan and terraform apply commands to deploy! terraform plan -var environment_name = \"dev\" -var subscription = \" $( az account show --query id -o tsv ) \" data.azurerm_subnet.aks_subnet: Reading... data.azurerm_subnet.aks_subnet: Read complete after 1s [ id = /subscriptions/01010101-1010-0101-1010-010101010101/resourceGroups/rg-network-dev/providers/Microsoft.Network/virtualNetworks/vnet-dev/subnets/AksSubnet ] Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: + create Terraform will perform the following actions: # azurerm_kubernetes_cluster.aks will be created + resource \"azurerm_kubernetes_cluster\" \"aks\" { + dns_prefix = \"aks-dev\" + fqdn = ( known after apply ) + id = ( known after apply ) + kube_admin_config = ( known after apply ) + kube_admin_config_raw = ( sensitive value ) + kube_config = ( known after apply ) + kube_config_raw = ( sensitive value ) + kubernetes_version = ( known after apply ) + location = \"westeurope\" + name = \"aks-dev\" + node_resource_group = ( known after apply ) + portal_fqdn = ( known after apply ) + private_cluster_enabled = ( known after apply ) + private_cluster_public_fqdn_enabled = false + private_dns_zone_id = ( known after apply ) + private_fqdn = ( known after apply ) + private_link_enabled = ( known after apply ) + public_network_access_enabled = true + resource_group_name = \"rg-aks-dev\" + sku_tier = \"Free\" [ ... ] truncated + default_node_pool { + kubelet_disk_type = ( known after apply ) + max_pods = ( known after apply ) + name = \"default\" + node_count = ( known after apply ) + node_labels = ( known after apply ) + orchestrator_version = ( known after apply ) + os_disk_size_gb = ( known after apply ) + os_disk_type = \"Managed\" + os_sku = ( known after apply ) + type = \"VirtualMachineScaleSets\" + ultra_ssd_enabled = false + vm_size = \"Standard_DS2_v2\" + vnet_subnet_id = \"/subscriptions/01010101-1010-0101-1010-010101010101/resourceGroups/rg-network-dev/providers/Microsoft.Network/virtualNetworks/vnet-dev/subnets/AksSubnet\" } + identity { + principal_id = ( known after apply ) + tenant_id = ( known after apply ) + type = \"SystemAssigned\" } [ ... ] truncated } Plan: 1 to add, 0 to change, 0 to destroy. Note: the usage of a common module is also valid if you decide to deploy all your modules in the same operation from a main Terraform configuration file, like: module \"common\" { source = \"./common\" environment_name = var.environment_name subscription = var.subscription } module \"network\" { source = \"./network\" vnet_name = module.common.vnet_name vnet_rg_name = module.common.vnet_rg_name } module \"kubernetes\" { source = \"./kubernetes\" aks_name = module.common.aks_name aks_rg = module.common.aks_rg_name } Centralize Input Variables Definitions In case you chose to define variables values directly in the source control (e.g. gitops scenario) using variables definitions files ( .tfvars ), having a common module will also help to not have to duplicate the common variables definitions in all modules. Indeed, it is possible to have a global file that is defined once, at the common module level, and merge it with a module-specific variables definitions files at Terraform plan or apply time. Let's consider the following structure: modules \u251c\u2500\u2500 common \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 output.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf The common module as well as all other modules contain variables files for dev and prod environment. tfvars files from the common module will define all the global variables that will be shared with other modules (like subscription, environment name, etc.) and .tfvars files of each module will define only the module-specific values. Then, it's possible to merge these files when running the terraform apply or terraform plan command, using the following syntax: terraform plan -var-file = < ( cat ../common/dev.tfvars ./dev.tfvars ) Note: using this, it is really important to ensure that you have not the same variable names in both files, otherwise that will generate an error. Conclusion By having a common module that owns shared variables as well as naming convention, it is now easier to refactor your Terraform configuration code base. Imagine that for some reason you need change the pattern that is used for the virtual network name: you change it in the common module output files, and just have to re-apply all modules!","title":"Sharing Common Variables / Naming Conventions Between Terraform Modules"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#sharing-common-variables-naming-conventions-between-terraform-modules","text":"","title":"Sharing Common Variables / Naming Conventions Between Terraform Modules"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#what-are-we-trying-to-solve","text":"When deploying infrastructure using code, it's common practice to split the code into different modules that are responsible for the deployment of a part or a component of the infrastructure. In Terraform, this can be done by using modules . In this case, it is useful to be able to share some common variables as well as centralize naming conventions of the different resources, to ensure it will be easy to refactor when it has to change, despite the dependencies that exist between modules. For example, let's consider 2 modules: Network module, responsible for deploying Virtual Network, Subnets, NSGs and Private DNS Zones Azure Kubernetes Service module responsible for deploying AKS cluster There are dependencies between these modules, like the Kubernetes cluster that will be deployed into the virtual network from the Network module. To do that, it must reference the name of the virtual network, as well as the resource group it is deployed in. And ideally, we would like these dependencies to be loosely coupled, as much as possible, to keep agility in how the modules are deployed and keep independent lifecycle. This page explains a way to solve this with Terraform.","title":"What are we Trying to Solve?"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#how-to-do-it","text":"","title":"How to Do It?"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#context","text":"Let's consider the following structure for our modules: modules \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf Now, assume that you deploy a virtual network for the development environment, with the following properties: name: vnet-dev resource group: rg-dev-network Then at some point, you need to inject these values into the Kubernetes module, to get a reference to it through a data source, for example: data \"azurem_virtual_network\" \"vnet\" { name = var.vnet_name resource_group_name = var.vnet_rg_name } In the snippet above, the virtual network name and resource group are defined through variable. This is great, but if this changes in the future, then the values of these variables must change too. In every module they are used. Being able to manage naming in a central place will make sure the code can easily be refactored in the future, without updating all modules.","title":"Context"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#about-terraform-variables","text":"In Terraform, every input variable must be defined at the configuration (or module) level, using the variable block. By convention, this is often done in a variables.tf file, in the module. This file contains variable declaration and default values. Values can be set using variables configuration files (.tfvars), environment variables or CLI arg when using the terraform plan or apply commands. One of the limitation of the variables declaration is that it's not possible to compose variables, locals or Terraform built-in functions are used for that.","title":"About Terraform Variables"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#common-terraform-module","text":"One way to bypass this limitations is to introduce a \"common\" module, that will not deploy any resources, but just compute / calculate and output the resource names and shared variables, and be used by all other modules, as a dependency. modules \u251c\u2500\u2500 common \u2502 \u251c\u2500\u2500 output.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf variables.tf: variable \"environment_name\" { type = string description = \"The name of the environment.\" } variable \"location\" { type = string description = \"The Azure region where the resources will be created. Default is westeurope.\" default = \"westeurope\" } output.tf: # Shared variables output \"location\" { value = var.location } output \"subscription\" { value = var.subscription } # Virtual Network Naming output \"vnet_rg_name\" { value = \"rg-network-${var.environment_name}\" } output \"vnet_name\" { value = \"vnet-${var.environment_name}\" } # AKS Naming output \"aks_rg_name\" { value = \"rg-aks-${var.environment_name}\" } output \"aks_name\" { value = \"aks-${var.environment_name}\" } Now, if you execute the Terraform apply for the common module, you get all the shared/common variables in outputs: $ terraform plan -var environment_name = \"dev\" -var subscription = \" $( az account show --query id -o tsv ) \" Changes to Outputs: + aks_name = \"aks-dev\" + aks_rg_name = \"rg-aks-dev\" + location = \"westeurope\" + subscription = \"01010101-1010-0101-1010-010101010101\" + vnet_name = \"vnet-dev\" + vnet_rg_name = \"rg-network-dev\" You can apply this plan to save these new output values to the Terraform state, without changing any real infrastructure.","title":"Common Terraform Module"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#use-the-common-terraform-module","text":"Using the common Terraform module in any other module is super easy. For example, this is what you can do in the Azure Kubernetes module main.tf file: module \"common\" { source = \"../common\" environment_name = var.environment_name subscription = var.subscription } data \"azurerm_subnet\" \"aks_subnet\" { name = \"AksSubnet\" virtual_network_name = module.common.vnet_name resource_group_name = module.common.vnet_rg_name } resource \"azurerm_kubernetes_cluster\" \"aks\" { name = module.common.aks_name resource_group_name = module.common.aks_rg_name location = module.common.location dns_prefix = module.common.aks_name identity { type = \"SystemAssigned\" } default_node_pool { name = \"default\" vm_size = \"Standard_DS2_v2\" vnet_subnet_id = data.azurerm_subnet.aks_subnet.id } } Then, you can execute the terraform plan and terraform apply commands to deploy! terraform plan -var environment_name = \"dev\" -var subscription = \" $( az account show --query id -o tsv ) \" data.azurerm_subnet.aks_subnet: Reading... data.azurerm_subnet.aks_subnet: Read complete after 1s [ id = /subscriptions/01010101-1010-0101-1010-010101010101/resourceGroups/rg-network-dev/providers/Microsoft.Network/virtualNetworks/vnet-dev/subnets/AksSubnet ] Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: + create Terraform will perform the following actions: # azurerm_kubernetes_cluster.aks will be created + resource \"azurerm_kubernetes_cluster\" \"aks\" { + dns_prefix = \"aks-dev\" + fqdn = ( known after apply ) + id = ( known after apply ) + kube_admin_config = ( known after apply ) + kube_admin_config_raw = ( sensitive value ) + kube_config = ( known after apply ) + kube_config_raw = ( sensitive value ) + kubernetes_version = ( known after apply ) + location = \"westeurope\" + name = \"aks-dev\" + node_resource_group = ( known after apply ) + portal_fqdn = ( known after apply ) + private_cluster_enabled = ( known after apply ) + private_cluster_public_fqdn_enabled = false + private_dns_zone_id = ( known after apply ) + private_fqdn = ( known after apply ) + private_link_enabled = ( known after apply ) + public_network_access_enabled = true + resource_group_name = \"rg-aks-dev\" + sku_tier = \"Free\" [ ... ] truncated + default_node_pool { + kubelet_disk_type = ( known after apply ) + max_pods = ( known after apply ) + name = \"default\" + node_count = ( known after apply ) + node_labels = ( known after apply ) + orchestrator_version = ( known after apply ) + os_disk_size_gb = ( known after apply ) + os_disk_type = \"Managed\" + os_sku = ( known after apply ) + type = \"VirtualMachineScaleSets\" + ultra_ssd_enabled = false + vm_size = \"Standard_DS2_v2\" + vnet_subnet_id = \"/subscriptions/01010101-1010-0101-1010-010101010101/resourceGroups/rg-network-dev/providers/Microsoft.Network/virtualNetworks/vnet-dev/subnets/AksSubnet\" } + identity { + principal_id = ( known after apply ) + tenant_id = ( known after apply ) + type = \"SystemAssigned\" } [ ... ] truncated } Plan: 1 to add, 0 to change, 0 to destroy. Note: the usage of a common module is also valid if you decide to deploy all your modules in the same operation from a main Terraform configuration file, like: module \"common\" { source = \"./common\" environment_name = var.environment_name subscription = var.subscription } module \"network\" { source = \"./network\" vnet_name = module.common.vnet_name vnet_rg_name = module.common.vnet_rg_name } module \"kubernetes\" { source = \"./kubernetes\" aks_name = module.common.aks_name aks_rg = module.common.aks_rg_name }","title":"Use the Common Terraform Module"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#centralize-input-variables-definitions","text":"In case you chose to define variables values directly in the source control (e.g. gitops scenario) using variables definitions files ( .tfvars ), having a common module will also help to not have to duplicate the common variables definitions in all modules. Indeed, it is possible to have a global file that is defined once, at the common module level, and merge it with a module-specific variables definitions files at Terraform plan or apply time. Let's consider the following structure: modules \u251c\u2500\u2500 common \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 output.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf The common module as well as all other modules contain variables files for dev and prod environment. tfvars files from the common module will define all the global variables that will be shared with other modules (like subscription, environment name, etc.) and .tfvars files of each module will define only the module-specific values. Then, it's possible to merge these files when running the terraform apply or terraform plan command, using the following syntax: terraform plan -var-file = < ( cat ../common/dev.tfvars ./dev.tfvars ) Note: using this, it is really important to ensure that you have not the same variable names in both files, otherwise that will generate an error.","title":"Centralize Input Variables Definitions"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#conclusion","text":"By having a common module that owns shared variables as well as naming convention, it is now easier to refactor your Terraform configuration code base. Imagine that for some reason you need change the pattern that is used for the virtual network name: you change it in the common module output files, and just have to re-apply all modules!","title":"Conclusion"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/","text":"Guidelines on Structuring and Testing the Terraform Configuration Context When creating an infrastructure configuration, it is important to follow a consistent and organized structure to ensure maintainability, scalability and reusability of the code. The goal of this section is to briefly describe how to structure your Terraform configuration in order to achieve this. Structuring the Terraform Configuration The recommended structure is as follows: Place each component you want to configure in its own module folder. Analyze your infrastructure code and identify the logical components that can be separated into reusable modules. This will give you a clear separation of concerns and will make it straight forward to include new resources, update existing ones or reuse them in the future. For more details on modules and when to use them, see the Terraform guidance . Place the .tf module files at the root of each folder and make sure to include a README file in a markdown format which can be automatically generated based on the module code. It's recommended to follow this approach as this file structure will be automatically picked up by the Terraform Registry . Use a consistent set of files to structure your modules. While this can vary depending on the specific needs of the project, one good example can be the following: provider.tf : defines the list of providers according to the plugins used data.tf : defines information read from different data sources main.tf : defines the infrastructure objects needed for your configuration (e.g. resource group, role assignment, container registry) backend.tf : backend configuration file outputs.tf : defines structured data that is exported variables.tf : defines static, reusable values Include in each module sub folders for documentation, examples and tests. The documentation includes basic information about the module: what is it installing, what are the options, an example use case and so on. You can also add here any other relevant details you might have. The example folder can include one or more examples of how to use the module, each example having the same set of configuration files decided on the previous step. It's recommended to also include a README providing a clear understanding of how it can be used in practice. The tests folder includes one or more files to test the example module together with a documentation file with instructions on how these tests can be executed . Place the root module in a separate folder called main : this is the primary entry point for the configuration. Like for the other modules, it will contain its corresponding configuration files. An example configuration structure obtained using the guidelines above is: modules \u251c\u2500\u2500 mlops \u2502 \u251c\u2500\u2500 doc \u2502 \u251c\u2500\u2500 example \u2502 \u251c\u2500\u2500 test \u2502 \u251c\u2500\u2500 backend.tf \u2502 \u251c\u2500\u2500 data.tf \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 outputs.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u251c\u2500\u2500 variables.tf \u2502 \u251c\u2500\u2500 README.md \u251c\u2500\u2500 common \u251c\u2500\u2500 main Testing the Configuration To test Terraform configurations, the Terratest library is utilized. A comprehensive guide to best practices with Terratest, including unit tests, integration tests, and end-to-end tests, is available for reference here . Types of tests Unit Test for Module / Resource : Write unit tests for individual modules / resources to ensure that each module behaves as expected in isolation. They are particularly valuable in larger, more complex Terraform configurations where individual modules can be reused and are generally quicker in terms of execution time. Integration Test : These tests verify that the different modules and resources work together as intended. For simple Terraform configurations, extensive unit testing might be overkill. Integration tests might be sufficient in such cases. However, as the complexity grows, unit tests become more valuable. Key aspects to consider Syntax and validation : Use terraform fmt and terraform validate to check the syntax and validate the Terraform configuration during development or in the deployment script / pipeline. This ensures that the configuration is correctly formatted and free of syntax errors. Deployment and existence : Terraform providers, like the Azure provider, perform certain checks during the execution of terraform apply. If Terraform successfully applies a configuration, it typically means that the specified resources were created or modified as expected. In your code you can skip this validation and focus on particular resource configurations that are more critical, described in the next points. Resource properties that can break the functionality : The expectation here is that we're not interested in testing each property of a resource, but to identify the ones that could cause an issue in the system if they are changed, such as access or network policies, service principal permissions and others. Validation of Key Vault contents : Ensuring the presence of necessary keys, certificates, or secrets in the Azure Key Vault that are stored as part of resource configuration. Properties that can influence the cost or location : This can be achieved by asserting the locations, service tiers, storage settings, depending on the properties available for the resources. Naming Convention When naming Terraform variables, it's essential to use clear and consistent naming conventions that are easy to understand and follow. The general convention is to use lowercase letters and numbers, with underscores instead of dashes, for example: \"azurerm_resource_group\". When naming resources, start with the provider's name, followed by the target resource, separated by underscores. For instance, \"azurerm_postgresql_server\" is an appropriate name for an Azure provider resource. When it comes to data sources, use a similar naming convention, but make sure to use plural names for lists of items. For example, \"azurerm_resource_groups\" is a good name for a data source that represents a list of resource groups. Variable and output names should be descriptive and reflect the purpose or use of the variable. It's also helpful to group related items together using a common prefix. For example, all variables related to storage accounts could start with \"storage_\". Keep in mind that outputs should be understandable outside of their scope. A useful naming pattern to follow is \"{name}_{attribute}\", where \"name\" represents a resource or data source name, and \"attribute\" is the attribute returned by the output. For example, \"storage_primary_connection_string\" could be a valid output name. Make sure you include a description for outputs and variables, as well as marking the values as 'default' or 'sensitive' when the case. This information will be captured in the generated documentation. Generating the Documentation The documentation can be automatically generated based on the configuration code in your modules with the help of terraform-docs . To generate the Terraform module documentation, go to the module folder and enter this command: terraform-docs markdown table --output-file README.md --output-mode inject . Then, the documentation will be generated inside the component root directory. Conclusion The approach presented in this section is designed to be flexible and easy to use, making it straight forward to add new resources or update existing ones. The separation of concerns also makes it easy to reuse existing components in other projects, with all the information (modules, examples, documentation and tests) located in one place. Resources Terraform-docs Terraform Registry Terraform Module Guidance Terratest Testing HashiCorp Terraform Build Infrastructure - Terraform Azure Example","title":"Guidelines on Structuring and Testing the Terraform Configuration"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#guidelines-on-structuring-and-testing-the-terraform-configuration","text":"","title":"Guidelines on Structuring and Testing the Terraform Configuration"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#context","text":"When creating an infrastructure configuration, it is important to follow a consistent and organized structure to ensure maintainability, scalability and reusability of the code. The goal of this section is to briefly describe how to structure your Terraform configuration in order to achieve this.","title":"Context"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#structuring-the-terraform-configuration","text":"The recommended structure is as follows: Place each component you want to configure in its own module folder. Analyze your infrastructure code and identify the logical components that can be separated into reusable modules. This will give you a clear separation of concerns and will make it straight forward to include new resources, update existing ones or reuse them in the future. For more details on modules and when to use them, see the Terraform guidance . Place the .tf module files at the root of each folder and make sure to include a README file in a markdown format which can be automatically generated based on the module code. It's recommended to follow this approach as this file structure will be automatically picked up by the Terraform Registry . Use a consistent set of files to structure your modules. While this can vary depending on the specific needs of the project, one good example can be the following: provider.tf : defines the list of providers according to the plugins used data.tf : defines information read from different data sources main.tf : defines the infrastructure objects needed for your configuration (e.g. resource group, role assignment, container registry) backend.tf : backend configuration file outputs.tf : defines structured data that is exported variables.tf : defines static, reusable values Include in each module sub folders for documentation, examples and tests. The documentation includes basic information about the module: what is it installing, what are the options, an example use case and so on. You can also add here any other relevant details you might have. The example folder can include one or more examples of how to use the module, each example having the same set of configuration files decided on the previous step. It's recommended to also include a README providing a clear understanding of how it can be used in practice. The tests folder includes one or more files to test the example module together with a documentation file with instructions on how these tests can be executed . Place the root module in a separate folder called main : this is the primary entry point for the configuration. Like for the other modules, it will contain its corresponding configuration files. An example configuration structure obtained using the guidelines above is: modules \u251c\u2500\u2500 mlops \u2502 \u251c\u2500\u2500 doc \u2502 \u251c\u2500\u2500 example \u2502 \u251c\u2500\u2500 test \u2502 \u251c\u2500\u2500 backend.tf \u2502 \u251c\u2500\u2500 data.tf \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 outputs.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u251c\u2500\u2500 variables.tf \u2502 \u251c\u2500\u2500 README.md \u251c\u2500\u2500 common \u251c\u2500\u2500 main","title":"Structuring the Terraform Configuration"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#testing-the-configuration","text":"To test Terraform configurations, the Terratest library is utilized. A comprehensive guide to best practices with Terratest, including unit tests, integration tests, and end-to-end tests, is available for reference here .","title":"Testing the Configuration"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#types-of-tests","text":"Unit Test for Module / Resource : Write unit tests for individual modules / resources to ensure that each module behaves as expected in isolation. They are particularly valuable in larger, more complex Terraform configurations where individual modules can be reused and are generally quicker in terms of execution time. Integration Test : These tests verify that the different modules and resources work together as intended. For simple Terraform configurations, extensive unit testing might be overkill. Integration tests might be sufficient in such cases. However, as the complexity grows, unit tests become more valuable.","title":"Types of tests"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#key-aspects-to-consider","text":"Syntax and validation : Use terraform fmt and terraform validate to check the syntax and validate the Terraform configuration during development or in the deployment script / pipeline. This ensures that the configuration is correctly formatted and free of syntax errors. Deployment and existence : Terraform providers, like the Azure provider, perform certain checks during the execution of terraform apply. If Terraform successfully applies a configuration, it typically means that the specified resources were created or modified as expected. In your code you can skip this validation and focus on particular resource configurations that are more critical, described in the next points. Resource properties that can break the functionality : The expectation here is that we're not interested in testing each property of a resource, but to identify the ones that could cause an issue in the system if they are changed, such as access or network policies, service principal permissions and others. Validation of Key Vault contents : Ensuring the presence of necessary keys, certificates, or secrets in the Azure Key Vault that are stored as part of resource configuration. Properties that can influence the cost or location : This can be achieved by asserting the locations, service tiers, storage settings, depending on the properties available for the resources.","title":"Key aspects to consider"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#naming-convention","text":"When naming Terraform variables, it's essential to use clear and consistent naming conventions that are easy to understand and follow. The general convention is to use lowercase letters and numbers, with underscores instead of dashes, for example: \"azurerm_resource_group\". When naming resources, start with the provider's name, followed by the target resource, separated by underscores. For instance, \"azurerm_postgresql_server\" is an appropriate name for an Azure provider resource. When it comes to data sources, use a similar naming convention, but make sure to use plural names for lists of items. For example, \"azurerm_resource_groups\" is a good name for a data source that represents a list of resource groups. Variable and output names should be descriptive and reflect the purpose or use of the variable. It's also helpful to group related items together using a common prefix. For example, all variables related to storage accounts could start with \"storage_\". Keep in mind that outputs should be understandable outside of their scope. A useful naming pattern to follow is \"{name}_{attribute}\", where \"name\" represents a resource or data source name, and \"attribute\" is the attribute returned by the output. For example, \"storage_primary_connection_string\" could be a valid output name. Make sure you include a description for outputs and variables, as well as marking the values as 'default' or 'sensitive' when the case. This information will be captured in the generated documentation.","title":"Naming Convention"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#generating-the-documentation","text":"The documentation can be automatically generated based on the configuration code in your modules with the help of terraform-docs . To generate the Terraform module documentation, go to the module folder and enter this command: terraform-docs markdown table --output-file README.md --output-mode inject . Then, the documentation will be generated inside the component root directory.","title":"Generating the Documentation"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#conclusion","text":"The approach presented in this section is designed to be flexible and easy to use, making it straight forward to add new resources or update existing ones. The separation of concerns also makes it easy to reuse existing components in other projects, with all the information (modules, examples, documentation and tests) located in one place.","title":"Conclusion"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#resources","text":"Terraform-docs Terraform Registry Terraform Module Guidance Terratest Testing HashiCorp Terraform Build Infrastructure - Terraform Azure Example","title":"Resources"},{"location":"UI-UX/","text":"User Interface and User Experience Engineering Also known as UI/UX , Front End Development , or Web Development , user interface and user experience engineering is a broad topic and encompasses many different aspects of modern application development. When a user interface is required, ISE primarily develops a web application . Web apps can be built in a variety of ways with many different tools. Goal The goal of the User Interface section is to provide guidance on developing web applications. Everyone should begin by reading the General Guidance for a quick introduction to the four main aspects of every web application project. From there, readers are encouraged to dive deeper into each topic, or begin reviewing technical guidance that pertains to their engagement. All UI/UX projects should begin with a detailed design document. Review the Design Process section for more details, and a template to get started. Keep in mind that like all software, there is no \"right way\" to build a user interface application. Leverage and trust your team's or your customer's experience and expertise for the best development experience. General Guidance The state of web platform engineering is fast moving. There is no one-size-fits-all solution. For any team to be successful in building a UI, they need to have an understanding of the higher-level aspects of all UI project. Accessibility - ensuring your application is usable and enjoyed by as many people as possible is at the heart of accessibility and inclusive design. Usability - how effortless should it be for any given user to use the application? Do they need special training or a document to understand how to use it, or will it be intuitive? Maintainability - is the application just a proof of concept to showcase an idea for future work, or will it be an MVP and act as the starting point for a larger, production-ready application? Sometimes you don't need React or any other framework. Sometimes you need React, but not all the bells and whistles from create-react-app. Understanding project maintainability requirements can simplify an engagement\u2019s tooling needs significantly and let folks iterate without headaches. Stability - what is the cost of adding a dependency? Is it actively stable/updated/maintained? If not, can you afford the tech debt (sometimes the answer can be yes!)? Could you get 90% of the way there without adding another dependency? More information is available for each general guidance section in the corresponding pages. Design Process All user interface applications begin with the design process. The true definition for \"the design process\" is ever changing and highly opinion based as well. This sections aims to deliver a general overview of a design process any engineering team could conduct when starting an UI application engagement. When committing to a UI/UX project, be certain to not over-promise on the web application requirements. Delivering a production-ready application involves a large number of engineering complexities resulting in a very long timeline. Always start with a proof-of-concept or minimum-viable-product first. These projects can easily be achieved within a couple month timeline (and sometimes even less). The first step in the design process is to understand the problem at hand and outline what the solution should achieve. Commonly referred to as Desired Outcomes , the output of this first step should be a generalized list of outcomes that the solution will accomplish. Consider the following example: A public library has a set of data containing information about its collection. The data stores text, images, and the status of a book (borrowed, available, reserved). The library librarian wants to share this data with its users. As the librarian, I want to notify users before they receive late penalties for overdue books As the librarian, I want to notify users when a book they have reserved becomes available With the desired outcomes in mind, the next step in the design process is to define user personas. Regardless of the solution for a given problem, understanding the user needs leads to a better understanding of feature development and technological choices. Personas are written as prose-like paragraphs that describe different types of users. Considering the previous example, the various user personas could be: An individual with no disabilities, but is unfamiliar with using software interfaces An individual with no disabilities, and is familiar with using software interfaces An individual with disabilities, and is unfamiliar with using software interfaces (with or without the use of accessibility tooling) An individual with disabilities, but familiar with using software interfaces through the use of accessibility tooling After defining these personas it is clear that whatever the solution is, it requires a lot of accessibility and user experience design work. Sometimes personas can be simpler than this, but always include disabled users . Even when a user set is predefined as a group of individuals without disabilities, there is no guarantee that the user set will remain that way. After defining the desired outcomes as well as the personas , the next step in the design process is to begin conducting Trade Studies for potential solutions. The first trade study should be high-level and solution oriented. It will utilize the results of previous steps and propose multiple solutions for achieving the desired outcomes with the listed personas in mind. Continuing with the library example, this first trade study may compare various application solutions such as automated emails or text messages, an RSS feed, or an user interface application. There are pros and cons for each solution both from an user experience and a developer experience perspective, but at this stage it is important to focus on the users. After arriving on the best solution, the next trade study can dive into different implementation methods. It is in this subsequent trade studies that developer experience becomes more important. The benefit of building software applications is that there are truly infinite ways to build something. A team can use the latest shiny tools, or they can utilize the tried-and-tested ones. It is for this reason that focussing completely on the user until a solution is defined is better than obsessing over technology choices. Within ISE, we often reach for tools such as the React framework. React is a great tool when wielded by an experienced team. Otherwise, it can create more hurdles than it is worth. Keep in mind that even if you feel capable with React, the rest of your team and your customer's dev team needs to as well. Some other great options to consider when building a proof-of-concept or minimum-viable-product are: HTML/CSS/JavaScript Back to the basics! Start with a single index.html , include a popular CSS framework such as Bootstrap using their CDN link, and start prototyping! Rarely will you have to support legacy browsers; thus, you can rely on modern JavaScript language features! No need for build tools or even TypeScript (did you know you can type check JavaScript ). Web Component frameworks Web Components are now standardized in all modern browsers Microsoft has their own, stable & actively-maintained framework, Fast For more information of choosing the right implementation tool, read the Recommended Technologies document. Continue reading the Trade Study section of this site for more information on completing this step in the design process. After iterating through multiple trade study documents, this design process can be considered complete! With an agreed upon solution and implementation in mind, it is now time to begin development. A natural continuation of the design process is to get users (or stakeholders) involved as early as possible. Constantly look for design and usability feedback, and utilize this to improve the application as it is being developed.","title":"User Interface and User Experience Engineering"},{"location":"UI-UX/#user-interface-and-user-experience-engineering","text":"Also known as UI/UX , Front End Development , or Web Development , user interface and user experience engineering is a broad topic and encompasses many different aspects of modern application development. When a user interface is required, ISE primarily develops a web application . Web apps can be built in a variety of ways with many different tools.","title":"User Interface and User Experience Engineering"},{"location":"UI-UX/#goal","text":"The goal of the User Interface section is to provide guidance on developing web applications. Everyone should begin by reading the General Guidance for a quick introduction to the four main aspects of every web application project. From there, readers are encouraged to dive deeper into each topic, or begin reviewing technical guidance that pertains to their engagement. All UI/UX projects should begin with a detailed design document. Review the Design Process section for more details, and a template to get started. Keep in mind that like all software, there is no \"right way\" to build a user interface application. Leverage and trust your team's or your customer's experience and expertise for the best development experience.","title":"Goal"},{"location":"UI-UX/#general-guidance","text":"The state of web platform engineering is fast moving. There is no one-size-fits-all solution. For any team to be successful in building a UI, they need to have an understanding of the higher-level aspects of all UI project. Accessibility - ensuring your application is usable and enjoyed by as many people as possible is at the heart of accessibility and inclusive design. Usability - how effortless should it be for any given user to use the application? Do they need special training or a document to understand how to use it, or will it be intuitive? Maintainability - is the application just a proof of concept to showcase an idea for future work, or will it be an MVP and act as the starting point for a larger, production-ready application? Sometimes you don't need React or any other framework. Sometimes you need React, but not all the bells and whistles from create-react-app. Understanding project maintainability requirements can simplify an engagement\u2019s tooling needs significantly and let folks iterate without headaches. Stability - what is the cost of adding a dependency? Is it actively stable/updated/maintained? If not, can you afford the tech debt (sometimes the answer can be yes!)? Could you get 90% of the way there without adding another dependency? More information is available for each general guidance section in the corresponding pages.","title":"General Guidance"},{"location":"UI-UX/#design-process","text":"All user interface applications begin with the design process. The true definition for \"the design process\" is ever changing and highly opinion based as well. This sections aims to deliver a general overview of a design process any engineering team could conduct when starting an UI application engagement. When committing to a UI/UX project, be certain to not over-promise on the web application requirements. Delivering a production-ready application involves a large number of engineering complexities resulting in a very long timeline. Always start with a proof-of-concept or minimum-viable-product first. These projects can easily be achieved within a couple month timeline (and sometimes even less). The first step in the design process is to understand the problem at hand and outline what the solution should achieve. Commonly referred to as Desired Outcomes , the output of this first step should be a generalized list of outcomes that the solution will accomplish. Consider the following example: A public library has a set of data containing information about its collection. The data stores text, images, and the status of a book (borrowed, available, reserved). The library librarian wants to share this data with its users. As the librarian, I want to notify users before they receive late penalties for overdue books As the librarian, I want to notify users when a book they have reserved becomes available With the desired outcomes in mind, the next step in the design process is to define user personas. Regardless of the solution for a given problem, understanding the user needs leads to a better understanding of feature development and technological choices. Personas are written as prose-like paragraphs that describe different types of users. Considering the previous example, the various user personas could be: An individual with no disabilities, but is unfamiliar with using software interfaces An individual with no disabilities, and is familiar with using software interfaces An individual with disabilities, and is unfamiliar with using software interfaces (with or without the use of accessibility tooling) An individual with disabilities, but familiar with using software interfaces through the use of accessibility tooling After defining these personas it is clear that whatever the solution is, it requires a lot of accessibility and user experience design work. Sometimes personas can be simpler than this, but always include disabled users . Even when a user set is predefined as a group of individuals without disabilities, there is no guarantee that the user set will remain that way. After defining the desired outcomes as well as the personas , the next step in the design process is to begin conducting Trade Studies for potential solutions. The first trade study should be high-level and solution oriented. It will utilize the results of previous steps and propose multiple solutions for achieving the desired outcomes with the listed personas in mind. Continuing with the library example, this first trade study may compare various application solutions such as automated emails or text messages, an RSS feed, or an user interface application. There are pros and cons for each solution both from an user experience and a developer experience perspective, but at this stage it is important to focus on the users. After arriving on the best solution, the next trade study can dive into different implementation methods. It is in this subsequent trade studies that developer experience becomes more important. The benefit of building software applications is that there are truly infinite ways to build something. A team can use the latest shiny tools, or they can utilize the tried-and-tested ones. It is for this reason that focussing completely on the user until a solution is defined is better than obsessing over technology choices. Within ISE, we often reach for tools such as the React framework. React is a great tool when wielded by an experienced team. Otherwise, it can create more hurdles than it is worth. Keep in mind that even if you feel capable with React, the rest of your team and your customer's dev team needs to as well. Some other great options to consider when building a proof-of-concept or minimum-viable-product are: HTML/CSS/JavaScript Back to the basics! Start with a single index.html , include a popular CSS framework such as Bootstrap using their CDN link, and start prototyping! Rarely will you have to support legacy browsers; thus, you can rely on modern JavaScript language features! No need for build tools or even TypeScript (did you know you can type check JavaScript ). Web Component frameworks Web Components are now standardized in all modern browsers Microsoft has their own, stable & actively-maintained framework, Fast For more information of choosing the right implementation tool, read the Recommended Technologies document. Continue reading the Trade Study section of this site for more information on completing this step in the design process. After iterating through multiple trade study documents, this design process can be considered complete! With an agreed upon solution and implementation in mind, it is now time to begin development. A natural continuation of the design process is to get users (or stakeholders) involved as early as possible. Constantly look for design and usability feedback, and utilize this to improve the application as it is being developed.","title":"Design Process"},{"location":"UI-UX/recommended-technologies/","text":"Recommended Technologies The purpose of this page is to review the commonly selected technology options when developing user interface applications. To reiterate from the general guidance section: Keep in mind that like all software, there is no \"right way\" to build a user interface application. Leverage and trust your team's or your customer's experience and expertise for the best development experience. Additionally, while some of these technologies are presented as alternate options, many can be combined together. For example, you can use React in a basic HTML/CSS/JS workflow by inline-importing React along with Babel. See the Add React to a Website for more details. Similarly, any Fast web component can be integrated into any existing React application . And of course, every JavaScript technology can also be used with TypeScript! TypeScript TypeScript is JavaScript with syntax for types. TypeScript is a strongly typed programming language that builds on JavaScript, giving you better tooling at any scale. typescriptlang.org TypeScript is highly recommended for all new web application projects. The stability it provides for teams is unmatched, and can make it easier for folks with C# backgrounds to work with web technologies. There are many ways to integrate TypeScript into a web application. The easiest way to get started is by reviewing the TypeScript Tooling in 5 Minutes guide from the official TypeScript docs. The other sections on this page contain information regarding integration with TypeScript. React React is a framework developed and maintained by Facebook. React is used throughout Microsoft and has a vast open source community. Documentation & Recommended Resources One can expect to find a multitude of guides, answers, and posts on how to work with React; don't take everything at face value. The best place to review React concepts is the React documentation. From there, you can review articles from various sources such as React Community Articles , Kent C Dodd's Blog , CSS Tricks Articles , and Awesome React . The React API has changed dramatically over time. Older resources may contain solutions or patterns that have since been changed and improved upon. Modern React development uses the React Hooks pattern. Rarely will you have to implement something using React Class pattern. If you're reading an article/answer/docs that instruct you to use the class pattern you may be looking at an out-of-date resource. Bootstrapping There are many different ways to bootstrap a React application. Two great tool sets to use are create-react-app and vite . create-react-app From Adding TypeScript npx create-react-app my-app --template typescript Vite From Scaffolding your First Vite Project # npm 6.x npm init vite@latest my-app --template react-ts # npm 7.x npm init vite@latest my-app -- --template react-ts","title":"Recommended Technologies"},{"location":"UI-UX/recommended-technologies/#recommended-technologies","text":"The purpose of this page is to review the commonly selected technology options when developing user interface applications. To reiterate from the general guidance section: Keep in mind that like all software, there is no \"right way\" to build a user interface application. Leverage and trust your team's or your customer's experience and expertise for the best development experience. Additionally, while some of these technologies are presented as alternate options, many can be combined together. For example, you can use React in a basic HTML/CSS/JS workflow by inline-importing React along with Babel. See the Add React to a Website for more details. Similarly, any Fast web component can be integrated into any existing React application . And of course, every JavaScript technology can also be used with TypeScript!","title":"Recommended Technologies"},{"location":"UI-UX/recommended-technologies/#typescript","text":"TypeScript is JavaScript with syntax for types. TypeScript is a strongly typed programming language that builds on JavaScript, giving you better tooling at any scale. typescriptlang.org TypeScript is highly recommended for all new web application projects. The stability it provides for teams is unmatched, and can make it easier for folks with C# backgrounds to work with web technologies. There are many ways to integrate TypeScript into a web application. The easiest way to get started is by reviewing the TypeScript Tooling in 5 Minutes guide from the official TypeScript docs. The other sections on this page contain information regarding integration with TypeScript.","title":"TypeScript"},{"location":"UI-UX/recommended-technologies/#react","text":"React is a framework developed and maintained by Facebook. React is used throughout Microsoft and has a vast open source community.","title":"React"},{"location":"UI-UX/recommended-technologies/#documentation-recommended-resources","text":"One can expect to find a multitude of guides, answers, and posts on how to work with React; don't take everything at face value. The best place to review React concepts is the React documentation. From there, you can review articles from various sources such as React Community Articles , Kent C Dodd's Blog , CSS Tricks Articles , and Awesome React . The React API has changed dramatically over time. Older resources may contain solutions or patterns that have since been changed and improved upon. Modern React development uses the React Hooks pattern. Rarely will you have to implement something using React Class pattern. If you're reading an article/answer/docs that instruct you to use the class pattern you may be looking at an out-of-date resource.","title":"Documentation &amp; Recommended Resources"},{"location":"UI-UX/recommended-technologies/#bootstrapping","text":"There are many different ways to bootstrap a React application. Two great tool sets to use are create-react-app and vite .","title":"Bootstrapping"},{"location":"UI-UX/recommended-technologies/#create-react-app","text":"From Adding TypeScript npx create-react-app my-app --template typescript","title":"create-react-app"},{"location":"UI-UX/recommended-technologies/#vite","text":"From Scaffolding your First Vite Project # npm 6.x npm init vite@latest my-app --template react-ts # npm 7.x npm init vite@latest my-app -- --template react-ts","title":"Vite"},{"location":"agile-development/","text":"Agile Development In this documentation we refer to the team working on an engagement a \"Crew\" . This includes the dev team, dev lead, PM, data scientists, etc. Why Agile We want to be quick to respond to change We want to get to a state of working software fast, and iterate on it to improve it We want to keep the customer/end users involved all the way through We care about individuals and interactions over documents and processes The Fundamentals We care about the goal for each activity, but not necessarily about how they are accomplished. The suggestions in parenthesis are common ways to accomplish the goals. We keep a shared backlog of work, that everyone in the team can always access (ex. Azure DevOps or GitHub) We plan our work in iterations with clear goals (ex. sprints) We have a clear idea of when work items are ready to implement (ex. definition of ready) We have a clear idea of when work items are completed (ex. definition of done) We communicate the progress in one place that everyone can access, and keep the progress up to date (ex. sprint board and daily standups) We reflect on our work regularly to make improvements (ex. retrospectives) The team has a clear idea of the roles and responsibilities in the project (ex. Dev lead, TPM, Process Lead etc.) The team has a joint idea of how we work together (ex. team agreement) We value and respect the opinions and work of all team members. References What Is Scrum? Essential Scrum: A Practical Guide to The Most Popular Agile Process","title":"Agile Development"},{"location":"agile-development/#agile-development","text":"In this documentation we refer to the team working on an engagement a \"Crew\" . This includes the dev team, dev lead, PM, data scientists, etc.","title":"Agile Development"},{"location":"agile-development/#why-agile","text":"We want to be quick to respond to change We want to get to a state of working software fast, and iterate on it to improve it We want to keep the customer/end users involved all the way through We care about individuals and interactions over documents and processes","title":"Why Agile"},{"location":"agile-development/#the-fundamentals","text":"We care about the goal for each activity, but not necessarily about how they are accomplished. The suggestions in parenthesis are common ways to accomplish the goals. We keep a shared backlog of work, that everyone in the team can always access (ex. Azure DevOps or GitHub) We plan our work in iterations with clear goals (ex. sprints) We have a clear idea of when work items are ready to implement (ex. definition of ready) We have a clear idea of when work items are completed (ex. definition of done) We communicate the progress in one place that everyone can access, and keep the progress up to date (ex. sprint board and daily standups) We reflect on our work regularly to make improvements (ex. retrospectives) The team has a clear idea of the roles and responsibilities in the project (ex. Dev lead, TPM, Process Lead etc.) The team has a joint idea of how we work together (ex. team agreement) We value and respect the opinions and work of all team members.","title":"The Fundamentals"},{"location":"agile-development/#references","text":"What Is Scrum? Essential Scrum: A Practical Guide to The Most Popular Agile Process","title":"References"},{"location":"agile-development/backlog-management/","text":"Backlog Management Backlog Goals User stories have a clear acceptance criteria and definition of done. Design activities are planned as part of the backlog (a design for a story that needs it should be done before it is added in a Sprint). Suggestions Consider the backlog refinement as an ongoing activity, that expands outside of the typical \"Refinement meeting\". The team should decide on and have a clear understanding of a definition of ready and a definition of done . The team should have a clear understanding of what constitutes good acceptance criteria for a story/task, and decide on how stories/tasks are handled. Eg. in some projects, stories are refined as a crew, but tasks are created by individual developers on an as needed bases. Technical debt is mostly due to shortcuts made in the implementation as well as the future maintenance cost as the natural result of continuous improvement. Shortcuts should generally be avoided. In some rare instances where they happen, prioritizing and planning improvement activities to reduce this debt at a later time is the recommended approach. Resources Product Backlog Sprint Backlog Acceptance Criteria Definition of Done Definition of Ready Estimation Basics in Agile","title":"Backlog Management"},{"location":"agile-development/backlog-management/#backlog-management","text":"","title":"Backlog Management"},{"location":"agile-development/backlog-management/#backlog","text":"Goals User stories have a clear acceptance criteria and definition of done. Design activities are planned as part of the backlog (a design for a story that needs it should be done before it is added in a Sprint). Suggestions Consider the backlog refinement as an ongoing activity, that expands outside of the typical \"Refinement meeting\". The team should decide on and have a clear understanding of a definition of ready and a definition of done . The team should have a clear understanding of what constitutes good acceptance criteria for a story/task, and decide on how stories/tasks are handled. Eg. in some projects, stories are refined as a crew, but tasks are created by individual developers on an as needed bases. Technical debt is mostly due to shortcuts made in the implementation as well as the future maintenance cost as the natural result of continuous improvement. Shortcuts should generally be avoided. In some rare instances where they happen, prioritizing and planning improvement activities to reduce this debt at a later time is the recommended approach.","title":"Backlog"},{"location":"agile-development/backlog-management/#resources","text":"Product Backlog Sprint Backlog Acceptance Criteria Definition of Done Definition of Ready Estimation Basics in Agile","title":"Resources"},{"location":"agile-development/ceremonies/","text":"Agile Ceremonies Sprint Planning Goals The planning supports Diversity and Inclusion principles and provides equal opportunities. The Planning defines how the work is going to be completed in the sprint. Stories fit in a sprint and are designed and ready before the planning. Note: Self assignment by team members can give a feeling of fairness in how work is split in the team. Sometime, this ends up not being the case as it can give an advantage to the loudest or more experienced voices in the team. Individuals also tend to stay in their comfort zone, which might not be the right approach for their own growth.* Sprint Goal Consider defining a sprint goal, or list of goals for each sprint. Effective sprint goals are a concise bullet point list of items. A Sprint goal can be created first and used as an input to choose the Stories for the sprint. A sprint goal could also be created from the list of stories that were picked for the Sprint. The sprint goal can be used: At the end of each stand up meeting, to remember the north star for the Sprint and help everyone taking a step back During the sprint review (\"was the goal achieved?\", \"If not, why?\") Note: A simple way to define a sprint goal, is to create a User Story in each sprint backlog and name it \"Sprint XX goal\". You can add the bullet points in the description.* Stories Example 1: Preparing in advance The dev lead and product owner plan time to prepare the sprint backlog ahead of sprint planning. The dev lead uses their experience (past and on the current project) and the estimation made for these stories to gauge how many should be in the sprint. The dev lead asks the entire team to look at the tentative sprint backlog in advance of the sprint planning. The dev lead assigns stories to specific developers after confirming with them that it makes sense During the sprint planning meeting, the team reviews the sprint goal and the stories. Everyone confirm they understand the plan and feel it's reasonable. Example 2: Building during the planning meeting The product owner ensures that the highest priority items of the product backlog is refined and estimated following the team estimation process. During the Sprint planning meeting, the product owner describe each stories, one by one, starting by highest priority. For each story, the dev lead and the team confirm they understand what needs to be done and add the story to the sprint backlog. The team keeps considering more stories up to a point where they agree the sprint backlog is full. This should be informed by the estimation, past developer experience and past experience in this specific project. Stories are assigned during the planning meeting: Option 1: The dev lead makes suggestion on who could work on each stories. Each engineer agrees or discuss if required. Option 2: The team review each story and engineer volunteer select the one they want to be assigned to. Note : this option might cause issues with the first core expectations. Who gets to work on what? Ultimately, it is the dev lead responsibility to ensure each engineer gets the opportunity to work on what makes sense for their growth.) Tasks Examples of approaches for task creation and assignment: Stories are split into tasks ahead of time by dev lead and assigned before/during sprint planning to engineers. Stories are assigned to more senior engineers who are responsible for splitting into tasks. Stories are split into tasks during the Sprint planning meeting by the entire team. Note : Depending on the seniority of the team, consider splitting into tasks before sprint planning. This can help getting out of sprint planning with all work assigned. It also increase clarity for junior engineers. Sprint Planning Resources Definition of Ready Sprint Goal Template Planning Refinement User Stories Applied: For Software Development Estimation Goals Estimation supports the predictability of the team work and delivery. Estimation re-enforces the value of accountability to the team. The estimation process is improved over time and discussed on a regular basis. Estimation is inclusive of the different individuals in the team. Rough estimation is usually done for a generic SE 2 dev. Example 1: T-shirt Sizes The team use t-shirt sizes (S, M, L, XL) and agrees in advance which size fits a sprint. In this example: S, M fits a sprint, L, XL too big for a sprint and need to be split / refined The dev lead with support of the team roughly estimates how much S and M stories can be done in the first sprints This rough estimation is refined over time and used to as an input for future sprint planning and to adjust project end date forecasting Example 2: Single Indicator The team uses a single indicator: \"does this story fits in one sprint?\", if not, the story needs to be split The dev lead with support of the team roughly estimates how many stories can be done in the first sprints How many stories are done in each sprint on average is used as an input for future sprint planning and as an indicator to adjust project end date forecasting Example 3: Planning Poker The team does planning poker and estimates in story points Story points are roughly used to estimate how much can be done in next sprint The dev lead and the TPM uses the past sprints and observed velocity to adjust project end date forecasting Other Considerations Estimating stories using story points in smaller project does not always provide the value it would in bigger ones. Avoid converting story points or t-shirt sizes to days. Measure Estimation Accuracy Collect data to monitor estimation accuracy and sprint completion over time to drive improvements. Use the sprint goal to understand if the estimation was correct. If the sprint goal is met: does anything else matter? Scrum Practices While Scrum does not prescribe how to size work, Professional Scrum is biased away from absolute estimation (hours, function points, ideal-days, etc.) and towards relative sizing. Planning Poker Planning Poker is a collaborative technique to assign relative size. Developers may choose whatever units they want - story points and t-shirt sizes are examples of units. 'Same-Size' Product Backlog Items (PBIs) 'Same-Size' PBIs is a relative estimation approach that involves breaking items down small enough that they are roughly the same size. Velocity can be understood as a count of PBIs; this is sometimes used by teams doing continuously delivery. 'Right-Size' Product Backlog Items (PBIs) 'Right-Size' PBIs is a relative estimation approach that involves breaking things down small enough to deliver value in a certain time period (i.e. get to Done by the end of a Sprint). This is sometimes associated with teams utilizing flow for forecasting. Teams use historical data to determine if they think they can get the PBI done within the confidence level that their historical data says they typically get a PBI done. Estimation Resources The Most Important Thing You Are Missing about Estimation Retrospectives Goals Retrospectives lead to actionable items that help grow the team's engineering practices. These items are in the backlog, assigned, and prioritized to be fixed by a date agreed upon (default being next retrospective). Retrospectives are used to ask the hard questions (\"we usually don't finish what we plan, let's talk about this\") when necessary. Suggestions Consider other retro formats available outside of Mad Sad Glad. Gather Data: Triple Nickels, Timeline, Mad Sad Glad, Team Radar Generate Insights: 5 Whys, Fishbone, Patterns and Shifts Consider setting a retro focus area. Schedule enough time to ensure that you can have the conversation you need to get the correct plan an action and improve how you work. Bring in a neutral facilitator for project retros or retros that introspect after a difficult period. Use the following retrospectives techniques to address specific trends that might be emerging on an engagement 5 Whys If a team is confronting a problem and is unsure of the exact root cause, the 5 whys exercise taken from the business analysis sector can help get to the bottom of it. For example, if a team cannot get to Done each Sprint, that would go at the top of the whiteboard. The team then asks why that problem exists, writing that answer in the box below. Next, the team asks why again, but this time in response to the why they just identified. Continue this process until the team identifies an actual root cause, which usually becomes apparent within five steps. Processes, Tools, Individuals, Interactions and the Definition of Done This approach encourages team members to think more broadly. Ask team members to identify what is going well and ideas for improvement within the categories of processes, tools, individuals/interactions, and the Definition of Done. Then, ask team members to vote on which improvement ideas to focus on during the upcoming Sprint. Focus This retrospective technique incorporates the concept of visioning. Using this technique, you ask team members where they would like to go? Decide what the team should look like in 4 weeks, and then ask what is holding them back from that and how they can resolve the impediment. If you are focusing on specific improvements, you can use this technique for one or two Retrospectives in a row so that the team can see progress over time. Retrospective Resources Agile Retrospective: Making Good Teams Great Retrospective Sprint Demo Goals Each sprint ends with demos that illustrate the sprint goal and how it fits in the engagement goal. Suggestions Consider not pre-recording sprint demos in advance. You can record the demo meeting and archive them. A demo does not have to be about running code. It can be showing documentation that was written. Sprint Demo Resources Sprint Review/Demo Stand-Up Goals The stand-up is run efficiently. The stand-up helps the team understand what was done, what will be done and what are the blockers. The stand-up helps the team understand if they will meet the sprint goal or not. Suggestions Keep stand up short and efficient. Table the longer conversations for a parking lot section, or for a conversation that will be planned later. Run daily stand ups: 15 minutes of stand up and 15 minutes of parking lot. If someone cannot make the stand-up exceptionally: Ask them to do a written stand up in advance. Stand ups should include everyone involved in the project, including the customer. Projects with widely divergent time zones should be avoided if possible, but if you are on one, you should adapt the standups to meet the needs and time constraints of all team members. Stand-Up Resources Stand-Up/Daily Scrum","title":"Agile Ceremonies"},{"location":"agile-development/ceremonies/#agile-ceremonies","text":"","title":"Agile Ceremonies"},{"location":"agile-development/ceremonies/#sprint-planning","text":"Goals The planning supports Diversity and Inclusion principles and provides equal opportunities. The Planning defines how the work is going to be completed in the sprint. Stories fit in a sprint and are designed and ready before the planning. Note: Self assignment by team members can give a feeling of fairness in how work is split in the team. Sometime, this ends up not being the case as it can give an advantage to the loudest or more experienced voices in the team. Individuals also tend to stay in their comfort zone, which might not be the right approach for their own growth.*","title":"Sprint Planning"},{"location":"agile-development/ceremonies/#sprint-goal","text":"Consider defining a sprint goal, or list of goals for each sprint. Effective sprint goals are a concise bullet point list of items. A Sprint goal can be created first and used as an input to choose the Stories for the sprint. A sprint goal could also be created from the list of stories that were picked for the Sprint. The sprint goal can be used: At the end of each stand up meeting, to remember the north star for the Sprint and help everyone taking a step back During the sprint review (\"was the goal achieved?\", \"If not, why?\") Note: A simple way to define a sprint goal, is to create a User Story in each sprint backlog and name it \"Sprint XX goal\". You can add the bullet points in the description.*","title":"Sprint Goal"},{"location":"agile-development/ceremonies/#stories","text":"Example 1: Preparing in advance The dev lead and product owner plan time to prepare the sprint backlog ahead of sprint planning. The dev lead uses their experience (past and on the current project) and the estimation made for these stories to gauge how many should be in the sprint. The dev lead asks the entire team to look at the tentative sprint backlog in advance of the sprint planning. The dev lead assigns stories to specific developers after confirming with them that it makes sense During the sprint planning meeting, the team reviews the sprint goal and the stories. Everyone confirm they understand the plan and feel it's reasonable. Example 2: Building during the planning meeting The product owner ensures that the highest priority items of the product backlog is refined and estimated following the team estimation process. During the Sprint planning meeting, the product owner describe each stories, one by one, starting by highest priority. For each story, the dev lead and the team confirm they understand what needs to be done and add the story to the sprint backlog. The team keeps considering more stories up to a point where they agree the sprint backlog is full. This should be informed by the estimation, past developer experience and past experience in this specific project. Stories are assigned during the planning meeting: Option 1: The dev lead makes suggestion on who could work on each stories. Each engineer agrees or discuss if required. Option 2: The team review each story and engineer volunteer select the one they want to be assigned to. Note : this option might cause issues with the first core expectations. Who gets to work on what? Ultimately, it is the dev lead responsibility to ensure each engineer gets the opportunity to work on what makes sense for their growth.)","title":"Stories"},{"location":"agile-development/ceremonies/#tasks","text":"Examples of approaches for task creation and assignment: Stories are split into tasks ahead of time by dev lead and assigned before/during sprint planning to engineers. Stories are assigned to more senior engineers who are responsible for splitting into tasks. Stories are split into tasks during the Sprint planning meeting by the entire team. Note : Depending on the seniority of the team, consider splitting into tasks before sprint planning. This can help getting out of sprint planning with all work assigned. It also increase clarity for junior engineers.","title":"Tasks"},{"location":"agile-development/ceremonies/#sprint-planning-resources","text":"Definition of Ready Sprint Goal Template Planning Refinement User Stories Applied: For Software Development","title":"Sprint Planning Resources"},{"location":"agile-development/ceremonies/#estimation","text":"Goals Estimation supports the predictability of the team work and delivery. Estimation re-enforces the value of accountability to the team. The estimation process is improved over time and discussed on a regular basis. Estimation is inclusive of the different individuals in the team. Rough estimation is usually done for a generic SE 2 dev.","title":"Estimation"},{"location":"agile-development/ceremonies/#example-1-t-shirt-sizes","text":"The team use t-shirt sizes (S, M, L, XL) and agrees in advance which size fits a sprint. In this example: S, M fits a sprint, L, XL too big for a sprint and need to be split / refined The dev lead with support of the team roughly estimates how much S and M stories can be done in the first sprints This rough estimation is refined over time and used to as an input for future sprint planning and to adjust project end date forecasting","title":"Example 1: T-shirt Sizes"},{"location":"agile-development/ceremonies/#example-2-single-indicator","text":"The team uses a single indicator: \"does this story fits in one sprint?\", if not, the story needs to be split The dev lead with support of the team roughly estimates how many stories can be done in the first sprints How many stories are done in each sprint on average is used as an input for future sprint planning and as an indicator to adjust project end date forecasting","title":"Example 2: Single Indicator"},{"location":"agile-development/ceremonies/#example-3-planning-poker","text":"The team does planning poker and estimates in story points Story points are roughly used to estimate how much can be done in next sprint The dev lead and the TPM uses the past sprints and observed velocity to adjust project end date forecasting","title":"Example 3: Planning Poker"},{"location":"agile-development/ceremonies/#other-considerations","text":"Estimating stories using story points in smaller project does not always provide the value it would in bigger ones. Avoid converting story points or t-shirt sizes to days.","title":"Other Considerations"},{"location":"agile-development/ceremonies/#measure-estimation-accuracy","text":"Collect data to monitor estimation accuracy and sprint completion over time to drive improvements. Use the sprint goal to understand if the estimation was correct. If the sprint goal is met: does anything else matter?","title":"Measure Estimation Accuracy"},{"location":"agile-development/ceremonies/#scrum-practices","text":"While Scrum does not prescribe how to size work, Professional Scrum is biased away from absolute estimation (hours, function points, ideal-days, etc.) and towards relative sizing. Planning Poker Planning Poker is a collaborative technique to assign relative size. Developers may choose whatever units they want - story points and t-shirt sizes are examples of units. 'Same-Size' Product Backlog Items (PBIs) 'Same-Size' PBIs is a relative estimation approach that involves breaking items down small enough that they are roughly the same size. Velocity can be understood as a count of PBIs; this is sometimes used by teams doing continuously delivery. 'Right-Size' Product Backlog Items (PBIs) 'Right-Size' PBIs is a relative estimation approach that involves breaking things down small enough to deliver value in a certain time period (i.e. get to Done by the end of a Sprint). This is sometimes associated with teams utilizing flow for forecasting. Teams use historical data to determine if they think they can get the PBI done within the confidence level that their historical data says they typically get a PBI done.","title":"Scrum Practices"},{"location":"agile-development/ceremonies/#estimation-resources","text":"The Most Important Thing You Are Missing about Estimation","title":"Estimation Resources"},{"location":"agile-development/ceremonies/#retrospectives","text":"Goals Retrospectives lead to actionable items that help grow the team's engineering practices. These items are in the backlog, assigned, and prioritized to be fixed by a date agreed upon (default being next retrospective). Retrospectives are used to ask the hard questions (\"we usually don't finish what we plan, let's talk about this\") when necessary. Suggestions Consider other retro formats available outside of Mad Sad Glad. Gather Data: Triple Nickels, Timeline, Mad Sad Glad, Team Radar Generate Insights: 5 Whys, Fishbone, Patterns and Shifts Consider setting a retro focus area. Schedule enough time to ensure that you can have the conversation you need to get the correct plan an action and improve how you work. Bring in a neutral facilitator for project retros or retros that introspect after a difficult period. Use the following retrospectives techniques to address specific trends that might be emerging on an engagement","title":"Retrospectives"},{"location":"agile-development/ceremonies/#5-whys","text":"If a team is confronting a problem and is unsure of the exact root cause, the 5 whys exercise taken from the business analysis sector can help get to the bottom of it. For example, if a team cannot get to Done each Sprint, that would go at the top of the whiteboard. The team then asks why that problem exists, writing that answer in the box below. Next, the team asks why again, but this time in response to the why they just identified. Continue this process until the team identifies an actual root cause, which usually becomes apparent within five steps.","title":"5 Whys"},{"location":"agile-development/ceremonies/#processes-tools-individuals-interactions-and-the-definition-of-done","text":"This approach encourages team members to think more broadly. Ask team members to identify what is going well and ideas for improvement within the categories of processes, tools, individuals/interactions, and the Definition of Done. Then, ask team members to vote on which improvement ideas to focus on during the upcoming Sprint.","title":"Processes, Tools, Individuals, Interactions and the Definition of Done"},{"location":"agile-development/ceremonies/#focus","text":"This retrospective technique incorporates the concept of visioning. Using this technique, you ask team members where they would like to go? Decide what the team should look like in 4 weeks, and then ask what is holding them back from that and how they can resolve the impediment. If you are focusing on specific improvements, you can use this technique for one or two Retrospectives in a row so that the team can see progress over time.","title":"Focus"},{"location":"agile-development/ceremonies/#retrospective-resources","text":"Agile Retrospective: Making Good Teams Great Retrospective","title":"Retrospective Resources"},{"location":"agile-development/ceremonies/#sprint-demo","text":"Goals Each sprint ends with demos that illustrate the sprint goal and how it fits in the engagement goal. Suggestions Consider not pre-recording sprint demos in advance. You can record the demo meeting and archive them. A demo does not have to be about running code. It can be showing documentation that was written.","title":"Sprint Demo"},{"location":"agile-development/ceremonies/#sprint-demo-resources","text":"Sprint Review/Demo","title":"Sprint Demo Resources"},{"location":"agile-development/ceremonies/#stand-up","text":"Goals The stand-up is run efficiently. The stand-up helps the team understand what was done, what will be done and what are the blockers. The stand-up helps the team understand if they will meet the sprint goal or not. Suggestions Keep stand up short and efficient. Table the longer conversations for a parking lot section, or for a conversation that will be planned later. Run daily stand ups: 15 minutes of stand up and 15 minutes of parking lot. If someone cannot make the stand-up exceptionally: Ask them to do a written stand up in advance. Stand ups should include everyone involved in the project, including the customer. Projects with widely divergent time zones should be avoided if possible, but if you are on one, you should adapt the standups to meet the needs and time constraints of all team members.","title":"Stand-Up"},{"location":"agile-development/ceremonies/#stand-up-resources","text":"Stand-Up/Daily Scrum","title":"Stand-Up Resources"},{"location":"agile-development/roles/","text":"Agile/Scrum Roles We prefer using \"process lead\" over \"scrum master\". It describes the same role. This section has links directing you to definitions for the traditional roles within Agile/Scrum. After reading through the best practices you should have a basic understanding of the key Agile roles in terms of what they are and the expectations for the role. Product Owner Scrum Master Development Team","title":"Agile/Scrum Roles"},{"location":"agile-development/roles/#agilescrum-roles","text":"We prefer using \"process lead\" over \"scrum master\". It describes the same role. This section has links directing you to definitions for the traditional roles within Agile/Scrum. After reading through the best practices you should have a basic understanding of the key Agile roles in terms of what they are and the expectations for the role. Product Owner Scrum Master Development Team","title":"Agile/Scrum Roles"},{"location":"agile-development/advanced-topics/backlog-management/external-feedback/","text":"External Feedback Various stakeholders can provide feedback to the working product during a project, beyond any formal review and feedback sessions required by the organization. The frequency and method of collecting feedback through reviews varies depending on the case, but a couple of good practices are: Capture each review in the backlog as a separate user story. Standardize the tasks that implement this user story. Plan for a review user story per Epic / Feature in your backlog proactively.","title":"External Feedback"},{"location":"agile-development/advanced-topics/backlog-management/external-feedback/#external-feedback","text":"Various stakeholders can provide feedback to the working product during a project, beyond any formal review and feedback sessions required by the organization. The frequency and method of collecting feedback through reviews varies depending on the case, but a couple of good practices are: Capture each review in the backlog as a separate user story. Standardize the tasks that implement this user story. Plan for a review user story per Epic / Feature in your backlog proactively.","title":"External Feedback"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/","text":"Minimal Slices Always Deliver Your Work Using Minimal Valuable Slices Split your work item into small chunks that are contributed in incremental commits. Contribute your chunks frequently. Follow an iterative approach by regularly providing updates and changes to the team. This allows for instant feedback and early issue discovery and ensures you are developing in the right direction, both technically and functionally. Do NOT work independently on your task without providing any updates to your team. Example Imagine you are working on adding UWP (Universal Windows Platform) application building functionality for existing continuous integration service which already has Android/iOS support. Bad Approach After six weeks of work you created PR with all required functionality, including portal UI (build settings), backend REST API (UWP build functionality), telemetry, unit and integration tests, etc. Good Approach You divided your feature into smaller user stories (which in turn were divided into multiple tasks) and started working on them one by one: As a user I can successfully build UWP apps using current service As a user I can see telemetry when building the apps As a user I have the ability to select build configuration (debug, release) As a user I have the ability to select target platform (arm, x86, x64) ... You also divided your stories into smaller tasks and sent PRs based on those tasks. E.g. you have the following tasks for the first user story above: Enable UWP platform on backend Add build button to the UI (build first solution file found) Add select solution file dropdown to the UI Implement unit tests Implement integration tests to verify build succeeded Update documentation ... Resources Minimalism Rules","title":"Minimal Slices"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#minimal-slices","text":"","title":"Minimal Slices"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#always-deliver-your-work-using-minimal-valuable-slices","text":"Split your work item into small chunks that are contributed in incremental commits. Contribute your chunks frequently. Follow an iterative approach by regularly providing updates and changes to the team. This allows for instant feedback and early issue discovery and ensures you are developing in the right direction, both technically and functionally. Do NOT work independently on your task without providing any updates to your team.","title":"Always Deliver Your Work Using Minimal Valuable Slices"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#example","text":"Imagine you are working on adding UWP (Universal Windows Platform) application building functionality for existing continuous integration service which already has Android/iOS support.","title":"Example"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#bad-approach","text":"After six weeks of work you created PR with all required functionality, including portal UI (build settings), backend REST API (UWP build functionality), telemetry, unit and integration tests, etc.","title":"Bad Approach"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#good-approach","text":"You divided your feature into smaller user stories (which in turn were divided into multiple tasks) and started working on them one by one: As a user I can successfully build UWP apps using current service As a user I can see telemetry when building the apps As a user I have the ability to select build configuration (debug, release) As a user I have the ability to select target platform (arm, x86, x64) ... You also divided your stories into smaller tasks and sent PRs based on those tasks. E.g. you have the following tasks for the first user story above: Enable UWP platform on backend Add build button to the UI (build first solution file found) Add select solution file dropdown to the UI Implement unit tests Implement integration tests to verify build succeeded Update documentation ...","title":"Good Approach"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#resources","text":"Minimalism Rules","title":"Resources"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/","text":"Risk Management Agile methodologies are conceived to be driven by risk management principles, but no methodology can eliminate all risks. Goal Anticipation is a key aspect of software project management, involving the proactive identification and assessment of potential risks and challenges to enable effective planning and mitigation strategies. The following guidance aims to provide decision-makers with the information needed to make informed choices, understanding trade-offs, costs, and project timelines throughout the project. General Guidance Identify risks in every activity such as a planning meetings, design and code reviews, or daily standups. All team members are responsible for identifying relevant risks. Assess risks in terms of their likelihood and potential impact on the project. Use the issues to report and track risks. Issues represent unplanned activities. Prioritize them based on their severity and likelihood, focusing on addressing the most critical ones first. Mitigate or reduce the impact and likelihood of the risks. Monitor continuously to ensure the effectiveness of the mitigation strategies. Prepare contingency plans for high-impact risks that may still materialize. Communicate and report risks to keep all stakeholders informed. Opportunity Management The same process can be applied to opportunities, but while risk management involves applying mitigation actions to decrease the likelihood of a risk, in opportunity management, you enhance actions to increase the likelihood of a positive outcome.","title":"Risk Management"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/#risk-management","text":"Agile methodologies are conceived to be driven by risk management principles, but no methodology can eliminate all risks.","title":"Risk Management"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/#goal","text":"Anticipation is a key aspect of software project management, involving the proactive identification and assessment of potential risks and challenges to enable effective planning and mitigation strategies. The following guidance aims to provide decision-makers with the information needed to make informed choices, understanding trade-offs, costs, and project timelines throughout the project.","title":"Goal"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/#general-guidance","text":"Identify risks in every activity such as a planning meetings, design and code reviews, or daily standups. All team members are responsible for identifying relevant risks. Assess risks in terms of their likelihood and potential impact on the project. Use the issues to report and track risks. Issues represent unplanned activities. Prioritize them based on their severity and likelihood, focusing on addressing the most critical ones first. Mitigate or reduce the impact and likelihood of the risks. Monitor continuously to ensure the effectiveness of the mitigation strategies. Prepare contingency plans for high-impact risks that may still materialize. Communicate and report risks to keep all stakeholders informed.","title":"General Guidance"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/#opportunity-management","text":"The same process can be applied to opportunities, but while risk management involves applying mitigation actions to decrease the likelihood of a risk, in opportunity management, you enhance actions to increase the likelihood of a positive outcome.","title":"Opportunity Management"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/","text":"How to Add a Pairing Custom Field in Azure DevOps User Stories This document outlines the benefits of adding a custom field of type Identity in Azure DevOps user stories, prerequisites, and a step-by-step guide. Benefits of Adding a Custom Field Having the names of both individuals pairing on a story visible on the Azure DevOps cards can be helpful during sprint ceremonies and lead to greater accountability by the pairing assignee. For example, it is easier to keep track of the individuals assigned stories as part of a pair during sprint planning by using the \"pairing names\" field. During stand-up it can also help the Process Lead filter stories assigned to the individual (both as an owner or as a pairing assignee) and show these on the board. Furthermore, the pairing field can provide an additional data point for reports and burndown rates. Prerequisites Prior to customizing Azure DevOps, review Configure and customize Azure Boards . In order to add a custom field to user stories in Azure DevOps changes must be made as an Organization setting . This document therefore assumes use of an existing Organization in Azure DevOps and that the user account used to make these changes is a member of the Project Collection Administrators Group . Change the Organization Settings Duplicate the process currently in use. Navigate to the Organization Settings , within the Boards / Process tab. Select the Process type, click on the icon with three dots ... and click Create inherited process . Click on the newly created inherited process. As you can see in the example below, we called it 'Pairing'. Click on the work item type User Story . Click New Field . Give it a Name and select Identity in Type. Click on Add Field . This completes the change in Organization settings. The rest of the instructions must be completed under Project Settings. Change the Project Settings Go to the Project that is to be modified, select Project Settings . Select Project configuration . Click on process customization page . Click on Projects then click on Change process . Change the target process to Pairing then click Save. Go to Boards . Click on the Gear icon to open Settings. Add field to card. Click on the green + icon to add select the Pairing field. Check the box to display fields, even when they are empty. Save and close . View the modified the card. Notice the new Pairing field. The Story can now be assigned an Owner and a Pairing assignee!","title":"How to Add a Pairing Custom Field in Azure DevOps User Stories"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#how-to-add-a-pairing-custom-field-in-azure-devops-user-stories","text":"This document outlines the benefits of adding a custom field of type Identity in Azure DevOps user stories, prerequisites, and a step-by-step guide.","title":"How to Add a Pairing Custom Field in Azure DevOps User Stories"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#benefits-of-adding-a-custom-field","text":"Having the names of both individuals pairing on a story visible on the Azure DevOps cards can be helpful during sprint ceremonies and lead to greater accountability by the pairing assignee. For example, it is easier to keep track of the individuals assigned stories as part of a pair during sprint planning by using the \"pairing names\" field. During stand-up it can also help the Process Lead filter stories assigned to the individual (both as an owner or as a pairing assignee) and show these on the board. Furthermore, the pairing field can provide an additional data point for reports and burndown rates.","title":"Benefits of Adding a Custom Field"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#prerequisites","text":"Prior to customizing Azure DevOps, review Configure and customize Azure Boards . In order to add a custom field to user stories in Azure DevOps changes must be made as an Organization setting . This document therefore assumes use of an existing Organization in Azure DevOps and that the user account used to make these changes is a member of the Project Collection Administrators Group .","title":"Prerequisites"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#change-the-organization-settings","text":"Duplicate the process currently in use. Navigate to the Organization Settings , within the Boards / Process tab. Select the Process type, click on the icon with three dots ... and click Create inherited process . Click on the newly created inherited process. As you can see in the example below, we called it 'Pairing'. Click on the work item type User Story . Click New Field . Give it a Name and select Identity in Type. Click on Add Field . This completes the change in Organization settings. The rest of the instructions must be completed under Project Settings.","title":"Change the Organization Settings"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#change-the-project-settings","text":"Go to the Project that is to be modified, select Project Settings . Select Project configuration . Click on process customization page . Click on Projects then click on Change process . Change the target process to Pairing then click Save. Go to Boards . Click on the Gear icon to open Settings. Add field to card. Click on the green + icon to add select the Pairing field. Check the box to display fields, even when they are empty. Save and close . View the modified the card. Notice the new Pairing field. The Story can now be assigned an Owner and a Pairing assignee!","title":"Change the Project Settings"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/","text":"Effortless Pair Programming with GitHub Codespaces and VSCode Pair programming used to be a software development technique in which two programmers work together on a single computer, sharing one keyboard and mouse, to jointly design, code, test, and debug software. It is one of the patterns explored in the section why collaboration? of this playbook, however with teams that work mostly remotely, sharing a physical computer became a challenge, but opened the door to a more efficient approach of pair programming. Through the effective utilization of a range of tools and techniques, we have successfully implemented both pair and swarm programming methodologies. As such, we are eager to share some of the valuable insights and knowledge gained from this experience. How to Make Pair Programming a Painless Experience? Working Sessions In order to enhance pair programming capabilities, you can create regular working sessions that are open to all team members. This facilitates smooth and efficient collaboration as everyone can simply join in and work together before branching off into smaller groups. This approach has proven particularly beneficial for new team members who may otherwise feel overwhelmed by a large codebase. It emulates the concept of the \" humble water cooler ,\" which fosters a sense of connectedness among team members through their shared work. Additionally, scheduling these working sessions in advance ensures intentional collaboration and provides clarity on user story responsibilities. To this end, assign a single person to each user story to ensure clear ownership and eliminate ambiguity. By doing so, this could eliminate the common problem of engineers being hesitant to modify code outside of their assigned tasks due to the sentiment of lack of ownership. These working sessions are instrumental in promoting a cohesive team dynamic, allowing for effective knowledge sharing and collective problem-solving. GitHub Codespaces GitHub Codespaces is a vital component in an efficient development environment, particularly in the context of pair programming. Prioritize setting up a Codespace as the initial step of the project, preceding tasks such as local machine project compilation or VSCode plugin installation. To this end, make sure to update the Codespace documentation before incorporating any quick start instructions for local environments. Additionally, consistently demonstrate demos in codespaces environment to ensure its prominent integration into our workflow. With its cloud-based infrastructure, GitHub Codespaces presents a highly efficient and simplified approach to real-time collaborative coding. As a result, new team members can easily access the GitHub project and begin coding within seconds, without requiring installation on their local machines. This seamless, integrated solution for pair programming offers a streamlined workflow, allowing you to direct your attention towards producing exemplary code, free from the distractions of cumbersome setup processes. VSCode Live Share VSCode Live Share is specifically designed for pair programming and enables you to work on the same codebase, in real-time, with your team members. The arduous process of configuring complex setups, grappling with confusing configurations, straining one's eyes to work on small screens, or physically switching keyboards is not a problem with LiveShare. This innovative solution enables seamless sharing of your development environment with your team members, facilitating smooth collaborative coding experiences. Fully integrated into Visual Studio Code and Visual Studio, LiveShare offers the added benefit of terminal sharing, debug session collaboration, and host machine control. When paired with GitHub Codespaces, it presents a potent tool set for effective pair programming. Tip: Share VSCode extensions (including Live Share) using a base devcontainer.json . This ensure all team members have available the same set of extensions, and allow them to focus in solving the business needs from day one. Resources GitHub Codespaces . VSCode Live Share . Create a Dev Container . How companies have optimized the humble office water cooler .","title":"Effortless Pair Programming with GitHub Codespaces and VSCode"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#effortless-pair-programming-with-github-codespaces-and-vscode","text":"Pair programming used to be a software development technique in which two programmers work together on a single computer, sharing one keyboard and mouse, to jointly design, code, test, and debug software. It is one of the patterns explored in the section why collaboration? of this playbook, however with teams that work mostly remotely, sharing a physical computer became a challenge, but opened the door to a more efficient approach of pair programming. Through the effective utilization of a range of tools and techniques, we have successfully implemented both pair and swarm programming methodologies. As such, we are eager to share some of the valuable insights and knowledge gained from this experience.","title":"Effortless Pair Programming with GitHub Codespaces and VSCode"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#how-to-make-pair-programming-a-painless-experience","text":"","title":"How to Make Pair Programming a Painless Experience?"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#working-sessions","text":"In order to enhance pair programming capabilities, you can create regular working sessions that are open to all team members. This facilitates smooth and efficient collaboration as everyone can simply join in and work together before branching off into smaller groups. This approach has proven particularly beneficial for new team members who may otherwise feel overwhelmed by a large codebase. It emulates the concept of the \" humble water cooler ,\" which fosters a sense of connectedness among team members through their shared work. Additionally, scheduling these working sessions in advance ensures intentional collaboration and provides clarity on user story responsibilities. To this end, assign a single person to each user story to ensure clear ownership and eliminate ambiguity. By doing so, this could eliminate the common problem of engineers being hesitant to modify code outside of their assigned tasks due to the sentiment of lack of ownership. These working sessions are instrumental in promoting a cohesive team dynamic, allowing for effective knowledge sharing and collective problem-solving.","title":"Working Sessions"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#github-codespaces","text":"GitHub Codespaces is a vital component in an efficient development environment, particularly in the context of pair programming. Prioritize setting up a Codespace as the initial step of the project, preceding tasks such as local machine project compilation or VSCode plugin installation. To this end, make sure to update the Codespace documentation before incorporating any quick start instructions for local environments. Additionally, consistently demonstrate demos in codespaces environment to ensure its prominent integration into our workflow. With its cloud-based infrastructure, GitHub Codespaces presents a highly efficient and simplified approach to real-time collaborative coding. As a result, new team members can easily access the GitHub project and begin coding within seconds, without requiring installation on their local machines. This seamless, integrated solution for pair programming offers a streamlined workflow, allowing you to direct your attention towards producing exemplary code, free from the distractions of cumbersome setup processes.","title":"GitHub Codespaces"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#vscode-live-share","text":"VSCode Live Share is specifically designed for pair programming and enables you to work on the same codebase, in real-time, with your team members. The arduous process of configuring complex setups, grappling with confusing configurations, straining one's eyes to work on small screens, or physically switching keyboards is not a problem with LiveShare. This innovative solution enables seamless sharing of your development environment with your team members, facilitating smooth collaborative coding experiences. Fully integrated into Visual Studio Code and Visual Studio, LiveShare offers the added benefit of terminal sharing, debug session collaboration, and host machine control. When paired with GitHub Codespaces, it presents a potent tool set for effective pair programming. Tip: Share VSCode extensions (including Live Share) using a base devcontainer.json . This ensure all team members have available the same set of extensions, and allow them to focus in solving the business needs from day one.","title":"VSCode Live Share"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#resources","text":"GitHub Codespaces . VSCode Live Share . Create a Dev Container . How companies have optimized the humble office water cooler .","title":"Resources"},{"location":"agile-development/advanced-topics/collaboration/social-question/","text":"Social Question of the Day The social question of the day is an optional short question to follow the three project questions in the daily stand-up. It develops team cohesion and interpersonal trust over the course of an engagement by facilitating the sharing of personal preferences, lifestyle, or other context. The social question should be chosen before the stand-up. The facilitator should select the question either independently or from the team's asynchronous suggestions. This minimizes delays at the start of the stand-up. Tip: having the stand-up facilitator role rotate each sprint lets the facilitator choose the social question independently without burdening any one team member. Properties of a Good Question A good question has a brief answer with small optional elaboration. A yes or no answer doesn't tell you very much about someone, while knowing that their favorite fruit is a durian is informative. Good questions are low in consequence but allow controversy. Watching someone strongly exclaim that salmon and lox on cinnamon-raisin is the best bagel order is endearing. As a corollary, a good question is one someone is likely to be passionate about. You know a little more about a team member's personality if their eyes light up when describing their favorite karaoke song. Starter List of Questions Potentially good questions include: What's your Starbucks order? What's your favorite operating system? What's your favorite version of Windows? What's your favorite plant, houseplant or otherwise? What's your favorite fruit? What's your favorite fast food? What's your favorite noodle? What's your favorite text editor? Mountains or beach? DC or Marvel? Coffee with one person from history: who? What's your silliest online purchase? What's your alternate career? What's the best bagel topping? What's your guilty TV pleasure? What's your go-to karaoke song? Would you rather see the past or the future? Would you rather be able to teleport or to fly? Would you rather live underwater or in space for a year? What's your favorite phone app? What's your favorite fish, to eat or otherwise? What was your best costume? Who is someone you admire (from history, from your personal life, etc.)? Give one reason why. What's the best compliment you've ever received? What's your favorite or most used emoji right now? What was your biggest DIY project? What's a spice that you use on everything? What's your top Spotify (or just your favorite) genre/artist for this year? What was your first computer? What's your favorite kind of taco? What's your favorite decade? What's the best way to eat potatoes? What was your best vacation (stay-cations acceptable)? Favorite cartoon? Pick someone in your family and tell us something awesome about them. What was your longest road trip? What thing do you remember learning when you were young that is taught differently now? What was your favorite toy as a child?","title":"Social Question of the Day"},{"location":"agile-development/advanced-topics/collaboration/social-question/#social-question-of-the-day","text":"The social question of the day is an optional short question to follow the three project questions in the daily stand-up. It develops team cohesion and interpersonal trust over the course of an engagement by facilitating the sharing of personal preferences, lifestyle, or other context. The social question should be chosen before the stand-up. The facilitator should select the question either independently or from the team's asynchronous suggestions. This minimizes delays at the start of the stand-up. Tip: having the stand-up facilitator role rotate each sprint lets the facilitator choose the social question independently without burdening any one team member.","title":"Social Question of the Day"},{"location":"agile-development/advanced-topics/collaboration/social-question/#properties-of-a-good-question","text":"A good question has a brief answer with small optional elaboration. A yes or no answer doesn't tell you very much about someone, while knowing that their favorite fruit is a durian is informative. Good questions are low in consequence but allow controversy. Watching someone strongly exclaim that salmon and lox on cinnamon-raisin is the best bagel order is endearing. As a corollary, a good question is one someone is likely to be passionate about. You know a little more about a team member's personality if their eyes light up when describing their favorite karaoke song.","title":"Properties of a Good Question"},{"location":"agile-development/advanced-topics/collaboration/social-question/#starter-list-of-questions","text":"Potentially good questions include: What's your Starbucks order? What's your favorite operating system? What's your favorite version of Windows? What's your favorite plant, houseplant or otherwise? What's your favorite fruit? What's your favorite fast food? What's your favorite noodle? What's your favorite text editor? Mountains or beach? DC or Marvel? Coffee with one person from history: who? What's your silliest online purchase? What's your alternate career? What's the best bagel topping? What's your guilty TV pleasure? What's your go-to karaoke song? Would you rather see the past or the future? Would you rather be able to teleport or to fly? Would you rather live underwater or in space for a year? What's your favorite phone app? What's your favorite fish, to eat or otherwise? What was your best costume? Who is someone you admire (from history, from your personal life, etc.)? Give one reason why. What's the best compliment you've ever received? What's your favorite or most used emoji right now? What was your biggest DIY project? What's a spice that you use on everything? What's your top Spotify (or just your favorite) genre/artist for this year? What was your first computer? What's your favorite kind of taco? What's your favorite decade? What's the best way to eat potatoes? What was your best vacation (stay-cations acceptable)? Favorite cartoon? Pick someone in your family and tell us something awesome about them. What was your longest road trip? What thing do you remember learning when you were young that is taught differently now? What was your favorite toy as a child?","title":"Starter List of Questions"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/","text":"Engagement Team Development In every ISE engagement, dynamics are different so are the team requirements. Based on transfer learning among teams, we aim to build right \"code-with\" environments in every team. This documentation gives a high-level template with some suggestions by aiming to accelerate team swarming phase to achieve a high speed agility however it has no intention to provide a list of \"must-do\" items. Identification As it's stated in Tuckman's team phases , traditional team development has several stages. However those phases can be extremely fast or sometimes mismatched in teams due to external factors, what applies to ISE engagements. In order to minimize the risk and set the expectations on the right way for all parties, an identification phase is important to understand each other. Some potential steps in this phase may be as following (not limited): Working agreement Identification of styles/preferences in communication, sharing, learning, decision making of each team member Talking about necessity of pair programming Decisions on backlog management & refinement meetings, weekly design sessions, social time sessions...etc. Sync/Async communication methods, work hours/flexible times Decisions and identifications of charts that will be helpful to provide transparent and true information to everyone Identification of \"Software Craftspersonship\" areas which means the tools and methods will be widely used during the engagement and taking the required actions on team upskilling side if necessary. GitHub, VSCode LiveShare, AzDevOps, necessary development tools & libraries ... more. If upskilling on certain topic(s) is needed, identifying the areas and arranging code spikes for increasing the team knowledge on the regarding topic(s). Identification of communication channels, feedback loops and recurrent team call slots out of regular sprint meetings Introduction to Technical Agility Team Manifesto and planning the technical delivery by aiming to keep technical debt risk minimum. Following the Plan and Agile Debugging Identification phase accelerates the process of building a safe environment for every individual in the team, later on team has the required assets to follow the plan. And it is team's itself responsibility (engineers,PO,Process Lead) to debug their Agility level. In every team stabilization takes time and pro-active agile debugging is the best accelerator to decrease the distraction away from sprint/engagement goal. Team is also responsible to keep the plan up-to-date based on team changes/needs and debugging results. Just as an example, agility debugging activities may include: Dashboards related with \"Goal\" such as burndown/burnout, Item/PR Aging, Mood Chart ..etc. are accessible to the team and team is always up-to-date Backlog Refinement meetings Size of stories (Too big? Too small?) Are \"User Stories\" and \"Tasks\" clear ? Are Acceptance Criteria enough and right? Is everyone ready-to-go after taking the User Story/Task? Running efficient retrospectives Is the Sprint Goal clear in every iteration ? Is the estimation process in the team improving over time or does it meet the delivery/workload prediction? Kindly check Scrum Values to have a better understanding to improve team commitment. Following that, above suggestions aim to remove agile/team disfunctionalities and provide a broader team understanding, potential time savings and full transparency. Resources Tuckman's Stages of Group Development Scrum Values","title":"Engagement Team Development"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/#engagement-team-development","text":"In every ISE engagement, dynamics are different so are the team requirements. Based on transfer learning among teams, we aim to build right \"code-with\" environments in every team. This documentation gives a high-level template with some suggestions by aiming to accelerate team swarming phase to achieve a high speed agility however it has no intention to provide a list of \"must-do\" items.","title":"Engagement Team Development"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/#identification","text":"As it's stated in Tuckman's team phases , traditional team development has several stages. However those phases can be extremely fast or sometimes mismatched in teams due to external factors, what applies to ISE engagements. In order to minimize the risk and set the expectations on the right way for all parties, an identification phase is important to understand each other. Some potential steps in this phase may be as following (not limited): Working agreement Identification of styles/preferences in communication, sharing, learning, decision making of each team member Talking about necessity of pair programming Decisions on backlog management & refinement meetings, weekly design sessions, social time sessions...etc. Sync/Async communication methods, work hours/flexible times Decisions and identifications of charts that will be helpful to provide transparent and true information to everyone Identification of \"Software Craftspersonship\" areas which means the tools and methods will be widely used during the engagement and taking the required actions on team upskilling side if necessary. GitHub, VSCode LiveShare, AzDevOps, necessary development tools & libraries ... more. If upskilling on certain topic(s) is needed, identifying the areas and arranging code spikes for increasing the team knowledge on the regarding topic(s). Identification of communication channels, feedback loops and recurrent team call slots out of regular sprint meetings Introduction to Technical Agility Team Manifesto and planning the technical delivery by aiming to keep technical debt risk minimum.","title":"Identification"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/#following-the-plan-and-agile-debugging","text":"Identification phase accelerates the process of building a safe environment for every individual in the team, later on team has the required assets to follow the plan. And it is team's itself responsibility (engineers,PO,Process Lead) to debug their Agility level. In every team stabilization takes time and pro-active agile debugging is the best accelerator to decrease the distraction away from sprint/engagement goal. Team is also responsible to keep the plan up-to-date based on team changes/needs and debugging results. Just as an example, agility debugging activities may include: Dashboards related with \"Goal\" such as burndown/burnout, Item/PR Aging, Mood Chart ..etc. are accessible to the team and team is always up-to-date Backlog Refinement meetings Size of stories (Too big? Too small?) Are \"User Stories\" and \"Tasks\" clear ? Are Acceptance Criteria enough and right? Is everyone ready-to-go after taking the User Story/Task? Running efficient retrospectives Is the Sprint Goal clear in every iteration ? Is the estimation process in the team improving over time or does it meet the delivery/workload prediction? Kindly check Scrum Values to have a better understanding to improve team commitment. Following that, above suggestions aim to remove agile/team disfunctionalities and provide a broader team understanding, potential time savings and full transparency.","title":"Following the Plan and Agile Debugging"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/#resources","text":"Tuckman's Stages of Group Development Scrum Values","title":"Resources"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/","text":"Virtual Collaboration and Pair Programming Pair programming is the de facto work method that most large engineering organizations use for \u201chands on keyboard\u201d coding. Two developers, working synchronously, looking at the same screen and attempting to code and design together, which often results in better and clearer code than either could produce individually. Pair programming works well under the correct circumstances, but it loses some of its charm when executed in a completely virtual setting. The virtual setup still involves two developers looking at the same screen and talking out their designs, but there are often logistical issues to deal with, including lag, microphone set up issues, workspace and personal considerations, and many other small, individually trivial problems that worsen the experience. Virtual work patterns are different from the in-person patterns we are accustomed to. Pair programming at its core is based on the following principles: Generating clarity through communication Producing higher quality through collaboration Creating ownership through equal contribution Pair programming is one way to achieve these results. Red Team Testing (RTT) is an alternate programming method that uses the same principles but with some of the advantages that virtual work methods provide. Red Team Testing (RTT) Red Team Testing borrows its name from the \u201cRed Team\u201d and \u201cBlue Team\u201d paradigm of penetration testing, and is a collaborative, parallel way of working virtually. In Red Team Testing, two developers jointly decide on the interface, architecture, and design of the program, and then separate for the implementation phase. One developer writes tests using the public interface, attempting to perform edge case testing, input validation, and otherwise stress testing the interface. The second developer is simultaneously writing the implementation which will eventually be tested. Red Team Testing has the same philosophy as any other Test-Driven Development lifecycle: All implementation is separated from the interface, and the interface can be tested with no knowledge of the implementation. Steps Design Phase: Both developers design the interface together. This includes: - Method signatures and names - Writing documentation or docstrings for what the methods are intended to do. - Architecture decisions that would influence testing (Factory patterns, etc.) Implementation Phase: The developers separate and parallelize work, while continuing to communicate. - Developer A will design the implementation of the methods, adhering to the previously decided design. - Developer B will concurrently write tests for the same method signatures, without knowing details of the implementation. Integration & Testing Phase: Both developers commit their code and run the tests. - Utopian Scenario: All tests run and pass correctly. - Realistic Scenario: The tests have either broken or failed due to flaws in testing. This leads to further clarification of the design and a discussion of why the tests failed. The developers will repeat the three phases until the code is functional and tested. When to Follow the RTT Strategy RTT works well under specific circumstances. If collaboration needs to happen virtually, and all communication is virtual, RTT reduces the need for constant communication while maintaining the benefits of a joint design session. This considers the human element: Virtual communication is more exhausting than in person communication. RTT also works well when there is complete consensus, or no consensus at all, on what purpose the code serves. Since creating the design jointly and agreeing to implement and test against it are part of the RTT method, RTT forcibly creates clarity through iteration and communication. Benefits RTT has many of the same benefits as Pair Programming and Test-Driven development but tries to update them for a virtual setting. Code implementation and testing can be done in parallel, over long distances or across time zones, which reduces the overall time taken to finish writing the code. RTT maintains the pair programming paradigm, while reducing the need for video communication or constant communication between developers. RTT allows detailed focus on design and engineering alignment before implementing any code, leading to cleaner and simpler interfaces. RTT encourages testing to be prioritized alongside implementation, instead of having testing follow or be influenced by the implementation of the code. Documentation is inherently a part of RTT, since both the implementer and the tester need correct, up to date documentation, in the implementation phase. What You Need for RTT to Work Well Demand for constant communication and good teamwork may pose a challenge; daily updates amongst team members are essential to maintain alignment on varying code requirements. Clarity of the code design and testing strategy must be established beforehand and documented as reference. Lack of an established design will cause misalignment between the two major pieces of work and a need for time-consuming refactoring. RTT does not work well if only one developer has knowledge of the overall design. Team communication is critical to ensuring that every developer involved in RTT is on the same page.","title":"Virtual Collaboration and Pair Programming"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#virtual-collaboration-and-pair-programming","text":"Pair programming is the de facto work method that most large engineering organizations use for \u201chands on keyboard\u201d coding. Two developers, working synchronously, looking at the same screen and attempting to code and design together, which often results in better and clearer code than either could produce individually. Pair programming works well under the correct circumstances, but it loses some of its charm when executed in a completely virtual setting. The virtual setup still involves two developers looking at the same screen and talking out their designs, but there are often logistical issues to deal with, including lag, microphone set up issues, workspace and personal considerations, and many other small, individually trivial problems that worsen the experience. Virtual work patterns are different from the in-person patterns we are accustomed to. Pair programming at its core is based on the following principles: Generating clarity through communication Producing higher quality through collaboration Creating ownership through equal contribution Pair programming is one way to achieve these results. Red Team Testing (RTT) is an alternate programming method that uses the same principles but with some of the advantages that virtual work methods provide.","title":"Virtual Collaboration and Pair Programming"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#red-team-testing-rtt","text":"Red Team Testing borrows its name from the \u201cRed Team\u201d and \u201cBlue Team\u201d paradigm of penetration testing, and is a collaborative, parallel way of working virtually. In Red Team Testing, two developers jointly decide on the interface, architecture, and design of the program, and then separate for the implementation phase. One developer writes tests using the public interface, attempting to perform edge case testing, input validation, and otherwise stress testing the interface. The second developer is simultaneously writing the implementation which will eventually be tested. Red Team Testing has the same philosophy as any other Test-Driven Development lifecycle: All implementation is separated from the interface, and the interface can be tested with no knowledge of the implementation.","title":"Red Team Testing (RTT)"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#steps","text":"Design Phase: Both developers design the interface together. This includes: - Method signatures and names - Writing documentation or docstrings for what the methods are intended to do. - Architecture decisions that would influence testing (Factory patterns, etc.) Implementation Phase: The developers separate and parallelize work, while continuing to communicate. - Developer A will design the implementation of the methods, adhering to the previously decided design. - Developer B will concurrently write tests for the same method signatures, without knowing details of the implementation. Integration & Testing Phase: Both developers commit their code and run the tests. - Utopian Scenario: All tests run and pass correctly. - Realistic Scenario: The tests have either broken or failed due to flaws in testing. This leads to further clarification of the design and a discussion of why the tests failed. The developers will repeat the three phases until the code is functional and tested.","title":"Steps"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#when-to-follow-the-rtt-strategy","text":"RTT works well under specific circumstances. If collaboration needs to happen virtually, and all communication is virtual, RTT reduces the need for constant communication while maintaining the benefits of a joint design session. This considers the human element: Virtual communication is more exhausting than in person communication. RTT also works well when there is complete consensus, or no consensus at all, on what purpose the code serves. Since creating the design jointly and agreeing to implement and test against it are part of the RTT method, RTT forcibly creates clarity through iteration and communication.","title":"When to Follow the RTT Strategy"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#benefits","text":"RTT has many of the same benefits as Pair Programming and Test-Driven development but tries to update them for a virtual setting. Code implementation and testing can be done in parallel, over long distances or across time zones, which reduces the overall time taken to finish writing the code. RTT maintains the pair programming paradigm, while reducing the need for video communication or constant communication between developers. RTT allows detailed focus on design and engineering alignment before implementing any code, leading to cleaner and simpler interfaces. RTT encourages testing to be prioritized alongside implementation, instead of having testing follow or be influenced by the implementation of the code. Documentation is inherently a part of RTT, since both the implementer and the tester need correct, up to date documentation, in the implementation phase.","title":"Benefits"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#what-you-need-for-rtt-to-work-well","text":"Demand for constant communication and good teamwork may pose a challenge; daily updates amongst team members are essential to maintain alignment on varying code requirements. Clarity of the code design and testing strategy must be established beforehand and documented as reference. Lack of an established design will cause misalignment between the two major pieces of work and a need for time-consuming refactoring. RTT does not work well if only one developer has knowledge of the overall design. Team communication is critical to ensuring that every developer involved in RTT is on the same page.","title":"What You Need for RTT to Work Well"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/","text":"Why Collaboration Why is Collaboration Important In engagements, we aim to be highly collaborative because when we code together, we perform better, have a higher sprint velocity, and have a greater degree of knowledge sharing across the team. There are two common patterns we use for collaboration: Pairing and swarming. Pair programming (\u201cpairing\u201d) - two software engineers assigned to, and working on, one shared story at a time during the sprint. The Dev Lead assigns a user story to two engineers -- one primary engineer (story owner) and one secondary engineer (pairing assignee). Swarm programming (\u201cswarming\u201d) - three or more software engineers collaborating on a high-priority item to bring it to completion. How to Pair Program As mentioned, every story is intentionally assigned to a pair. The pairing assignee may be in the process of upskilling, nevertheless, they are equal partners in the development effort. Below are some general guidelines for pairing: Upon assignment of the story/product backlog item (PBI), the pair needs to be deliberate about defining how to work together and have a firm definition of the work to be completed. This information should be expressed clearly in the story\u2019s description and acceptance criteria. The expectations about this need to be communicated and agreed upon by both engineers and should be done prior to any actual working sessions. The story owner and pairing assignee do not merely split the work up and sync regularly \u2013 they actively work together on the same tasks, and might share their screens via a Teams online session. Collaborative tools like VS Live Share can be preferable to sharing screens. Not all collaboration needs to be screen-share based. During the collaborative sessions, one engineer provides the development environment while the other actively views and comments verbally. Engineers trade places often from one session to the next so that everyone has time in control of the keyboard. Engineers leverage feature branches for the collaboration during the development of each story to have small Pull Requests (PRs) (as opposed to a single giant PR) at the end of the sprint. Code is committed to the repository by both members of the assigned pair where and when it makes sense as tasks were completed. The pairing assignee is the voice representing the pair during the daily standup while being supported by the story owner. Having the names of both individuals (owner and pair assignee) visible on the PBI can be helpful during sprint ceremonies and lead to greater accountability by the pairing assignee. An example of this using Azure DevOps cards can be found here . Why Pair Programming Helps Collaboration Pair programming helps collaboration because both engineers share equal responsibility for bringing the story to completion. This is a mutually beneficial exercise because, while the story owner often has more experience to lean on, the pairing assignee brings a fresh view that is unclouded by repetition. Some other benefits include: Fewer defects and increased accountability. Having two sets of eyes allows the engineers more opportunity to catch errors and to remember often-overlooked tasks such as writing unit and integration tests. Pairing allows engineers with different experience and expertise to learn from one another by collaborating and receiving feedback in real-time. Instead of having an engineer work alone on a task for long hours and hit an isolation breaking point, pairing allows the pair to check in with one another. Even something as simple as describing the problem out loud can help uncover issues or bugs in the code. Pairing can help brainstorming as well as validating details such as making the variable names consistent. When to Swarm Program It is important to know that not every PBI needs to use swarming. Some sprints may not even warrant swarming at all. Swarm when: The work is complex enough to have collective minds collaborating (not because the quantity of work is more than what would be completed in one sprint). The task that the swarm works on has become (or is in imminent danger of becoming) a blocker to other stories. An unknown is discovered that needs a collaborative effort to form a decision on how to move forward. The collective knowledge and expertise help move the story forward more quickly and ultimately produced better quality code. A conflict or unresolved difference of opinion arises during a pairing session. Promote the work to become a swarming session to help resolve the conflict. How to Swarm Program As soon the pair finds out that the PBI will warrant swarming, the pair brings it up to the rest of the team (via parking lot during stand-up or asynchronously). Members of the team agree or volunteer to assist. The story owner (or pairing assignee) sends Teams call invite to the interested parties. This allows the swarm to have dedicated focus time by blocking time in calendars. During a swarming session, an engineer can branch out if there is something that needs to be handled while the swarm tackles the main problem at hand, then reconnects and reports back. This allows the swarm to focus on a core aspect and to be all on the same page. The Teams call is repeated until resolution is found or alternative path forward is formulated. Why Swarm Programming Helps Collaboration Swarming allows the collective knowledge and expertise of the team to come together in a focused and unified way. Not only does swarming help close out the item faster, but it also helps the team understand each other\u2019s strengths and weaknesses. Allows the team to build a higher level of trust and work as a cohesive unit. When to Decide to Swarm, Pair, and/or Split While a lot of time can be spent on pair programming, it does make sense to split the work when folks understand how the work will be carried out, and the work to be done is largely prescriptive. Once the story has been jointly tasked out by both engineers, the engineers may choose to tackle some tasks separately and then combine the work together at the end. Pair programming is more helpful when the engineers do not have perfect clarity about what is needed to be done or how it can be done. Swarming is done when the two engineers assigned to the story need an additional sounding board or need expertise that other team members could provide. Benefits of Increased Collaboration Knowledge sharing and bringing ISE and customer engineers together in a \u2018code-with\u2019 manner is an important aspect of ISE engagements. This grows both our customers\u2019 and our ISE team\u2019s capability to build on Azure. We are responsible for demonstrating engineering fundamentals and leaving the customer in a better place after we disengage. This can only happen if we collaborate and engage together as a team. In addition to improved software quality, this also adds a beneficial social aspect to the engagements. Resources How to add a pairing custom field in Azure DevOps User Stories - adding a custom field of type Identity in Azure DevOps for pairing On Pair Programming - Martin Fowler Pair Programming hands-on lessons - these can be used (and adapted) to support bringing pair programming into your team (MS internal or including customers) Effortless Pair Programming with GitHub Codespaces and VSCode","title":"Why Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#why-collaboration","text":"","title":"Why Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#why-is-collaboration-important","text":"In engagements, we aim to be highly collaborative because when we code together, we perform better, have a higher sprint velocity, and have a greater degree of knowledge sharing across the team. There are two common patterns we use for collaboration: Pairing and swarming. Pair programming (\u201cpairing\u201d) - two software engineers assigned to, and working on, one shared story at a time during the sprint. The Dev Lead assigns a user story to two engineers -- one primary engineer (story owner) and one secondary engineer (pairing assignee). Swarm programming (\u201cswarming\u201d) - three or more software engineers collaborating on a high-priority item to bring it to completion.","title":"Why is Collaboration Important"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#how-to-pair-program","text":"As mentioned, every story is intentionally assigned to a pair. The pairing assignee may be in the process of upskilling, nevertheless, they are equal partners in the development effort. Below are some general guidelines for pairing: Upon assignment of the story/product backlog item (PBI), the pair needs to be deliberate about defining how to work together and have a firm definition of the work to be completed. This information should be expressed clearly in the story\u2019s description and acceptance criteria. The expectations about this need to be communicated and agreed upon by both engineers and should be done prior to any actual working sessions. The story owner and pairing assignee do not merely split the work up and sync regularly \u2013 they actively work together on the same tasks, and might share their screens via a Teams online session. Collaborative tools like VS Live Share can be preferable to sharing screens. Not all collaboration needs to be screen-share based. During the collaborative sessions, one engineer provides the development environment while the other actively views and comments verbally. Engineers trade places often from one session to the next so that everyone has time in control of the keyboard. Engineers leverage feature branches for the collaboration during the development of each story to have small Pull Requests (PRs) (as opposed to a single giant PR) at the end of the sprint. Code is committed to the repository by both members of the assigned pair where and when it makes sense as tasks were completed. The pairing assignee is the voice representing the pair during the daily standup while being supported by the story owner. Having the names of both individuals (owner and pair assignee) visible on the PBI can be helpful during sprint ceremonies and lead to greater accountability by the pairing assignee. An example of this using Azure DevOps cards can be found here .","title":"How to Pair Program"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#why-pair-programming-helps-collaboration","text":"Pair programming helps collaboration because both engineers share equal responsibility for bringing the story to completion. This is a mutually beneficial exercise because, while the story owner often has more experience to lean on, the pairing assignee brings a fresh view that is unclouded by repetition. Some other benefits include: Fewer defects and increased accountability. Having two sets of eyes allows the engineers more opportunity to catch errors and to remember often-overlooked tasks such as writing unit and integration tests. Pairing allows engineers with different experience and expertise to learn from one another by collaborating and receiving feedback in real-time. Instead of having an engineer work alone on a task for long hours and hit an isolation breaking point, pairing allows the pair to check in with one another. Even something as simple as describing the problem out loud can help uncover issues or bugs in the code. Pairing can help brainstorming as well as validating details such as making the variable names consistent.","title":"Why Pair Programming Helps Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#when-to-swarm-program","text":"It is important to know that not every PBI needs to use swarming. Some sprints may not even warrant swarming at all. Swarm when: The work is complex enough to have collective minds collaborating (not because the quantity of work is more than what would be completed in one sprint). The task that the swarm works on has become (or is in imminent danger of becoming) a blocker to other stories. An unknown is discovered that needs a collaborative effort to form a decision on how to move forward. The collective knowledge and expertise help move the story forward more quickly and ultimately produced better quality code. A conflict or unresolved difference of opinion arises during a pairing session. Promote the work to become a swarming session to help resolve the conflict.","title":"When to Swarm Program"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#how-to-swarm-program","text":"As soon the pair finds out that the PBI will warrant swarming, the pair brings it up to the rest of the team (via parking lot during stand-up or asynchronously). Members of the team agree or volunteer to assist. The story owner (or pairing assignee) sends Teams call invite to the interested parties. This allows the swarm to have dedicated focus time by blocking time in calendars. During a swarming session, an engineer can branch out if there is something that needs to be handled while the swarm tackles the main problem at hand, then reconnects and reports back. This allows the swarm to focus on a core aspect and to be all on the same page. The Teams call is repeated until resolution is found or alternative path forward is formulated.","title":"How to Swarm Program"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#why-swarm-programming-helps-collaboration","text":"Swarming allows the collective knowledge and expertise of the team to come together in a focused and unified way. Not only does swarming help close out the item faster, but it also helps the team understand each other\u2019s strengths and weaknesses. Allows the team to build a higher level of trust and work as a cohesive unit.","title":"Why Swarm Programming Helps Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#when-to-decide-to-swarm-pair-andor-split","text":"While a lot of time can be spent on pair programming, it does make sense to split the work when folks understand how the work will be carried out, and the work to be done is largely prescriptive. Once the story has been jointly tasked out by both engineers, the engineers may choose to tackle some tasks separately and then combine the work together at the end. Pair programming is more helpful when the engineers do not have perfect clarity about what is needed to be done or how it can be done. Swarming is done when the two engineers assigned to the story need an additional sounding board or need expertise that other team members could provide.","title":"When to Decide to Swarm, Pair, and/or Split"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#benefits-of-increased-collaboration","text":"Knowledge sharing and bringing ISE and customer engineers together in a \u2018code-with\u2019 manner is an important aspect of ISE engagements. This grows both our customers\u2019 and our ISE team\u2019s capability to build on Azure. We are responsible for demonstrating engineering fundamentals and leaving the customer in a better place after we disengage. This can only happen if we collaborate and engage together as a team. In addition to improved software quality, this also adds a beneficial social aspect to the engagements.","title":"Benefits of Increased Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#resources","text":"How to add a pairing custom field in Azure DevOps User Stories - adding a custom field of type Identity in Azure DevOps for pairing On Pair Programming - Martin Fowler Pair Programming hands-on lessons - these can be used (and adapted) to support bringing pair programming into your team (MS internal or including customers) Effortless Pair Programming with GitHub Codespaces and VSCode","title":"Resources"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/","text":"Delivery Plan Goals While Scrum does not require and discourages planning more than one sprint at a time. Most of us work in enterprises where we are dependent outside teams (for example: marketing, sales, support). A rough assessment of the planned project scope is achievable within a reasonable time frame and resources. The goal is to have a rough plan and estimate as a starting point, not to implement \"Agilefall.\" Note that this is just a starting point to enable planning discussions. We expect the actual schedule to evolve and shift over time and that you will update the scope and timeline as you progress. Delivery Plans ensure your teams are aligning with your organizational goals. Benefits As you complete the assessment, you can push back on the scope, time frame or ask for more resources. As you progress in your project/product delivery, you can highlight risks to the scope, time frame, and resources. Approach One approach you can take to accomplish is with stickies and a spreadsheet. Stack rank the features for everything in your backlog - Functional Features - Non-functional Features - User Research and Design - Testing - Documentation - Knowledge Transfer/Support Processes T-Shirt Features in terms of working weeks per person. In some scenarios, you have no idea how complex the work. In this situation, you can ask for time to conduct a spike (timebox the effort so you can get back on time). Calculate the capacity for the team based on the number of weeks person with his/her start and end date and minus holidays, vacation, conferences, training, and onboarding days. Also, minus time if the person is also working on defects and support. Based on your capacity, you know have the options Ask for more resources. Caution: onboarding new resources take time. Reduce the scope to the most MVP. Caution: as you trim more of the scope, it might not be valuable anymore to the customer. Consider a cupcake which is everything you need. You don't want to skim off the frosting. Ask for more time. Usually, this is the most flexible, but if there is a marketing date that you need to hit, this might be as flexible. Tools You can also leverage one of these tools by creating your epics and features and add the weeks estimates. The Plans (Preview) feature on Azure DevOps will help you make a plan. Delivery Plans provide a schedule of stories or features your team plan to deliver. Delivery Plans show the scheduled work items by a sprint (iteration path) of selected teams against a calendar view. Confluence JIRA, Trello, Rally, Asana, Basecamp, and GitHub Issues are other similar tools in the market (some are free, others you pay a monthly fee, or you can install on-prem) that you can leverage.","title":"Delivery Plan"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#delivery-plan","text":"","title":"Delivery Plan"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#goals","text":"While Scrum does not require and discourages planning more than one sprint at a time. Most of us work in enterprises where we are dependent outside teams (for example: marketing, sales, support). A rough assessment of the planned project scope is achievable within a reasonable time frame and resources. The goal is to have a rough plan and estimate as a starting point, not to implement \"Agilefall.\" Note that this is just a starting point to enable planning discussions. We expect the actual schedule to evolve and shift over time and that you will update the scope and timeline as you progress. Delivery Plans ensure your teams are aligning with your organizational goals.","title":"Goals"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#benefits","text":"As you complete the assessment, you can push back on the scope, time frame or ask for more resources. As you progress in your project/product delivery, you can highlight risks to the scope, time frame, and resources.","title":"Benefits"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#approach","text":"One approach you can take to accomplish is with stickies and a spreadsheet. Stack rank the features for everything in your backlog - Functional Features - Non-functional Features - User Research and Design - Testing - Documentation - Knowledge Transfer/Support Processes T-Shirt Features in terms of working weeks per person. In some scenarios, you have no idea how complex the work. In this situation, you can ask for time to conduct a spike (timebox the effort so you can get back on time). Calculate the capacity for the team based on the number of weeks person with his/her start and end date and minus holidays, vacation, conferences, training, and onboarding days. Also, minus time if the person is also working on defects and support. Based on your capacity, you know have the options Ask for more resources. Caution: onboarding new resources take time. Reduce the scope to the most MVP. Caution: as you trim more of the scope, it might not be valuable anymore to the customer. Consider a cupcake which is everything you need. You don't want to skim off the frosting. Ask for more time. Usually, this is the most flexible, but if there is a marketing date that you need to hit, this might be as flexible.","title":"Approach"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#tools","text":"You can also leverage one of these tools by creating your epics and features and add the weeks estimates. The Plans (Preview) feature on Azure DevOps will help you make a plan. Delivery Plans provide a schedule of stories or features your team plan to deliver. Delivery Plans show the scheduled work items by a sprint (iteration path) of selected teams against a calendar view. Confluence JIRA, Trello, Rally, Asana, Basecamp, and GitHub Issues are other similar tools in the market (some are free, others you pay a monthly fee, or you can install on-prem) that you can leverage.","title":"Tools"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/","text":"Scrum of Scrums Scrum of scrums is a technique used to scale Scrum to a larger group working towards the same project goal. In Scrum, we consider a team being too big when going over 10-12 individuals. This should be decided on a case by case basis. If the project is set up in multiple work streams that contain a fixed group of people and a common stand-up meeting is slowing down productivity: scrum of scrums should be considered. The team would identify the different subgroups that would act as a separate scrum teams with their own backlog, board and stand-up. Goals The goal of the scrum of scrums ceremony is to give sub-teams the agility they need while not loosing visibility and coordination. It also helps to ensure that the sub-teams are achieving their sprint goals, and they are going in the right direction to achieve the overall project goal. The scrum of scrums ceremony happens every day and can be seen as a regular stand-up: What was done the day before by the sub-team. What will be done today by the sub-team. What are blockers or other issues for the sub-team. What are the blockers or issues that may impact other sub-teams. The outcome of the meeting will result in a list of impediments related to coordination of the whole project. Solutions could be: agreeing on interfaces between teams, discussing architecture changes, evolving responsibility boundaries, etc. This list of impediments is usually managed in a separate backlog but does not have to. Participation The common guideline is to have on average one person per sub-team to participate in the scrum of scrums. Ideally, the Process Lead of each sub-team would represent them in this ceremony. In some instances, the representative for the day is selected at the end of each sub-team daily stand-up and could change every day. In practice, having a fixed representative tends to be more efficient in the long term. Impact This practice is helpful in cases of longer projects and with a larger scope, requiring more people. When having more people, it is usually easier to divide the project in sub-teams. Having a daily scrum of scrums improves communication, lowers the risk of integration issues and increases the project chances of success. When choosing to implement Scrum of Scrums, you need to keep in mind that some team members will have additional meetings to coordinate and participate in. Also: all team members for each sub-team need to be updated on the decisions at a later point to ensure a good flow of information. Measures The easiest way to measure the impact is by tracking the time to resolve issues in the scrum of scrums backlog. You can also track issues reported during the retrospective related to global coordination (is it well done? can it be improved?). Facilitation Guidance This should be facilitated like a regular stand-up.","title":"Scrum of Scrums"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#scrum-of-scrums","text":"Scrum of scrums is a technique used to scale Scrum to a larger group working towards the same project goal. In Scrum, we consider a team being too big when going over 10-12 individuals. This should be decided on a case by case basis. If the project is set up in multiple work streams that contain a fixed group of people and a common stand-up meeting is slowing down productivity: scrum of scrums should be considered. The team would identify the different subgroups that would act as a separate scrum teams with their own backlog, board and stand-up.","title":"Scrum of Scrums"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#goals","text":"The goal of the scrum of scrums ceremony is to give sub-teams the agility they need while not loosing visibility and coordination. It also helps to ensure that the sub-teams are achieving their sprint goals, and they are going in the right direction to achieve the overall project goal. The scrum of scrums ceremony happens every day and can be seen as a regular stand-up: What was done the day before by the sub-team. What will be done today by the sub-team. What are blockers or other issues for the sub-team. What are the blockers or issues that may impact other sub-teams. The outcome of the meeting will result in a list of impediments related to coordination of the whole project. Solutions could be: agreeing on interfaces between teams, discussing architecture changes, evolving responsibility boundaries, etc. This list of impediments is usually managed in a separate backlog but does not have to.","title":"Goals"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#participation","text":"The common guideline is to have on average one person per sub-team to participate in the scrum of scrums. Ideally, the Process Lead of each sub-team would represent them in this ceremony. In some instances, the representative for the day is selected at the end of each sub-team daily stand-up and could change every day. In practice, having a fixed representative tends to be more efficient in the long term.","title":"Participation"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#impact","text":"This practice is helpful in cases of longer projects and with a larger scope, requiring more people. When having more people, it is usually easier to divide the project in sub-teams. Having a daily scrum of scrums improves communication, lowers the risk of integration issues and increases the project chances of success. When choosing to implement Scrum of Scrums, you need to keep in mind that some team members will have additional meetings to coordinate and participate in. Also: all team members for each sub-team need to be updated on the decisions at a later point to ensure a good flow of information.","title":"Impact"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#measures","text":"The easiest way to measure the impact is by tracking the time to resolve issues in the scrum of scrums backlog. You can also track issues reported during the retrospective related to global coordination (is it well done? can it be improved?).","title":"Measures"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#facilitation-guidance","text":"This should be facilitated like a regular stand-up.","title":"Facilitation Guidance"},{"location":"agile-development/team-agreements/definition-of-done/","text":"Definition of Done To close a user story, a sprint, or a milestone it is important to verify that the tasks are complete. The development team should decide together what their Definition of Done is and document this in the project. Below are some examples of checks to verify that the user story, sprint, task is completed. Feature/User Story Acceptance criteria are met Refactoring is complete Code builds with no error Unit tests are written and pass Existing Unit Tests pass Sufficient diagnostics/telemetry are logged Code review is complete UX review is complete (if applicable) Documentation is updated The feature is merged into the develop branch The feature is signed off by the product owner Sprint Goal Definition of Done for all user stories included in the sprint are met Product backlog is updated Functional and Integration tests pass Performance tests pass End 2 End tests pass All bugs are fixed The sprint is signed off from developers, software architects, project manager, product owner etc. Release/Milestone Code Complete (goals of sprints are met) Release is marked as ready for production deployment by product owner","title":"Definition of Done"},{"location":"agile-development/team-agreements/definition-of-done/#definition-of-done","text":"To close a user story, a sprint, or a milestone it is important to verify that the tasks are complete. The development team should decide together what their Definition of Done is and document this in the project. Below are some examples of checks to verify that the user story, sprint, task is completed.","title":"Definition of Done"},{"location":"agile-development/team-agreements/definition-of-done/#featureuser-story","text":"Acceptance criteria are met Refactoring is complete Code builds with no error Unit tests are written and pass Existing Unit Tests pass Sufficient diagnostics/telemetry are logged Code review is complete UX review is complete (if applicable) Documentation is updated The feature is merged into the develop branch The feature is signed off by the product owner","title":"Feature/User Story"},{"location":"agile-development/team-agreements/definition-of-done/#sprint-goal","text":"Definition of Done for all user stories included in the sprint are met Product backlog is updated Functional and Integration tests pass Performance tests pass End 2 End tests pass All bugs are fixed The sprint is signed off from developers, software architects, project manager, product owner etc.","title":"Sprint Goal"},{"location":"agile-development/team-agreements/definition-of-done/#releasemilestone","text":"Code Complete (goals of sprints are met) Release is marked as ready for production deployment by product owner","title":"Release/Milestone"},{"location":"agile-development/team-agreements/definition-of-ready/","text":"Definition of Ready When the development team picks a user story from the top of the backlog, the user story needs to have enough detail to estimate the work needed to complete the story within the sprint. If it has enough detail to estimate, it is Ready to be developed. If a user story is not Ready in the beginning of the Sprint it increases the chance that the story will not be done at the end of this sprint. What it is Definition of Ready is the agreement made by the scrum team around how complete a user story should be in order to be selected as candidate for estimation in the sprint planning. These can be codified as a checklist in user stories using GitHub Issue Templates or Azure DevOps Work Item Templates . It can be understood as a checklist that helps the Product Owner to ensure that the user story they wrote contains all the necessary details for the scrum team to understand the work to be done. Examples of Ready Checklist Items Does the description have the details including any input values required to implement the user story? Does the user story have clear and complete acceptance criteria? Does the user story address the business need? Can we measure the acceptance criteria? Is the user story small enough to be implemented in a short amount of time, but large enough to provide value to the customer? Is the user story blocked? For example, does it depend on any of the following: The completion of unfinished work A deliverable provided by another team (code artifact, data, etc...) Who Writes it The ready checklist can be written by a Product Owner in agreement with the development team and the Process Lead. When Should a Definition of Ready be Updated Update or change the definition of ready anytime the scrum team observes that there are missing information in the user stories that recurrently impacts the planning. What Should be Avoided The ready checklist should contain items that apply broadly. Don't include items or details that only apply to one or two user stories. This may become an overhead when writing the user stories. How to get Stories Ready In the case that the highest priority work is not yet ready, it still may be possible to make forward progress. Here are some strategies that may help: Backlog Refinement sessions are a good time to validate that high priority user stories are verified to have a clear description, acceptance criteria and demonstrable business value. It is also a good time to breakdown large stories will likely not be completable in a single sprint. Prioritization sessions are a good time to prioritize user stories that unblock other blocked high priority work. Blocked user stories can often be broken down in a way that unblocks a portion of the original stories scope. This is a good way to make forward progress even when some work is blocked.","title":"Definition of Ready"},{"location":"agile-development/team-agreements/definition-of-ready/#definition-of-ready","text":"When the development team picks a user story from the top of the backlog, the user story needs to have enough detail to estimate the work needed to complete the story within the sprint. If it has enough detail to estimate, it is Ready to be developed. If a user story is not Ready in the beginning of the Sprint it increases the chance that the story will not be done at the end of this sprint.","title":"Definition of Ready"},{"location":"agile-development/team-agreements/definition-of-ready/#what-it-is","text":"Definition of Ready is the agreement made by the scrum team around how complete a user story should be in order to be selected as candidate for estimation in the sprint planning. These can be codified as a checklist in user stories using GitHub Issue Templates or Azure DevOps Work Item Templates . It can be understood as a checklist that helps the Product Owner to ensure that the user story they wrote contains all the necessary details for the scrum team to understand the work to be done.","title":"What it is"},{"location":"agile-development/team-agreements/definition-of-ready/#examples-of-ready-checklist-items","text":"Does the description have the details including any input values required to implement the user story? Does the user story have clear and complete acceptance criteria? Does the user story address the business need? Can we measure the acceptance criteria? Is the user story small enough to be implemented in a short amount of time, but large enough to provide value to the customer? Is the user story blocked? For example, does it depend on any of the following: The completion of unfinished work A deliverable provided by another team (code artifact, data, etc...)","title":"Examples of Ready Checklist Items"},{"location":"agile-development/team-agreements/definition-of-ready/#who-writes-it","text":"The ready checklist can be written by a Product Owner in agreement with the development team and the Process Lead.","title":"Who Writes it"},{"location":"agile-development/team-agreements/definition-of-ready/#when-should-a-definition-of-ready-be-updated","text":"Update or change the definition of ready anytime the scrum team observes that there are missing information in the user stories that recurrently impacts the planning.","title":"When Should a Definition of Ready be Updated"},{"location":"agile-development/team-agreements/definition-of-ready/#what-should-be-avoided","text":"The ready checklist should contain items that apply broadly. Don't include items or details that only apply to one or two user stories. This may become an overhead when writing the user stories.","title":"What Should be Avoided"},{"location":"agile-development/team-agreements/definition-of-ready/#how-to-get-stories-ready","text":"In the case that the highest priority work is not yet ready, it still may be possible to make forward progress. Here are some strategies that may help: Backlog Refinement sessions are a good time to validate that high priority user stories are verified to have a clear description, acceptance criteria and demonstrable business value. It is also a good time to breakdown large stories will likely not be completable in a single sprint. Prioritization sessions are a good time to prioritize user stories that unblock other blocked high priority work. Blocked user stories can often be broken down in a way that unblocks a portion of the original stories scope. This is a good way to make forward progress even when some work is blocked.","title":"How to get Stories Ready"},{"location":"agile-development/team-agreements/team-manifesto/","text":"Team Manifesto Introduction ISE teams work with a new development team in each customer engagement which requires a phase of introduction & knowledge transfer before starting an engagement. Completion of this phase of ice-breakers and discussions about the standards takes time, but is required to start increasing the learning curve of the new team. A team manifesto is a light-weight one page agile document among team members which summarizes the basic principles and values of the team and aiming to provide a consensus about technical expectations from each team member in order to deliver high quality output at the end of each engagement. It aims to reduce the time on setting the right expectations without arranging longer \"team document reading\" meetings and provide a consensus among team members to answer the question - \"How does the new team develop the software?\" - by covering all engineering fundamentals and excellence topics such as release process, clean coding, testing. Another main goal of writing the manifesto is to start a conversation during the \"manifesto building session\" to detect any differences of opinion around how the team should work. It also serves in the same way when a new team member joins to the team. New joiners can quickly get up to speed on the agreed standards. How to Build a Team Manifesto It can be said that the best time to start building it is at the very early phase of the engagement when teams meet with each other for swarming or during the preparation phase. It is recommended to keep team manifesto as simple as possible, so preferably, one-page simple document which doesn't include any references or links is a nice format for it. If there is a need for providing knowledge on certain topics, the way to do is delivering brown-bag sessions, technical katas, team practices, documentations and others later on. A few important points about the team manifesto The team manifesto is built by the development team itself It should cover all required technical engineering points for the excellence as well as behavioral agility mindset items that the team finds relevant It aims to give a common understanding about the desired expertise, practices and/or mindset within the team Based on the needs of the team and retrospective results, it can be modified during the engagement. In ISE, we aim for quality over quantity, and well-crafted software as well as to a comfortable/transparent environment where each team member can reach their highest potential. The difference between the team manifesto and other team documents is that it is used to give a short summary of expectations around the technical way of working and supported mindset in the team, before code-with sprints starts. Below, you can find some including, but not limited, topics many teams touch during engagements, Topic What is it about ? Collective Ownership Does team own the code rather than individuals? What is the expectation? Respect Any preferred statement about it's a \"must-have\" team value Collaboration Any preferred statement about how does team want to collaborate ? Transparency A simple statement about it's a \"must-have\" team value and if preferred, how does this being provided by the team ? meetings, retrospective, feedback mechanisms etc. Craftspersonship Which tools such as Git, VS Code LiveShare, etc. are being used ? What is the definition of expected best usage of them? PR sizing What does team prefer in PRs ? Branching Team's branching strategy and standards Commit standards Preferred format in commit messages, rules and more Clean Code Does team follow clean code principles ? Pair/Mob Programming Will team apply pair/mob programming ? If yes, what programming styles are suitable for the team ? Release Process Principles around release process such as quality gates, reviewing process ...etc. Code Review Any rule for code reviewing such as min number of reviewers, team rules ...etc. Action Readiness How the backlog will be refined? How do we ensure clear Definition of Done and Acceptance Criteria ? TDD Will the team follow TDD ? Test Coverage Is there any expected number, percentage or measurement ? Dimensions in Testing Required tests for high quality software, eg : unit, integration, functional, performance, regression, acceptance Build process build for all? or not; The clear statement of where code and under what conditions code should work ? eg : OS, DevOps, tool dependency Bug fix The rules of bug fixing in the team ? eg: contact people, attaching PR to the issue etc. Technical debt How does team manage/follow it? Refactoring How does team manage/follow it? Agile Documentation Does team want to use diagrams and tables more rather than detailed KB articles ? Efficient Documentation When is it necessary ? Is it a prerequisite to complete tasks/PRs etc.? Definition of Fun How will we have fun for relaxing/enjoying the team spirit during the engagement? Tools Generally team sessions are enough for building a manifesto and having a consensus around it, and if there is a need for improving it in a structured way, there are many blogs and tools online, any retrospective tool can be used. Resources Technical Agility*","title":"Team Manifesto"},{"location":"agile-development/team-agreements/team-manifesto/#team-manifesto","text":"","title":"Team Manifesto"},{"location":"agile-development/team-agreements/team-manifesto/#introduction","text":"ISE teams work with a new development team in each customer engagement which requires a phase of introduction & knowledge transfer before starting an engagement. Completion of this phase of ice-breakers and discussions about the standards takes time, but is required to start increasing the learning curve of the new team. A team manifesto is a light-weight one page agile document among team members which summarizes the basic principles and values of the team and aiming to provide a consensus about technical expectations from each team member in order to deliver high quality output at the end of each engagement. It aims to reduce the time on setting the right expectations without arranging longer \"team document reading\" meetings and provide a consensus among team members to answer the question - \"How does the new team develop the software?\" - by covering all engineering fundamentals and excellence topics such as release process, clean coding, testing. Another main goal of writing the manifesto is to start a conversation during the \"manifesto building session\" to detect any differences of opinion around how the team should work. It also serves in the same way when a new team member joins to the team. New joiners can quickly get up to speed on the agreed standards.","title":"Introduction"},{"location":"agile-development/team-agreements/team-manifesto/#how-to-build-a-team-manifesto","text":"It can be said that the best time to start building it is at the very early phase of the engagement when teams meet with each other for swarming or during the preparation phase. It is recommended to keep team manifesto as simple as possible, so preferably, one-page simple document which doesn't include any references or links is a nice format for it. If there is a need for providing knowledge on certain topics, the way to do is delivering brown-bag sessions, technical katas, team practices, documentations and others later on. A few important points about the team manifesto The team manifesto is built by the development team itself It should cover all required technical engineering points for the excellence as well as behavioral agility mindset items that the team finds relevant It aims to give a common understanding about the desired expertise, practices and/or mindset within the team Based on the needs of the team and retrospective results, it can be modified during the engagement. In ISE, we aim for quality over quantity, and well-crafted software as well as to a comfortable/transparent environment where each team member can reach their highest potential. The difference between the team manifesto and other team documents is that it is used to give a short summary of expectations around the technical way of working and supported mindset in the team, before code-with sprints starts. Below, you can find some including, but not limited, topics many teams touch during engagements, Topic What is it about ? Collective Ownership Does team own the code rather than individuals? What is the expectation? Respect Any preferred statement about it's a \"must-have\" team value Collaboration Any preferred statement about how does team want to collaborate ? Transparency A simple statement about it's a \"must-have\" team value and if preferred, how does this being provided by the team ? meetings, retrospective, feedback mechanisms etc. Craftspersonship Which tools such as Git, VS Code LiveShare, etc. are being used ? What is the definition of expected best usage of them? PR sizing What does team prefer in PRs ? Branching Team's branching strategy and standards Commit standards Preferred format in commit messages, rules and more Clean Code Does team follow clean code principles ? Pair/Mob Programming Will team apply pair/mob programming ? If yes, what programming styles are suitable for the team ? Release Process Principles around release process such as quality gates, reviewing process ...etc. Code Review Any rule for code reviewing such as min number of reviewers, team rules ...etc. Action Readiness How the backlog will be refined? How do we ensure clear Definition of Done and Acceptance Criteria ? TDD Will the team follow TDD ? Test Coverage Is there any expected number, percentage or measurement ? Dimensions in Testing Required tests for high quality software, eg : unit, integration, functional, performance, regression, acceptance Build process build for all? or not; The clear statement of where code and under what conditions code should work ? eg : OS, DevOps, tool dependency Bug fix The rules of bug fixing in the team ? eg: contact people, attaching PR to the issue etc. Technical debt How does team manage/follow it? Refactoring How does team manage/follow it? Agile Documentation Does team want to use diagrams and tables more rather than detailed KB articles ? Efficient Documentation When is it necessary ? Is it a prerequisite to complete tasks/PRs etc.? Definition of Fun How will we have fun for relaxing/enjoying the team spirit during the engagement?","title":"How to Build a Team Manifesto"},{"location":"agile-development/team-agreements/team-manifesto/#tools","text":"Generally team sessions are enough for building a manifesto and having a consensus around it, and if there is a need for improving it in a structured way, there are many blogs and tools online, any retrospective tool can be used.","title":"Tools"},{"location":"agile-development/team-agreements/team-manifesto/#resources","text":"Technical Agility*","title":"Resources"},{"location":"agile-development/team-agreements/working-agreement/","text":"Sections of a Working Agreement A working agreement is a document, or a set of documents that describe how we work together as a team and what our expectations and principles are. The working agreement created by the team at the beginning of the project, and is stored in the repository so that it is readily available for everyone working on the project. The following are examples of sections and points that can be part of a working agreement but each team should compose their own, and adjust times, communication channels, branch naming policies etc. to fit their team needs. General We work as one team towards a common goal and clear scope We make sure everyone's voice is heard, listened to We show all team members equal respect We work as a team to have common expectations for technical delivery that are documented in a Team Manifesto . We make sure to spread our expertise and skills in the team, so no single person is relied on for one skill All times below are listed in CET Communication We communicate all information relevant to the team through the Project Teams channel We add all technical spikes , trade studies , and other technical documentation to the project repository through async design reviews in PRs Work-life Balance Our office hours, when we can expect to collaborate via Microsoft Teams, phone or face-to-face are Monday to Friday 10AM - 5PM We are not expected to answer emails past 6PM, on weekends or when we are on holidays or vacation. We work in different time zones and respect this, especially when setting up recurring meetings. We record meetings when possible, so that team members who could not attend live can listen later. Quality and not Quantity We agree on a Definition of Done for our user story's and sprints and live by it. We follow engineering best practices like the Engineering Fundamentals Engineering Playbook Scrum Rhythm Activity When Duration Who Accountable Goal Project Standup Tue-Fri 9AM 15 min Everyone Process Lead What has been accomplished, next steps, blockers Sprint Demo Monday 9AM 1 hour Everyone Dev Lead Present work done and sign off on user story completion Sprint Retro Monday 10AM 1 hour Everyone Process Lead Dev Teams shares learnings and what can be improved Sprint Planning Monday 11AM 1 hour Everyone PO Size and plan user stories for the sprint Task Creation After Sprint Planning - Dev Team Dev Lead Create tasks to clarify and determine velocity Backlog refinement Wednesday 2PM 1 hour Dev Lead, PO PO Prepare for next sprint and ensure that stories are ready for next sprint. Process Lead The Process Lead is responsible for leading any scrum or agile practices to enable the project to move forward. Facilitate standup meetings and hold team accountable for attendance and participation. Keep the meeting moving as described in the Project Standup page. Make sure all action items are documented and ensure each has an owner and a due date and tracks the open issues. Notes as needed after planning / stand-ups. Make sure that items are moved to the parking lot and ensure follow-up afterwards. Maintain a location showing team\u2019s work and status and removing impediments that are blocking the team. Hold the team accountable for results in a supportive fashion. Make sure that project and program documentation are up-to-date. Guarantee the tracking/following up on action items from retrospectives (iteration and release planning) and from daily standup meetings. Facilitate the sprint retrospective. Coach Product Owner and the team in the process, as needed. Backlog Management We work together on a Definition of Ready and all user stories assigned to a sprint need to follow this We communicate what we are working on through the board We assign ourselves a task when we are ready to work on it (not before) and move it to active We capture any work we do related to the project in a user story/task We close our tasks/user stories only when they are done (as described in the Definition of Done ) We work with the PM if we want to add a new user story to the sprint If we add new tasks to the board, we make sure it matches the acceptance criteria of the user story (to avoid scope creep). If it doesn't match the acceptance criteria we should discuss with the PM to see if we need a new user story for the task or if we should adjust the acceptance criteria. Code Management We follow the git flow branch naming convention for branches and identify the task number e.g. feature/123-add-working-agreement We merge all code into main branches through PRs All PRs are reviewed by one person from and one from Microsoft (for knowledge transfer and to ensure code and security standards are met) We always review existing PRs before starting work on a new task We look through open PRs at the end of stand-up to make sure all PRs have reviewers. We treat documentation as code and apply the same standards to Markdown as code","title":"Sections of a Working Agreement"},{"location":"agile-development/team-agreements/working-agreement/#sections-of-a-working-agreement","text":"A working agreement is a document, or a set of documents that describe how we work together as a team and what our expectations and principles are. The working agreement created by the team at the beginning of the project, and is stored in the repository so that it is readily available for everyone working on the project. The following are examples of sections and points that can be part of a working agreement but each team should compose their own, and adjust times, communication channels, branch naming policies etc. to fit their team needs.","title":"Sections of a Working Agreement"},{"location":"agile-development/team-agreements/working-agreement/#general","text":"We work as one team towards a common goal and clear scope We make sure everyone's voice is heard, listened to We show all team members equal respect We work as a team to have common expectations for technical delivery that are documented in a Team Manifesto . We make sure to spread our expertise and skills in the team, so no single person is relied on for one skill All times below are listed in CET","title":"General"},{"location":"agile-development/team-agreements/working-agreement/#communication","text":"We communicate all information relevant to the team through the Project Teams channel We add all technical spikes , trade studies , and other technical documentation to the project repository through async design reviews in PRs","title":"Communication"},{"location":"agile-development/team-agreements/working-agreement/#work-life-balance","text":"Our office hours, when we can expect to collaborate via Microsoft Teams, phone or face-to-face are Monday to Friday 10AM - 5PM We are not expected to answer emails past 6PM, on weekends or when we are on holidays or vacation. We work in different time zones and respect this, especially when setting up recurring meetings. We record meetings when possible, so that team members who could not attend live can listen later.","title":"Work-life Balance"},{"location":"agile-development/team-agreements/working-agreement/#quality-and-not-quantity","text":"We agree on a Definition of Done for our user story's and sprints and live by it. We follow engineering best practices like the Engineering Fundamentals Engineering Playbook","title":"Quality and not Quantity"},{"location":"agile-development/team-agreements/working-agreement/#scrum-rhythm","text":"Activity When Duration Who Accountable Goal Project Standup Tue-Fri 9AM 15 min Everyone Process Lead What has been accomplished, next steps, blockers Sprint Demo Monday 9AM 1 hour Everyone Dev Lead Present work done and sign off on user story completion Sprint Retro Monday 10AM 1 hour Everyone Process Lead Dev Teams shares learnings and what can be improved Sprint Planning Monday 11AM 1 hour Everyone PO Size and plan user stories for the sprint Task Creation After Sprint Planning - Dev Team Dev Lead Create tasks to clarify and determine velocity Backlog refinement Wednesday 2PM 1 hour Dev Lead, PO PO Prepare for next sprint and ensure that stories are ready for next sprint.","title":"Scrum Rhythm"},{"location":"agile-development/team-agreements/working-agreement/#process-lead","text":"The Process Lead is responsible for leading any scrum or agile practices to enable the project to move forward. Facilitate standup meetings and hold team accountable for attendance and participation. Keep the meeting moving as described in the Project Standup page. Make sure all action items are documented and ensure each has an owner and a due date and tracks the open issues. Notes as needed after planning / stand-ups. Make sure that items are moved to the parking lot and ensure follow-up afterwards. Maintain a location showing team\u2019s work and status and removing impediments that are blocking the team. Hold the team accountable for results in a supportive fashion. Make sure that project and program documentation are up-to-date. Guarantee the tracking/following up on action items from retrospectives (iteration and release planning) and from daily standup meetings. Facilitate the sprint retrospective. Coach Product Owner and the team in the process, as needed.","title":"Process Lead"},{"location":"agile-development/team-agreements/working-agreement/#backlog-management","text":"We work together on a Definition of Ready and all user stories assigned to a sprint need to follow this We communicate what we are working on through the board We assign ourselves a task when we are ready to work on it (not before) and move it to active We capture any work we do related to the project in a user story/task We close our tasks/user stories only when they are done (as described in the Definition of Done ) We work with the PM if we want to add a new user story to the sprint If we add new tasks to the board, we make sure it matches the acceptance criteria of the user story (to avoid scope creep). If it doesn't match the acceptance criteria we should discuss with the PM to see if we need a new user story for the task or if we should adjust the acceptance criteria.","title":"Backlog Management"},{"location":"agile-development/team-agreements/working-agreement/#code-management","text":"We follow the git flow branch naming convention for branches and identify the task number e.g. feature/123-add-working-agreement We merge all code into main branches through PRs All PRs are reviewed by one person from and one from Microsoft (for knowledge transfer and to ensure code and security standards are met) We always review existing PRs before starting work on a new task We look through open PRs at the end of stand-up to make sure all PRs have reviewers. We treat documentation as code and apply the same standards to Markdown as code","title":"Code Management"},{"location":"automated-testing/","text":"Testing Why Testing Tests allow us to find flaws in our software Good tests document the code by describing the intent Automated tests saves time, compared to manual tests Automated tests allow us to safely change and refactor our code without introducing regressions The Fundamentals We consider code to be incomplete if it is not accompanied by tests We write unit tests (tests without external dependencies) that can run before every PR merge to validate that we don\u2019t have regressions We write Integration tests/E2E tests that test the whole system end to end, and run them regularly We write our tests early and block any further code merging if tests fail. We run load tests/performance tests where appropriate to validate that the system performs under stress Build for Testing Testing is a critical part of the development process. It is important to build your application with testing in mind. Here are some tips to help you build for testing: Parameterize everything. Rather than hard-code any variables, consider making everything a configurable parameter with a reasonable default. This will allow you to easily change the behavior of your application during testing. Particularly during performance testing, it is common to test different values to see what impact that has on performance. If a range of defaults need to change together, consider one or more parameters which set \"modes\", changing the defaults of a group of parameters together. Document at startup. When your application starts up, it should log all parameters. This ensures the person reviewing the logs and application behavior know exactly how the application is configured. Log to console. Logging to external systems like Azure Monitor is desirable for traceability across services. This requires logs to be dispatched from the local system to the external system and that is a dependency that can fail. It is important that someone be able to console logs directly on the local system. Log to external system. In addition to console logs, logging to an external system like Azure Monitor is desirable for traceability across services and durability of logs. Log all activity. If the system is performing some activity (reading data from a database, calling an external service, etc.), it should log that activity. Ideally, there should be a log message saying the activity is starting and another log message saying the activity is complete. This allows someone reviewing the logs to understand what the application is doing and how long it is taking. Depending on how noisy this is, different messages can be associated with different log levels, but it is important to have the information available when it comes to debugging a deployed system. Correlate distributed activities. If the system is performing some activity that is distributed across multiple systems, it is important to correlate the activity across those systems. This can be done using a Correlation ID that is passed from system to system. This allows someone reviewing the logs to understand the entire flow of activity. For more information, please see Observability in Microservices . Log metadata. When logging, it is important to include metadata that is relevant to the activity. For example, a Tenant ID, Customer ID, or Order ID. This allows someone reviewing the logs to understand the context of the activity and filter to a manageable set of logs. Log performance metrics. Even if you are using App Insights to capture how long dependency calls are taking, it is often useful to know long certain functions of your application took. It then becomes possible to evaluate the performance characteristics of your application as it is deployed on different compute platforms with different limitations on CPU, memory, and network bandwidth. For more information, please see Metrics . Map of Outcomes to Testing Techniques The table below maps outcomes (the results that you may want to achieve in your validation efforts) to one or more techniques that can be used to accomplish that outcome. When I am working on... I want to get this outcome... ...so I should consider Development Prove backward compatibility with existing callers and clients Shadow testing Development Ensure telemetry is sufficiently detailed and complete to trace and diagnose malfunction in End-to-End testing flows Distributed Debug challenges; Orphaned call chain analysis Development Ensure program logic is correct for a variety of expected, mainline, edge and unexpected inputs Unit testing ; Functional tests; Consumer-driven Contract Testing ; Integration testing Development Prevent regressions in logical correctness; earlier is better Unit testing ; Functional tests; Consumer-driven Contract Testing ; Integration testing ; Rings (each of these are expanding scopes of coverage) Development Quickly validate mainline correctness of a point of functionality (e.g. single API), manually Manual smoke testing Tools: postman, powershell, curl Development Validate interactions between components in isolation, ensuring that consumer and provider components are compatible and conform to a shared understanding documented in a contract Consumer-driven Contract Testing Development Validate that multiple components function together across multiple interfaces in a call chain, incl network hops Integration testing ; End-to-end ( End-to-End testing ) tests; Segmented end-to-end ( End-to-End testing ) Development Prove disaster recoverability \u2013 recover from corruption of data DR drills Development Find vulnerabilities in service Authentication or Authorization Scenario (security) Development Prove correct RBAC and claims interpretation of Authorization code Scenario (security) Development Document and/or enforce valid API usage Unit testing ; Functional tests; Consumer-driven Contract Testing Development Prove implementation correctness in advance of a dependency or absent a dependency Unit testing (with mocks); Unit testing (with emulators); Consumer-driven Contract Testing Development Ensure that the user interface is accessible Accessibility Development Ensure that users can operate the interface UI testing (automated) (human usability observation) Development Prevent regression in user experience UI automation; End-to-End testing Development Detect and prevent 'noisy neighbor' phenomena Load testing Development Detect availability drops Synthetic Transaction testing ; Outside-in probes Development Prevent regression in 'composite' scenario use cases / workflows (e.g. an e-commerce system might have many APIs that used together in a sequence perform a \"shop-and-buy\" scenario) End-to-End testing ; Scenario Development; Operations Prevent regressions in runtime performance metrics e.g. latency / cost / resource consumption; earlier is better Rings; Synthetic Transaction testing / Transaction; Rollback Watchdogs Development; Optimization Compare any given metric between 2 candidate implementations or variations in functionality Flighting; A/B testing Development; Staging Prove production system of provisioned capacity meets goals for reliability, availability, resource consumption, performance Load testing (stress) ; Spike; Soak; Performance testing Development; Staging Understand key user experience performance characteristics \u2013 latency, chattiness, resiliency to network errors Load; Performance testing ; Scenario (network partitioning) Development; Staging; Operation Discover melt points (the loads at which failure or maximum tolerable resource consumption occurs) for each individual component in the stack Squeeze; Load testing (stress) Development; Staging; Operation Discover overall system melt point (the loads at which the end-to-end system fails) and which component is the weakest link in the whole stack Squeeze; Load testing (stress) Development; Staging; Operation Measure capacity limits for given provisioning to predict or satisfy future provisioning needs Squeeze; Load testing (stress) Development; Staging; Operation Create / exercise failover runbook Failover drills Development; Staging; Operation Prove disaster recoverability \u2013 loss of data center (the meteor scenario); measure MTTR DR drills Development; Staging; Operation Understand whether observability dashboards are correct, and telemetry is complete; flowing Trace Validation; Load testing (stress) ; Scenario; End-to-End testing Development; Staging; Operation Measure impact of seasonality of traffic Load testing Development; Staging; Operation Prove Transaction and alerts correctly notify / take action Synthetic Transaction testing (negative cases); Load testing Development; Staging; Operation; Optimizing Understand scalability curve, i.e. how the system consumes resources with load Load testing (stress) ; Performance testing Operation; Optimizing Discover system behavior over long-haul time Soak Optimizing Find cost savings opportunities Squeeze Staging; Operation Measure impact of failover / scale-out (repartitioning, increasing provisioning) / scale-down Failover drills; Scale drills Staging; Operation Create/Exercise runbook for increasing/reducing provisioning Scale drills Staging; Operation Measure behavior under rapid changes in traffic Spike Staging; Optimizing Discover cost metrics per unit load volume (what factors influence cost at what load points, e.g. cost per million concurrent users) Load (stress) Development; Operation Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, \u2026) Chaos Development Perform unit testing on Power platform custom connectors Custom Connector Testing Technology Specific Testing Using DevTest Pattern for building containers with AzDO Using Azurite to run blob storage tests in pipeline","title":"Testing"},{"location":"automated-testing/#testing","text":"","title":"Testing"},{"location":"automated-testing/#why-testing","text":"Tests allow us to find flaws in our software Good tests document the code by describing the intent Automated tests saves time, compared to manual tests Automated tests allow us to safely change and refactor our code without introducing regressions","title":"Why Testing"},{"location":"automated-testing/#the-fundamentals","text":"We consider code to be incomplete if it is not accompanied by tests We write unit tests (tests without external dependencies) that can run before every PR merge to validate that we don\u2019t have regressions We write Integration tests/E2E tests that test the whole system end to end, and run them regularly We write our tests early and block any further code merging if tests fail. We run load tests/performance tests where appropriate to validate that the system performs under stress","title":"The Fundamentals"},{"location":"automated-testing/#build-for-testing","text":"Testing is a critical part of the development process. It is important to build your application with testing in mind. Here are some tips to help you build for testing: Parameterize everything. Rather than hard-code any variables, consider making everything a configurable parameter with a reasonable default. This will allow you to easily change the behavior of your application during testing. Particularly during performance testing, it is common to test different values to see what impact that has on performance. If a range of defaults need to change together, consider one or more parameters which set \"modes\", changing the defaults of a group of parameters together. Document at startup. When your application starts up, it should log all parameters. This ensures the person reviewing the logs and application behavior know exactly how the application is configured. Log to console. Logging to external systems like Azure Monitor is desirable for traceability across services. This requires logs to be dispatched from the local system to the external system and that is a dependency that can fail. It is important that someone be able to console logs directly on the local system. Log to external system. In addition to console logs, logging to an external system like Azure Monitor is desirable for traceability across services and durability of logs. Log all activity. If the system is performing some activity (reading data from a database, calling an external service, etc.), it should log that activity. Ideally, there should be a log message saying the activity is starting and another log message saying the activity is complete. This allows someone reviewing the logs to understand what the application is doing and how long it is taking. Depending on how noisy this is, different messages can be associated with different log levels, but it is important to have the information available when it comes to debugging a deployed system. Correlate distributed activities. If the system is performing some activity that is distributed across multiple systems, it is important to correlate the activity across those systems. This can be done using a Correlation ID that is passed from system to system. This allows someone reviewing the logs to understand the entire flow of activity. For more information, please see Observability in Microservices . Log metadata. When logging, it is important to include metadata that is relevant to the activity. For example, a Tenant ID, Customer ID, or Order ID. This allows someone reviewing the logs to understand the context of the activity and filter to a manageable set of logs. Log performance metrics. Even if you are using App Insights to capture how long dependency calls are taking, it is often useful to know long certain functions of your application took. It then becomes possible to evaluate the performance characteristics of your application as it is deployed on different compute platforms with different limitations on CPU, memory, and network bandwidth. For more information, please see Metrics .","title":"Build for Testing"},{"location":"automated-testing/#map-of-outcomes-to-testing-techniques","text":"The table below maps outcomes (the results that you may want to achieve in your validation efforts) to one or more techniques that can be used to accomplish that outcome. When I am working on... I want to get this outcome... ...so I should consider Development Prove backward compatibility with existing callers and clients Shadow testing Development Ensure telemetry is sufficiently detailed and complete to trace and diagnose malfunction in End-to-End testing flows Distributed Debug challenges; Orphaned call chain analysis Development Ensure program logic is correct for a variety of expected, mainline, edge and unexpected inputs Unit testing ; Functional tests; Consumer-driven Contract Testing ; Integration testing Development Prevent regressions in logical correctness; earlier is better Unit testing ; Functional tests; Consumer-driven Contract Testing ; Integration testing ; Rings (each of these are expanding scopes of coverage) Development Quickly validate mainline correctness of a point of functionality (e.g. single API), manually Manual smoke testing Tools: postman, powershell, curl Development Validate interactions between components in isolation, ensuring that consumer and provider components are compatible and conform to a shared understanding documented in a contract Consumer-driven Contract Testing Development Validate that multiple components function together across multiple interfaces in a call chain, incl network hops Integration testing ; End-to-end ( End-to-End testing ) tests; Segmented end-to-end ( End-to-End testing ) Development Prove disaster recoverability \u2013 recover from corruption of data DR drills Development Find vulnerabilities in service Authentication or Authorization Scenario (security) Development Prove correct RBAC and claims interpretation of Authorization code Scenario (security) Development Document and/or enforce valid API usage Unit testing ; Functional tests; Consumer-driven Contract Testing Development Prove implementation correctness in advance of a dependency or absent a dependency Unit testing (with mocks); Unit testing (with emulators); Consumer-driven Contract Testing Development Ensure that the user interface is accessible Accessibility Development Ensure that users can operate the interface UI testing (automated) (human usability observation) Development Prevent regression in user experience UI automation; End-to-End testing Development Detect and prevent 'noisy neighbor' phenomena Load testing Development Detect availability drops Synthetic Transaction testing ; Outside-in probes Development Prevent regression in 'composite' scenario use cases / workflows (e.g. an e-commerce system might have many APIs that used together in a sequence perform a \"shop-and-buy\" scenario) End-to-End testing ; Scenario Development; Operations Prevent regressions in runtime performance metrics e.g. latency / cost / resource consumption; earlier is better Rings; Synthetic Transaction testing / Transaction; Rollback Watchdogs Development; Optimization Compare any given metric between 2 candidate implementations or variations in functionality Flighting; A/B testing Development; Staging Prove production system of provisioned capacity meets goals for reliability, availability, resource consumption, performance Load testing (stress) ; Spike; Soak; Performance testing Development; Staging Understand key user experience performance characteristics \u2013 latency, chattiness, resiliency to network errors Load; Performance testing ; Scenario (network partitioning) Development; Staging; Operation Discover melt points (the loads at which failure or maximum tolerable resource consumption occurs) for each individual component in the stack Squeeze; Load testing (stress) Development; Staging; Operation Discover overall system melt point (the loads at which the end-to-end system fails) and which component is the weakest link in the whole stack Squeeze; Load testing (stress) Development; Staging; Operation Measure capacity limits for given provisioning to predict or satisfy future provisioning needs Squeeze; Load testing (stress) Development; Staging; Operation Create / exercise failover runbook Failover drills Development; Staging; Operation Prove disaster recoverability \u2013 loss of data center (the meteor scenario); measure MTTR DR drills Development; Staging; Operation Understand whether observability dashboards are correct, and telemetry is complete; flowing Trace Validation; Load testing (stress) ; Scenario; End-to-End testing Development; Staging; Operation Measure impact of seasonality of traffic Load testing Development; Staging; Operation Prove Transaction and alerts correctly notify / take action Synthetic Transaction testing (negative cases); Load testing Development; Staging; Operation; Optimizing Understand scalability curve, i.e. how the system consumes resources with load Load testing (stress) ; Performance testing Operation; Optimizing Discover system behavior over long-haul time Soak Optimizing Find cost savings opportunities Squeeze Staging; Operation Measure impact of failover / scale-out (repartitioning, increasing provisioning) / scale-down Failover drills; Scale drills Staging; Operation Create/Exercise runbook for increasing/reducing provisioning Scale drills Staging; Operation Measure behavior under rapid changes in traffic Spike Staging; Optimizing Discover cost metrics per unit load volume (what factors influence cost at what load points, e.g. cost per million concurrent users) Load (stress) Development; Operation Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, \u2026) Chaos Development Perform unit testing on Power platform custom connectors Custom Connector Testing","title":"Map of Outcomes to Testing Techniques"},{"location":"automated-testing/#technology-specific-testing","text":"Using DevTest Pattern for building containers with AzDO Using Azurite to run blob storage tests in pipeline","title":"Technology Specific Testing"},{"location":"automated-testing/cdc-testing/","text":"Consumer-Driven Contract Testing (CDC) Consumer-driven Contract Testing (or CDC for short) is a software testing methodology used to test components of a system in isolation while ensuring that provider components are compatible with the expectations that consumer components have of them. Why Consumer-Driven Contract Testing CDC tries to overcome the several painful drawbacks of automated E2E tests with components interacting together: E2E tests are slow E2E tests break easily E2E tests are expensive and hard to maintain E2E tests of larger systems may be hard or impossible to run outside a dedicated testing environment Although testing best practices suggest to write just a few E2E tests compared to the cheaper, faster and more stable integration and unit tests as pictured in the testing pyramid below, experience shows many teams end up writing too many E2E tests . A reason for this is that E2E tests give developers the highest confidence to release as they are testing the \"real\" system. CDC addresses these issues by testing interactions between components in isolation using mocks that conform to a shared understanding documented in a \"contract\". Contracts are agreed between consumer and provider, and are regularly verified against a real instance of the provider component. This effectively partitions a larger system into smaller pieces that can be tested individually in isolation of each other, leading to simpler, fast and stable tests that also give confidence to release. Some E2E tests are still required to verify the system as a whole when deployed in the real environment, but most functional interactions between components can be covered with CDC tests. CDC testing was initially developed for testing RESTful API's, but the pattern scales to all consumer-provider systems and tooling for other messaging protocols besides HTTP does exist. Consumer-Driven Contract Testing Design Blocks In a consumer-driven approach the consumer drives changes to contracts between a consumer (the client) and a provider (the server). This may sound counterintuitive, but it helps providers create APIs that fit the real requirements of the consumers rather than trying to guess these in advance. Next we describe the CDC building blocks ordered by their occurrence in the development cycle. Consumer Tests with Provider Mock The consumers start by creating integration tests against a provider mock and running them as part of their CI pipeline. Expected responses are defined in the provider mock for requests fired from the tests. Through this, the consumer essentially defines the contract they expect the provider to fulfill. Contract Contracts are generated from the expectations defined in the provider mock as a result of a successful test run. CDC frameworks like Pact provide a specification for contracts in json format consisting of the list of request/responses generated from the consumer tests plus some additional metadata. Contracts are not a replacement for a discussion between the consumer and provider team. This is the moment where this discussion should take place (if not already done before). The consumer tests and generated contract are refined with the feedback and cooperation of the provider team. Lastly the finalized contract is versioned and stored in a central place accessible by both consumer and provider. Contracts are complementary to API specification documents like OpenAPI. API specifications describe the structure and the format of the API. A contract instead specifies that for a given request, a given response is expected. An API specifications document is helpful in writing an API contract and can be used to validate that the contract conforms to the API specification. Provider Contract Verification On the provider side tests are also executed as part of a separate pipeline which verifies contracts against real responses of the provider. Contract verification fails if real responses differ from the expected responses as specified in the contract. The cause of this can be: Invalid expectations on the consumer side leading to incompatibility with the current provider implementation Broken provider implementation due to some missing functionality or a regression Either way, thanks to CDC it is easy to pinpoint integration issues down to the consumer/provider of the affected interaction. This is a big advantage compared to the debugging pain this could have been with an E2E test approach. CDC Testing Frameworks and Tools Pact is an implementation of CDC testing that allows mocking of responses in the consumer codebase, and verification of the interactions in the provider codebase, while defining a specification for contracts . It was originally written in Ruby but has available wrappers for multiple languages. Pact is the de-facto standard to use when working with CDC. Spring Cloud Contract is an implementation of CDC testing from Spring, and offers easy integration in the Spring ecosystem. Support for non-Spring and non-JVM providers and consumers also exists. Conclusion CDC has several benefits that make it an approach worth considering when dealing with systems composed of multiple components interacting together. Maintenance efforts can be reduced by testing consumer-provider interactions in isolation without the need of a complex integrated environment, specially as the interactions between components grow in number and become more complex. Additionally, a close collaboration between consumer and provider teams is strongly encouraged through the CDC development process, which can bring many other benefits. Contracts offer a formal way to document the shared understanding how components interact with each other, and serve as a base for the communication between teams. In a way, the contract repository serves as a live documentation of all consumer-provider interactions of a system. CDC has some drawbacks as well. An extra layer of testing is added requiring a proper investment in education for team members to understand and use CDC correctly. Additionally, the CDC test scope should be considered carefully to prevent blurring CDC with other higher level functional testing layers. Contract tests are not the place to verify internal business logic and correctness of the consumer. Resources Testing pyramid from Kent C. Dodd's blog Pact , a code-first consumer-driven contract testing tool with support for several different programming languages Consumer-driven contracts from Ian Robinson Contract test from Martin Fowler A simple example of using Pact consumer-driven contract testing in a Java client-server application Pact dotnet workshop","title":"Consumer-Driven Contract Testing (CDC)"},{"location":"automated-testing/cdc-testing/#consumer-driven-contract-testing-cdc","text":"Consumer-driven Contract Testing (or CDC for short) is a software testing methodology used to test components of a system in isolation while ensuring that provider components are compatible with the expectations that consumer components have of them.","title":"Consumer-Driven Contract Testing (CDC)"},{"location":"automated-testing/cdc-testing/#why-consumer-driven-contract-testing","text":"CDC tries to overcome the several painful drawbacks of automated E2E tests with components interacting together: E2E tests are slow E2E tests break easily E2E tests are expensive and hard to maintain E2E tests of larger systems may be hard or impossible to run outside a dedicated testing environment Although testing best practices suggest to write just a few E2E tests compared to the cheaper, faster and more stable integration and unit tests as pictured in the testing pyramid below, experience shows many teams end up writing too many E2E tests . A reason for this is that E2E tests give developers the highest confidence to release as they are testing the \"real\" system. CDC addresses these issues by testing interactions between components in isolation using mocks that conform to a shared understanding documented in a \"contract\". Contracts are agreed between consumer and provider, and are regularly verified against a real instance of the provider component. This effectively partitions a larger system into smaller pieces that can be tested individually in isolation of each other, leading to simpler, fast and stable tests that also give confidence to release. Some E2E tests are still required to verify the system as a whole when deployed in the real environment, but most functional interactions between components can be covered with CDC tests. CDC testing was initially developed for testing RESTful API's, but the pattern scales to all consumer-provider systems and tooling for other messaging protocols besides HTTP does exist.","title":"Why Consumer-Driven Contract Testing"},{"location":"automated-testing/cdc-testing/#consumer-driven-contract-testing-design-blocks","text":"In a consumer-driven approach the consumer drives changes to contracts between a consumer (the client) and a provider (the server). This may sound counterintuitive, but it helps providers create APIs that fit the real requirements of the consumers rather than trying to guess these in advance. Next we describe the CDC building blocks ordered by their occurrence in the development cycle.","title":"Consumer-Driven Contract Testing Design Blocks"},{"location":"automated-testing/cdc-testing/#consumer-tests-with-provider-mock","text":"The consumers start by creating integration tests against a provider mock and running them as part of their CI pipeline. Expected responses are defined in the provider mock for requests fired from the tests. Through this, the consumer essentially defines the contract they expect the provider to fulfill.","title":"Consumer Tests with Provider Mock"},{"location":"automated-testing/cdc-testing/#contract","text":"Contracts are generated from the expectations defined in the provider mock as a result of a successful test run. CDC frameworks like Pact provide a specification for contracts in json format consisting of the list of request/responses generated from the consumer tests plus some additional metadata. Contracts are not a replacement for a discussion between the consumer and provider team. This is the moment where this discussion should take place (if not already done before). The consumer tests and generated contract are refined with the feedback and cooperation of the provider team. Lastly the finalized contract is versioned and stored in a central place accessible by both consumer and provider. Contracts are complementary to API specification documents like OpenAPI. API specifications describe the structure and the format of the API. A contract instead specifies that for a given request, a given response is expected. An API specifications document is helpful in writing an API contract and can be used to validate that the contract conforms to the API specification.","title":"Contract"},{"location":"automated-testing/cdc-testing/#provider-contract-verification","text":"On the provider side tests are also executed as part of a separate pipeline which verifies contracts against real responses of the provider. Contract verification fails if real responses differ from the expected responses as specified in the contract. The cause of this can be: Invalid expectations on the consumer side leading to incompatibility with the current provider implementation Broken provider implementation due to some missing functionality or a regression Either way, thanks to CDC it is easy to pinpoint integration issues down to the consumer/provider of the affected interaction. This is a big advantage compared to the debugging pain this could have been with an E2E test approach.","title":"Provider Contract Verification"},{"location":"automated-testing/cdc-testing/#cdc-testing-frameworks-and-tools","text":"Pact is an implementation of CDC testing that allows mocking of responses in the consumer codebase, and verification of the interactions in the provider codebase, while defining a specification for contracts . It was originally written in Ruby but has available wrappers for multiple languages. Pact is the de-facto standard to use when working with CDC. Spring Cloud Contract is an implementation of CDC testing from Spring, and offers easy integration in the Spring ecosystem. Support for non-Spring and non-JVM providers and consumers also exists.","title":"CDC Testing Frameworks and Tools"},{"location":"automated-testing/cdc-testing/#conclusion","text":"CDC has several benefits that make it an approach worth considering when dealing with systems composed of multiple components interacting together. Maintenance efforts can be reduced by testing consumer-provider interactions in isolation without the need of a complex integrated environment, specially as the interactions between components grow in number and become more complex. Additionally, a close collaboration between consumer and provider teams is strongly encouraged through the CDC development process, which can bring many other benefits. Contracts offer a formal way to document the shared understanding how components interact with each other, and serve as a base for the communication between teams. In a way, the contract repository serves as a live documentation of all consumer-provider interactions of a system. CDC has some drawbacks as well. An extra layer of testing is added requiring a proper investment in education for team members to understand and use CDC correctly. Additionally, the CDC test scope should be considered carefully to prevent blurring CDC with other higher level functional testing layers. Contract tests are not the place to verify internal business logic and correctness of the consumer.","title":"Conclusion"},{"location":"automated-testing/cdc-testing/#resources","text":"Testing pyramid from Kent C. Dodd's blog Pact , a code-first consumer-driven contract testing tool with support for several different programming languages Consumer-driven contracts from Ian Robinson Contract test from Martin Fowler A simple example of using Pact consumer-driven contract testing in a Java client-server application Pact dotnet workshop","title":"Resources"},{"location":"automated-testing/e2e-testing/","text":"E2E Testing End-to-end (E2E) testing is a Software testing methodology to test a functional and data application flow consisting of several sub-systems working together from start to end. At times, these systems are developed in different technologies by different teams or organizations. Finally, they come together to form a functional business application. Hence, testing a single system would not suffice. Therefore, end-to-end testing verifies the application from start to end putting all its components together. Why E2E Testing In many commercial software application scenarios, a modern software system consists of its interconnection with multiple sub-systems. These sub-systems can be within the same organization or can be components of different organizations. Also, these sub-systems can have somewhat similar or different lifetime release cycle from the current system. As a result, if there is any failure or fault in any sub-system, it can adversely affect the whole software system leading to its collapse. The above illustration is a testing pyramid from Kent C. Dodd's blog which is a combination of the pyramids from Martin Fowler\u2019s blog and the Google Testing Blog . The majority of your tests are at the bottom of the pyramid. As you move up the pyramid, the number of tests gets smaller. Also, going up the pyramid, tests get slower and more expensive to write, run, and maintain. Each type of testing vary for its purpose, application and the areas it's supposed to cover. For more information on comparison analysis of different testing types, please see this ## Unit vs Integration vs System vs E2E Testing document. E2E Testing Design Blocks We will look into all the 3 categories one by one: User Functions Following actions should be performed as a part of building user functions: List user initiated functions of the software systems, and their interconnected sub-systems. For any function, keep track of the actions performed as well as Input and Output data. Find the relations, if any between different Users functions. Find out the nature of different user functions i.e. if they are independent or are reusable. Conditions Following activities should be performed as a part of building conditions based on user functions: For each and every user functions, a set of conditions should be prepared. Timing, data conditions and other factors that affect user functions can be considered as parameters. Test Cases Following factors should be considered for building test cases: For every scenario, one or more test cases should be created to test each and every functionality of the user functions. If possible, these test cases should be automated through the standard CI/CD build pipeline processes with the track of each successful and failed build in AzDO. Every single condition should be enlisted as a separate test case. Applying the E2E Testing Like any other testing, E2E testing also goes through formal planning, test execution, and closure phases. E2E testing is done with the following steps: Planning Business and Functional Requirement analysis Test plan development Test case development Production like Environment setup for the testing Test data setup Decide exit criteria Choose the testing methods that most applicable to your system. For the definition of the various testing methods, please see Testing Methods document. Pre-requisite System Testing should be complete for all the participating systems. All subsystems should be combined to work as a complete application. Production like test environment should be ready. Test Execution Execute the test cases Register the test results and decide on pass and failure Report the Bugs in the bug reporting tool Re-verify the bug fixes Test Closure Test report preparation Evaluation of exit criteria Test phase closure Test Metrics The tracing the quality metrics gives insight about the current status of testing. Some common metrics of E2E testing are: Test case preparation status : Number of test cases ready versus the total number of test cases. Frequent Test progress : Number of test cases executed in the consistent frequent manner, e.g. weekly, versus a target number of the test cases in the same time period. Defects Status : This metric represents the status of the defects found during testing. Defects should be logged into defect tracking tool (e.g. AzDO backlog) and resolved as per their severity and priority. Therefore, the percentage of open and closed defects as per their severity and priority should be calculated to track this metric. The AzDO Dashboard Query can be used to track this metric. Test environment availability : This metric tracks the duration of the test environment used for end-to-end testing versus its scheduled allocation duration. E2E Testing Frameworks and Tools 1. Gauge Framework Gauge is a free and open source framework for writing and running E2E tests. Some key features of Gauge that makes it unique include: Simple, flexible and rich syntax based on Markdown. Consistent cross-platform/language support for writing test code. A modular architecture with plugins support. Supports data driven execution and external data sources. Helps you create maintainable test suites. Supports Visual Studio Code, Intellij IDEA, IDE Support. Supports html, json and XML reporting. Gauge Framework Website 2. Robot Framework Robot Framework is a generic open source automation framework. The framework has easy syntax, utilizing human-readable keywords. Its capabilities can be extended by libraries implemented with Python or Java. Robot shares a lot of the same \"pros\" as Gauge, except the developer tooling and the syntax. In our usage, we found the VS Code Intellisense offered with Gauge to be much more stable than the offerings for Robot. We also found the syntax to be less readable than what Gauge offered. While both frameworks allow for markup based test case definitions, the Gauge syntax reads much more like an English sentence than Robot. Finally, Intellisense is baked into the markup files for Gauge test cases, which will create a function stub for the actual test definition if the developer allows it. The same cannot be said of the Robot Framework. Robot Framework Website 3. TestCraft TestCraft is a codeless Selenium test automation platform. Its revolutionary AI technology and unique visual modeling allow for faster test creation and execution while eliminating test maintenance overhead. The testers create fully automated test scenarios without coding. Customers find bugs faster, release more frequently, integrate with the CI/CD approach and improve the overall quality of their digital products. This all creates a complete end-to-end testing experience. Perfecto (TestCraft) Website or get it from the Visual Studio Marketplace 4. Ranorex Studio Ranorex Studio is a complete end-to-end test automation tool for desktop, web, and mobile applications. Create reliable tests fast without any coding at all, or using the full IDE. Use external CSV or Excel files, or a SQL database as inputs to your tests. Run tests in parallel or on a Selenium Grid with built-in Selenium WebDriver. Ranorex Studio integrates with your CI/CD process to shorten your release cycles without sacrificing quality. Ranorex Studio tests also integrate with Azure DevOps (AzDO), which can be run as part of a build pipeline in AzDO. Ranorex Studio Website 5. Katalon Studio Katalon Studio is an excellent end-to-end automation solution for web, API, mobile, and desktop testing with DevOps support. With Katalon Studio, automated testing can be easily integrated into any CI/CD pipeline to release products faster while guaranteeing high quality. Katalon Studio customizes for users from beginners to experts. Robust functions such as Spying, Recording, Dual-editor interface and Custom Keywords make setting up, creating and maintaining tests possible for users. Built on top of Selenium and Appium, Katalon Studio helps standardize your end-to-end tests standardized. It also complies with the most popular frameworks to work seamlessly with other tools in the automated testing ecosystem. Katalon is endorsed by Gartner, IT professionals, and a large testing community. Note: At the time of this writing, Katalon Studio extension for AzDO was NOT available for Linux. Katalon Studio Website or read about its integration with AzDO 6. BugBug.io BugBug is an easy way to automate tests for web applications. The tool focuses on simplicity, yet allows you to cover all essential test cases without coding. It's an all-in-one solution - you can easily create tests and use the built-in cloud to run them on schedule or from your CI/CD, without changes to your own infrastructure. BugBug is an interesting alternative to Selenium because it's actually a completely different technology. It is based on a Chrome extension that allows BugBug to record and run tests faster than old-school frameworks. The biggest advantage of BugBug is its user-friendliness. Most tests created with BugBug simply work out of the box. This makes it easier for non-technical people to maintain tests - with BugBug you can save money on hiring a QA engineer. BugBug Website Conclusion Hope you learned various aspects of E2E testing like its processes, metrics, the difference between Unit, Integration and E2E testing, and the various recommended E2E test frameworks and tools. For any commercial release of the software, E2E test verification plays an important role as it tests the entire application in an environment that exactly imitates real-world users like network communication, middleware and backend services interaction, etc. Finally, the E2E test is often performed manually as the cost of automating such test cases is too high to be afforded by any organization. Having said that, the ultimate goal of each organization is to make the e2e testing as streamlined as possible adding full and semi-automation testing components into the process. Hence, the various E2E testing frameworks and tools listed in this article come to the rescue. Resources Wikipedia: Software testing Wikipedia: Unit testing Wikipedia: Integration testing Wikipedia: System testing","title":"E2E Testing"},{"location":"automated-testing/e2e-testing/#e2e-testing","text":"End-to-end (E2E) testing is a Software testing methodology to test a functional and data application flow consisting of several sub-systems working together from start to end. At times, these systems are developed in different technologies by different teams or organizations. Finally, they come together to form a functional business application. Hence, testing a single system would not suffice. Therefore, end-to-end testing verifies the application from start to end putting all its components together.","title":"E2E Testing"},{"location":"automated-testing/e2e-testing/#why-e2e-testing","text":"In many commercial software application scenarios, a modern software system consists of its interconnection with multiple sub-systems. These sub-systems can be within the same organization or can be components of different organizations. Also, these sub-systems can have somewhat similar or different lifetime release cycle from the current system. As a result, if there is any failure or fault in any sub-system, it can adversely affect the whole software system leading to its collapse. The above illustration is a testing pyramid from Kent C. Dodd's blog which is a combination of the pyramids from Martin Fowler\u2019s blog and the Google Testing Blog . The majority of your tests are at the bottom of the pyramid. As you move up the pyramid, the number of tests gets smaller. Also, going up the pyramid, tests get slower and more expensive to write, run, and maintain. Each type of testing vary for its purpose, application and the areas it's supposed to cover. For more information on comparison analysis of different testing types, please see this ## Unit vs Integration vs System vs E2E Testing document.","title":"Why E2E Testing"},{"location":"automated-testing/e2e-testing/#e2e-testing-design-blocks","text":"We will look into all the 3 categories one by one:","title":"E2E Testing Design Blocks"},{"location":"automated-testing/e2e-testing/#user-functions","text":"Following actions should be performed as a part of building user functions: List user initiated functions of the software systems, and their interconnected sub-systems. For any function, keep track of the actions performed as well as Input and Output data. Find the relations, if any between different Users functions. Find out the nature of different user functions i.e. if they are independent or are reusable.","title":"User Functions"},{"location":"automated-testing/e2e-testing/#conditions","text":"Following activities should be performed as a part of building conditions based on user functions: For each and every user functions, a set of conditions should be prepared. Timing, data conditions and other factors that affect user functions can be considered as parameters.","title":"Conditions"},{"location":"automated-testing/e2e-testing/#test-cases","text":"Following factors should be considered for building test cases: For every scenario, one or more test cases should be created to test each and every functionality of the user functions. If possible, these test cases should be automated through the standard CI/CD build pipeline processes with the track of each successful and failed build in AzDO. Every single condition should be enlisted as a separate test case.","title":"Test Cases"},{"location":"automated-testing/e2e-testing/#applying-the-e2e-testing","text":"Like any other testing, E2E testing also goes through formal planning, test execution, and closure phases. E2E testing is done with the following steps:","title":"Applying the E2E Testing"},{"location":"automated-testing/e2e-testing/#planning","text":"Business and Functional Requirement analysis Test plan development Test case development Production like Environment setup for the testing Test data setup Decide exit criteria Choose the testing methods that most applicable to your system. For the definition of the various testing methods, please see Testing Methods document.","title":"Planning"},{"location":"automated-testing/e2e-testing/#pre-requisite","text":"System Testing should be complete for all the participating systems. All subsystems should be combined to work as a complete application. Production like test environment should be ready.","title":"Pre-requisite"},{"location":"automated-testing/e2e-testing/#test-execution","text":"Execute the test cases Register the test results and decide on pass and failure Report the Bugs in the bug reporting tool Re-verify the bug fixes","title":"Test Execution"},{"location":"automated-testing/e2e-testing/#test-closure","text":"Test report preparation Evaluation of exit criteria Test phase closure","title":"Test Closure"},{"location":"automated-testing/e2e-testing/#test-metrics","text":"The tracing the quality metrics gives insight about the current status of testing. Some common metrics of E2E testing are: Test case preparation status : Number of test cases ready versus the total number of test cases. Frequent Test progress : Number of test cases executed in the consistent frequent manner, e.g. weekly, versus a target number of the test cases in the same time period. Defects Status : This metric represents the status of the defects found during testing. Defects should be logged into defect tracking tool (e.g. AzDO backlog) and resolved as per their severity and priority. Therefore, the percentage of open and closed defects as per their severity and priority should be calculated to track this metric. The AzDO Dashboard Query can be used to track this metric. Test environment availability : This metric tracks the duration of the test environment used for end-to-end testing versus its scheduled allocation duration.","title":"Test Metrics"},{"location":"automated-testing/e2e-testing/#e2e-testing-frameworks-and-tools","text":"","title":"E2E Testing Frameworks and Tools"},{"location":"automated-testing/e2e-testing/#1-gauge-framework","text":"Gauge is a free and open source framework for writing and running E2E tests. Some key features of Gauge that makes it unique include: Simple, flexible and rich syntax based on Markdown. Consistent cross-platform/language support for writing test code. A modular architecture with plugins support. Supports data driven execution and external data sources. Helps you create maintainable test suites. Supports Visual Studio Code, Intellij IDEA, IDE Support. Supports html, json and XML reporting. Gauge Framework Website","title":"1. Gauge Framework"},{"location":"automated-testing/e2e-testing/#2-robot-framework","text":"Robot Framework is a generic open source automation framework. The framework has easy syntax, utilizing human-readable keywords. Its capabilities can be extended by libraries implemented with Python or Java. Robot shares a lot of the same \"pros\" as Gauge, except the developer tooling and the syntax. In our usage, we found the VS Code Intellisense offered with Gauge to be much more stable than the offerings for Robot. We also found the syntax to be less readable than what Gauge offered. While both frameworks allow for markup based test case definitions, the Gauge syntax reads much more like an English sentence than Robot. Finally, Intellisense is baked into the markup files for Gauge test cases, which will create a function stub for the actual test definition if the developer allows it. The same cannot be said of the Robot Framework. Robot Framework Website","title":"2. Robot Framework"},{"location":"automated-testing/e2e-testing/#3-testcraft","text":"TestCraft is a codeless Selenium test automation platform. Its revolutionary AI technology and unique visual modeling allow for faster test creation and execution while eliminating test maintenance overhead. The testers create fully automated test scenarios without coding. Customers find bugs faster, release more frequently, integrate with the CI/CD approach and improve the overall quality of their digital products. This all creates a complete end-to-end testing experience. Perfecto (TestCraft) Website or get it from the Visual Studio Marketplace","title":"3. TestCraft"},{"location":"automated-testing/e2e-testing/#4-ranorex-studio","text":"Ranorex Studio is a complete end-to-end test automation tool for desktop, web, and mobile applications. Create reliable tests fast without any coding at all, or using the full IDE. Use external CSV or Excel files, or a SQL database as inputs to your tests. Run tests in parallel or on a Selenium Grid with built-in Selenium WebDriver. Ranorex Studio integrates with your CI/CD process to shorten your release cycles without sacrificing quality. Ranorex Studio tests also integrate with Azure DevOps (AzDO), which can be run as part of a build pipeline in AzDO. Ranorex Studio Website","title":"4. Ranorex Studio"},{"location":"automated-testing/e2e-testing/#5-katalon-studio","text":"Katalon Studio is an excellent end-to-end automation solution for web, API, mobile, and desktop testing with DevOps support. With Katalon Studio, automated testing can be easily integrated into any CI/CD pipeline to release products faster while guaranteeing high quality. Katalon Studio customizes for users from beginners to experts. Robust functions such as Spying, Recording, Dual-editor interface and Custom Keywords make setting up, creating and maintaining tests possible for users. Built on top of Selenium and Appium, Katalon Studio helps standardize your end-to-end tests standardized. It also complies with the most popular frameworks to work seamlessly with other tools in the automated testing ecosystem. Katalon is endorsed by Gartner, IT professionals, and a large testing community. Note: At the time of this writing, Katalon Studio extension for AzDO was NOT available for Linux. Katalon Studio Website or read about its integration with AzDO","title":"5. Katalon Studio"},{"location":"automated-testing/e2e-testing/#6-bugbugio","text":"BugBug is an easy way to automate tests for web applications. The tool focuses on simplicity, yet allows you to cover all essential test cases without coding. It's an all-in-one solution - you can easily create tests and use the built-in cloud to run them on schedule or from your CI/CD, without changes to your own infrastructure. BugBug is an interesting alternative to Selenium because it's actually a completely different technology. It is based on a Chrome extension that allows BugBug to record and run tests faster than old-school frameworks. The biggest advantage of BugBug is its user-friendliness. Most tests created with BugBug simply work out of the box. This makes it easier for non-technical people to maintain tests - with BugBug you can save money on hiring a QA engineer. BugBug Website","title":"6. BugBug.io"},{"location":"automated-testing/e2e-testing/#conclusion","text":"Hope you learned various aspects of E2E testing like its processes, metrics, the difference between Unit, Integration and E2E testing, and the various recommended E2E test frameworks and tools. For any commercial release of the software, E2E test verification plays an important role as it tests the entire application in an environment that exactly imitates real-world users like network communication, middleware and backend services interaction, etc. Finally, the E2E test is often performed manually as the cost of automating such test cases is too high to be afforded by any organization. Having said that, the ultimate goal of each organization is to make the e2e testing as streamlined as possible adding full and semi-automation testing components into the process. Hence, the various E2E testing frameworks and tools listed in this article come to the rescue.","title":"Conclusion"},{"location":"automated-testing/e2e-testing/#resources","text":"Wikipedia: Software testing Wikipedia: Unit testing Wikipedia: Integration testing Wikipedia: System testing","title":"Resources"},{"location":"automated-testing/e2e-testing/testing-comparison/","text":"Unit vs Integration vs System vs E2E Testing The table below illustrates the most critical characteristics and differences among Unit, Integration, System, and End-to-End Testing, and when to apply each methodology in a project. Unit Test Integration Test System Testing E2E Test Scope Modules, APIs Modules, interfaces Application, system All sub-systems, network dependencies, services and databases Size Tiny Small to medium Large X-Large Environment Development Integration test QA test Production like Data Mock data Test data Test data Copy of real production data System Under Test Isolated unit test Interfaces and flow data between the modules Particular system as a whole Application flow from start to end Scenarios Developer perspectives Developers and IT Pro tester perspectives Developer and QA tester perspectives End-user perspectives When After each build After Unit testing Before E2E testing and after Unit and Integration testing After System testing Automated or Manual Automated Manual or automated Manual or automated Manual","title":"Unit vs Integration vs System vs E2E Testing"},{"location":"automated-testing/e2e-testing/testing-comparison/#unit-vs-integration-vs-system-vs-e2e-testing","text":"The table below illustrates the most critical characteristics and differences among Unit, Integration, System, and End-to-End Testing, and when to apply each methodology in a project. Unit Test Integration Test System Testing E2E Test Scope Modules, APIs Modules, interfaces Application, system All sub-systems, network dependencies, services and databases Size Tiny Small to medium Large X-Large Environment Development Integration test QA test Production like Data Mock data Test data Test data Copy of real production data System Under Test Isolated unit test Interfaces and flow data between the modules Particular system as a whole Application flow from start to end Scenarios Developer perspectives Developers and IT Pro tester perspectives Developer and QA tester perspectives End-user perspectives When After each build After Unit testing Before E2E testing and after Unit and Integration testing After System testing Automated or Manual Automated Manual or automated Manual or automated Manual","title":"Unit vs Integration vs System vs E2E Testing"},{"location":"automated-testing/e2e-testing/testing-methods/","text":"E2E Testing Methods Horizontal Test This method is used very commonly. It occurs horizontally across the context of multiple applications. Take an example of a data ingest management system. The inbound data may be injected from various sources, but it then \"flatten\" into a horizontal processing pipeline that may include various components, such as a gateway API, data transformation, data validation, storage, etc... Throughout the entire Extract-Transform-Load (ETL) processing, the data flow can be tracked and monitored under the horizontal spectrum with little sprinkles of optional, and thus not important for the overall E2E test case, services, like logging, auditing, authentication. Vertical Test In this method, all most critical transactions of any application are verified and evaluated right from the start to finish. Each individual layer of the application is tested starting from top to bottom. Take an example of a web-based application that uses middleware services for reaching back-end resources. In such case, each layer (tier) is required to be fully tested in conjunction with the \"connected\" layers above and beneath, in which services \"talk\" to each other during the end to end data flow. All these complex testing scenarios will require proper validation and dedicated automated testing. Thus, this method is much more difficult. E2E Test Cases Design Guidelines Below enlisted are few guidelines that should be kept in mind while designing the test cases for performing E2E testing: Test cases should be designed from the end user\u2019s perspective. Should focus on testing some existing features of the system. Multiple scenarios should be considered for creating multiple test cases. Different sets of test cases should be created to focus on multiple scenarios of the system.","title":"E2E Testing Methods"},{"location":"automated-testing/e2e-testing/testing-methods/#e2e-testing-methods","text":"","title":"E2E Testing Methods"},{"location":"automated-testing/e2e-testing/testing-methods/#horizontal-test","text":"This method is used very commonly. It occurs horizontally across the context of multiple applications. Take an example of a data ingest management system. The inbound data may be injected from various sources, but it then \"flatten\" into a horizontal processing pipeline that may include various components, such as a gateway API, data transformation, data validation, storage, etc... Throughout the entire Extract-Transform-Load (ETL) processing, the data flow can be tracked and monitored under the horizontal spectrum with little sprinkles of optional, and thus not important for the overall E2E test case, services, like logging, auditing, authentication.","title":"Horizontal Test"},{"location":"automated-testing/e2e-testing/testing-methods/#vertical-test","text":"In this method, all most critical transactions of any application are verified and evaluated right from the start to finish. Each individual layer of the application is tested starting from top to bottom. Take an example of a web-based application that uses middleware services for reaching back-end resources. In such case, each layer (tier) is required to be fully tested in conjunction with the \"connected\" layers above and beneath, in which services \"talk\" to each other during the end to end data flow. All these complex testing scenarios will require proper validation and dedicated automated testing. Thus, this method is much more difficult.","title":"Vertical Test"},{"location":"automated-testing/e2e-testing/testing-methods/#e2e-test-cases-design-guidelines","text":"Below enlisted are few guidelines that should be kept in mind while designing the test cases for performing E2E testing: Test cases should be designed from the end user\u2019s perspective. Should focus on testing some existing features of the system. Multiple scenarios should be considered for creating multiple test cases. Different sets of test cases should be created to focus on multiple scenarios of the system.","title":"E2E Test Cases Design Guidelines"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/","text":"Gauge Framework Gauge is a free and open source framework for writing and running E2E tests. Some key features of Gauge that makes it unique include: Simple, flexible and rich syntax based on Markdown. Consistent cross-platform/language support for writing test code. A modular architecture with plugins support Extensible through plugins and hackable. Supports data driven execution and external data sources Helps you create maintainable test suites Supports Visual Studio Code, Intellij IDEA, IDE Support What is a Specification Gauge specifications are written using a Markdown syntax. For example # Search for the data blob ## Look for file * Goto Azure blob In this specification Search for the data blob is the specification heading , Look for file is a scenario with a step Goto Azure blob What is an Implementation You can implement the steps in a specification using a programming language, for example: from getgauge.python import step import os from step_impl.utils.driver import Driver @step ( \"Goto Azure blob\" ) def gotoAzureStorage () : URL = os.getenv ( 'STORAGE_ENDPOINT' ) Driver.driver.get ( URL ) The Gauge runner reads and runs steps and its implementation for every scenario in the specification and generates a report of passing or failing scenarios. # Search for the data blob ## Look for file \u2714 Successfully generated html-report to = > reports/html-report/index.html Specifications: 1 executed 1 passed 0 failed 0 skipped Scenarios: 1 executed 1 passed 0 failed 0 skipped Re-using Steps Gauge helps you focus on testing the flow of an application. Gauge does this by making steps as re-usable as possible. With Gauge, you don\u2019t need to build custom frameworks using a programming language. For example, Gauge steps can pass parameters to an implementation by using a text with quotes. # Search for the data blob ## Look for file * Goto Azure blob * Search for \"store_data.csv\" The implementation can now use \u201cstore_data.csv\u201d as follows from getgauge.python import step import os @step ( \"Search for <query>\" ) def searchForQuery ( query ) : write ( query ) press ( \"Enter\" ) step ( \"Search for <query>\" , ( query ) = > { write ( query ) ; press ( \"Enter\" ) ; You can then re-use this step within or across scenarios with different parameters: # Search for the data blob ## Look for Store data #1 * Goto Azure blob * Search for \"store_1.csv\" ## Look for Store data #2 * Goto Azure blob * Search for \"store_2.csv\" Or combine more than one step into concepts # Search Azure Storage for <query> * Goto Azure blob * Search for \"store_1.csv\" The concept, Search Azure Storage for <query> can be used like a step in a specification # Search for the data blob ## Look for Store data #1 * Search Azure Storage for \"store_1.csv\" ## Look for Store data #2 * Search Azure Storage for \"store_2.csv\" Data-Driven Testing Gauge also supports data driven testing using Markdown tables as well as external csv files for example # Search for the data blob | query | | --------- | | store_1 | | store_2 | | store_3 | ## Look for stores data * Search Azure Storage for <query> This will execute the scenario for all rows in the table. In the examples above, we refactored a specification to be concise and flexible without changing the implementation. Other Features This is brief introduction to a few Gauge features. Please refer to the Gauge documentation for additional features such as: Reports Tags Parallel execution Environments Screenshots Plugins And much more Installing Gauge This getting started guide takes you through the core features of Gauge. By the end of this guide, you\u2019ll be able to install Gauge and learn how to create your first Gauge test automation project. Installation Instructions for Windows OS Step 1: Installing Gauge on Windows This section gives specific instructions on setting up Gauge in a Microsoft Windows environment. Download the following installation bundle to get the latest stable release of Gauge. Step 2: Installing Gauge Extension for Visual Studio Code Follow the steps to add the Gauge Visual Studio Code plugin from the IDE Install the following Gauge extension for Visual Studio Code . Troubleshooting Installation If, when you run your first gauge spec you receive the error of missing python packages, open the command line terminal window and run this command: python.exe -m pip install getgauge == 0 .3.7 --user Installation Instructions for macOS Step 1: Installing Gauge on macOS This section gives specific instructions on setting up Gauge in a macOS environment. Install brew if you haven\u2019t already: Go to the brew website , and follow the directions there. Run the brew command to install Gauge > brew install gauge if HomeBrew is working properly, you should see something similar to the following: == > Fetching gauge == > Downloading https://ghcr.io/v2/homebrew/core/gauge/manifests/1.4.3 ######################################################################## 100.0% == > Downloading https://ghcr.io/v2/homebrew/core/gauge/blobs/sha256:05117bb3c0b2efeafe41e817cd3ad86307c1d2ea7e0e835655c4b51ab2472893 == > Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:05117bb3c0b2efeafe41e817cd3ad86307c1d2ea7e0e835655c4b51ab2472893?se = 2022 -12-13T12%3A35%3A00Z & sig = I78SuuwNgSMFoBTT ######################################################################## 100.0% == > Pouring gauge--1.4.3.ventura.bottle.tar.gz /usr/local/Cellar/gauge/1.4.3: 6 files, 18 .9MB Step 2 : Installing Gauge Extension for Visual Studio Code Follow the steps to add the Gauge Visual Studio Code plugin from the IDE Install the following Gauge extension for Visual Studio Code . Post-Installation Troubleshooting If, when you run your first gauge spec you receive the error of missing python packages, open the command line terminal window and run this command: python.exe -m pip install getgauge == 0 .3.7 --user","title":"Gauge Framework"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#gauge-framework","text":"Gauge is a free and open source framework for writing and running E2E tests. Some key features of Gauge that makes it unique include: Simple, flexible and rich syntax based on Markdown. Consistent cross-platform/language support for writing test code. A modular architecture with plugins support Extensible through plugins and hackable. Supports data driven execution and external data sources Helps you create maintainable test suites Supports Visual Studio Code, Intellij IDEA, IDE Support","title":"Gauge Framework"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#what-is-a-specification","text":"Gauge specifications are written using a Markdown syntax. For example # Search for the data blob ## Look for file * Goto Azure blob In this specification Search for the data blob is the specification heading , Look for file is a scenario with a step Goto Azure blob","title":"What is a Specification"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#what-is-an-implementation","text":"You can implement the steps in a specification using a programming language, for example: from getgauge.python import step import os from step_impl.utils.driver import Driver @step ( \"Goto Azure blob\" ) def gotoAzureStorage () : URL = os.getenv ( 'STORAGE_ENDPOINT' ) Driver.driver.get ( URL ) The Gauge runner reads and runs steps and its implementation for every scenario in the specification and generates a report of passing or failing scenarios. # Search for the data blob ## Look for file \u2714 Successfully generated html-report to = > reports/html-report/index.html Specifications: 1 executed 1 passed 0 failed 0 skipped Scenarios: 1 executed 1 passed 0 failed 0 skipped","title":"What is an Implementation"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#re-using-steps","text":"Gauge helps you focus on testing the flow of an application. Gauge does this by making steps as re-usable as possible. With Gauge, you don\u2019t need to build custom frameworks using a programming language. For example, Gauge steps can pass parameters to an implementation by using a text with quotes. # Search for the data blob ## Look for file * Goto Azure blob * Search for \"store_data.csv\" The implementation can now use \u201cstore_data.csv\u201d as follows from getgauge.python import step import os @step ( \"Search for <query>\" ) def searchForQuery ( query ) : write ( query ) press ( \"Enter\" ) step ( \"Search for <query>\" , ( query ) = > { write ( query ) ; press ( \"Enter\" ) ; You can then re-use this step within or across scenarios with different parameters: # Search for the data blob ## Look for Store data #1 * Goto Azure blob * Search for \"store_1.csv\" ## Look for Store data #2 * Goto Azure blob * Search for \"store_2.csv\" Or combine more than one step into concepts # Search Azure Storage for <query> * Goto Azure blob * Search for \"store_1.csv\" The concept, Search Azure Storage for <query> can be used like a step in a specification # Search for the data blob ## Look for Store data #1 * Search Azure Storage for \"store_1.csv\" ## Look for Store data #2 * Search Azure Storage for \"store_2.csv\"","title":"Re-using Steps"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#data-driven-testing","text":"Gauge also supports data driven testing using Markdown tables as well as external csv files for example # Search for the data blob | query | | --------- | | store_1 | | store_2 | | store_3 | ## Look for stores data * Search Azure Storage for <query> This will execute the scenario for all rows in the table. In the examples above, we refactored a specification to be concise and flexible without changing the implementation.","title":"Data-Driven Testing"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#other-features","text":"This is brief introduction to a few Gauge features. Please refer to the Gauge documentation for additional features such as: Reports Tags Parallel execution Environments Screenshots Plugins And much more","title":"Other Features"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#installing-gauge","text":"This getting started guide takes you through the core features of Gauge. By the end of this guide, you\u2019ll be able to install Gauge and learn how to create your first Gauge test automation project.","title":"Installing Gauge"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#installation-instructions-for-windows-os","text":"","title":"Installation Instructions for Windows OS"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#step-1-installing-gauge-on-windows","text":"This section gives specific instructions on setting up Gauge in a Microsoft Windows environment. Download the following installation bundle to get the latest stable release of Gauge.","title":"Step 1: Installing Gauge on Windows"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#step-2-installing-gauge-extension-for-visual-studio-code","text":"Follow the steps to add the Gauge Visual Studio Code plugin from the IDE Install the following Gauge extension for Visual Studio Code .","title":"Step 2: Installing Gauge Extension for Visual Studio Code"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#troubleshooting-installation","text":"If, when you run your first gauge spec you receive the error of missing python packages, open the command line terminal window and run this command: python.exe -m pip install getgauge == 0 .3.7 --user","title":"Troubleshooting Installation"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#installation-instructions-for-macos","text":"","title":"Installation Instructions for macOS"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#step-1-installing-gauge-on-macos","text":"This section gives specific instructions on setting up Gauge in a macOS environment. Install brew if you haven\u2019t already: Go to the brew website , and follow the directions there. Run the brew command to install Gauge > brew install gauge if HomeBrew is working properly, you should see something similar to the following: == > Fetching gauge == > Downloading https://ghcr.io/v2/homebrew/core/gauge/manifests/1.4.3 ######################################################################## 100.0% == > Downloading https://ghcr.io/v2/homebrew/core/gauge/blobs/sha256:05117bb3c0b2efeafe41e817cd3ad86307c1d2ea7e0e835655c4b51ab2472893 == > Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:05117bb3c0b2efeafe41e817cd3ad86307c1d2ea7e0e835655c4b51ab2472893?se = 2022 -12-13T12%3A35%3A00Z & sig = I78SuuwNgSMFoBTT ######################################################################## 100.0% == > Pouring gauge--1.4.3.ventura.bottle.tar.gz /usr/local/Cellar/gauge/1.4.3: 6 files, 18 .9MB","title":"Step 1: Installing Gauge on macOS"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#step-2-installing-gauge-extension-for-visual-studio-code_1","text":"Follow the steps to add the Gauge Visual Studio Code plugin from the IDE Install the following Gauge extension for Visual Studio Code .","title":"Step 2 : Installing Gauge Extension for Visual Studio Code"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#post-installation-troubleshooting","text":"If, when you run your first gauge spec you receive the error of missing python packages, open the command line terminal window and run this command: python.exe -m pip install getgauge == 0 .3.7 --user","title":"Post-Installation Troubleshooting"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/","text":"Postman Testing This purpose of this document is to provide guidance on how to use Newman in your CI/CD pipeline to run End-to-end (E2E) tests defined in Postman Collections while following security best practices. First, we'll introduce Postman and Newman and then outline several Postman testing use cases that answer why you may want to go beyond local testing with Postman Collections. In the final use case, we are looking to use a shell script that references the Postman Collection file path and Environment file path as inputs to Newman. Below is a flow diagram representing the outcome of the final use case: Postman and Newman Postman is a free API platform for testing APIs. Key features highlighted in this guidance include: Postman Collections Postman Environment Files Postman Scripts Newman is a command-line Collection Runner for Postman. It enables you to run and test a Postman Collection directly from the command line. Key features highlighted in this guidance include: Newman Run Command What is a Collection A Postman Collection is a group of executable saved requests. A collection can be exported as a json file. What is an Environment File A Postman Environment file holds environment variables that can be referenced by a valid Postman Collection. What is a Postman Script A Postman Script is Javascript hosted within a Postman Collection that can be written to execute against your Postman Collection and Environment File. What is the Newman Run Command A Newman CLI command that allows you to specify a Postman Collection to be run. Installing Postman and Newman For specific instruction on installing Postman, visit the Downloads Postman page. For specific instruction on installing Newman, visit the NPMJS Newman package page. Implementing Automated End-to-end (E2E) Tests With Postman Collections In order to provide guidance on implementing automated E2E tests with Postman, the section below begins with a use case that explains the trade-offs a dev or QA analyst might face when intending to use Postman for early testing. Each use case represents scenarios that facilitate the end goal of automated E2E tests. Use Case - Hands-on Functional Testing Of Endpoints A developer or QA analyst would like to locally test input data against API services all sharing a common oauth2 token. As a result, they use Postman to craft an API test suite of Postman Collections that can be locally executed against individual endpoints across environments. After validating that their Postman Collection works, they share it with their team. Steps may look like the following: For each of your existing API services, use the Postman IDE's import feature to import its OpenAPI Spec (Swagger) as a Postman Collection. If a service is not already using Swagger, look for language specific guidance on how to use Swagger to generate an OpenAPI Spec for your service. Finally, if your service only has a few endpoints, read Postman docs for guidance on how to manually build a Postman Collection. Provide extra clarity about a request in a Postman Collection by using Postman's Example feature to save its responses as examples. You can also simply add an example manually. Please read Postman docs for guidance on how to specify examples. Combine each Postman Collection into a centralized Postman Collection. Build Postman Environment files (local, Dev and/or QA) and parameterize all saved requests of the Postman Collection in a way that references the Postman Environment files. Use the Postman Script feature to create a shared prefetch script that automatically refreshes expired auth tokens per saved request. This would require referencing secrets from a Postman Environment file. // Please treat this as pseudocode, and adjust as necessary. /* The request to an oauth2 authorization endpoint that will issue a token based on provided credentials.*/ const oauth2Request = POST {...}; var getToken = true ; if ( pm . environment . get ( 'ACCESS_TOKEN_EXPIRY' ) <= ( new Date ()). getTime ()) { console . log ( 'Token is expired' ) } else { getToken = false ; console . log ( 'Token and expiry date are all good' ); } if ( getToken === true ) { pm . sendRequest ( oauth2Request , function ( _ , res ) { console . log ( 'Save the token' ) var responseJson = res . json (); pm . environment . set ( 'token' , responseJson . access_token ) console . log ( 'Save the expiry date' ) var expiryDate = new Date (); expiryDate . setSeconds ( expiryDate . getSeconds () + responseJson . expires_in ); pm . environment . set ( 'ACCESS_TOKEN_EXPIRY' , expiryDate . getTime ()); }); } Use Postman IDE to exercise endpoints. Export collection and environment files then remove any secrets before committing to your repo. Starting with this approach has the following upsides: You've set yourself up for the beginning stages of an E2E postman collection by aggregating the collections into a single file and using environment files to make it easier to switch environments. Token is refreshed automatically on every call in the collection. This saves you time normally lost from manually having to request a token that expired. Grants QA/Dev granular control of submitting combinations of input data per endpoint. Grants developers a common experience via Postman IDE features. Ending with this approach has the following downsides: Promotes unsafe sharing of secrets. Credentials needed to request JWT token in the prefetch script are being manually shared. Secrets may happen to get exposed in the git commit history for various reasons (ex. Sharing the exported Postman Environment files). Collections can only be used locally to hit APIs (local or deployed). Not CI based. Each developer has to keep both their Postman Collection and Postman environment file(s) updated in order to keep up with latest changes to deployed services. Use Case - Hands-on Functional Testing Of Endpoints with Azure Key Vault and Azure App Config A developer or QA analyst may have an existing API test suite of Postman Collections, however, they now want to discourage unsafe sharing of secrets. As a result, they build a script that connects to both Key Vault and Azure App Config in order to automatically generate Postman Environment files instead of checking them into a shared repository. Steps may look like the following: Create an Azure Key Vault and store authentication secrets per environment: - \"Key:value\" (ex. \"dev-auth-password:12345\" ) - \"Key:value\" (ex. \"qa-auth-password:12345\" ) Create a shared Azure App Configuration instance and save all your Postman environment variables. This instance will be dedicated to holding all your Postman environment variables: > NOTE: Use the Label feature to delineate between environments. - \"Key:value\" -> \"apiRoute:url\" (ex. \"servicename:https://servicename.net\" & Label = \"QA\" ) - \"Key:value\" -> \"Header:value\" (ex. \"token: \" & Label = \"QA\" ) - \"Key:value\" -> \"KeyVaultKey:KeyVaultSecret\" (ex. \"authpassword:qa-auth-password\" & Label = \"QA\" ) Install Powershell or Bash. Powershell works for both Azure Powershell and Azure CLI. Download Azure CLI, login to the appropriate subscription and ensure you have access to the appropriate resources. Some helpful commands are below: # login to the appropriate subscription az login # validate login az account show # validate access to Key Vault az keyvault secret list - -vault-name \"$KeyvaultName\" # validate access to App Configuration az appconfig kv list - -name \"$AppConfigName\" Build a script that automatically generates your environment files. > Note: App Configuration references Key Vault, however, your script is responsible for authenticating properly to both App Configuration and Key Vault. The two services don't communicate directly. ```powershell (CreatePostmanEnvironmentFiles.ps1) # Please treat this as pseudocode, and adjust as necessary. ############################################################ env = $arg1 # 1. list app config vars for an environment envVars = az appconfig kv list --name PostmanAppConfig --label $env | ConvertFrom-Json # 2. step through envVars array to get Key Vault uris keyvaultURI = \"\" $envVars | % {if($ .key -eq 'password'){keyvaultURI = $ .value}} # 3. parse uris for Key Vault name and secret names # 4. get secret from Key Vault kvsecret = az keyvault secret show --name $secretName --vault-name $keyvaultName --query \"value\" # 5. set password value to returned Key Vault secret $envVars | % {if($ .key -eq 'password'){$ .value=$kvsecret}} # 6. create environment file envFile = @{ \"_postman_variable_scope\" = \"environment\", \"name\" = $env, values = @() } foreach($var in $envVars){ $envFile.values += @{ key = $var.key; value = $var.value; } } $envFile | ConvertTo-Json -depth 50 | Out-File -encoding ASCII -FilePath .\\$env.postman_environment.json ``` Use Postman IDE to import the Postman Environment files to be referenced by your collection. This approach has the following upsides: Inherits all the upsides of the previous case. Discourages unsafe sharing of secrets. Secrets are now pulled from Key Vault via Azure CLI. Key Vault Uri also no longer needs to be shared for access to auth tokens. Single source of truth for Postman Environment files. There's no longer a need to share them via repo. Developer only has to manage a single Postman Collection. Ending with this approach has the following downsides: Secrets may happen to get exposed in the git commit history if .gitIgnore is not updated to ignore Postman Environment files. Collections can only be used locally to hit APIs (local or deployed). Not CI based. Use Case - E2E Testing with Continuous Integration and Newman A developer or QA analyst may have an existing API test suite of local Postman Collections that follow security best practices for development, however, they now want E2E tests to run as part of automated CI pipeline. With the advent of Newman, you can now more readily use Postman to craft an API test suite executable in your CI. Steps may look like the following: Update your Postman Collection to use the Postman Test feature in order to craft test assertions that will cover all saved requests E2E. Read Postman docs for guidance on how to use the Postman Test feature. Locally use Newman to validate tests are working as intended newman run tests \\ e2e_Postman_collection . json -e qa . postman_environment . json Build a script that automatically executes Postman Test assertions via Newman and Azure CLI. > NOTE: An Azure Service Principal must be setup to continue using azure cli in this CI pipeline example. ```powershell (RunPostmanE2eTests.ps1) # Please treat this as pseudocode, and adjust as necessary. ############################################################ # 1. login to Azure using a Service Principal az login --service-principal -u $APP_ID -p $AZURE_SECRET --tenant $AZURE_TENANT # 2. list app config vars for an environment envVars = az appconfig kv list --name PostmanAppConfig --label $env | ConvertFrom-Json # 3. step through envVars array to get Key Vault uris keyvaultURI = \"\" @envVars | % {if($ .key -eq 'password'){keyvaultURI = $ .value}} # 4. parse uris for Key Vault name and secret names # 5. get secret from Key Vault kvsecret = az keyvault secret show --name $secretName --vault-name $keyvaultName --query \"value\" # 6. set password value to returned Key Vault secret $envVars | % {if($ .key -eq 'password'){$ .value=$kvsecret}} # 7. create environment file envFile = @{ \"_postman_variable_scope\" = \"environment\", \"name\" = $env, values = @() } foreach($var in $envVars){ $envFile.values += @{ key = $var.key; value = $var.value; } } $envFile | ConvertTo-Json -depth 50 | Out-File -encoding ASCII $env.postman_environment.json # 8. install Newman npm install --save-dev newman # 9. run automated E2E tests via Newman node_modules.bin\\newman run tests\\e2e_Postman_collection.json -e $env.postman_environment.json ``` Create a yaml file and define a step that will run your test script. (ex. A yaml file targeting Azure Devops that runs a Powershell script.) # Please treat this as pseudocode, and adjust as necessary. ############################################################ displayName : 'Run Postman E2E tests' inputs : targetType : 'filePath' filePath : RunPostmanE2eTests.ps1 env : APP_ID : $(environment.appId) # credentials for az cli AZURE_SECRET : $(environment.secret) AZURE_TENANT : $(environment.tenant) This approach has the following upside: E2E tests can now be run automatically as part of a CI pipeline. Ending with this approach has the following downside: Postman Environment files are no longer being output to a local environment for hands-on manual testing. However, this can be solved by managing 2 scripts.","title":"Postman Testing"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#postman-testing","text":"This purpose of this document is to provide guidance on how to use Newman in your CI/CD pipeline to run End-to-end (E2E) tests defined in Postman Collections while following security best practices. First, we'll introduce Postman and Newman and then outline several Postman testing use cases that answer why you may want to go beyond local testing with Postman Collections. In the final use case, we are looking to use a shell script that references the Postman Collection file path and Environment file path as inputs to Newman. Below is a flow diagram representing the outcome of the final use case:","title":"Postman Testing"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#postman-and-newman","text":"Postman is a free API platform for testing APIs. Key features highlighted in this guidance include: Postman Collections Postman Environment Files Postman Scripts Newman is a command-line Collection Runner for Postman. It enables you to run and test a Postman Collection directly from the command line. Key features highlighted in this guidance include: Newman Run Command","title":"Postman and Newman"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#what-is-a-collection","text":"A Postman Collection is a group of executable saved requests. A collection can be exported as a json file.","title":"What is a Collection"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#what-is-an-environment-file","text":"A Postman Environment file holds environment variables that can be referenced by a valid Postman Collection.","title":"What is an Environment File"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#what-is-a-postman-script","text":"A Postman Script is Javascript hosted within a Postman Collection that can be written to execute against your Postman Collection and Environment File.","title":"What is a Postman Script"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#what-is-the-newman-run-command","text":"A Newman CLI command that allows you to specify a Postman Collection to be run.","title":"What is the Newman Run Command"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#installing-postman-and-newman","text":"For specific instruction on installing Postman, visit the Downloads Postman page. For specific instruction on installing Newman, visit the NPMJS Newman package page.","title":"Installing Postman and Newman"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#implementing-automated-end-to-end-e2e-tests-with-postman-collections","text":"In order to provide guidance on implementing automated E2E tests with Postman, the section below begins with a use case that explains the trade-offs a dev or QA analyst might face when intending to use Postman for early testing. Each use case represents scenarios that facilitate the end goal of automated E2E tests.","title":"Implementing Automated End-to-end (E2E) Tests With Postman Collections"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#use-case-hands-on-functional-testing-of-endpoints","text":"A developer or QA analyst would like to locally test input data against API services all sharing a common oauth2 token. As a result, they use Postman to craft an API test suite of Postman Collections that can be locally executed against individual endpoints across environments. After validating that their Postman Collection works, they share it with their team. Steps may look like the following: For each of your existing API services, use the Postman IDE's import feature to import its OpenAPI Spec (Swagger) as a Postman Collection. If a service is not already using Swagger, look for language specific guidance on how to use Swagger to generate an OpenAPI Spec for your service. Finally, if your service only has a few endpoints, read Postman docs for guidance on how to manually build a Postman Collection. Provide extra clarity about a request in a Postman Collection by using Postman's Example feature to save its responses as examples. You can also simply add an example manually. Please read Postman docs for guidance on how to specify examples. Combine each Postman Collection into a centralized Postman Collection. Build Postman Environment files (local, Dev and/or QA) and parameterize all saved requests of the Postman Collection in a way that references the Postman Environment files. Use the Postman Script feature to create a shared prefetch script that automatically refreshes expired auth tokens per saved request. This would require referencing secrets from a Postman Environment file. // Please treat this as pseudocode, and adjust as necessary. /* The request to an oauth2 authorization endpoint that will issue a token based on provided credentials.*/ const oauth2Request = POST {...}; var getToken = true ; if ( pm . environment . get ( 'ACCESS_TOKEN_EXPIRY' ) <= ( new Date ()). getTime ()) { console . log ( 'Token is expired' ) } else { getToken = false ; console . log ( 'Token and expiry date are all good' ); } if ( getToken === true ) { pm . sendRequest ( oauth2Request , function ( _ , res ) { console . log ( 'Save the token' ) var responseJson = res . json (); pm . environment . set ( 'token' , responseJson . access_token ) console . log ( 'Save the expiry date' ) var expiryDate = new Date (); expiryDate . setSeconds ( expiryDate . getSeconds () + responseJson . expires_in ); pm . environment . set ( 'ACCESS_TOKEN_EXPIRY' , expiryDate . getTime ()); }); } Use Postman IDE to exercise endpoints. Export collection and environment files then remove any secrets before committing to your repo. Starting with this approach has the following upsides: You've set yourself up for the beginning stages of an E2E postman collection by aggregating the collections into a single file and using environment files to make it easier to switch environments. Token is refreshed automatically on every call in the collection. This saves you time normally lost from manually having to request a token that expired. Grants QA/Dev granular control of submitting combinations of input data per endpoint. Grants developers a common experience via Postman IDE features. Ending with this approach has the following downsides: Promotes unsafe sharing of secrets. Credentials needed to request JWT token in the prefetch script are being manually shared. Secrets may happen to get exposed in the git commit history for various reasons (ex. Sharing the exported Postman Environment files). Collections can only be used locally to hit APIs (local or deployed). Not CI based. Each developer has to keep both their Postman Collection and Postman environment file(s) updated in order to keep up with latest changes to deployed services.","title":"Use Case - Hands-on Functional Testing Of Endpoints"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#use-case-hands-on-functional-testing-of-endpoints-with-azure-key-vault-and-azure-app-config","text":"A developer or QA analyst may have an existing API test suite of Postman Collections, however, they now want to discourage unsafe sharing of secrets. As a result, they build a script that connects to both Key Vault and Azure App Config in order to automatically generate Postman Environment files instead of checking them into a shared repository. Steps may look like the following: Create an Azure Key Vault and store authentication secrets per environment: - \"Key:value\" (ex. \"dev-auth-password:12345\" ) - \"Key:value\" (ex. \"qa-auth-password:12345\" ) Create a shared Azure App Configuration instance and save all your Postman environment variables. This instance will be dedicated to holding all your Postman environment variables: > NOTE: Use the Label feature to delineate between environments. - \"Key:value\" -> \"apiRoute:url\" (ex. \"servicename:https://servicename.net\" & Label = \"QA\" ) - \"Key:value\" -> \"Header:value\" (ex. \"token: \" & Label = \"QA\" ) - \"Key:value\" -> \"KeyVaultKey:KeyVaultSecret\" (ex. \"authpassword:qa-auth-password\" & Label = \"QA\" ) Install Powershell or Bash. Powershell works for both Azure Powershell and Azure CLI. Download Azure CLI, login to the appropriate subscription and ensure you have access to the appropriate resources. Some helpful commands are below: # login to the appropriate subscription az login # validate login az account show # validate access to Key Vault az keyvault secret list - -vault-name \"$KeyvaultName\" # validate access to App Configuration az appconfig kv list - -name \"$AppConfigName\" Build a script that automatically generates your environment files. > Note: App Configuration references Key Vault, however, your script is responsible for authenticating properly to both App Configuration and Key Vault. The two services don't communicate directly. ```powershell (CreatePostmanEnvironmentFiles.ps1) # Please treat this as pseudocode, and adjust as necessary. ############################################################ env = $arg1 # 1. list app config vars for an environment envVars = az appconfig kv list --name PostmanAppConfig --label $env | ConvertFrom-Json # 2. step through envVars array to get Key Vault uris keyvaultURI = \"\" $envVars | % {if($ .key -eq 'password'){keyvaultURI = $ .value}} # 3. parse uris for Key Vault name and secret names # 4. get secret from Key Vault kvsecret = az keyvault secret show --name $secretName --vault-name $keyvaultName --query \"value\" # 5. set password value to returned Key Vault secret $envVars | % {if($ .key -eq 'password'){$ .value=$kvsecret}} # 6. create environment file envFile = @{ \"_postman_variable_scope\" = \"environment\", \"name\" = $env, values = @() } foreach($var in $envVars){ $envFile.values += @{ key = $var.key; value = $var.value; } } $envFile | ConvertTo-Json -depth 50 | Out-File -encoding ASCII -FilePath .\\$env.postman_environment.json ``` Use Postman IDE to import the Postman Environment files to be referenced by your collection. This approach has the following upsides: Inherits all the upsides of the previous case. Discourages unsafe sharing of secrets. Secrets are now pulled from Key Vault via Azure CLI. Key Vault Uri also no longer needs to be shared for access to auth tokens. Single source of truth for Postman Environment files. There's no longer a need to share them via repo. Developer only has to manage a single Postman Collection. Ending with this approach has the following downsides: Secrets may happen to get exposed in the git commit history if .gitIgnore is not updated to ignore Postman Environment files. Collections can only be used locally to hit APIs (local or deployed). Not CI based.","title":"Use Case - Hands-on Functional Testing Of Endpoints with Azure Key Vault and Azure App Config"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#use-case-e2e-testing-with-continuous-integration-and-newman","text":"A developer or QA analyst may have an existing API test suite of local Postman Collections that follow security best practices for development, however, they now want E2E tests to run as part of automated CI pipeline. With the advent of Newman, you can now more readily use Postman to craft an API test suite executable in your CI. Steps may look like the following: Update your Postman Collection to use the Postman Test feature in order to craft test assertions that will cover all saved requests E2E. Read Postman docs for guidance on how to use the Postman Test feature. Locally use Newman to validate tests are working as intended newman run tests \\ e2e_Postman_collection . json -e qa . postman_environment . json Build a script that automatically executes Postman Test assertions via Newman and Azure CLI. > NOTE: An Azure Service Principal must be setup to continue using azure cli in this CI pipeline example. ```powershell (RunPostmanE2eTests.ps1) # Please treat this as pseudocode, and adjust as necessary. ############################################################ # 1. login to Azure using a Service Principal az login --service-principal -u $APP_ID -p $AZURE_SECRET --tenant $AZURE_TENANT # 2. list app config vars for an environment envVars = az appconfig kv list --name PostmanAppConfig --label $env | ConvertFrom-Json # 3. step through envVars array to get Key Vault uris keyvaultURI = \"\" @envVars | % {if($ .key -eq 'password'){keyvaultURI = $ .value}} # 4. parse uris for Key Vault name and secret names # 5. get secret from Key Vault kvsecret = az keyvault secret show --name $secretName --vault-name $keyvaultName --query \"value\" # 6. set password value to returned Key Vault secret $envVars | % {if($ .key -eq 'password'){$ .value=$kvsecret}} # 7. create environment file envFile = @{ \"_postman_variable_scope\" = \"environment\", \"name\" = $env, values = @() } foreach($var in $envVars){ $envFile.values += @{ key = $var.key; value = $var.value; } } $envFile | ConvertTo-Json -depth 50 | Out-File -encoding ASCII $env.postman_environment.json # 8. install Newman npm install --save-dev newman # 9. run automated E2E tests via Newman node_modules.bin\\newman run tests\\e2e_Postman_collection.json -e $env.postman_environment.json ``` Create a yaml file and define a step that will run your test script. (ex. A yaml file targeting Azure Devops that runs a Powershell script.) # Please treat this as pseudocode, and adjust as necessary. ############################################################ displayName : 'Run Postman E2E tests' inputs : targetType : 'filePath' filePath : RunPostmanE2eTests.ps1 env : APP_ID : $(environment.appId) # credentials for az cli AZURE_SECRET : $(environment.secret) AZURE_TENANT : $(environment.tenant) This approach has the following upside: E2E tests can now be run automatically as part of a CI pipeline. Ending with this approach has the following downside: Postman Environment files are no longer being output to a local environment for hands-on manual testing. However, this can be solved by managing 2 scripts.","title":"Use Case - E2E Testing with Continuous Integration and Newman"},{"location":"automated-testing/fault-injection-testing/","text":"Fault Injection Testing Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its stability and reliability . The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time. When To Use Problem Addressed Systems need to be resilient to the conditions that caused inevitable production disruptions. Modern applications are built with an increasing number of dependencies; on infrastructure, platform, network, 3rd party software or APIs, etc. Such systems increase the risk of impact from dependency disruptions. Each dependent component may fail. Furthermore, its interactions with other components may propagate the failure. Fault injection methods are a way to increase coverage and validate software robustness and error handling, either at build-time or at run-time, with the intention of \"embracing failure\" as part of the development lifecycle. These methods assist engineering teams in designing and continuously validating for failure, accounting for known and unknown failure conditions, architect for redundancy, employ retry and back-off mechanisms, etc. Applicable to Software - Error handling code paths, in-process memory management. Example tests: Edge-case unit/integration tests and/or load tests (i.e. stress and soak). Protocol - Vulnerabilities in communication interfaces such as command line parameters or APIs. Example tests: Fuzzing provides invalid, unexpected, or random data as input we can assess the level of protocol stability of a component. Infrastructure - Outages, networking issues, hardware failures. Example tests: Using different methods to cause fault in the underlying infrastructure such as Shut down virtual machine (VM) instances, crash processes, expire certificates, introduce network latency, etc. This level of testing relies on statistical metrics observations over time and measuring the deviations of its observed behavior during fault, or its recovery time. How to Use Architecture Terminology Fault - The adjudged or hypothesized cause of an error. Error - That part of the system state that may cause a subsequent failure. Failure - An event that occurs when the delivered service deviates from correct state. Fault-Error-Failure cycle - A key mechanism in dependability : A fault may cause an error. An error may cause further errors within the system boundary; therefore each new error acts as a fault. When error states are observed at the system boundary, they are termed failures. Fault Injection Testing Basics Fault injection is an advanced form of testing where the system is subjected to different failure modes , and where the testing engineer may know in advance what is the expected outcome, as in the case of release validation tests, or in an exploration to find potential issues in the product, which should be mitigated. Fault Injection and Chaos Engineering Fault injection testing is a specific approach to testing one condition. It introduces a failure into a system to validate its robustness. Chaos engineering, coined by Netflix, is a practice for generating new information. There is an overlap in concerns and often in tooling between the terms, and many times chaos engineering uses fault injection to introduce the required effects to the system. High-level Step-by-Step Fault injection testing in the development cycle Fault injection is an effective way to find security bugs in software, so much so that the Microsoft Security Development Lifecycle requires fuzzing at every untrusted interface of every product and penetration testing which includes introducing faults to the system, to uncover potential vulnerabilities resulting from coding errors, system configuration faults, or other operational deployment weaknesses. Automated fault injection coverage in a CI pipeline promotes a Shift-Left approach of testing earlier in the lifecycle for potential issues. Examples of performing fault injection during the development lifecycle: Using fuzzing tools in CI. Execute existing end-to-end scenario tests (such as integration or stress tests), which are augmented with fault injection. Write regression and acceptance tests based on issues that were found and fixed or based on resolved service incidents. Ad-hoc (manual) validations of fault in the dev environment for new features. Fault Injection Testing in the Release Cycle Much like Synthetic Monitoring Tests , fault injection testing in the release cycle is a part of Shift-Right testing approach, which uses safe methods to perform tests in a production or pre-production environment. Given the nature of distributed, cloud-based applications, it is very difficult to simulate the real behavior of services outside their production environment. Testers are encouraged to run tests where it really matters, on a live system with customer traffic. Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering: Measure and define a steady (healthy) state for the system's interoperability. Create hypotheses based on predicted behavior when a fault is introduced. Introduce real-world fault-events to the system. Measure the state and compare it to the baseline state. Document the process and the observations. Identify and act on the result. Fault Injection Testing in Kubernetes With the advancement of kubernetes (k8s) as the infrastructure platform, fault injection testing in kubernetes has become inevitable to ensure that system behaves in a reliable manner in the event of a fault or failure. There could be different type of workloads running within a k8s cluster which are written in different languages. For eg. within a K8s cluster, you can run a micro service, a web app and/or a scheduled job. Hence you need to have mechanism to inject fault into any kind of workloads running within the cluster. In addition, kubernetes clusters are managed differently from traditional infrastructure. The tools used for fault injection testing within kubernetes should have compatibility with k8s infrastructure. These are the main characteristics which are required: Ease of injecting fault into kubernetes pods. Support for faster tool installation within the cluster. Support for YAML based configurations which works well with kubernetes. Ease of customization to add custom resources. Support for workflows to deploy various workloads and faults. Ease of maintainability of the tool Ease of integration with telemetry Best Practices and Advice Experimenting in production has the benefit of running tests against a live system with real user traffic, ensuring its health, or building confidence in its ability to handle errors gracefully. However, it has the potential to cause unnecessary customer pain. A test can either succeed or fail. In the event of failure, there will likely be some impact on the production environment. Thinking about the Blast Radius of the effect, should the test fail, is a crucial step to conduct beforehand. The following practices may help minimize such risk: Run tests in a non-production environment first. Understand how the system behaves in a safe environment, using synthetic workload, before introducing potential risk to customer traffic. Use fault injection as gates in different stages through the CD pipeline. Deploy and test on Blue/Green and Canary deployments. Use methods such as traffic shadowing (a.k.a. Dark Traffic ) to get customer traffic to the staging slot. Strive to achieve a balance between collecting actual result data while affecting as few production users as possible. Use defensive design principles such as circuit breaking and the bulkhead patterns. Agreed on a budget (in terms of Service Level Objective (SLO)) as an investment in chaos and fault injection. Grow the risk incrementally - Start with hardening the core and expand out in layers. At each point, progress should be locked in with automated regression tests. Fault Injection Testing Frameworks and Tools Fuzzing OneFuzz - is a Microsoft open-source self-hosted fuzzing-as-a-service platform which is easy to integrate into CI pipelines. AFL and WinAFL - Popular fuzz tools by Google's project zero team which is used locally to target binaries on Linux or Windows. WebScarab - A web-focused fuzzer owned by OWASP which can be found in Kali linux distributions. Chaos Azure Chaos Studio - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources. Chaos toolkit - A declarative, modular chaos platform with many extensions, including the Azure actions and probes kit . Kraken - An Openshift-specific chaos tool, maintained by Redhat. Chaos Monkey - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB). Simmy - A .NET library for chaos testing and fault injection integrated with the Polly library for resilience engineering. Litmus - A CNCF open source tool for chaos testing and fault injection for kubernetes cluster. This ISE dev blog post provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing. Conclusion From the principals of chaos: \"The harder it is to disrupt the steady-state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large\". Fault injection techniques increase resilience and confidence in the products we ship. They are used across the industry to validate applications and platforms before and while they are delivered to customers. Fault injection is a powerful tool and should be used with caution. Cases such as the Cloudflare 30 minute global outage , which was caused due to a deployment of code that was meant to be \u201cdark launched\u201d, entail the importance of curtailing the blast radius in the system during experiments. Resources Mark Russinovich's fault injection and chaos engineering blog post Cindy Sridharan's Testing in production blog post Cindy Sridharan's Testing in production blog post cont. Fault injection in Azure Search Azure Architecture Framework - Chaos engineering Azure Architecture Framework - Testing resilience Landscape of Software Failure Cause Models","title":"Fault Injection Testing"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing","text":"Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its stability and reliability . The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.","title":"Fault Injection Testing"},{"location":"automated-testing/fault-injection-testing/#when-to-use","text":"","title":"When To Use"},{"location":"automated-testing/fault-injection-testing/#problem-addressed","text":"Systems need to be resilient to the conditions that caused inevitable production disruptions. Modern applications are built with an increasing number of dependencies; on infrastructure, platform, network, 3rd party software or APIs, etc. Such systems increase the risk of impact from dependency disruptions. Each dependent component may fail. Furthermore, its interactions with other components may propagate the failure. Fault injection methods are a way to increase coverage and validate software robustness and error handling, either at build-time or at run-time, with the intention of \"embracing failure\" as part of the development lifecycle. These methods assist engineering teams in designing and continuously validating for failure, accounting for known and unknown failure conditions, architect for redundancy, employ retry and back-off mechanisms, etc.","title":"Problem Addressed"},{"location":"automated-testing/fault-injection-testing/#applicable-to","text":"Software - Error handling code paths, in-process memory management. Example tests: Edge-case unit/integration tests and/or load tests (i.e. stress and soak). Protocol - Vulnerabilities in communication interfaces such as command line parameters or APIs. Example tests: Fuzzing provides invalid, unexpected, or random data as input we can assess the level of protocol stability of a component. Infrastructure - Outages, networking issues, hardware failures. Example tests: Using different methods to cause fault in the underlying infrastructure such as Shut down virtual machine (VM) instances, crash processes, expire certificates, introduce network latency, etc. This level of testing relies on statistical metrics observations over time and measuring the deviations of its observed behavior during fault, or its recovery time.","title":"Applicable to"},{"location":"automated-testing/fault-injection-testing/#how-to-use","text":"","title":"How to Use"},{"location":"automated-testing/fault-injection-testing/#architecture","text":"","title":"Architecture"},{"location":"automated-testing/fault-injection-testing/#terminology","text":"Fault - The adjudged or hypothesized cause of an error. Error - That part of the system state that may cause a subsequent failure. Failure - An event that occurs when the delivered service deviates from correct state. Fault-Error-Failure cycle - A key mechanism in dependability : A fault may cause an error. An error may cause further errors within the system boundary; therefore each new error acts as a fault. When error states are observed at the system boundary, they are termed failures.","title":"Terminology"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-basics","text":"Fault injection is an advanced form of testing where the system is subjected to different failure modes , and where the testing engineer may know in advance what is the expected outcome, as in the case of release validation tests, or in an exploration to find potential issues in the product, which should be mitigated.","title":"Fault Injection Testing Basics"},{"location":"automated-testing/fault-injection-testing/#fault-injection-and-chaos-engineering","text":"Fault injection testing is a specific approach to testing one condition. It introduces a failure into a system to validate its robustness. Chaos engineering, coined by Netflix, is a practice for generating new information. There is an overlap in concerns and often in tooling between the terms, and many times chaos engineering uses fault injection to introduce the required effects to the system.","title":"Fault Injection and Chaos Engineering"},{"location":"automated-testing/fault-injection-testing/#high-level-step-by-step","text":"","title":"High-level Step-by-Step"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-in-the-development-cycle","text":"Fault injection is an effective way to find security bugs in software, so much so that the Microsoft Security Development Lifecycle requires fuzzing at every untrusted interface of every product and penetration testing which includes introducing faults to the system, to uncover potential vulnerabilities resulting from coding errors, system configuration faults, or other operational deployment weaknesses. Automated fault injection coverage in a CI pipeline promotes a Shift-Left approach of testing earlier in the lifecycle for potential issues. Examples of performing fault injection during the development lifecycle: Using fuzzing tools in CI. Execute existing end-to-end scenario tests (such as integration or stress tests), which are augmented with fault injection. Write regression and acceptance tests based on issues that were found and fixed or based on resolved service incidents. Ad-hoc (manual) validations of fault in the dev environment for new features.","title":"Fault injection testing in the development cycle"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-in-the-release-cycle","text":"Much like Synthetic Monitoring Tests , fault injection testing in the release cycle is a part of Shift-Right testing approach, which uses safe methods to perform tests in a production or pre-production environment. Given the nature of distributed, cloud-based applications, it is very difficult to simulate the real behavior of services outside their production environment. Testers are encouraged to run tests where it really matters, on a live system with customer traffic. Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering: Measure and define a steady (healthy) state for the system's interoperability. Create hypotheses based on predicted behavior when a fault is introduced. Introduce real-world fault-events to the system. Measure the state and compare it to the baseline state. Document the process and the observations. Identify and act on the result.","title":"Fault Injection Testing in the Release Cycle"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-in-kubernetes","text":"With the advancement of kubernetes (k8s) as the infrastructure platform, fault injection testing in kubernetes has become inevitable to ensure that system behaves in a reliable manner in the event of a fault or failure. There could be different type of workloads running within a k8s cluster which are written in different languages. For eg. within a K8s cluster, you can run a micro service, a web app and/or a scheduled job. Hence you need to have mechanism to inject fault into any kind of workloads running within the cluster. In addition, kubernetes clusters are managed differently from traditional infrastructure. The tools used for fault injection testing within kubernetes should have compatibility with k8s infrastructure. These are the main characteristics which are required: Ease of injecting fault into kubernetes pods. Support for faster tool installation within the cluster. Support for YAML based configurations which works well with kubernetes. Ease of customization to add custom resources. Support for workflows to deploy various workloads and faults. Ease of maintainability of the tool Ease of integration with telemetry","title":"Fault Injection Testing in Kubernetes"},{"location":"automated-testing/fault-injection-testing/#best-practices-and-advice","text":"Experimenting in production has the benefit of running tests against a live system with real user traffic, ensuring its health, or building confidence in its ability to handle errors gracefully. However, it has the potential to cause unnecessary customer pain. A test can either succeed or fail. In the event of failure, there will likely be some impact on the production environment. Thinking about the Blast Radius of the effect, should the test fail, is a crucial step to conduct beforehand. The following practices may help minimize such risk: Run tests in a non-production environment first. Understand how the system behaves in a safe environment, using synthetic workload, before introducing potential risk to customer traffic. Use fault injection as gates in different stages through the CD pipeline. Deploy and test on Blue/Green and Canary deployments. Use methods such as traffic shadowing (a.k.a. Dark Traffic ) to get customer traffic to the staging slot. Strive to achieve a balance between collecting actual result data while affecting as few production users as possible. Use defensive design principles such as circuit breaking and the bulkhead patterns. Agreed on a budget (in terms of Service Level Objective (SLO)) as an investment in chaos and fault injection. Grow the risk incrementally - Start with hardening the core and expand out in layers. At each point, progress should be locked in with automated regression tests.","title":"Best Practices and Advice"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-frameworks-and-tools","text":"","title":"Fault Injection Testing Frameworks and Tools"},{"location":"automated-testing/fault-injection-testing/#fuzzing","text":"OneFuzz - is a Microsoft open-source self-hosted fuzzing-as-a-service platform which is easy to integrate into CI pipelines. AFL and WinAFL - Popular fuzz tools by Google's project zero team which is used locally to target binaries on Linux or Windows. WebScarab - A web-focused fuzzer owned by OWASP which can be found in Kali linux distributions.","title":"Fuzzing"},{"location":"automated-testing/fault-injection-testing/#chaos","text":"Azure Chaos Studio - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources. Chaos toolkit - A declarative, modular chaos platform with many extensions, including the Azure actions and probes kit . Kraken - An Openshift-specific chaos tool, maintained by Redhat. Chaos Monkey - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB). Simmy - A .NET library for chaos testing and fault injection integrated with the Polly library for resilience engineering. Litmus - A CNCF open source tool for chaos testing and fault injection for kubernetes cluster. This ISE dev blog post provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing.","title":"Chaos"},{"location":"automated-testing/fault-injection-testing/#conclusion","text":"From the principals of chaos: \"The harder it is to disrupt the steady-state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large\". Fault injection techniques increase resilience and confidence in the products we ship. They are used across the industry to validate applications and platforms before and while they are delivered to customers. Fault injection is a powerful tool and should be used with caution. Cases such as the Cloudflare 30 minute global outage , which was caused due to a deployment of code that was meant to be \u201cdark launched\u201d, entail the importance of curtailing the blast radius in the system during experiments.","title":"Conclusion"},{"location":"automated-testing/fault-injection-testing/#resources","text":"Mark Russinovich's fault injection and chaos engineering blog post Cindy Sridharan's Testing in production blog post Cindy Sridharan's Testing in production blog post cont. Fault injection in Azure Search Azure Architecture Framework - Chaos engineering Azure Architecture Framework - Testing resilience Landscape of Software Failure Cause Models","title":"Resources"},{"location":"automated-testing/integration-testing/","text":"Integration Testing Integration testing is a software testing methodology used to determine how well individually developed components, or modules of a system communicate with each other. This method of testing confirms that an aggregate of a system, or sub-system, works together correctly or otherwise exposes erroneous behavior between two or more units of code. Why Integration Testing Because one component of a system may be developed independently or in isolation of another it is important to verify the interaction of some or all components. A complex system may be composed of databases, APIs, interfaces, and more, that all interact with each other or additional external systems. Integration tests expose system-level issues such as broken database schemas or faulty third-party API integration. It ensures higher test coverage and serves as an important feedback loop throughout development. Integration Testing Design Blocks Consider a banking application with three modules: login, transfers, and current balance, all developed independently. An integration test may verify when a user logs in they are re-directed to their current balance with the correct amount for the specific mock user. Another integration test may perform a transfer of a specified amount of money. The test may confirm there are sufficient funds in the account to perform the transfer, and after the transfer the current balance is updated appropriately for the mock user. The login page may be mocked with a test user and mock credentials if this module is not completed when testing the transfers module. Integration testing is done by the developer or QA tester. In the past, integration testing always happened after unit and before system and E2E testing. Compared to unit-tests, integration tests are fewer in quantity, usually run slower, and are more expensive to set up and develop. Now, if a team is following agile principles, integration tests can be performed before or after unit tests, early and often, as there is no need to wait for sequential processes. Additionally, integration tests can utilize mock data in order to simulate a complete system. There is an abundance of language-specific testing frameworks that can be used throughout the entire development lifecycle. It is important to note the difference between integration and acceptance testing. Integration testing confirms a group of components work together as intended from a technical perspective, while acceptance testing confirms a group of components work together as intended from a business scenario. Applying Integration Testing Prior to writing integration tests, the engineers must identify the different components of the system, and their intended behaviors and inputs and outputs. The architecture of the project must be fully documented or specified somewhere that can be readily referenced (e.g., the architecture diagram). There are two main techniques for integration testing. Big Bang Big Bang integration testing is when all components are tested as a single unit. This is best for small system as a system too large may be difficult to localize for potential errors from failed tests. This approach also requires all components in the system under test to be completed which may delay when testing begins. Incremental Testing Incremental testing is when two or more components that are logically related are tested as a unit. After testing the unit, additional components are combined and tested all together. This process repeats until all necessary components are tested. Top Down Top down testing is when higher level components are tested following the control flow of a software system. In the scenario, what is commonly referred to as stubs are used to emulate the behavior of lower level modules not yet complete or merged in the integration test. Bottom Up Bottom up testing is when lower level modules are tested together. In the scenario, what is commonly referred to as drivers are used to emulate the behavior of higher level modules not yet complete or included in the integration test. A third approach known as the sandwich or hybrid model combines the bottom up and town down approaches to test lower and higher level components at the same time. Things to Avoid There is a tradeoff a developer must make between integration test code coverage and engineering cycles. With mock dependencies, test data, and multiple environments at test, too many integration tests are infeasible to maintain and become increasingly less meaningful. Too much mocking will slow down the test suite, make scaling difficult, and may be a sign the developer should consider other tests for the scenario such as acceptance or E2E. Integration tests of complex systems require high maintenance. Avoid testing business logic in integration tests by keeping test suites separate. Do not test beyond the acceptance criteria of the task and be sure to clean up any resources created for a given test. Additionally, avoid writing tests in a production environment. Instead, write them in a scaled-down copy environment. Integration Testing Frameworks and Tools Many tools and frameworks can be used to write both unit and integration tests. The following tools are for automating integration tests. JUnit Robot Framework moq Cucumber Selenium Behave (Python) Conclusion Integration testing demonstrates how one module of a system, or external system, interfaces with another. This can be a test of two components, a sub-system, a whole system, or a collection of systems. Tests should be written frequently and throughout the entire development lifecycle using an appropriate amount of mocked dependencies and test data. Because integration tests prove that independently developed modules interface as technically designed, it increases confidence in the development cycle providing a path for a system that deploys and scales. Resources Integration testing approaches Integration testing pros and cons Integration tests mocks and stubs Software Testing: Principles and Practices Integration testing Behave test quick start","title":"Integration Testing"},{"location":"automated-testing/integration-testing/#integration-testing","text":"Integration testing is a software testing methodology used to determine how well individually developed components, or modules of a system communicate with each other. This method of testing confirms that an aggregate of a system, or sub-system, works together correctly or otherwise exposes erroneous behavior between two or more units of code.","title":"Integration Testing"},{"location":"automated-testing/integration-testing/#why-integration-testing","text":"Because one component of a system may be developed independently or in isolation of another it is important to verify the interaction of some or all components. A complex system may be composed of databases, APIs, interfaces, and more, that all interact with each other or additional external systems. Integration tests expose system-level issues such as broken database schemas or faulty third-party API integration. It ensures higher test coverage and serves as an important feedback loop throughout development.","title":"Why Integration Testing"},{"location":"automated-testing/integration-testing/#integration-testing-design-blocks","text":"Consider a banking application with three modules: login, transfers, and current balance, all developed independently. An integration test may verify when a user logs in they are re-directed to their current balance with the correct amount for the specific mock user. Another integration test may perform a transfer of a specified amount of money. The test may confirm there are sufficient funds in the account to perform the transfer, and after the transfer the current balance is updated appropriately for the mock user. The login page may be mocked with a test user and mock credentials if this module is not completed when testing the transfers module. Integration testing is done by the developer or QA tester. In the past, integration testing always happened after unit and before system and E2E testing. Compared to unit-tests, integration tests are fewer in quantity, usually run slower, and are more expensive to set up and develop. Now, if a team is following agile principles, integration tests can be performed before or after unit tests, early and often, as there is no need to wait for sequential processes. Additionally, integration tests can utilize mock data in order to simulate a complete system. There is an abundance of language-specific testing frameworks that can be used throughout the entire development lifecycle. It is important to note the difference between integration and acceptance testing. Integration testing confirms a group of components work together as intended from a technical perspective, while acceptance testing confirms a group of components work together as intended from a business scenario.","title":"Integration Testing Design Blocks"},{"location":"automated-testing/integration-testing/#applying-integration-testing","text":"Prior to writing integration tests, the engineers must identify the different components of the system, and their intended behaviors and inputs and outputs. The architecture of the project must be fully documented or specified somewhere that can be readily referenced (e.g., the architecture diagram). There are two main techniques for integration testing.","title":"Applying Integration Testing"},{"location":"automated-testing/integration-testing/#big-bang","text":"Big Bang integration testing is when all components are tested as a single unit. This is best for small system as a system too large may be difficult to localize for potential errors from failed tests. This approach also requires all components in the system under test to be completed which may delay when testing begins.","title":"Big Bang"},{"location":"automated-testing/integration-testing/#incremental-testing","text":"Incremental testing is when two or more components that are logically related are tested as a unit. After testing the unit, additional components are combined and tested all together. This process repeats until all necessary components are tested.","title":"Incremental Testing"},{"location":"automated-testing/integration-testing/#top-down","text":"Top down testing is when higher level components are tested following the control flow of a software system. In the scenario, what is commonly referred to as stubs are used to emulate the behavior of lower level modules not yet complete or merged in the integration test.","title":"Top Down"},{"location":"automated-testing/integration-testing/#bottom-up","text":"Bottom up testing is when lower level modules are tested together. In the scenario, what is commonly referred to as drivers are used to emulate the behavior of higher level modules not yet complete or included in the integration test. A third approach known as the sandwich or hybrid model combines the bottom up and town down approaches to test lower and higher level components at the same time.","title":"Bottom Up"},{"location":"automated-testing/integration-testing/#things-to-avoid","text":"There is a tradeoff a developer must make between integration test code coverage and engineering cycles. With mock dependencies, test data, and multiple environments at test, too many integration tests are infeasible to maintain and become increasingly less meaningful. Too much mocking will slow down the test suite, make scaling difficult, and may be a sign the developer should consider other tests for the scenario such as acceptance or E2E. Integration tests of complex systems require high maintenance. Avoid testing business logic in integration tests by keeping test suites separate. Do not test beyond the acceptance criteria of the task and be sure to clean up any resources created for a given test. Additionally, avoid writing tests in a production environment. Instead, write them in a scaled-down copy environment.","title":"Things to Avoid"},{"location":"automated-testing/integration-testing/#integration-testing-frameworks-and-tools","text":"Many tools and frameworks can be used to write both unit and integration tests. The following tools are for automating integration tests. JUnit Robot Framework moq Cucumber Selenium Behave (Python)","title":"Integration Testing Frameworks and Tools"},{"location":"automated-testing/integration-testing/#conclusion","text":"Integration testing demonstrates how one module of a system, or external system, interfaces with another. This can be a test of two components, a sub-system, a whole system, or a collection of systems. Tests should be written frequently and throughout the entire development lifecycle using an appropriate amount of mocked dependencies and test data. Because integration tests prove that independently developed modules interface as technically designed, it increases confidence in the development cycle providing a path for a system that deploys and scales.","title":"Conclusion"},{"location":"automated-testing/integration-testing/#resources","text":"Integration testing approaches Integration testing pros and cons Integration tests mocks and stubs Software Testing: Principles and Practices Integration testing Behave test quick start","title":"Resources"},{"location":"automated-testing/performance-testing/","text":"Performance Testing Performance Testing is an overloaded term that is used to refer to several subcategories of performance related testing, each of which has different purpose. A good description of overall performance testing is as follows: Performance testing is a type of testing intended to determine the responsiveness, throughput, reliability, and/or scalability of a system under a given workload. Performance Testing Guidance for Web Applications . Before getting into the different subcategories of performance tests let us understand why performance testing is typically done. Why Performance Testing Performance testing is commonly conducted to accomplish one or more the following: Tune the system's performance Identifying bottlenecks and issues with the system at different load levels. Comparing performance characteristics of the system for different system configurations. Come up with a scaling strategy for the system. Assist in capacity planning Capacity planning is the process of determining what type of hardware and software resources are required to run an application to support pre-defined performance goals. Capacity planning involves identifying business expectations, the periodic fluctuations of application usage, considering the cost of running the hardware and software infrastructure. Assess the system's readiness for release: Evaluating the system's performance characteristics (response time, throughput) in a production-like environment. The goal is to ensure that performance goals can be achieved upon release. Evaluate the performance impact of application changes Comparing the performance characteristics of an application after a change to the values of performance characteristics during previous runs (or baseline values), can provide an indication of performance issues (performance regression) or enhancements introduced due to a change Key Performance Testing Categories Performance testing is a broad topic. There are many areas where you can perform tests. In broad strokes you can perform tests on the backend and on the front end. You can test the performance of individual components as well as testing the end-to-end functionality. There are several categories of tests as well: Load Testing This is the subcategory of performance testing that focuses on validating the performance characteristics of a system, when the system faces the load volumes which are expected during production operation. An Endurance Test or a Soak Test is a load test carried over a long duration ranging from several hours to days. Stress Testing This is the subcategory of performance testing that focuses on validating the performance characteristics of a system when the system faces extreme load. The goal is to evaluate how does the system handles being pressured to its limits, does it recover (i.e., scale-out) or does it just break and fail? Endurance Testing The goal of endurance testing is to make sure that the system can maintain good performance under extended periods of load. Spike Testing The goal of Spike testing is to validate that a software system can respond well to large and sudden spikes. Chaos Testing Chaos testing or Chaos engineering is the practice of experimenting on a system to build confidence that the system can withstand turbulent conditions in production. Its goal is to identify weaknesses before they manifest system wide. Developers often implement fallback procedures for service failure. Chaos testing arbitrarily shuts down different parts of the system to validate that fallback procedures function correctly. Best Practices Consider the following best practices for performance testing: Make one change at a time. Don't make multiple changes to the system between tests. If you do, you won't know which change caused the performance to improve or degrade. Automate testing. Strive to automate the setup and teardown of resources for a performance run as much as possible. Manual execution can lead to misconfigurations. Use different IP addresses. Some systems will throttle requests from a single IP address. If you are testing a system that has this type of restriction, you can use different IP addresses to simulate multiple users. Performance Monitor Metrics When executing the various types of testing approaches, whether it is stress, endurance, spike, or chaos testing, it is important to capture various metrics to see how the system performs. At the basic hardware level, there are four areas to consider. Physical disk Memory Processor Network These four areas are inextricably linked, meaning that poor performance in one area will lead to poor performance in another area. Engineers concerned with understanding application performance, should focus on these four core areas. The classic example of how performance in one area can affect performance in another area is memory pressure. If an application's available memory is running low, the operating system will try to compensate for shortages in memory by transferring pages of data from memory to disk, thus freeing up memory. But this work requires help from the CPU and the physical disk. This means that when you look at performance when there are low amounts of memory, you will also notice spikes in disk activity as well as CPU. Physical Disk Almost all software systems are dependent on the performance of the physical disk. This is especially true for the performance of databases. More modern approaches to using SSDs for physical disk storage can dramatically improve the performance of applications. Here are some of the metrics that you can capture and analyze: Counter Description Avg. Disk Queue Length This value is derived using the (Disk Transfers/sec)*(Disk sec/Transfer) counters. This metric describes the disk queue over time, smoothing out any quick spikes. Having any physical disk with an average queue length over 2 for prolonged periods of time can be an indication that your disk is a bottleneck. % Idle Time This is a measure of the percentage of time that the disk was idle. ie. there are no pending disk requests from the operating system waiting to be completed. A low number here is a positive sign that disk has excess capacity to service or write requests from the operating system. Avg. Disk sec/Read and Avg. Disk sec/Write These both measure the latency of your disks. Latency is defined as the average time it takes for a disk transfer to complete. You obviously want is low numbers as possible but need to be careful to account for inherent speed differences between SSD and traditional spinning disks. For this counter is important to define a baseline after the hardware is installed. Then use this value going forward to determine if you are experiencing any latency issues related to the hardware. Disk Reads/sec and Disk Writes/sec These counters each measure the total number of IO requests completed per second. Similar to the latency counters, good and bad values for these counters depend on your disk hardware but values higher than your initial baseline don't normally point to a hardware issue in this case. This counter can be useful to identify spikes in disk I/O. Processor It is important to understand the amount of time spent in kernel or privileged mode. In general, if code is spending too much time executing operating system calls, that could be an area of concern because it will not allow you to run your user mode applications, such as your databases, Web servers/services, etc. The guideline is that the CPU should only spend about 20% of the total processor time running in kernel mode. Counter Description % Processor time This is the percentage of total elapsed time that the processor was busy executing. This counter can either be too high or too low. If your processor time is consistently below 40%, then there is a question as to whether you have over provisioned your CPU. 70% is generally considered a good target number and if you start going higher than 70%, you may want to explore why there is high CPU pressure. % Privileged (Kernel Mode) time This measures the percentage of elapsed time the processor spent executing in kernel mode. Since this counter takes into account only kernel operations a high percentage of privileged time (greater than 25%) may indicate driver or hardware issue that should be investigated. % User time The percentage of elapsed time the processor spent executing in user mode (your application code). A good guideline is to be consistently below 65% as you want to have some buffer for both the kernel operations mentioned above as well as any other bursts of CPU required by other applications. Queue Length This is the number of threads that are ready to execute but waiting for a core to become available. On single core machines a sustained value greater than 2-3 can mean that you have some CPU pressure. Similarly, for a multicore machine divide the queue length by the number of cores and if that is continuously greater than 2-3 there might be CPU pressure. Network Adapter Network speed is often a hidden culprit of poor performance. Finding the root cause to poor network performance is often difficult. The source of issues can originate from bandwidth hogs such as videoconferencing, transaction data, network backups, recreational videos. In fact, the three most common reasons for a network slow down are: Congestion Data corruption Collisions Some of the tools that can help include: ifconfig netstat iperf tcpretrans tcpdump WireShark Troubleshooting network performance usually begins with checking the hardware. Typical things to explore is whether there are any loose wires or checking that all routers are powered up. It is not always possible to do so, but sometimes a simple case of power recycling of the modem or router can solve many problems. Network specialists often perform the following sequence of troubleshooting steps: Check the hardware Use IP config Use ping and tracert Perform DNS Check More advanced approaches often involve looking at some of the networking performance counters, as explained below. Network Counters The table above gives you some reference points to better understand what you can expect out of your network. Here are some counters that can help you understand where the bottlenecks might exist: Counter Description Bytes Received/sec The rate at which bytes are received over each network adapter. Bytes Sent/sec The rate at which bytes are sent over each network adapter. Bytes Total/sec The number of bytes sent and received over the network. Segments Received/sec The rate at which segments are received for the protocol Segments Sent/sec The rate at which segments are sent. % Interrupt Time The percentage of time the processor spends receiving and servicing hardware interrupts. This value is an indirect indicator of the activity of devices that generate interrupts, such as network adapters. There is an important distinction between latency and throughput . Latency measures the time it takes for a packet to be transferred across the network, either in terms of a one-way transmission or a round-trip transmission. Throughput is different and attempts to measure the quantity of data being sent and received within a unit of time. Memory Counter Description Available MBs This counter represents the amount of memory that is available to applications that are executing. Low memory can trigger Page Faults, whereby additional pressure is put on the CPU to swap memory to and from the disk. if the amount of available memory dips below 10%, more memory should be obtained. Pages/sec This is actually the sum of \"Pages Input/sec\" and \"Pages Output/sec\" counters which is the rate at which pages are being read and written as a result of pages faults. Small spikes with this value do not mean there is an issue but sustained values of greater than 50 can mean that system memory is a bottleneck. Paging File(_Total)\\% Usage The percentage of the system page file that is currently in use. This is not directly related to performance, but you can run into serious application issues if the page file does become completely full and additional memory is still being requested by applications. Key Performance Testing Activities Performance testing activities vary depending on the subcategory of performance testing and the system's requirements and constraints. For specific guidance you can follow the link to the subcategory of performance tests listed above. The following activities might be included depending on the performance test subcategory: Identify the Acceptance Criteria for the Tests This will generally include identifying the goals and constraints for the performance characteristics of the system Plan and Design the Tests In general we need to consider the following points: Defining the load the application should be tested with Establishing the metrics to be collected Establish what tools will be used for the tests Establish the performance test frequency: whether the performance tests be done as a part of the feature development sprints, or only prior to release to a major environment? Implementation Implement the performance tests according to the designed approach. Instrument the system and ensure that is emitting the needed performance metrics. Test Execution Execute the tests and collect performance metrics. Result Analysis and Re-testing Analyze the results/performance metrics from the tests. Identify needed changes to tweak the system (i.e., code, infrastructure) to better accommodate the test objectives. Then test again. This cycle continues until the test objective is achieved. The Iterative Performance Test Template can be used to capture details about the test result for every iterations. Resources Patters and Practices: Performance Testing Guidance for Web Applications","title":"Performance Testing"},{"location":"automated-testing/performance-testing/#performance-testing","text":"Performance Testing is an overloaded term that is used to refer to several subcategories of performance related testing, each of which has different purpose. A good description of overall performance testing is as follows: Performance testing is a type of testing intended to determine the responsiveness, throughput, reliability, and/or scalability of a system under a given workload. Performance Testing Guidance for Web Applications . Before getting into the different subcategories of performance tests let us understand why performance testing is typically done.","title":"Performance Testing"},{"location":"automated-testing/performance-testing/#why-performance-testing","text":"Performance testing is commonly conducted to accomplish one or more the following: Tune the system's performance Identifying bottlenecks and issues with the system at different load levels. Comparing performance characteristics of the system for different system configurations. Come up with a scaling strategy for the system. Assist in capacity planning Capacity planning is the process of determining what type of hardware and software resources are required to run an application to support pre-defined performance goals. Capacity planning involves identifying business expectations, the periodic fluctuations of application usage, considering the cost of running the hardware and software infrastructure. Assess the system's readiness for release: Evaluating the system's performance characteristics (response time, throughput) in a production-like environment. The goal is to ensure that performance goals can be achieved upon release. Evaluate the performance impact of application changes Comparing the performance characteristics of an application after a change to the values of performance characteristics during previous runs (or baseline values), can provide an indication of performance issues (performance regression) or enhancements introduced due to a change","title":"Why Performance Testing"},{"location":"automated-testing/performance-testing/#key-performance-testing-categories","text":"Performance testing is a broad topic. There are many areas where you can perform tests. In broad strokes you can perform tests on the backend and on the front end. You can test the performance of individual components as well as testing the end-to-end functionality. There are several categories of tests as well:","title":"Key Performance Testing Categories"},{"location":"automated-testing/performance-testing/#load-testing","text":"This is the subcategory of performance testing that focuses on validating the performance characteristics of a system, when the system faces the load volumes which are expected during production operation. An Endurance Test or a Soak Test is a load test carried over a long duration ranging from several hours to days.","title":"Load Testing"},{"location":"automated-testing/performance-testing/#stress-testing","text":"This is the subcategory of performance testing that focuses on validating the performance characteristics of a system when the system faces extreme load. The goal is to evaluate how does the system handles being pressured to its limits, does it recover (i.e., scale-out) or does it just break and fail?","title":"Stress Testing"},{"location":"automated-testing/performance-testing/#endurance-testing","text":"The goal of endurance testing is to make sure that the system can maintain good performance under extended periods of load.","title":"Endurance Testing"},{"location":"automated-testing/performance-testing/#spike-testing","text":"The goal of Spike testing is to validate that a software system can respond well to large and sudden spikes.","title":"Spike Testing"},{"location":"automated-testing/performance-testing/#chaos-testing","text":"Chaos testing or Chaos engineering is the practice of experimenting on a system to build confidence that the system can withstand turbulent conditions in production. Its goal is to identify weaknesses before they manifest system wide. Developers often implement fallback procedures for service failure. Chaos testing arbitrarily shuts down different parts of the system to validate that fallback procedures function correctly.","title":"Chaos Testing"},{"location":"automated-testing/performance-testing/#best-practices","text":"Consider the following best practices for performance testing: Make one change at a time. Don't make multiple changes to the system between tests. If you do, you won't know which change caused the performance to improve or degrade. Automate testing. Strive to automate the setup and teardown of resources for a performance run as much as possible. Manual execution can lead to misconfigurations. Use different IP addresses. Some systems will throttle requests from a single IP address. If you are testing a system that has this type of restriction, you can use different IP addresses to simulate multiple users.","title":"Best Practices"},{"location":"automated-testing/performance-testing/#performance-monitor-metrics","text":"When executing the various types of testing approaches, whether it is stress, endurance, spike, or chaos testing, it is important to capture various metrics to see how the system performs. At the basic hardware level, there are four areas to consider. Physical disk Memory Processor Network These four areas are inextricably linked, meaning that poor performance in one area will lead to poor performance in another area. Engineers concerned with understanding application performance, should focus on these four core areas. The classic example of how performance in one area can affect performance in another area is memory pressure. If an application's available memory is running low, the operating system will try to compensate for shortages in memory by transferring pages of data from memory to disk, thus freeing up memory. But this work requires help from the CPU and the physical disk. This means that when you look at performance when there are low amounts of memory, you will also notice spikes in disk activity as well as CPU.","title":"Performance Monitor Metrics"},{"location":"automated-testing/performance-testing/#physical-disk","text":"Almost all software systems are dependent on the performance of the physical disk. This is especially true for the performance of databases. More modern approaches to using SSDs for physical disk storage can dramatically improve the performance of applications. Here are some of the metrics that you can capture and analyze: Counter Description Avg. Disk Queue Length This value is derived using the (Disk Transfers/sec)*(Disk sec/Transfer) counters. This metric describes the disk queue over time, smoothing out any quick spikes. Having any physical disk with an average queue length over 2 for prolonged periods of time can be an indication that your disk is a bottleneck. % Idle Time This is a measure of the percentage of time that the disk was idle. ie. there are no pending disk requests from the operating system waiting to be completed. A low number here is a positive sign that disk has excess capacity to service or write requests from the operating system. Avg. Disk sec/Read and Avg. Disk sec/Write These both measure the latency of your disks. Latency is defined as the average time it takes for a disk transfer to complete. You obviously want is low numbers as possible but need to be careful to account for inherent speed differences between SSD and traditional spinning disks. For this counter is important to define a baseline after the hardware is installed. Then use this value going forward to determine if you are experiencing any latency issues related to the hardware. Disk Reads/sec and Disk Writes/sec These counters each measure the total number of IO requests completed per second. Similar to the latency counters, good and bad values for these counters depend on your disk hardware but values higher than your initial baseline don't normally point to a hardware issue in this case. This counter can be useful to identify spikes in disk I/O.","title":"Physical Disk"},{"location":"automated-testing/performance-testing/#processor","text":"It is important to understand the amount of time spent in kernel or privileged mode. In general, if code is spending too much time executing operating system calls, that could be an area of concern because it will not allow you to run your user mode applications, such as your databases, Web servers/services, etc. The guideline is that the CPU should only spend about 20% of the total processor time running in kernel mode. Counter Description % Processor time This is the percentage of total elapsed time that the processor was busy executing. This counter can either be too high or too low. If your processor time is consistently below 40%, then there is a question as to whether you have over provisioned your CPU. 70% is generally considered a good target number and if you start going higher than 70%, you may want to explore why there is high CPU pressure. % Privileged (Kernel Mode) time This measures the percentage of elapsed time the processor spent executing in kernel mode. Since this counter takes into account only kernel operations a high percentage of privileged time (greater than 25%) may indicate driver or hardware issue that should be investigated. % User time The percentage of elapsed time the processor spent executing in user mode (your application code). A good guideline is to be consistently below 65% as you want to have some buffer for both the kernel operations mentioned above as well as any other bursts of CPU required by other applications. Queue Length This is the number of threads that are ready to execute but waiting for a core to become available. On single core machines a sustained value greater than 2-3 can mean that you have some CPU pressure. Similarly, for a multicore machine divide the queue length by the number of cores and if that is continuously greater than 2-3 there might be CPU pressure.","title":"Processor"},{"location":"automated-testing/performance-testing/#network-adapter","text":"Network speed is often a hidden culprit of poor performance. Finding the root cause to poor network performance is often difficult. The source of issues can originate from bandwidth hogs such as videoconferencing, transaction data, network backups, recreational videos. In fact, the three most common reasons for a network slow down are: Congestion Data corruption Collisions Some of the tools that can help include: ifconfig netstat iperf tcpretrans tcpdump WireShark Troubleshooting network performance usually begins with checking the hardware. Typical things to explore is whether there are any loose wires or checking that all routers are powered up. It is not always possible to do so, but sometimes a simple case of power recycling of the modem or router can solve many problems. Network specialists often perform the following sequence of troubleshooting steps: Check the hardware Use IP config Use ping and tracert Perform DNS Check More advanced approaches often involve looking at some of the networking performance counters, as explained below.","title":"Network Adapter"},{"location":"automated-testing/performance-testing/#network-counters","text":"The table above gives you some reference points to better understand what you can expect out of your network. Here are some counters that can help you understand where the bottlenecks might exist: Counter Description Bytes Received/sec The rate at which bytes are received over each network adapter. Bytes Sent/sec The rate at which bytes are sent over each network adapter. Bytes Total/sec The number of bytes sent and received over the network. Segments Received/sec The rate at which segments are received for the protocol Segments Sent/sec The rate at which segments are sent. % Interrupt Time The percentage of time the processor spends receiving and servicing hardware interrupts. This value is an indirect indicator of the activity of devices that generate interrupts, such as network adapters. There is an important distinction between latency and throughput . Latency measures the time it takes for a packet to be transferred across the network, either in terms of a one-way transmission or a round-trip transmission. Throughput is different and attempts to measure the quantity of data being sent and received within a unit of time.","title":"Network Counters"},{"location":"automated-testing/performance-testing/#memory","text":"Counter Description Available MBs This counter represents the amount of memory that is available to applications that are executing. Low memory can trigger Page Faults, whereby additional pressure is put on the CPU to swap memory to and from the disk. if the amount of available memory dips below 10%, more memory should be obtained. Pages/sec This is actually the sum of \"Pages Input/sec\" and \"Pages Output/sec\" counters which is the rate at which pages are being read and written as a result of pages faults. Small spikes with this value do not mean there is an issue but sustained values of greater than 50 can mean that system memory is a bottleneck. Paging File(_Total)\\% Usage The percentage of the system page file that is currently in use. This is not directly related to performance, but you can run into serious application issues if the page file does become completely full and additional memory is still being requested by applications.","title":"Memory"},{"location":"automated-testing/performance-testing/#key-performance-testing-activities","text":"Performance testing activities vary depending on the subcategory of performance testing and the system's requirements and constraints. For specific guidance you can follow the link to the subcategory of performance tests listed above. The following activities might be included depending on the performance test subcategory:","title":"Key Performance Testing Activities"},{"location":"automated-testing/performance-testing/#identify-the-acceptance-criteria-for-the-tests","text":"This will generally include identifying the goals and constraints for the performance characteristics of the system","title":"Identify the Acceptance Criteria for the Tests"},{"location":"automated-testing/performance-testing/#plan-and-design-the-tests","text":"In general we need to consider the following points: Defining the load the application should be tested with Establishing the metrics to be collected Establish what tools will be used for the tests Establish the performance test frequency: whether the performance tests be done as a part of the feature development sprints, or only prior to release to a major environment?","title":"Plan and Design the Tests"},{"location":"automated-testing/performance-testing/#implementation","text":"Implement the performance tests according to the designed approach. Instrument the system and ensure that is emitting the needed performance metrics.","title":"Implementation"},{"location":"automated-testing/performance-testing/#test-execution","text":"Execute the tests and collect performance metrics.","title":"Test Execution"},{"location":"automated-testing/performance-testing/#result-analysis-and-re-testing","text":"Analyze the results/performance metrics from the tests. Identify needed changes to tweak the system (i.e., code, infrastructure) to better accommodate the test objectives. Then test again. This cycle continues until the test objective is achieved. The Iterative Performance Test Template can be used to capture details about the test result for every iterations.","title":"Result Analysis and Re-testing"},{"location":"automated-testing/performance-testing/#resources","text":"Patters and Practices: Performance Testing Guidance for Web Applications","title":"Resources"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/","text":"Performance Test Iteration Template This document provides template for capturing results of performance tests. Performance tests are done in iterations and each iteration should have a clear goal. The results of any iteration is immutable regardless whether the goal was achieved or not. If the iteration failed or the goal is not achieved then a new iteration of testing is carried out with appropriate fixes. It is recommended to keep track of the recorded iterations to maintain a timeline of how system evolved and which changes affected the performance in what way. Feel free to modify this template as needed. Iteration Template Goal Mention in bullet points the goal for this iteration of test. The goal should be small and measurable within this iteration. Test Details Date : Date and time when this iteration started and ended Duration : Time it took to complete this iteration. Application Code : Commit id and link to the commit for the code(s) which are being tested in this iteration Benchmarking Configuration: Application Configuration: In bullet points mention the configuration for application that should be recorded System Configuration: In bullet points mention the configuration of the infrastructure Record different types of configurations. Usually application specific configuration changes between iterations whereas system or infrastructure configurations rarely change Work Items List of links to relevant work items (task, story, bug) being tested in this iteration. Results In bullet points document the results from the test. - Attach any documents supporting the test results. - Add links to the dashboard for metrics and logs such as Application Insights. - Capture screenshots for metrics and include it in the results. Good candidate for this is CPU/Memory/Disk usage. Observations Observations are insights derived from test results. Keep the observations brief and as bullet points. Mention outcomes supporting the goal of the iteration. If any of the observation results in a work item (task, story, bug) then add the link to the work item together with the observation.","title":"Performance Test Iteration Template"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#performance-test-iteration-template","text":"This document provides template for capturing results of performance tests. Performance tests are done in iterations and each iteration should have a clear goal. The results of any iteration is immutable regardless whether the goal was achieved or not. If the iteration failed or the goal is not achieved then a new iteration of testing is carried out with appropriate fixes. It is recommended to keep track of the recorded iterations to maintain a timeline of how system evolved and which changes affected the performance in what way. Feel free to modify this template as needed.","title":"Performance Test Iteration Template"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#iteration-template","text":"","title":"Iteration Template"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#goal","text":"Mention in bullet points the goal for this iteration of test. The goal should be small and measurable within this iteration.","title":"Goal"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#test-details","text":"Date : Date and time when this iteration started and ended Duration : Time it took to complete this iteration. Application Code : Commit id and link to the commit for the code(s) which are being tested in this iteration Benchmarking Configuration: Application Configuration: In bullet points mention the configuration for application that should be recorded System Configuration: In bullet points mention the configuration of the infrastructure Record different types of configurations. Usually application specific configuration changes between iterations whereas system or infrastructure configurations rarely change","title":"Test Details"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#work-items","text":"List of links to relevant work items (task, story, bug) being tested in this iteration.","title":"Work Items"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#results","text":"In bullet points document the results from the test. - Attach any documents supporting the test results. - Add links to the dashboard for metrics and logs such as Application Insights. - Capture screenshots for metrics and include it in the results. Good candidate for this is CPU/Memory/Disk usage.","title":"Results"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#observations","text":"Observations are insights derived from test results. Keep the observations brief and as bullet points. Mention outcomes supporting the goal of the iteration. If any of the observation results in a work item (task, story, bug) then add the link to the work item together with the observation.","title":"Observations"},{"location":"automated-testing/performance-testing/load-testing/","text":"Load Testing \" Load testing is performed to determine a system's behavior under both normal and anticipated peak load conditions. \" - Load testing - Wikipedia A load test is designed to determine how a system behaves under expected normal and peak workloads. Specifically its main purpose is to confirm if a system can handle the expected load level. Depending on the target system this could be concurrent users, requests per second or data size. Why Load Testing The main objective is to prove the system can behave normally under the expected normal load before releasing it to production. The criteria that define \"behave normally\" will depend on your target, this may be as simple as \"the system remains available\", but it could also include meeting a response time SLA or error rate. Additionally, the results of a load test can also be used as data to help with capacity planning and calculating scalability. Load Testing Design Blocks There are a number of basic components that are required to carry out a load test. In order to have meaningful results the system needs to be tested in a production-like environment with a network and hardware which closely resembles the expected deployment environment. The load test will consist of a module which simulates user activity. Of course the composition of this \"user activity\" will vary based on the type of application being tested. For example, an e-commerce website might simulate user browsing and purchasing items, but an IoT data ingestion pipeline would simulate a stream of device readings. Please ensure the simulation is as close to real activity as possible, and consider not just volume but also patterns and variability. For example, if the simulator data is too uniform or predictable, then cache/hit ratios may impact your results. The load test will be initiated from a component external to the target system which can control the amount of load applied. This can be a single agent, but may need to scaled to multiple agents in order to achieve higher levels of activity. Although not required to run a load test, it is advisable to have monitoring and/or logging in place to be able to measure the impact of the test and discover potential bottlenecks. Applying the Load Testing Planning Identify key scenarios to measure - Gather these scenarios from Product Owner, they should provide a representative sample of real world traffic. The key activity of this phase is to agree on and define the load test cases. Determine expected normal and peak load for the scenarios - Determine a load level such as concurrent users or requests per second to find the size of the load test you will run. Identify success criteria metrics - These may be on testing side such as response time and error rate, or they may be on the system side such as CPU and memory usage. Agree on test matrix - Which load test cases should be run for which combinations of input parameters. Select the right tool - Many frameworks exist for load testing so consider if features and limitations are suitable for your needs (Some popular tools are listed below). This may also include development of a custom load test client, see Preparation phase below. Observability - Determine which metrics need to gathered to gain insight into throughput, latency, resource utilization, etc. Scalability - Determine the amount of scale needed by load generator, workload application, CPU, Memory, and network components needed to achieve testing goals. The use of kubernetes on the cloud can be used to make testing infinitely scalable. Preparation The key activity is to replace the end user client with a test bench that simulates one or more instances of the original client. For standard 3rd party tools it may suffice to configure the existing test UI before initiating the load tests. If a custom client is used, code development will be required: Custom development - Design for minimal impact/overhead. Be sure to capture only those features of the production client that are relevant from a load perspective. Does it matter if the same test is duplicated, or must the workload be unique for each test? Can all tests be run under the same user context? Test environment - Create test environment that resembles production environment. This includes the platform as well as external systems, e.g., data sources. Security contexts - Be sure to have all requisite security contexts for the test environment. Automation like pipelines may require special setup, e.g., OAuth2 client credential flow instead of auth code flow, because interactive login is replaced by non-interactive. Allow planning leeway in case admin approval is required for new security contexts. Test data strategy - Make sure that output data format (ascii/binary/...) is compatible with whatever analysis tool is used in the analysis phase. This also includes storage areas (local/cloud/...), which may trigger new security contexts. Bear in mind that it may be necessary to collect data from sources external to the application to correlate potential performance issues with the application behavior. This includes platform and network metrics. Make sure to collect data that covers analysis needs (statistical measures, distributions, graphs, etc.). Automation - Repeatability is critical. It must be possible to re-run a given test multiple times to verify consistency and resilience of the application itself and the underlying platform. Pipelines are recommended whenever possible. Evaluate whether load tests should be run as part of the PR strategy. Test client debugging - All test modules should be carefully debugged to ensure that the execution phase progresses smoothly. Test client validation - All test modules should be validated for extreme values of the input parameters. This reduces the risk of running into unexpected difficulties when stepping through the full test matrix during the execution phase. Execution It is recommended to use an existing testing framework (see below). These tools will provide a method of both specifying the user activity scenarios and how to execute those at load. Depending on the situation, it may be advisable to coordinate testing activities with the platform operations team. It is common to slowly ramp up to your desired load to better replicate real world behavior. Once you have reached your defined workload, maintain this level long enough to see if your system stabilizes. To finish up the test you should also ramp to see record how the system slows down as well. You should also consider the origin of your load test traffic. Depending on the scope of the target system you may want to initiate from a different location to better replicate real world traffic such as from a different region. Note: Before starting please be aware of any restrictions on your network such as DDOS protection where you may need to notify a network administrator or apply for an exemption. Note: In general, the preferred approach to load testing would be the usage of a standard test framework such as the ones discussed below. There are cases, however, where a custom test client may be advantageous. Examples include batch oriented workloads that can be run under a single security context and the same test data can be re-used for multiple load tests. In such a scenario it may be beneficial to develop a custom script that can be used interactively as well as non-interactively. Analysis The analysis phase represents the work that brings all previous activities together: Set aside time to allow for collection of new test data based on the analysis of the load tests. Correlate application metrics and platform metrics to identify potential pitfalls and bottlenecks. Include business stakeholders early in the analysis phase to validate application findings. Include platform operations to validate platform findings. Report Writing Summarize your findings from the analysis phase. Be sure to include application and platform enhancement suggestions, if any. Further Testing After completing your load test you should be set up to continue on to additional related testing such as; Soak Testing - Also known as Endurance Testing . Performing a load test over an extended period of time to ensure long term stability. Stress Testing - Gradually increasing the load to find the limits of the system and identify the maximum capacity. Spike Testing - Introduce a sharp short-term increase into the load scenarios. Scalability Testing - Re-testing of a system as your expand horizontally or vertically to measure how it scales. Distributed Testing - Distributed testing allows you to leverage the power of multiple machines to perform larger or more in-depth tests faster. Is necessary when a fully optimized node cannot produce the load required by your extremely large test. Load Generation Testing Frameworks and Tools Here are a few popular load testing frameworks you may consider, and the languages used to define your scenarios. Azure Load Testing ( https://learn.microsoft.com/en-us/azure/load-testing/ ) - Managed platform for running load tests on Azure. It allows to run and monitor tests automatically, source secrets from the KeyVault, generate traffic at scale, and load test Azure private endpoints. In the simple case, it executes load tests with HTTP GET traffic to a given endpoint. For the more complex cases, you can upload your own JMeter scenarios . JMeter ( https://github.com/apache/jmeter ) - Has built in patterns to test without coding, but can be extended with Java. Artillery ( https://artillery.io/ ) - Write your scenarios in Javascript, executes a node application. Gatling ( https://gatling.io/ ) - Write your scenarios in Scala with their DSL. Locust ( https://locust.io/ ) - Write your scenarios in Python using the concept of concurrent user activity. K6 ( https://k6.io/ ) - Write your test scenarios in Javascript, available as open source kubernetes operator, open source Docker image, or as SaaS. Particularly useful for distributed load testing. Integrates easily with prometheus. NBomber ( https://nbomber.com/ ) - Write your test scenarios in C# or F#, available integration with test runners (NUnit/xUnit). WebValidate ( https://github.com/microsoft/webvalidate ) - Web request validation tool used to run end-to-end tests and long-running performance and availability tests. Sample Workload Applications In the case where a specific workload application is not being provided and the focus is instead on the system, here are a few popular sample workload applications you may consider. HttpBin ( Python , GoLang ) - Supports variety of endpoint types and language implementations. Can echo data used in request. NGSA ( Java , C# ) - Intended for Kubernetes Platform and Monitoring Testing. Built on top of IMDB data store with many CRUD endpoints available. Does not need to have a live database connection. MockBin ( https://github.com/Kong/mockbin ) - Allows you to generate custom endpoints to test, mock, and track HTTP requests & responses between libraries, sockets and APIs. Conclusion A load test is critical step to understand if a target system will be reliable under the expected real world traffic. Of course, it's only as good as your ability to predict the expected load, so it's important to follow up with other further testing to truly understand how your system behaves in different situations. Resources List additional readings about this test type for those that would like to dive deeper. Microsoft Azure Well-Architected Framework > Load Testing","title":"Load Testing"},{"location":"automated-testing/performance-testing/load-testing/#load-testing","text":"\" Load testing is performed to determine a system's behavior under both normal and anticipated peak load conditions. \" - Load testing - Wikipedia A load test is designed to determine how a system behaves under expected normal and peak workloads. Specifically its main purpose is to confirm if a system can handle the expected load level. Depending on the target system this could be concurrent users, requests per second or data size.","title":"Load Testing"},{"location":"automated-testing/performance-testing/load-testing/#why-load-testing","text":"The main objective is to prove the system can behave normally under the expected normal load before releasing it to production. The criteria that define \"behave normally\" will depend on your target, this may be as simple as \"the system remains available\", but it could also include meeting a response time SLA or error rate. Additionally, the results of a load test can also be used as data to help with capacity planning and calculating scalability.","title":"Why Load Testing"},{"location":"automated-testing/performance-testing/load-testing/#load-testing-design-blocks","text":"There are a number of basic components that are required to carry out a load test. In order to have meaningful results the system needs to be tested in a production-like environment with a network and hardware which closely resembles the expected deployment environment. The load test will consist of a module which simulates user activity. Of course the composition of this \"user activity\" will vary based on the type of application being tested. For example, an e-commerce website might simulate user browsing and purchasing items, but an IoT data ingestion pipeline would simulate a stream of device readings. Please ensure the simulation is as close to real activity as possible, and consider not just volume but also patterns and variability. For example, if the simulator data is too uniform or predictable, then cache/hit ratios may impact your results. The load test will be initiated from a component external to the target system which can control the amount of load applied. This can be a single agent, but may need to scaled to multiple agents in order to achieve higher levels of activity. Although not required to run a load test, it is advisable to have monitoring and/or logging in place to be able to measure the impact of the test and discover potential bottlenecks.","title":"Load Testing Design Blocks"},{"location":"automated-testing/performance-testing/load-testing/#applying-the-load-testing","text":"","title":"Applying the Load Testing"},{"location":"automated-testing/performance-testing/load-testing/#planning","text":"Identify key scenarios to measure - Gather these scenarios from Product Owner, they should provide a representative sample of real world traffic. The key activity of this phase is to agree on and define the load test cases. Determine expected normal and peak load for the scenarios - Determine a load level such as concurrent users or requests per second to find the size of the load test you will run. Identify success criteria metrics - These may be on testing side such as response time and error rate, or they may be on the system side such as CPU and memory usage. Agree on test matrix - Which load test cases should be run for which combinations of input parameters. Select the right tool - Many frameworks exist for load testing so consider if features and limitations are suitable for your needs (Some popular tools are listed below). This may also include development of a custom load test client, see Preparation phase below. Observability - Determine which metrics need to gathered to gain insight into throughput, latency, resource utilization, etc. Scalability - Determine the amount of scale needed by load generator, workload application, CPU, Memory, and network components needed to achieve testing goals. The use of kubernetes on the cloud can be used to make testing infinitely scalable.","title":"Planning"},{"location":"automated-testing/performance-testing/load-testing/#preparation","text":"The key activity is to replace the end user client with a test bench that simulates one or more instances of the original client. For standard 3rd party tools it may suffice to configure the existing test UI before initiating the load tests. If a custom client is used, code development will be required: Custom development - Design for minimal impact/overhead. Be sure to capture only those features of the production client that are relevant from a load perspective. Does it matter if the same test is duplicated, or must the workload be unique for each test? Can all tests be run under the same user context? Test environment - Create test environment that resembles production environment. This includes the platform as well as external systems, e.g., data sources. Security contexts - Be sure to have all requisite security contexts for the test environment. Automation like pipelines may require special setup, e.g., OAuth2 client credential flow instead of auth code flow, because interactive login is replaced by non-interactive. Allow planning leeway in case admin approval is required for new security contexts. Test data strategy - Make sure that output data format (ascii/binary/...) is compatible with whatever analysis tool is used in the analysis phase. This also includes storage areas (local/cloud/...), which may trigger new security contexts. Bear in mind that it may be necessary to collect data from sources external to the application to correlate potential performance issues with the application behavior. This includes platform and network metrics. Make sure to collect data that covers analysis needs (statistical measures, distributions, graphs, etc.). Automation - Repeatability is critical. It must be possible to re-run a given test multiple times to verify consistency and resilience of the application itself and the underlying platform. Pipelines are recommended whenever possible. Evaluate whether load tests should be run as part of the PR strategy. Test client debugging - All test modules should be carefully debugged to ensure that the execution phase progresses smoothly. Test client validation - All test modules should be validated for extreme values of the input parameters. This reduces the risk of running into unexpected difficulties when stepping through the full test matrix during the execution phase.","title":"Preparation"},{"location":"automated-testing/performance-testing/load-testing/#execution","text":"It is recommended to use an existing testing framework (see below). These tools will provide a method of both specifying the user activity scenarios and how to execute those at load. Depending on the situation, it may be advisable to coordinate testing activities with the platform operations team. It is common to slowly ramp up to your desired load to better replicate real world behavior. Once you have reached your defined workload, maintain this level long enough to see if your system stabilizes. To finish up the test you should also ramp to see record how the system slows down as well. You should also consider the origin of your load test traffic. Depending on the scope of the target system you may want to initiate from a different location to better replicate real world traffic such as from a different region. Note: Before starting please be aware of any restrictions on your network such as DDOS protection where you may need to notify a network administrator or apply for an exemption. Note: In general, the preferred approach to load testing would be the usage of a standard test framework such as the ones discussed below. There are cases, however, where a custom test client may be advantageous. Examples include batch oriented workloads that can be run under a single security context and the same test data can be re-used for multiple load tests. In such a scenario it may be beneficial to develop a custom script that can be used interactively as well as non-interactively.","title":"Execution"},{"location":"automated-testing/performance-testing/load-testing/#analysis","text":"The analysis phase represents the work that brings all previous activities together: Set aside time to allow for collection of new test data based on the analysis of the load tests. Correlate application metrics and platform metrics to identify potential pitfalls and bottlenecks. Include business stakeholders early in the analysis phase to validate application findings. Include platform operations to validate platform findings.","title":"Analysis"},{"location":"automated-testing/performance-testing/load-testing/#report-writing","text":"Summarize your findings from the analysis phase. Be sure to include application and platform enhancement suggestions, if any.","title":"Report Writing"},{"location":"automated-testing/performance-testing/load-testing/#further-testing","text":"After completing your load test you should be set up to continue on to additional related testing such as; Soak Testing - Also known as Endurance Testing . Performing a load test over an extended period of time to ensure long term stability. Stress Testing - Gradually increasing the load to find the limits of the system and identify the maximum capacity. Spike Testing - Introduce a sharp short-term increase into the load scenarios. Scalability Testing - Re-testing of a system as your expand horizontally or vertically to measure how it scales. Distributed Testing - Distributed testing allows you to leverage the power of multiple machines to perform larger or more in-depth tests faster. Is necessary when a fully optimized node cannot produce the load required by your extremely large test.","title":"Further Testing"},{"location":"automated-testing/performance-testing/load-testing/#load-generation-testing-frameworks-and-tools","text":"Here are a few popular load testing frameworks you may consider, and the languages used to define your scenarios. Azure Load Testing ( https://learn.microsoft.com/en-us/azure/load-testing/ ) - Managed platform for running load tests on Azure. It allows to run and monitor tests automatically, source secrets from the KeyVault, generate traffic at scale, and load test Azure private endpoints. In the simple case, it executes load tests with HTTP GET traffic to a given endpoint. For the more complex cases, you can upload your own JMeter scenarios . JMeter ( https://github.com/apache/jmeter ) - Has built in patterns to test without coding, but can be extended with Java. Artillery ( https://artillery.io/ ) - Write your scenarios in Javascript, executes a node application. Gatling ( https://gatling.io/ ) - Write your scenarios in Scala with their DSL. Locust ( https://locust.io/ ) - Write your scenarios in Python using the concept of concurrent user activity. K6 ( https://k6.io/ ) - Write your test scenarios in Javascript, available as open source kubernetes operator, open source Docker image, or as SaaS. Particularly useful for distributed load testing. Integrates easily with prometheus. NBomber ( https://nbomber.com/ ) - Write your test scenarios in C# or F#, available integration with test runners (NUnit/xUnit). WebValidate ( https://github.com/microsoft/webvalidate ) - Web request validation tool used to run end-to-end tests and long-running performance and availability tests.","title":"Load Generation Testing Frameworks and Tools"},{"location":"automated-testing/performance-testing/load-testing/#sample-workload-applications","text":"In the case where a specific workload application is not being provided and the focus is instead on the system, here are a few popular sample workload applications you may consider. HttpBin ( Python , GoLang ) - Supports variety of endpoint types and language implementations. Can echo data used in request. NGSA ( Java , C# ) - Intended for Kubernetes Platform and Monitoring Testing. Built on top of IMDB data store with many CRUD endpoints available. Does not need to have a live database connection. MockBin ( https://github.com/Kong/mockbin ) - Allows you to generate custom endpoints to test, mock, and track HTTP requests & responses between libraries, sockets and APIs.","title":"Sample Workload Applications"},{"location":"automated-testing/performance-testing/load-testing/#conclusion","text":"A load test is critical step to understand if a target system will be reliable under the expected real world traffic. Of course, it's only as good as your ability to predict the expected load, so it's important to follow up with other further testing to truly understand how your system behaves in different situations.","title":"Conclusion"},{"location":"automated-testing/performance-testing/load-testing/#resources","text":"List additional readings about this test type for those that would like to dive deeper. Microsoft Azure Well-Architected Framework > Load Testing","title":"Resources"},{"location":"automated-testing/shadow-testing/","text":"Shadow Testing Shadow testing is one approach to reduce risks before going to production. Shadow testing is also known as \"Shadow Deployment\" or \"Shadowing Traffic\" and similarities with \"Dark launching\". When to Use Shadow Testing reduces risks when you consider replacing the current environment (V-Current) with candidate environment with new feature (V-Next). This approach is monitoring and capturing differences between two environments then compare and reduces all risks before you introduce a new feature/release. In our test cases, code coverage is very important however sometimes providing code coverage can be tricky to replicate real-life combinations and possibilities. In this approach, to test V-Next environment we have side by side deployment, we're replicating the same traffic with V-Current environment and directing same traffic to V-Next environment, the only difference is we don't return any response from V-Next environment to users, but we collect those responses to compare with V-Current responses. Referencing back to one of the Principles of Chaos Engineering, mentions importance of sampling real traffic like below: Systems behave differently depending on environment and traffic patterns. Since the behavior of utilization can change at any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic. With this Shadow Testing approach we're leveraging real customer behavior in V-Next environment with sampling real traffic and mitigating the risks which users may face on production. At the same time we're testing V-Next environment infrastructure for scaling with real sampled traffic. V-Next should scale with the same way V-Current does. We're testing actual behavior of the product and this cause zero impact to production to test new features since traffic is replicated to V-next environment. There are some similarities with Dark Launching , Dark Launching proposes to integrate new feature into production code, but users can't use the feature. On the backend you can test your feature and improve the performance until it's acceptable. It is also similar to Feature Toggles which provides you with an ability to enable/disable your new feature in production on a UI level. With this approach your new feature will be visible to users, and you can collect feedback. Using Dark Launching with Feature Toggles can be very useful for introducing a new feature. Applicable to Production deployments : V-Next in Shadow testing always working separately and not effecting production. Users are not effected with this test. Infrastructure : Shadow testing replicating the same traffic, in test environment you can have the same traffic on the production. It helps to produce real life test scenarios Handling Scale : All traffic is replicated, and you have a chance to see how your system scaling. Shadow Testing Frameworks and Tools There are some tools to implement shadow testing. The main purpose of these tools is to compare responses of V-Current and V-Next then find the differences. Diffy Envoy McRouter Scientist Keploy One of the most popular tools is Diffy . It was created and used at Twitter. Now the original author and a former Twitter employee maintains their own version of this project, called Opendiffy . Twitter announced this tool on their engineering blog as \" Testing services without writing tests \". As of today Diffy is used in production by Twitter, Airbnb, Baidu and Bytedance companies. Diffy explains the shadow testing feature like this: Diffy finds potential bugs in your service using running instances of your new code, and your old code side by side. Diffy behaves as a proxy and multicasts whatever requests it receives to each of the running instances. It then compares the responses, and reports any regressions that may surface from those comparisons. The premise for Diffy is that if two implementations of the service return \u201csimilar\u201d responses for a sufficiently large and diverse set of requests, then the two implementations can be treated as equivalent, and the newer implementation is regression-free. Diffy architecture Conclusion Shadow Testing is a useful approach to reduce risks when you consider replacing the current environment with candidate environment using new feature(s). Shadow testing replicates traffic of the production to candidate environment for testing, so you get same production use case scenarios in the test environment. You can compare differences on both environments and validate your candidate environment to be ready for releasing. Some advantages of shadow testing are: Zero impact to production environment No need to generate test scenarios and test data We can test real-life scenarios with real-life data. We can simulate scale with replicated production traffic. Resources Martin Fowler - Dark Launching Martin Fowler - Feature Toggle Traffic Shadowing/Mirroring","title":"Shadow Testing"},{"location":"automated-testing/shadow-testing/#shadow-testing","text":"Shadow testing is one approach to reduce risks before going to production. Shadow testing is also known as \"Shadow Deployment\" or \"Shadowing Traffic\" and similarities with \"Dark launching\".","title":"Shadow Testing"},{"location":"automated-testing/shadow-testing/#when-to-use","text":"Shadow Testing reduces risks when you consider replacing the current environment (V-Current) with candidate environment with new feature (V-Next). This approach is monitoring and capturing differences between two environments then compare and reduces all risks before you introduce a new feature/release. In our test cases, code coverage is very important however sometimes providing code coverage can be tricky to replicate real-life combinations and possibilities. In this approach, to test V-Next environment we have side by side deployment, we're replicating the same traffic with V-Current environment and directing same traffic to V-Next environment, the only difference is we don't return any response from V-Next environment to users, but we collect those responses to compare with V-Current responses. Referencing back to one of the Principles of Chaos Engineering, mentions importance of sampling real traffic like below: Systems behave differently depending on environment and traffic patterns. Since the behavior of utilization can change at any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic. With this Shadow Testing approach we're leveraging real customer behavior in V-Next environment with sampling real traffic and mitigating the risks which users may face on production. At the same time we're testing V-Next environment infrastructure for scaling with real sampled traffic. V-Next should scale with the same way V-Current does. We're testing actual behavior of the product and this cause zero impact to production to test new features since traffic is replicated to V-next environment. There are some similarities with Dark Launching , Dark Launching proposes to integrate new feature into production code, but users can't use the feature. On the backend you can test your feature and improve the performance until it's acceptable. It is also similar to Feature Toggles which provides you with an ability to enable/disable your new feature in production on a UI level. With this approach your new feature will be visible to users, and you can collect feedback. Using Dark Launching with Feature Toggles can be very useful for introducing a new feature.","title":"When to Use"},{"location":"automated-testing/shadow-testing/#applicable-to","text":"Production deployments : V-Next in Shadow testing always working separately and not effecting production. Users are not effected with this test. Infrastructure : Shadow testing replicating the same traffic, in test environment you can have the same traffic on the production. It helps to produce real life test scenarios Handling Scale : All traffic is replicated, and you have a chance to see how your system scaling.","title":"Applicable to"},{"location":"automated-testing/shadow-testing/#shadow-testing-frameworks-and-tools","text":"There are some tools to implement shadow testing. The main purpose of these tools is to compare responses of V-Current and V-Next then find the differences. Diffy Envoy McRouter Scientist Keploy One of the most popular tools is Diffy . It was created and used at Twitter. Now the original author and a former Twitter employee maintains their own version of this project, called Opendiffy . Twitter announced this tool on their engineering blog as \" Testing services without writing tests \". As of today Diffy is used in production by Twitter, Airbnb, Baidu and Bytedance companies. Diffy explains the shadow testing feature like this: Diffy finds potential bugs in your service using running instances of your new code, and your old code side by side. Diffy behaves as a proxy and multicasts whatever requests it receives to each of the running instances. It then compares the responses, and reports any regressions that may surface from those comparisons. The premise for Diffy is that if two implementations of the service return \u201csimilar\u201d responses for a sufficiently large and diverse set of requests, then the two implementations can be treated as equivalent, and the newer implementation is regression-free. Diffy architecture","title":"Shadow Testing Frameworks and Tools"},{"location":"automated-testing/shadow-testing/#conclusion","text":"Shadow Testing is a useful approach to reduce risks when you consider replacing the current environment with candidate environment using new feature(s). Shadow testing replicates traffic of the production to candidate environment for testing, so you get same production use case scenarios in the test environment. You can compare differences on both environments and validate your candidate environment to be ready for releasing. Some advantages of shadow testing are: Zero impact to production environment No need to generate test scenarios and test data We can test real-life scenarios with real-life data. We can simulate scale with replicated production traffic.","title":"Conclusion"},{"location":"automated-testing/shadow-testing/#resources","text":"Martin Fowler - Dark Launching Martin Fowler - Feature Toggle Traffic Shadowing/Mirroring","title":"Resources"},{"location":"automated-testing/smoke-testing/","text":"Smoke Testing Smoke tests, sometimes named Sanity , Acceptance , or Build/Release Verification tests, are a sub-type of system/functional tests that are usually used as gates that verify the application's readiness as a preliminary step. If an application passes the smoke tests, it is acceptable, or in a stable-enough state, for the next stages of testing or deployment. When To Use Problem Addressed Smoke tests are meant to find, as early as possible, if an application is working or not. The goal of smoke tests is to save time; if the current version of the application does not pass smoke tests, then the rest of the integration or deployment chain for it can be abandoned. Smoke tests do not aim to provide full functionality coverage but instead focus on a few quick acceptance invocations for which the application should, at all times, respond correctly to. ROI Tipping Point Smoke tests cover only the most critical application path, and should not be used to actually test the application's behavior, keeping execution time and complexity to minimum. The tests can be formed of a subset of the application's integration or e2e tests, and they cover as much of the functionality with as little depth as required. The golden rule of a good smoke test is that it saves time on validating that the application is acceptable to a stage where better, more thorough testing will begin. Applicable to Local dev desktop - Example: Applying manual smoke testing to verify that the application is OK. Build pipelines - Example: Running a small set of the integration test suite before running the full coverage of tests, which may take a long time. Non-production and Production deployments - Example: Running a curl command to the product's API and asserting the response is 200 before running load test which consume resources. PR Validation - Example: - Deploying the application chart to a test namespace and validating the release is successful and no immediate regressions are merged. Conclusion Smoke testing is a low-effort, high-impact step to ship more reliable software. It should be considered amongst the first stages to implement when planning continuously integrated and delivered systems. Resources Wikipedia - Smoke Testing Google SRE Book - System Tests","title":"Smoke Testing"},{"location":"automated-testing/smoke-testing/#smoke-testing","text":"Smoke tests, sometimes named Sanity , Acceptance , or Build/Release Verification tests, are a sub-type of system/functional tests that are usually used as gates that verify the application's readiness as a preliminary step. If an application passes the smoke tests, it is acceptable, or in a stable-enough state, for the next stages of testing or deployment.","title":"Smoke Testing"},{"location":"automated-testing/smoke-testing/#when-to-use","text":"","title":"When To Use"},{"location":"automated-testing/smoke-testing/#problem-addressed","text":"Smoke tests are meant to find, as early as possible, if an application is working or not. The goal of smoke tests is to save time; if the current version of the application does not pass smoke tests, then the rest of the integration or deployment chain for it can be abandoned. Smoke tests do not aim to provide full functionality coverage but instead focus on a few quick acceptance invocations for which the application should, at all times, respond correctly to.","title":"Problem Addressed"},{"location":"automated-testing/smoke-testing/#roi-tipping-point","text":"Smoke tests cover only the most critical application path, and should not be used to actually test the application's behavior, keeping execution time and complexity to minimum. The tests can be formed of a subset of the application's integration or e2e tests, and they cover as much of the functionality with as little depth as required. The golden rule of a good smoke test is that it saves time on validating that the application is acceptable to a stage where better, more thorough testing will begin.","title":"ROI Tipping Point"},{"location":"automated-testing/smoke-testing/#applicable-to","text":"Local dev desktop - Example: Applying manual smoke testing to verify that the application is OK. Build pipelines - Example: Running a small set of the integration test suite before running the full coverage of tests, which may take a long time. Non-production and Production deployments - Example: Running a curl command to the product's API and asserting the response is 200 before running load test which consume resources. PR Validation - Example: - Deploying the application chart to a test namespace and validating the release is successful and no immediate regressions are merged.","title":"Applicable to"},{"location":"automated-testing/smoke-testing/#conclusion","text":"Smoke testing is a low-effort, high-impact step to ship more reliable software. It should be considered amongst the first stages to implement when planning continuously integrated and delivered systems.","title":"Conclusion"},{"location":"automated-testing/smoke-testing/#resources","text":"Wikipedia - Smoke Testing Google SRE Book - System Tests","title":"Resources"},{"location":"automated-testing/synthetic-monitoring-tests/","text":"Synthetic Monitoring Tests Synthetic Monitoring Tests are a set of functional tests that target a live system in production. The focus of these tests, which are sometimes named \"watchdog\", \"active monitoring\" or \"synthetic transactions\", is to verify the product's health and resilience continuously. Why Synthetic Monitoring Tests Traditionally, software providers rely on testing through CI/CD stages in the well known testing pyramid (unit, integration, e2e) to validate that the product is healthy and without regressions. Such tests will run on the build agent or in the test/stage environment before being deployed to production and released to live user traffic. During the services' lifetime in the production environment, they are safeguarded by monitoring and alerting tools that rely on Real User Metrics/Monitoring ( RUM ). However, as more organizations today provide highly-available (99.9+ SLA) products, they find that the nature of long-lived distributed applications, which typically rely on several hardware and software components, is to fail. Frequent releases (sometimes multiple times per day) of various components of the system can create further instability. This rapid rate of change to the production environment tends to make testing during CI/CD stages not hermetic and actually not representative of the end user experience and how the production system actually behaves. For such systems, the ambition of service engineering teams is to reduce to a minimum the time it takes to fix errors, or the MTTR - Mean Time To Repair . It is a continuous effort, performed on the live/production system. Synthetic Monitors can be used to detect the following issues: Availability - Is the system or specific region available. Transactions and customer journeys - Known good requests should work, while known bad requests should error. Performance - How fast are actions and is that performance maintained through high loads and through version releases. 3rd Party components - Cloud or software components used by the system may fail. Shift-Right Testing Synthetic Monitoring tests are a subset of tests that run in production, sometimes named Test-in-Production or Shift-Right tests. With Shift-Left paradigms that are so popular, the approach is to perform testing as early as possible in the application development lifecycle (i.e., moved left on the project timeline). Shift right compliments and adds on top of Shift-Left. It refers to running tests late in the cycle, during deployment, release, and post-release when the product is serving production traffic. They provide modern engineering teams a broader set of tools to assure high SLAs over time. Synthetic Monitoring Tests Design Blocks A synthetic monitoring test is a test that uses synthetic data and real testing accounts to inject user behaviors to the system and validates their effect, usually by passively relying on existing monitoring and alerting capabilities. Components of synthetic monitoring tests include Probes , test code/ accounts which generates data, and Monitoring tools placed to validate both the system's behavior under test and the health of the probes themselves. Probes Probes are the source of synthetic user actions that drive testing. They target the product's front-end or publicly-facing APIs and are running on their own production environment. A Synthetic Monitoring test is, in fact, very related to black-box tests and would usually focus on end-to-end scenarios from a user's perspective. It is not uncommon for the same code for e2e or integration tests to be used to implement the probe. Monitoring Given that Synthetic Monitoring tests are continuously running, at intervals, in a production environment, the assertion of system behavior through analysis relies on existing monitoring pillars used in live system (Logging, Metrics, Distributed Tracing). There would usually be a finite set of tests, and key metrics that are used to build monitors and alerts to assert against the known SLO , and verify that the OKR for that system are maintained. The monitoring tools are effectively capturing both RUMs and synthetic data generated by the probes. Applying Synthetic Monitoring Tests Asserting the System under Test Synthetic monitoring tests are usually statistical. Test metrics are compared against some historical or running average with a time dimension (Example: Over the last 30 days, for this time of day, the mean average response time is 250ms for AddToCart operation with a standard deviation from the mean of +/- 32ms) . So if an observed measurement is within a deviation of the norm at any time, the services are probably healthy. Building a Synthetic Monitoring Solution At a high level, building synthetic monitors usually consists of the following steps: Determine the metric to be validated (functional result, latency, etc.) Build a piece of automation that measures that metric against the system, and gathers telemetry into the system's existing monitoring infrastructure. Set up monitoring alarms/actions/responses that detect the failure of the system to meet the desired goal of the metric. Run the test case automation continuously at an appropriate interval. Monitoring the Health of Tests Probes runtime is a production environment on its own, and the health of tests is critical. Many providers offer cloud-based systems that host such runtimes, while some organizations use existing production environments to run these tests on. In either way, a monitor-the-monitor strategy should be a first-class citizen of the production environment's alerting systems. Synthetic Monitoring and Real User Monitoring Synthetic monitoring does not replace the need for RUM. Probes are predictable code that verifies specific scenarios, and they do not 100% completely and truly represent how a user session is handled. On the other hand, prefer not to use RUMs to test for site reliability because: As the name implies, RUM requires user traffic. The site may be down, but since no user visited the monitored path, no alerts were triggered yet. Inconsistent Traffic and usage patterns make it hard to gauge for benchmarks. Risks Testing in production, in general, has a risk factor attached to it, which does not exist tests executed during CI/CD stages. Specifically, in synthetic monitoring tests, the following may affect the production environment: Corrupted or invalid data - Tests inject test data which may be in some ways corrupt. Consider using a testing schema. Protected data leakage - Tests run in a production environment and emit logs or trace that may contain protected data. Overloaded systems - Synthetic tests may cause errors or overload the system. Unintended side effects or impacts on other production systems. Skewed analytics (traffic funnels, A/B test results, etc.) Auth/AuthZ - Tests are required to run in production where access to tokens and secrets may be restricted or more challenging to retrieve. Synthetic Monitoring Tests Frameworks and Tools Most key monitoring/APM players have an enterprise product that supports synthetic monitoring built into their systems (see list below). Such offerings make some of the risks raised above irrelevant as the integration and runtime aspects of the solution are OOTB. However, such solutions are typically pricey. Some organizations prefer running probes on existing infrastructure using known tools such as Postman , Wrk , JMeter , Selenium or even custom code to generate the synthetic data. Such solutions must account for isolating and decoupling the probe's production environment from the core product's as well as provide monitoring, geo-distribution, and maintaining test health. Application Insights availability - Simple availability tests that allow some customization using Multi-step web test DataDog Synthetics Dynatrace Synthetic Monitoring New Relic Synthetics Checkly Conclusion The value of production tests, in general, and specifically Synthetic monitoring, is only there for particular engagement types, and there is associated risk and cost to them. However, when applicable, they provide continuous assurance that there are no system failures from a user's perspective. When developing a PaaS/SaaS solution, Synthetic monitoring is key to the success of service reliability teams, and they are becoming an integral part of the quality assurance stack of highly available products. Resources Google SRE book - Testing Reliability Microsoft DevOps Architectures - Shift Right to Test in Production Martin Fowler - Synthetic Monitoring","title":"Synthetic Monitoring Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#synthetic-monitoring-tests","text":"Synthetic Monitoring Tests are a set of functional tests that target a live system in production. The focus of these tests, which are sometimes named \"watchdog\", \"active monitoring\" or \"synthetic transactions\", is to verify the product's health and resilience continuously.","title":"Synthetic Monitoring Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#why-synthetic-monitoring-tests","text":"Traditionally, software providers rely on testing through CI/CD stages in the well known testing pyramid (unit, integration, e2e) to validate that the product is healthy and without regressions. Such tests will run on the build agent or in the test/stage environment before being deployed to production and released to live user traffic. During the services' lifetime in the production environment, they are safeguarded by monitoring and alerting tools that rely on Real User Metrics/Monitoring ( RUM ). However, as more organizations today provide highly-available (99.9+ SLA) products, they find that the nature of long-lived distributed applications, which typically rely on several hardware and software components, is to fail. Frequent releases (sometimes multiple times per day) of various components of the system can create further instability. This rapid rate of change to the production environment tends to make testing during CI/CD stages not hermetic and actually not representative of the end user experience and how the production system actually behaves. For such systems, the ambition of service engineering teams is to reduce to a minimum the time it takes to fix errors, or the MTTR - Mean Time To Repair . It is a continuous effort, performed on the live/production system. Synthetic Monitors can be used to detect the following issues: Availability - Is the system or specific region available. Transactions and customer journeys - Known good requests should work, while known bad requests should error. Performance - How fast are actions and is that performance maintained through high loads and through version releases. 3rd Party components - Cloud or software components used by the system may fail.","title":"Why Synthetic Monitoring Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#shift-right-testing","text":"Synthetic Monitoring tests are a subset of tests that run in production, sometimes named Test-in-Production or Shift-Right tests. With Shift-Left paradigms that are so popular, the approach is to perform testing as early as possible in the application development lifecycle (i.e., moved left on the project timeline). Shift right compliments and adds on top of Shift-Left. It refers to running tests late in the cycle, during deployment, release, and post-release when the product is serving production traffic. They provide modern engineering teams a broader set of tools to assure high SLAs over time.","title":"Shift-Right Testing"},{"location":"automated-testing/synthetic-monitoring-tests/#synthetic-monitoring-tests-design-blocks","text":"A synthetic monitoring test is a test that uses synthetic data and real testing accounts to inject user behaviors to the system and validates their effect, usually by passively relying on existing monitoring and alerting capabilities. Components of synthetic monitoring tests include Probes , test code/ accounts which generates data, and Monitoring tools placed to validate both the system's behavior under test and the health of the probes themselves.","title":"Synthetic Monitoring Tests Design Blocks"},{"location":"automated-testing/synthetic-monitoring-tests/#probes","text":"Probes are the source of synthetic user actions that drive testing. They target the product's front-end or publicly-facing APIs and are running on their own production environment. A Synthetic Monitoring test is, in fact, very related to black-box tests and would usually focus on end-to-end scenarios from a user's perspective. It is not uncommon for the same code for e2e or integration tests to be used to implement the probe.","title":"Probes"},{"location":"automated-testing/synthetic-monitoring-tests/#monitoring","text":"Given that Synthetic Monitoring tests are continuously running, at intervals, in a production environment, the assertion of system behavior through analysis relies on existing monitoring pillars used in live system (Logging, Metrics, Distributed Tracing). There would usually be a finite set of tests, and key metrics that are used to build monitors and alerts to assert against the known SLO , and verify that the OKR for that system are maintained. The monitoring tools are effectively capturing both RUMs and synthetic data generated by the probes.","title":"Monitoring"},{"location":"automated-testing/synthetic-monitoring-tests/#applying-synthetic-monitoring-tests","text":"","title":"Applying Synthetic Monitoring Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#asserting-the-system-under-test","text":"Synthetic monitoring tests are usually statistical. Test metrics are compared against some historical or running average with a time dimension (Example: Over the last 30 days, for this time of day, the mean average response time is 250ms for AddToCart operation with a standard deviation from the mean of +/- 32ms) . So if an observed measurement is within a deviation of the norm at any time, the services are probably healthy.","title":"Asserting the System under Test"},{"location":"automated-testing/synthetic-monitoring-tests/#building-a-synthetic-monitoring-solution","text":"At a high level, building synthetic monitors usually consists of the following steps: Determine the metric to be validated (functional result, latency, etc.) Build a piece of automation that measures that metric against the system, and gathers telemetry into the system's existing monitoring infrastructure. Set up monitoring alarms/actions/responses that detect the failure of the system to meet the desired goal of the metric. Run the test case automation continuously at an appropriate interval.","title":"Building a Synthetic Monitoring Solution"},{"location":"automated-testing/synthetic-monitoring-tests/#monitoring-the-health-of-tests","text":"Probes runtime is a production environment on its own, and the health of tests is critical. Many providers offer cloud-based systems that host such runtimes, while some organizations use existing production environments to run these tests on. In either way, a monitor-the-monitor strategy should be a first-class citizen of the production environment's alerting systems.","title":"Monitoring the Health of Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#synthetic-monitoring-and-real-user-monitoring","text":"Synthetic monitoring does not replace the need for RUM. Probes are predictable code that verifies specific scenarios, and they do not 100% completely and truly represent how a user session is handled. On the other hand, prefer not to use RUMs to test for site reliability because: As the name implies, RUM requires user traffic. The site may be down, but since no user visited the monitored path, no alerts were triggered yet. Inconsistent Traffic and usage patterns make it hard to gauge for benchmarks.","title":"Synthetic Monitoring and Real User Monitoring"},{"location":"automated-testing/synthetic-monitoring-tests/#risks","text":"Testing in production, in general, has a risk factor attached to it, which does not exist tests executed during CI/CD stages. Specifically, in synthetic monitoring tests, the following may affect the production environment: Corrupted or invalid data - Tests inject test data which may be in some ways corrupt. Consider using a testing schema. Protected data leakage - Tests run in a production environment and emit logs or trace that may contain protected data. Overloaded systems - Synthetic tests may cause errors or overload the system. Unintended side effects or impacts on other production systems. Skewed analytics (traffic funnels, A/B test results, etc.) Auth/AuthZ - Tests are required to run in production where access to tokens and secrets may be restricted or more challenging to retrieve.","title":"Risks"},{"location":"automated-testing/synthetic-monitoring-tests/#synthetic-monitoring-tests-frameworks-and-tools","text":"Most key monitoring/APM players have an enterprise product that supports synthetic monitoring built into their systems (see list below). Such offerings make some of the risks raised above irrelevant as the integration and runtime aspects of the solution are OOTB. However, such solutions are typically pricey. Some organizations prefer running probes on existing infrastructure using known tools such as Postman , Wrk , JMeter , Selenium or even custom code to generate the synthetic data. Such solutions must account for isolating and decoupling the probe's production environment from the core product's as well as provide monitoring, geo-distribution, and maintaining test health. Application Insights availability - Simple availability tests that allow some customization using Multi-step web test DataDog Synthetics Dynatrace Synthetic Monitoring New Relic Synthetics Checkly","title":"Synthetic Monitoring Tests Frameworks and Tools"},{"location":"automated-testing/synthetic-monitoring-tests/#conclusion","text":"The value of production tests, in general, and specifically Synthetic monitoring, is only there for particular engagement types, and there is associated risk and cost to them. However, when applicable, they provide continuous assurance that there are no system failures from a user's perspective. When developing a PaaS/SaaS solution, Synthetic monitoring is key to the success of service reliability teams, and they are becoming an integral part of the quality assurance stack of highly available products.","title":"Conclusion"},{"location":"automated-testing/synthetic-monitoring-tests/#resources","text":"Google SRE book - Testing Reliability Microsoft DevOps Architectures - Shift Right to Test in Production Martin Fowler - Synthetic Monitoring","title":"Resources"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/","text":"Building Containers with Azure DevOps Using the DevTest Pattern In this documents, we highlight learnings from applying the DevTest pattern to container development in Azure DevOps through pipelines. The pattern enabled as to build container for development, testing and releasing the container for further reuse (production ready). We will dive into tools needed to build, test and push a container, our environment and go through each step separately. Follow this link to dive deeper or revisit the DevTest pattern . Build the Container The first step in container development, after creating the necessary Dockerfiles and source code, is building the container. Even the Dockerfile itself can include some basic testing. Code tests are performed when pushing the code to the repository origin, where it is then used to build the container. The first step in our pipeline is to run the docker build command with a temporary tag and the required build arguments: - task : Bash@3 name : BuildImage displayName : 'Build the image via docker' inputs : workingDirectory : \"$(System.DefaultWorkingDirectory)${{ parameters.buildDirectory }}\" targetType : 'inline' script : | docker build -t ${{ parameters.imageName }} --build-arg YOUR_BUILD_ARG -f ${{ parameters.dockerfileName }} . env : PredefinedPassword : $(Password) NewVariable : \"newVariableValue\" This task includes the parameters buildDirectory , imageName and dockerfileName , which have to be set beforehand. This task can for example be used in a template for multiple containers to improve code reuse. It is also possible to pass environment variables directly to the Dockerfile through the env section of the task. If this task succeeds, the Dockerfile was build without errors and we can continue to testing the container itself. Test the Container To test the container, we are using the tox environment. For more details on tox please visit the tox section of this repository or visit the official tox documentation page . Before we test the container, we are checking for exposed credentials in the docker image history. If known passwords, used to access our internal resources, are exposed here, the build step will fail: - task: Bash@3 name: CheckIfPasswordInDockerHistory displayName: 'Check for password in docker history' inputs: workingDirectory: \"$(System.DefaultWorkingDirectory)\" targetType: 'inline' failOnStdErr: true script: | if docker image history --no-trunc ${{ parameters.imageName }} | grep -qF $PredefinedPassword; then exit 1; fi exit 0; env: PredefinedPassword: $(Password) After the credential test, the container is tested through the pytest extension testinfra . Testinfra is a Python-based tool which can be used to start a container, gather prerequisites, test the container and shut it down again, without any effort besides writing the tests. These tests can for example include: if files exist if environment variables are set correctly if certain processes are running if the correct host environment is used For a complete collection of capabilities and requirements, please visit the testinfra project on GitHub . A few methods of a Linux-based container test can look like this: def test_dependencies ( host ): ''' Check all files needed to run the container properly. ''' env_file = \"/app/environment.sh.env\" assert host . file ( env_file ) . exists activate_sh_path = \"/app/start.sh\" assert host . file ( activate_sh_path ) . exists def test_container_running ( host ): process = host . process . get ( comm = \"start.sh\" ) assert process . user == \"root\" def test_host_system ( host ): system_type = 'linux' distribution = 'ubuntu' release = '18.04' assert system_type == host . system_info . type assert distribution == host . system_info . distribution assert release == host . system_info . release def extract_env_var ( file_content ): import re regex = r \"ENV_VAR= \\\" (?P<s>[^ \\\" ]*) \\\" \" match = re . match ( regex , file_content ) return match . group ( 's' ) def test_ports_exposed ( host ): port1 = \"9010\" st1 = f \"grep -q { port1 } /app/Dockerfile && echo 'true' || echo 'false'\" cmd1 = host . run ( st1 ) assert cmd1 . stdout def test_listening_simserver_sockets ( host ): assert host . socket ( \"tcp://0.0.0.0:32512\" ) . is_listening assert host . socket ( \"tcp://0.0.0.0:32513\" ) . is_listening To start the test, a pytest command is executed through tox. A task containing the tox command can look like this: - task : Bash@3 name : RunTestCommands displayName : \"Test - Run test commands\" inputs : workingDirectory : \"$(System.DefaultWorkingDirectory)\" targetType : 'inline' script : | tox -e testinfra-${{ parameters.makeTarget }} -- ${{ parameters.imageName }} failOnStderr : true Which could trigger the following pytest code, which is contained in the tox.ini file: pytest -vv tests/ { env:CONTEXT } --container-image ={ posargs: { env:IMAGE_TAG }} --volume ={ env:VOLUME } As a last task of this pipeline to build and test the container, we set a variable called testsPassed which is only true , if the previous tasks succeeded: - task: Bash@3 name: UpdateTestResultVariable condition: succeeded() inputs: targetType: 'inline' script: | echo '##vso[task.setvariable variable=testsPassed]true' Push the Container After building and testing, if our container runs as expected, we want to release it to our Azure Container Registry (ACR) to be used by our larger application. Before that, we want to automate the push behavior and define a meaningful tag. As a developer it is often helpful to have containers pushed to ACR, even if they are failing. This can be done by checking for the testsPassed variable we introduced at the end of our testing. If the test failed, we want to add a failed suffix at the end of the tag: - task: Bash@3 name: SetFailedSuffixTag displayName: \"Set failed suffix, if the tests failed.\" condition: and(eq(variables['testsPassed'], false), ne(variables['Build.SourceBranchName'], 'main')) # if this is not a release and failed -> retag the image to add failedSuffix inputs: targetType: inline script: | docker tag ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }} ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }}$(failedSuffix) The condition checks, if the value of testsPassed is false and also if we are not on the main branch, as we don't want to push failed containers from main. This helps us to keep our production environment clean. The value for imageRepository was defined in another template, along with the failedSuffix and testsPassed : parameters: - name: component variables: testsPassed: false failedSuffix: \"-failed\" # the imageRepo will changed based on dev or release ${{ if eq( variables['Build.SourceBranchName'], 'main' ) }}: imageRepository: 'stable/${{ parameters.component }}' ${{ if ne( variables['Build.SourceBranchName'], 'main' ) }}: imageRepository: 'dev/${{ parameters.component }}' The imageTag is open to discussion, as it depends highly on how your team wants to use the container. We went for Build.SourceVersion which is the commit ID of the branch the container was developed in. This allows you to easily track the origin of the container and aids debugging. A link to Azure DevOps predefined variables can be found in the Azure Docs on Azure DevOps After a tag was added to the container, the image must be pushed. This can be done with the following task: - task: Docker@1 name: pushFailedDockerImage displayName: 'Pushes failed image via Docker' condition: and(eq(variables['testsPassed'], false), ne(variables['Build.SourceBranchName'], 'main')) # if this is not a release and failed -> push the image with the failed tag inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:${{ parameters.imageTag }}$(failedSuffix)' Similarly, these are the steps to publish the container to the ACR, if the tests succeeded: - task: Bash@3 name: SetLatestSuffixTag displayName: \"Set latest suffix, if the tests succeed.\" condition: eq(variables['testsPassed'], true) inputs: targetType: inline script: | docker tag ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }} ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:latest - task: Docker@1 name: pushSuccessfulDockerImageSha displayName: 'Pushes successful image via Docker' condition: eq(variables['testsPassed'], true) inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:${{ parameters.imageTag }}' - task: Docker@1 name: pushSuccessfulDockerImageLatest displayName: 'Pushes successful image as latest' condition: eq(variables['testsPassed'], true) inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:latest' If you don't want to include the latest tag, you can also remove the steps involving latest (SetLatestSuffixTag & pushSuccessfulDockerImageLatest). Resources DevTest pattern Azure Docs on Azure DevOps official tox documentation page Testinfra Testinfra project on GitHub pytest","title":"Building Containers with Azure DevOps Using the DevTest Pattern"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#building-containers-with-azure-devops-using-the-devtest-pattern","text":"In this documents, we highlight learnings from applying the DevTest pattern to container development in Azure DevOps through pipelines. The pattern enabled as to build container for development, testing and releasing the container for further reuse (production ready). We will dive into tools needed to build, test and push a container, our environment and go through each step separately. Follow this link to dive deeper or revisit the DevTest pattern .","title":"Building Containers with Azure DevOps Using the DevTest Pattern"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#build-the-container","text":"The first step in container development, after creating the necessary Dockerfiles and source code, is building the container. Even the Dockerfile itself can include some basic testing. Code tests are performed when pushing the code to the repository origin, where it is then used to build the container. The first step in our pipeline is to run the docker build command with a temporary tag and the required build arguments: - task : Bash@3 name : BuildImage displayName : 'Build the image via docker' inputs : workingDirectory : \"$(System.DefaultWorkingDirectory)${{ parameters.buildDirectory }}\" targetType : 'inline' script : | docker build -t ${{ parameters.imageName }} --build-arg YOUR_BUILD_ARG -f ${{ parameters.dockerfileName }} . env : PredefinedPassword : $(Password) NewVariable : \"newVariableValue\" This task includes the parameters buildDirectory , imageName and dockerfileName , which have to be set beforehand. This task can for example be used in a template for multiple containers to improve code reuse. It is also possible to pass environment variables directly to the Dockerfile through the env section of the task. If this task succeeds, the Dockerfile was build without errors and we can continue to testing the container itself.","title":"Build the Container"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#test-the-container","text":"To test the container, we are using the tox environment. For more details on tox please visit the tox section of this repository or visit the official tox documentation page . Before we test the container, we are checking for exposed credentials in the docker image history. If known passwords, used to access our internal resources, are exposed here, the build step will fail: - task: Bash@3 name: CheckIfPasswordInDockerHistory displayName: 'Check for password in docker history' inputs: workingDirectory: \"$(System.DefaultWorkingDirectory)\" targetType: 'inline' failOnStdErr: true script: | if docker image history --no-trunc ${{ parameters.imageName }} | grep -qF $PredefinedPassword; then exit 1; fi exit 0; env: PredefinedPassword: $(Password) After the credential test, the container is tested through the pytest extension testinfra . Testinfra is a Python-based tool which can be used to start a container, gather prerequisites, test the container and shut it down again, without any effort besides writing the tests. These tests can for example include: if files exist if environment variables are set correctly if certain processes are running if the correct host environment is used For a complete collection of capabilities and requirements, please visit the testinfra project on GitHub . A few methods of a Linux-based container test can look like this: def test_dependencies ( host ): ''' Check all files needed to run the container properly. ''' env_file = \"/app/environment.sh.env\" assert host . file ( env_file ) . exists activate_sh_path = \"/app/start.sh\" assert host . file ( activate_sh_path ) . exists def test_container_running ( host ): process = host . process . get ( comm = \"start.sh\" ) assert process . user == \"root\" def test_host_system ( host ): system_type = 'linux' distribution = 'ubuntu' release = '18.04' assert system_type == host . system_info . type assert distribution == host . system_info . distribution assert release == host . system_info . release def extract_env_var ( file_content ): import re regex = r \"ENV_VAR= \\\" (?P<s>[^ \\\" ]*) \\\" \" match = re . match ( regex , file_content ) return match . group ( 's' ) def test_ports_exposed ( host ): port1 = \"9010\" st1 = f \"grep -q { port1 } /app/Dockerfile && echo 'true' || echo 'false'\" cmd1 = host . run ( st1 ) assert cmd1 . stdout def test_listening_simserver_sockets ( host ): assert host . socket ( \"tcp://0.0.0.0:32512\" ) . is_listening assert host . socket ( \"tcp://0.0.0.0:32513\" ) . is_listening To start the test, a pytest command is executed through tox. A task containing the tox command can look like this: - task : Bash@3 name : RunTestCommands displayName : \"Test - Run test commands\" inputs : workingDirectory : \"$(System.DefaultWorkingDirectory)\" targetType : 'inline' script : | tox -e testinfra-${{ parameters.makeTarget }} -- ${{ parameters.imageName }} failOnStderr : true Which could trigger the following pytest code, which is contained in the tox.ini file: pytest -vv tests/ { env:CONTEXT } --container-image ={ posargs: { env:IMAGE_TAG }} --volume ={ env:VOLUME } As a last task of this pipeline to build and test the container, we set a variable called testsPassed which is only true , if the previous tasks succeeded: - task: Bash@3 name: UpdateTestResultVariable condition: succeeded() inputs: targetType: 'inline' script: | echo '##vso[task.setvariable variable=testsPassed]true'","title":"Test the Container"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#push-the-container","text":"After building and testing, if our container runs as expected, we want to release it to our Azure Container Registry (ACR) to be used by our larger application. Before that, we want to automate the push behavior and define a meaningful tag. As a developer it is often helpful to have containers pushed to ACR, even if they are failing. This can be done by checking for the testsPassed variable we introduced at the end of our testing. If the test failed, we want to add a failed suffix at the end of the tag: - task: Bash@3 name: SetFailedSuffixTag displayName: \"Set failed suffix, if the tests failed.\" condition: and(eq(variables['testsPassed'], false), ne(variables['Build.SourceBranchName'], 'main')) # if this is not a release and failed -> retag the image to add failedSuffix inputs: targetType: inline script: | docker tag ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }} ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }}$(failedSuffix) The condition checks, if the value of testsPassed is false and also if we are not on the main branch, as we don't want to push failed containers from main. This helps us to keep our production environment clean. The value for imageRepository was defined in another template, along with the failedSuffix and testsPassed : parameters: - name: component variables: testsPassed: false failedSuffix: \"-failed\" # the imageRepo will changed based on dev or release ${{ if eq( variables['Build.SourceBranchName'], 'main' ) }}: imageRepository: 'stable/${{ parameters.component }}' ${{ if ne( variables['Build.SourceBranchName'], 'main' ) }}: imageRepository: 'dev/${{ parameters.component }}' The imageTag is open to discussion, as it depends highly on how your team wants to use the container. We went for Build.SourceVersion which is the commit ID of the branch the container was developed in. This allows you to easily track the origin of the container and aids debugging. A link to Azure DevOps predefined variables can be found in the Azure Docs on Azure DevOps After a tag was added to the container, the image must be pushed. This can be done with the following task: - task: Docker@1 name: pushFailedDockerImage displayName: 'Pushes failed image via Docker' condition: and(eq(variables['testsPassed'], false), ne(variables['Build.SourceBranchName'], 'main')) # if this is not a release and failed -> push the image with the failed tag inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:${{ parameters.imageTag }}$(failedSuffix)' Similarly, these are the steps to publish the container to the ACR, if the tests succeeded: - task: Bash@3 name: SetLatestSuffixTag displayName: \"Set latest suffix, if the tests succeed.\" condition: eq(variables['testsPassed'], true) inputs: targetType: inline script: | docker tag ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }} ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:latest - task: Docker@1 name: pushSuccessfulDockerImageSha displayName: 'Pushes successful image via Docker' condition: eq(variables['testsPassed'], true) inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:${{ parameters.imageTag }}' - task: Docker@1 name: pushSuccessfulDockerImageLatest displayName: 'Pushes successful image as latest' condition: eq(variables['testsPassed'], true) inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:latest' If you don't want to include the latest tag, you can also remove the steps involving latest (SetLatestSuffixTag & pushSuccessfulDockerImageLatest).","title":"Push the Container"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#resources","text":"DevTest pattern Azure Docs on Azure DevOps official tox documentation page Testinfra Testinfra project on GitHub pytest","title":"Resources"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/","text":"Using Azurite to Run Blob Storage Tests in a Pipeline This document determines the approach for writing automated tests with a short feedback loop (i.e. unit tests) against security considerations (private endpoints) for the Azure Blob Storage functionality. Once private endpoints are enabled for the Azure Storage accounts, the current tests will fail when executed locally or as part of a pipeline because this connection will be blocked. Utilize an Azure Storage Emulator - Azurite To emulate a local Azure Blob Storage, we can use Azure Storage Emulator . The Storage Emulator currently runs only on Windows. If you need a Storage Emulator for Linux, one option is the community maintained, open-source Storage Emulator Azurite . The Azure Storage Emulator is no longer being actively developed. Azurite is the Storage Emulator platform going forward. Azurite supersedes the Azure Storage Emulator. Azurite will continue to be updated to support the latest versions of Azure Storage APIs. For more information, see Use the Azurite emulator for local Azure Storage development . Some differences in functionality exist between the Storage Emulator and Azure storage services. For more information about these differences, see the Differences between the Storage Emulator and Azure Storage . There are several ways to install and run Azurite on your local system as listed here . In this document we will cover Install and run Azurite using NPM and Install and run the Azurite Docker image . 1. Install and Run Azurite a. Using NPM In order to run Azurite V3 you need Node.js >= 8.0 installed on your system. Azurite works cross-platform on Windows, Linux, and OS X. After the Node.js installation, you can install Azurite simply with npm which is the Node.js package management tool included with every Node.js installation. # Install Azurite npm install -g azurite # Create azurite directory mkdir c:/azurite # Launch Azurite for Windows azurite --silent --location c: \\a zurite --debug c: \\a zurite \\d ebug.log If you want to avoid any disk persistence and destroy the test data when the Azurite process terminates, you can pass the --inMemoryPersistence option, as of Azurite 3.28.0. The output will be: Azurite Blob service is starting at http://127.0.0.1:10000 Azurite Blob service is successfully listening at http://127.0.0.1:10000 Azurite Queue service is starting at http://127.0.0.1:10001 Azurite Queue service is successfully listening at http://127.0.0.1:10001 b. Using a Docker Image Another way to run Azurite is using docker, using default HTTP endpoint docker run -p 10000 :10000 mcr.microsoft.com/azure-storage/azurite azurite-blob --blobHost 0 .0.0.0 Docker Compose is another option and can run the same docker image using the docker-compose.yml file below. version : '3.4' services : azurite : image : mcr.microsoft.com/azure-storage/azurite hostname : azurite volumes : - ./cert/azurite:/data command : \"azurite-blob --blobHost 0.0.0.0 -l /data --cert /data/127.0.0.1.pem --key /data/127.0.0.1-key.pem --oauth basic\" ports : - \"10000:10000\" - \"10001:10001\" 2. Run Tests on Your Local Machine Python 3.8.7 is used for this, but it should be fine on other 3.x versions as well. Install and run Azurite for local tests: Option 1: using npm: # Install Azurite npm install -g azurite # Create azurite directory mkdir c:/azurite # Launch Azurite for Windows azurite --silent --location c: \\a zurite --debug c: \\a zurite \\d ebug.log Option 2: using docker docker run -p 10000 :10000 mcr.microsoft.com/azure-storage/azurite azurite-blob --blobHost 0 .0.0.0 In Azure Storage Explorer, select Attach to a local emulator Provide a Display name and port number, then your connection will be ready, and you can use Storage Explorer to manage your local blob storage. To test and see how these endpoints are running you can attach your local blob storage to the Azure Storage Explorer . Create a virtual python environment python -m venv .venv Container name and initialize env variables: Use conftest.py for test integration. from azure.storage.blob import BlobServiceClient import os def pytest_generate_tests ( metafunc ): os . environ [ 'STORAGE_CONNECTION_STRING' ] = 'DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;' os . environ [ 'STORAGE_CONTAINER' ] = 'test-container' # Crete container for Azurite for the first run blob_service_client = BlobServiceClient . from_connection_string ( os . environ . get ( \"STORAGE_CONNECTION_STRING\" )) try : blob_service_client . create_container ( os . environ . get ( \"STORAGE_CONTAINER\" )) except Exception as e : print ( e ) * Note: value for STORAGE_CONNECTION_STRING is default value for Azurite, it's not a private key Install the dependencies pip install -r requirements_tests.txt Run tests: python -m pytest ./tests After running tests, you can see the files in your local blob storage 3. Run Tests on Azure Pipelines After running tests locally we need to make sure these tests pass on Azure Pipelines too. We have 2 options here, we can use docker image as hosted agent on Azure or install an npm package in the Pipeline steps. trigger: - master steps: - task: UsePythonVersion@0 displayName: 'Use Python 3.7' inputs: versionSpec: 3 .7 - bash: | pip install -r requirements_tests.txt displayName: 'Setup requirements for tests' - bash: | sudo npm install -g azurite sudo mkdir azurite sudo azurite --silent --location azurite --debug azurite \\d ebug.log & displayName: 'Install and Run Azurite' - bash: | python -m pytest --junit-xml = unit_tests_report.xml --cov = tests --cov-report = html --cov-report = xml ./tests displayName: 'Run Tests' - task: PublishCodeCoverageResults@1 inputs: codeCoverageTool: Cobertura summaryFileLocation: '$(System.DefaultWorkingDirectory)/**/coverage.xml' reportDirectory: '$(System.DefaultWorkingDirectory)/**/htmlcov' - task: PublishTestResults@2 inputs: testResultsFormat: 'JUnit' testResultsFiles: '**/*_tests_report.xml' failTaskOnFailedTests: true Once we set up our pipeline in Azure Pipelines, result will be like below","title":"Using Azurite to Run Blob Storage Tests in a Pipeline"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#using-azurite-to-run-blob-storage-tests-in-a-pipeline","text":"This document determines the approach for writing automated tests with a short feedback loop (i.e. unit tests) against security considerations (private endpoints) for the Azure Blob Storage functionality. Once private endpoints are enabled for the Azure Storage accounts, the current tests will fail when executed locally or as part of a pipeline because this connection will be blocked.","title":"Using Azurite to Run Blob Storage Tests in a Pipeline"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#utilize-an-azure-storage-emulator-azurite","text":"To emulate a local Azure Blob Storage, we can use Azure Storage Emulator . The Storage Emulator currently runs only on Windows. If you need a Storage Emulator for Linux, one option is the community maintained, open-source Storage Emulator Azurite . The Azure Storage Emulator is no longer being actively developed. Azurite is the Storage Emulator platform going forward. Azurite supersedes the Azure Storage Emulator. Azurite will continue to be updated to support the latest versions of Azure Storage APIs. For more information, see Use the Azurite emulator for local Azure Storage development . Some differences in functionality exist between the Storage Emulator and Azure storage services. For more information about these differences, see the Differences between the Storage Emulator and Azure Storage . There are several ways to install and run Azurite on your local system as listed here . In this document we will cover Install and run Azurite using NPM and Install and run the Azurite Docker image .","title":"Utilize an Azure Storage Emulator - Azurite"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#1-install-and-run-azurite","text":"","title":"1. Install and Run Azurite"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#a-using-npm","text":"In order to run Azurite V3 you need Node.js >= 8.0 installed on your system. Azurite works cross-platform on Windows, Linux, and OS X. After the Node.js installation, you can install Azurite simply with npm which is the Node.js package management tool included with every Node.js installation. # Install Azurite npm install -g azurite # Create azurite directory mkdir c:/azurite # Launch Azurite for Windows azurite --silent --location c: \\a zurite --debug c: \\a zurite \\d ebug.log If you want to avoid any disk persistence and destroy the test data when the Azurite process terminates, you can pass the --inMemoryPersistence option, as of Azurite 3.28.0. The output will be: Azurite Blob service is starting at http://127.0.0.1:10000 Azurite Blob service is successfully listening at http://127.0.0.1:10000 Azurite Queue service is starting at http://127.0.0.1:10001 Azurite Queue service is successfully listening at http://127.0.0.1:10001","title":"a. Using NPM"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#b-using-a-docker-image","text":"Another way to run Azurite is using docker, using default HTTP endpoint docker run -p 10000 :10000 mcr.microsoft.com/azure-storage/azurite azurite-blob --blobHost 0 .0.0.0 Docker Compose is another option and can run the same docker image using the docker-compose.yml file below. version : '3.4' services : azurite : image : mcr.microsoft.com/azure-storage/azurite hostname : azurite volumes : - ./cert/azurite:/data command : \"azurite-blob --blobHost 0.0.0.0 -l /data --cert /data/127.0.0.1.pem --key /data/127.0.0.1-key.pem --oauth basic\" ports : - \"10000:10000\" - \"10001:10001\"","title":"b. Using a Docker Image"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#2-run-tests-on-your-local-machine","text":"Python 3.8.7 is used for this, but it should be fine on other 3.x versions as well. Install and run Azurite for local tests: Option 1: using npm: # Install Azurite npm install -g azurite # Create azurite directory mkdir c:/azurite # Launch Azurite for Windows azurite --silent --location c: \\a zurite --debug c: \\a zurite \\d ebug.log Option 2: using docker docker run -p 10000 :10000 mcr.microsoft.com/azure-storage/azurite azurite-blob --blobHost 0 .0.0.0 In Azure Storage Explorer, select Attach to a local emulator Provide a Display name and port number, then your connection will be ready, and you can use Storage Explorer to manage your local blob storage. To test and see how these endpoints are running you can attach your local blob storage to the Azure Storage Explorer . Create a virtual python environment python -m venv .venv Container name and initialize env variables: Use conftest.py for test integration. from azure.storage.blob import BlobServiceClient import os def pytest_generate_tests ( metafunc ): os . environ [ 'STORAGE_CONNECTION_STRING' ] = 'DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;' os . environ [ 'STORAGE_CONTAINER' ] = 'test-container' # Crete container for Azurite for the first run blob_service_client = BlobServiceClient . from_connection_string ( os . environ . get ( \"STORAGE_CONNECTION_STRING\" )) try : blob_service_client . create_container ( os . environ . get ( \"STORAGE_CONTAINER\" )) except Exception as e : print ( e ) * Note: value for STORAGE_CONNECTION_STRING is default value for Azurite, it's not a private key Install the dependencies pip install -r requirements_tests.txt Run tests: python -m pytest ./tests After running tests, you can see the files in your local blob storage","title":"2. Run Tests on Your Local Machine"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#3-run-tests-on-azure-pipelines","text":"After running tests locally we need to make sure these tests pass on Azure Pipelines too. We have 2 options here, we can use docker image as hosted agent on Azure or install an npm package in the Pipeline steps. trigger: - master steps: - task: UsePythonVersion@0 displayName: 'Use Python 3.7' inputs: versionSpec: 3 .7 - bash: | pip install -r requirements_tests.txt displayName: 'Setup requirements for tests' - bash: | sudo npm install -g azurite sudo mkdir azurite sudo azurite --silent --location azurite --debug azurite \\d ebug.log & displayName: 'Install and Run Azurite' - bash: | python -m pytest --junit-xml = unit_tests_report.xml --cov = tests --cov-report = html --cov-report = xml ./tests displayName: 'Run Tests' - task: PublishCodeCoverageResults@1 inputs: codeCoverageTool: Cobertura summaryFileLocation: '$(System.DefaultWorkingDirectory)/**/coverage.xml' reportDirectory: '$(System.DefaultWorkingDirectory)/**/htmlcov' - task: PublishTestResults@2 inputs: testResultsFormat: 'JUnit' testResultsFiles: '**/*_tests_report.xml' failTaskOnFailedTests: true Once we set up our pipeline in Azure Pipelines, result will be like below","title":"3. Run Tests on Azure Pipelines"},{"location":"automated-testing/templates/case-study-template/","text":"Case study template [Customer Project] Case Study Background Describe the customer and business requirements with the explicit problem statement. System Under Test (SUT) Include the system's conceptual architecture and highlight the architecture components that were included in the E2E testing. Problems and Limitations Describe about the problems of the overall SUT solution that prevented from testing specific (or any) part of the solution. Describe limitation of the testing tools and framework(s) used in this implementation E2E Testing Framework and Tools Describe what testing framework and/or tools were used to implement E2E testing in the SUT. Test Cases Describe the E2E test cases were created to E2E test the SUT Test Metrics Describe any architecture solution were used to monitor, observe and track the various service states that were used as the E2E testing metrics. Also, include the list of test cases were build to measure the progress of E2E testing. E2E Testing Architecture Describe any testing architecture were built to run E2E testing. E2E Testing Implementation (Code Samples) Include sample test cases and their implementation in the programming language of choice. Include any common reusable code implementation blocks that could be leveraged in the future project's E2E testing implementation. E2E Testing Reporting and Results Include sample of E2E testing reports and results obtained from the E2E testing runs in this project.","title":"Case study template"},{"location":"automated-testing/templates/case-study-template/#case-study-template","text":"[Customer Project] Case Study","title":"Case study template"},{"location":"automated-testing/templates/case-study-template/#background","text":"Describe the customer and business requirements with the explicit problem statement.","title":"Background"},{"location":"automated-testing/templates/case-study-template/#system-under-test-sut","text":"Include the system's conceptual architecture and highlight the architecture components that were included in the E2E testing.","title":"System Under Test (SUT)"},{"location":"automated-testing/templates/case-study-template/#problems-and-limitations","text":"Describe about the problems of the overall SUT solution that prevented from testing specific (or any) part of the solution. Describe limitation of the testing tools and framework(s) used in this implementation","title":"Problems and Limitations"},{"location":"automated-testing/templates/case-study-template/#e2e-testing-framework-and-tools","text":"Describe what testing framework and/or tools were used to implement E2E testing in the SUT.","title":"E2E Testing Framework and Tools"},{"location":"automated-testing/templates/case-study-template/#test-cases","text":"Describe the E2E test cases were created to E2E test the SUT","title":"Test Cases"},{"location":"automated-testing/templates/case-study-template/#test-metrics","text":"Describe any architecture solution were used to monitor, observe and track the various service states that were used as the E2E testing metrics. Also, include the list of test cases were build to measure the progress of E2E testing.","title":"Test Metrics"},{"location":"automated-testing/templates/case-study-template/#e2e-testing-architecture","text":"Describe any testing architecture were built to run E2E testing.","title":"E2E Testing Architecture"},{"location":"automated-testing/templates/case-study-template/#e2e-testing-implementation-code-samples","text":"Include sample test cases and their implementation in the programming language of choice. Include any common reusable code implementation blocks that could be leveraged in the future project's E2E testing implementation.","title":"E2E Testing Implementation (Code Samples)"},{"location":"automated-testing/templates/case-study-template/#e2e-testing-reporting-and-results","text":"Include sample of E2E testing reports and results obtained from the E2E testing runs in this project.","title":"E2E Testing Reporting and Results"},{"location":"automated-testing/templates/test-type-template/","text":"Test Type Template [Test Technique Name Here] Put a 2-3 sentence overview about the test technique here. When To Use Problem Addressed Describing the problem that this test type addresses, this should focus on the motivation behind the test type/technique to help the reader correlate this technique to their problem. When to Avoid Describe when NOT to use, if applicable. ROI Tipping Point How much is enough? For example, some opine that unit test ROI drops significantly at 80% block coverage and when the codebase is well-exercised by real traffic in production. Applicable to Local dev 'desktop' Build pipelines Non-production deployments Production deployments NOTE: If there is great (clear, succinct) documentation for the technique on the web, supply a pointer and skip the rest of this template. No need to re-type content How to Use Architecture Describe the components of the technique and how they interact with each other and the subject of the test technique. Add a simple diagram of how the technique's parts are organized, if helpful to illustrate. Pre-requisites Anything required in advance? High-level Step-by-Step 1. 1. 1. Best Practices and Advice Describe what good testing looks like for this technique, best practices, pitfalls. Anti patterns e.g. unit tests should never require off-box or even out-of-process dependencies. Are there similar things to avoid when applying this technique? Frameworks, Tools, Templates Describe known good (i.e. actually used and known to provide good results) frameworks, tools, templates, their pros and cons, with links. Resources Provide links to further readings about this technique to dive deeper.","title":"Test Type Template"},{"location":"automated-testing/templates/test-type-template/#test-type-template","text":"[Test Technique Name Here] Put a 2-3 sentence overview about the test technique here.","title":"Test Type Template"},{"location":"automated-testing/templates/test-type-template/#when-to-use","text":"","title":"When To Use"},{"location":"automated-testing/templates/test-type-template/#problem-addressed","text":"Describing the problem that this test type addresses, this should focus on the motivation behind the test type/technique to help the reader correlate this technique to their problem.","title":"Problem Addressed"},{"location":"automated-testing/templates/test-type-template/#when-to-avoid","text":"Describe when NOT to use, if applicable.","title":"When to Avoid"},{"location":"automated-testing/templates/test-type-template/#roi-tipping-point","text":"How much is enough? For example, some opine that unit test ROI drops significantly at 80% block coverage and when the codebase is well-exercised by real traffic in production.","title":"ROI Tipping Point"},{"location":"automated-testing/templates/test-type-template/#applicable-to","text":"Local dev 'desktop' Build pipelines Non-production deployments Production deployments","title":"Applicable to"},{"location":"automated-testing/templates/test-type-template/#note-if-there-is-great-clear-succinct-documentation-for-the-technique-on-the-web-supply-a-pointer-and-skip-the-rest-of-this-template-no-need-to-re-type-content","text":"","title":"NOTE: If there is great (clear, succinct) documentation for the technique on the web, supply a pointer and skip the rest of this template.  No need to re-type content"},{"location":"automated-testing/templates/test-type-template/#how-to-use","text":"","title":"How to Use"},{"location":"automated-testing/templates/test-type-template/#architecture","text":"Describe the components of the technique and how they interact with each other and the subject of the test technique. Add a simple diagram of how the technique's parts are organized, if helpful to illustrate.","title":"Architecture"},{"location":"automated-testing/templates/test-type-template/#pre-requisites","text":"Anything required in advance?","title":"Pre-requisites"},{"location":"automated-testing/templates/test-type-template/#high-level-step-by-step","text":"1. 1. 1.","title":"High-level Step-by-Step"},{"location":"automated-testing/templates/test-type-template/#best-practices-and-advice","text":"Describe what good testing looks like for this technique, best practices, pitfalls.","title":"Best Practices and Advice"},{"location":"automated-testing/templates/test-type-template/#anti-patterns","text":"e.g. unit tests should never require off-box or even out-of-process dependencies. Are there similar things to avoid when applying this technique?","title":"Anti patterns"},{"location":"automated-testing/templates/test-type-template/#frameworks-tools-templates","text":"Describe known good (i.e. actually used and known to provide good results) frameworks, tools, templates, their pros and cons, with links.","title":"Frameworks, Tools, Templates"},{"location":"automated-testing/templates/test-type-template/#resources","text":"Provide links to further readings about this technique to dive deeper.","title":"Resources"},{"location":"automated-testing/ui-testing/","text":"User Interface Testing This section is primarily geared towards web-based UIs, but the guidance is similar for mobile and OS based applications. Applicability UI Testing is not always going to be applicable, for example applications without a UI or parts of an application that require no human interaction. In those cases unit, functional and integration/e2e testing would be the primary means. UI Testing is going to be mainly applicable when dealing with a public facing UI that is used in a diverse environment or in a mission critical UI that requires higher fidelity. With something like an admin UI that is used by just a handful of people, UI Testing is still valuable but not as high priority. Goals UI testing provides the ability to ensure that users have a consistent visual user experience across a variety of means of access and that the user interaction is consistent with the function requirements. Ensure the UI appearance and interaction satisfy the functional and non-functional requirements Detect changes in the UI both across devices and delivery platforms and between code changes Provide confidence to designers and developers the user experience is consistent Support fast code evolution and refactoring while reducing the risk of regressions Evidence and Measures Integrating UI Tests in to your CI/CD is necessary but more challenging than unit tests. The increased challenge is that UI tests either need to run in headless mode with something like Puppeteer or there needs to be more extensive orchestration with Azure DevOps or GitHub that would handle the full testing integration for you like BrowserStack Integrations like BrowserStack are nice since they provide Azure DevOps reports as part of the test run. That said, Azure DevOps supports a variety of test adapters, so you can use any UI Testing framework that supports outputting the test results to one of the output formats listed at Publish Test Results task . If you're using an Azure DevOps pipeline to run UI tests, consider using a self hosted agent in order to manage framework versions and avoid unexpected updates. General Guidance The scope of UI testing should be strategic. UI tests can take a significant amount of time to both implement and run, and it's challenging to test every type of user interaction in a production application due to the large number of possible interactions. Designing the UI tests around the functional tests makes sense. For example, given an input form, a UI test would ensure that the visual representation is consistent across devices, is accessible and easy to interact with, and is consistent across code changes. UI Tests will catch 'runtime' bugs that unit and functional tests won't. For example if the submit button for an input form is rendered but not clickable due to a positioning bug in the UI, then this could be considered a runtime bug that would not have been caught by unit or functional tests. UI Tests can run on mock data or snapshots of production data, like in QA or staging. Writing Tests Good UI tests follow a few general principles: Choose a UI testing framework that enables quick feedback and is easy to use Design the UI to be easily testable. For example, add CSS selectors or set the id on elements in a web page to allow easier selecting. Test on all primary devices that the user uses, don't just test on a single device or OS. When a test mutates data ensure that data is created on demand and cleaned up after. The consequence of not doing this would be inconsistent testing. Common Issues UI Testing can get very challenging at the lower level, especially with a testing framework like Selenium. If you choose to go this route, then you'll likely encounter timeouts, missing elements, and you'll have significant friction with the testing framework itself. Due to many issues with UI testing there have been a number of free and paid solutions that help alleviate certain issues with frameworks like Selenium. This is why you'll find Cypress in the recommended frameworks as it solves many of the known issues with Selenium. This is an important point though. Depending on the UI testing framework you choose will result in either a smoother test creation experience, or a very frustrating and time-consuming one. If you were to choose just Selenium the development costs and time costs would likely be very high. It's better to use either a framework built on top of Selenium or one that attempts to solve many of the problems with something like Selenium. Note there that there are further considerations as when running in headless mode the UI can render differently than what you may see on your development machine, particularly with web applications. Furthermore, note that when rendering in different page dimensions elements may disappear on the page due to CSS rules, therefore not be selectable by certain frameworks with default options out of the box. All of these issues can be resolved and worked around, but the rendering demonstrates another particular challenge of UI testing. Specific Guidance Recommended testing frameworks: Web BrowserStack Cypress Jest Selenium Appium OS/Mobile Applications Coded UI tests (CUITs) Xamarin.UITest BrowserStack Appium Note that the framework listed above that is paid is BrowserStack, it's listed as it's an industry standard, the rest are open source and free.","title":"User Interface Testing"},{"location":"automated-testing/ui-testing/#user-interface-testing","text":"This section is primarily geared towards web-based UIs, but the guidance is similar for mobile and OS based applications.","title":"User Interface Testing"},{"location":"automated-testing/ui-testing/#applicability","text":"UI Testing is not always going to be applicable, for example applications without a UI or parts of an application that require no human interaction. In those cases unit, functional and integration/e2e testing would be the primary means. UI Testing is going to be mainly applicable when dealing with a public facing UI that is used in a diverse environment or in a mission critical UI that requires higher fidelity. With something like an admin UI that is used by just a handful of people, UI Testing is still valuable but not as high priority.","title":"Applicability"},{"location":"automated-testing/ui-testing/#goals","text":"UI testing provides the ability to ensure that users have a consistent visual user experience across a variety of means of access and that the user interaction is consistent with the function requirements. Ensure the UI appearance and interaction satisfy the functional and non-functional requirements Detect changes in the UI both across devices and delivery platforms and between code changes Provide confidence to designers and developers the user experience is consistent Support fast code evolution and refactoring while reducing the risk of regressions","title":"Goals"},{"location":"automated-testing/ui-testing/#evidence-and-measures","text":"Integrating UI Tests in to your CI/CD is necessary but more challenging than unit tests. The increased challenge is that UI tests either need to run in headless mode with something like Puppeteer or there needs to be more extensive orchestration with Azure DevOps or GitHub that would handle the full testing integration for you like BrowserStack Integrations like BrowserStack are nice since they provide Azure DevOps reports as part of the test run. That said, Azure DevOps supports a variety of test adapters, so you can use any UI Testing framework that supports outputting the test results to one of the output formats listed at Publish Test Results task . If you're using an Azure DevOps pipeline to run UI tests, consider using a self hosted agent in order to manage framework versions and avoid unexpected updates.","title":"Evidence and Measures"},{"location":"automated-testing/ui-testing/#general-guidance","text":"The scope of UI testing should be strategic. UI tests can take a significant amount of time to both implement and run, and it's challenging to test every type of user interaction in a production application due to the large number of possible interactions. Designing the UI tests around the functional tests makes sense. For example, given an input form, a UI test would ensure that the visual representation is consistent across devices, is accessible and easy to interact with, and is consistent across code changes. UI Tests will catch 'runtime' bugs that unit and functional tests won't. For example if the submit button for an input form is rendered but not clickable due to a positioning bug in the UI, then this could be considered a runtime bug that would not have been caught by unit or functional tests. UI Tests can run on mock data or snapshots of production data, like in QA or staging.","title":"General Guidance"},{"location":"automated-testing/ui-testing/#writing-tests","text":"Good UI tests follow a few general principles: Choose a UI testing framework that enables quick feedback and is easy to use Design the UI to be easily testable. For example, add CSS selectors or set the id on elements in a web page to allow easier selecting. Test on all primary devices that the user uses, don't just test on a single device or OS. When a test mutates data ensure that data is created on demand and cleaned up after. The consequence of not doing this would be inconsistent testing.","title":"Writing Tests"},{"location":"automated-testing/ui-testing/#common-issues","text":"UI Testing can get very challenging at the lower level, especially with a testing framework like Selenium. If you choose to go this route, then you'll likely encounter timeouts, missing elements, and you'll have significant friction with the testing framework itself. Due to many issues with UI testing there have been a number of free and paid solutions that help alleviate certain issues with frameworks like Selenium. This is why you'll find Cypress in the recommended frameworks as it solves many of the known issues with Selenium. This is an important point though. Depending on the UI testing framework you choose will result in either a smoother test creation experience, or a very frustrating and time-consuming one. If you were to choose just Selenium the development costs and time costs would likely be very high. It's better to use either a framework built on top of Selenium or one that attempts to solve many of the problems with something like Selenium. Note there that there are further considerations as when running in headless mode the UI can render differently than what you may see on your development machine, particularly with web applications. Furthermore, note that when rendering in different page dimensions elements may disappear on the page due to CSS rules, therefore not be selectable by certain frameworks with default options out of the box. All of these issues can be resolved and worked around, but the rendering demonstrates another particular challenge of UI testing.","title":"Common Issues"},{"location":"automated-testing/ui-testing/#specific-guidance","text":"Recommended testing frameworks: Web BrowserStack Cypress Jest Selenium Appium OS/Mobile Applications Coded UI tests (CUITs) Xamarin.UITest BrowserStack Appium Note that the framework listed above that is paid is BrowserStack, it's listed as it's an industry standard, the rest are open source and free.","title":"Specific Guidance"},{"location":"automated-testing/ui-testing/teams-tests/","text":"Automated UI Tests for a Teams Application Overview This is an overview on how you can implement UI tests for a custom Teams application. The insights provided can also be applied to automated end-to-end testing. General Observations Testing in a web browser is easier than on a native app. Testing a Teams app on a mobile device in an automated way is more challenging due to the fact that you are testing an app within an app: There is no Android Application Package (APK) / iOS App Store Package (IPA) publicly available for Microsoft Teams app itself. Mobile testing frameworks are designed with the assumption that you own the APK/IPA of the app under test. Workarounds need to be found to first automate the installation of Teams. Should you choose working with emulators, testing in a local Windows box is more stable than in a CI/CD. The latter involves a CI/CD agent and an emulator in a VM. When deciding whether to implement such tests, consider the project requirements as well as the advantages and disadvantages. Manual UI tests are often an acceptable solution due to their low effort requirements. The following are learnings from various engagements: Web Based UI Tests To implement web-based UI tests for your Teams application, follow the same approach as you would for testing any other web application with a UI. UI testing provides valuable guidance in this regard. Your starting point for the test would be to automatically launch a browser (using Selenium or similar frameworks) and navigate to https://teams.microsoft.com . If you want to test a Teams app that hasn\u2019t been published in the Teams store yet or if you\u2019d like to test the DEV/QA version of your app, you can use the Teams Toolkit and package your app based on the manifest.json . npx teamsfx package -- env dev -- manifest - path ... Once the app is installed, implement selectors to access your custom app and to perform various actions within the app. Pipeline If you are using Selenium and Edge as the browser, consider leveraging the selenium/standalone-edge Docker image which contains a standalone Selenium server with the Microsoft Edge browser installed. By default, it will run in headless mode, but by setting START_XVFB variable to True , you can control whether to start a virtual framebuffer server (Xvfb) that allows GUI applications to run without a display. Below is a code snippet which illustrates the usage of the image in a Gitlab pipeline: ... run-tests-dev: allow_failure: false image: ... environment: name: dev stage: tests services: - name: selenium/standalone-edge:latest alias: selenium variables: START_XVFB: \"true\" description: \"Start Xvfb server\" ... When running a test, you need to use the Selenium server URL for remote execution. With the definition from above, the URL is: http://selenium:4444/wd/hub . The code snippet below illustrates how you can initialize the Selenium driver to point to the remote Selenium server using JavaScript: var { Builder } = require ( \"selenium-webdriver\" ); const edge = require ( \"selenium-webdriver/edge\" ); var buildEdgeDriver = function () { let builder = new Builder (). forBrowser ( \"MicrosoftEdge\" ); builder = builder . usingServer ( \"http://selenium:4444/wd/hub\" ); builder . setEdgeOptions ( new edge . Options (). addArguments ( \"--inprivate\" )); return builder . build (); }; Mobile Based UI Tests Testing your custom Teams application on mobile devices is a bit more difficult than using the web-based approach as it requires usage of actual or simulated devices. Running such tests in a CI/CD pipeline can be more difficult and resource-intensive. One approach is to use real devices or cloud-based emulators from vendors such as BrowserStack which requires a license. Alternatively, you can use virtual devices hosted in Azure Virtual Machines. Option 1: Using Android Virtual Devices (AVD) This approach enables the creation of Android UI tests using virtual devices. It comes with the advantage of not requiring paid licenses to certain vendors. However, due to the nature of emulators, compared to real devices, it may prove to be less stable. Always choose the solution that best fits your project requirements and resources. Overall setup: AVD - Android Virtual Devices - which are virtual representation of physical Android devices. Appium is an open-source project designed to facilitate UI automation of many app platforms, including mobile. Appium is based on the W3C WebDriver specification . Note: If you look at these commands in the WebDriver specification, you will notice that they are not defined in terms of any particular programming language. They are not Java commands, or JavaScript commands, or Python commands. Instead, they form part of an HTTP API which can be accessed from within any programming language. Appium implements a client-server architecture: The server (consisting of Appium itself along with any drivers or plugins you are using for automation) is connected to the devices under test, and is actually responsible for making automation happen on those devices. UiAutomator driver is compatible with Android platform. The client is responsible for sending commands to the server over the network, and receiving responses from the server as a result. You can choose the language of your choice to write the commands. For example, for Javascript WebDriverIO can be used as client. Here you can read more about Appium ecosystem The advantage of this architecture is that it opens the possibility of running the server in a VM, and the client in a pipeline, enabling the tests to be ran automatically on scheduled basis as part of CI/CD pipelines. How to Run Mobile Tests Locally on a Windows Machine Using AVD? This approach involves: An emulator ( AVD - Android Virtual Devices ), which will represent the physical device. Appium server , which will redirect the commands from the test to your virtual device. Creating an Android Virtual Device Install Android Studio from official link . Note: At the time of writing the documentation, the latest version available was Android Studio Giraffe, 2022.3.1 Patch 2 for Window. Set ANDROID_HOME environment variable to point to the installation path of Android SDK. i.e. C:Users\\<user-name>\\AppData\\Local\\Android\\Sdk Install Java Development Kit (JDK) from official link . For the most recent devices JDK 9 is required, otherwise JDK 8 is required. Make sure you get the JDK and not the JRE. Set JAVA_HOME environment variable to the installation path, i.e. C:\\Program Files\\Java\\jdk-11 Create an AVD (Android Virtual Device): - Open Android Studio. From the Android Studio welcome screen, select More Action -> Virtual Device Manager , as instructed here - Click Create Device . - Choose a device definition with Play Store enabled . This is important, otherwise Teams cannot be installed on the device. - Choose a System image from the Recommended tab which includes access to Google Play services. You may need to install it before selecting it. - Start the emulator by clicking on the Run button from the Device Manage screen. - Manually install Microsoft Teams from Google Playstore on the device. Setting up Appium Install appium : Download NodeJs, if it is not already installed on your machine: Download | Node.js (nodejs.org) Install Appium globally: Install Appium - Appium Documentation Install the UiAutomator2 driver: Install the UiAutomator2 Driver - Appium Documentation . Go through the Set up Android automation requirements in the documentation, to make sure you have set up everything correctly. Read more about Appium Drivers here . - Start appium server by running appium command in a command prompt. Useful commands List emulators that you have previously created, without opening Android Studio: emulator -list-avds How to run Teams mobile tests in a pipeline using an Azure VM? This approach leverages the fact that Appium implements a client-server architecture. In this approach, the Appium server as well as the AVD run on an Azure VM, while the client operates within a pipeline and sends commands to be executed on the device. Configure the VM This approach involves hosting a virtual device within a virtual machine. To set up the emulator (Android Virtual Device) in an Azure VM, the VM must support nested virtualization . Azure VM configuration which, at the time of writing the documentation, worked successfully with AVD and appium: Operating system: Windows (Windows-10 Pro) VM generation: V1 Size: Standard D4ds v5 16 GiB memory Enable connection from outside to Appium server on the VM Note: By default appium server runs on port 4723. The rest of the steps will assume that this is the port where your appium server runs. In order to be able to reach appium server which runs on the VM from outside: Create an Inbound Rule for port 4723 from within the VM. Create an Inbound Security Rule in the NSG (Network Security Group) of the VM to be able to connect from that IP address to port 4723: - Find out the IP of the machine on which the tests will run on. - Replace the Source IP Address with the IP of your machine. Installing Android Studio and create AVD inside the VM Follow the instructions under the end to end tests on a Windows machine section to install Android Studio and create an Android Virtual Device. When you launch the emulator, it may show a warning as below and will eventually crash: Solution to fix it: 1. Enable Windows Hypervisor Platform 1. Enable Hyper-V (if not enabled by default) 1. Restart the VM. 1. Restart the AVD. How to inspect the Teams app in an Azure Virtual Device (AVD)? Inspecting the app is highly valuable when writing new tests, as it enables you to identify the unique IDs of various elements displayed on the screen. This process is similar to using DevTools, which allows you to navigate through the Document Object Model (DOM) of a web page. Appium Inspector is a very useful tool that allows you to inspect an app runing on an emulator. Note: This section assumes that you have already performed the prerequisites from How to run mobile test locally on a Windows machine using AVD? Steps Run the appium server with --alow-cors flag by running the following command in a terminal: appium --allow-cors Go to https://inspector.appiumpro.com and type in the following properties: { \"appium:deviceName\" : \"your-emulator-name\" , \"appium:appPackage\" : \"com.microsoft.teams\" , \"appium:appActivity\" : \"com.microsoft.skype.teams.Launcher\" , \"appium:automationName\" : \"UiAutomator2\" , \"platformName\" : \"Android\" } \"appium:deviceName\" - is the name of your emulator. In Useful commands sections from above, you can see how to get the name of your AVD. \"appium:appPackage\" - is the name of the package, should be kept to \" com.microsoft.teams \". \"appium:appActivity\"- is the name of the activity in the app that you want to launch, should be kept to \" com.microsoft.skype.teams.Launcher \" \"appium:automationName\" - is the name of the driver you are using, in this case, \" UiAutomator2 \" If the appium server runs on your local machine at the default portal, then Remote Host and Remote Port can be kept to the default values. The configuration should look similar to the printscren below: Press on Start Session . - In the browser, you should see a similar view as below: You can do any action on the emulator, and if you press on the \"Refresh\" button in the browser, the left hand side of the Appium Inspector will reflect your app. In the App Source you will be able to see the IDs of the elements, so you can write relevant selectors in your tests. Connecting to Appium server Below it is outlined how this can be achieved with JavaScript. A similar approach can be followed for other languages. Assuming you are using webdriverio as the client, you would need to initialize the remote connection as follows: const opts = { port : 4723 , hostname : \"your-hostname\" , capabilities : { platformName : \"android\" , \"appium:deviceName\" : \"the-name-of-the-virtual-device\" , \"appium:appPackage\" : \"com.microsoft.teams\" , \"appium:appActivity\" : \"com.microsoft.skype.teams.Launcher\" , \"appium:automationName\" : \"the-name-of-the-driver\" , }, }; // Create a new WebDriverIO instance with the Appium server URL and capabilities await wdio . remote ( opts ); \"port\": the port on which the Appium server runs on. By default, it is 4723. \"hostname\": the IP of the machine where the Appium sever runs on. If it is running locally, that is 127.0.0.1. If it runs in an Azure VM, it would be the public IP address of the VM. Note: ensure you have followed the steps from 2. Enable connection from outside to Appium server on the VM . \"platformName\": Appium can be used to connect to different platforms (Windows, iOS, Android). In our case, it would be \"android\". \"appium:deviceName\": the name of the Android Virtual Device. See Useful commands on how to find the name of the device. \"appium:appPackage\": the name of the app's package that you would like to launch. Teams' package name is \"com.microsoft.teams\". \"appium:appActivity\": the activity within Teams that you would like to launch on the device. In our case, we would like just to launch the app. The activity name for launching Teams is called \"com.microsoft.skype.teams.Launcher\". \"appium:automationName\": the name of the driver you are using. Note: Appium can communicate to different platforms. This is achieved by installing a dedicated driver, designed for each platform. In our case, it would be UiAutomator2 or Espresso , since they are both designed for Android platform. Option 2: Using BrowserStack BrowserStack serves as a cloud-based platform that enables developers to test both the web and mobile application across various browsers, operating systems, and real mobile devices. This can be seen as an alternative solution to the approach described earlier. The specific insights provided below relate to implementing such tests for a custom Microsoft Teams application: BrowserStack does not support out of the box the installation of Teams from the App Store or Play Store. However, there is a workaround, described in their documentation . Therefore, if you choose to go this way, you would first need to implement a step that installs Teams on the cloud-based device, by implementing the workaround described above. You may encounter issues with Google login, as it requires a newly created Google account, in order to log in to the store. To overcome this, make sure to disable 2FA from Google, further described in Troubleshooting Google login issues .","title":"Automated UI Tests for a Teams Application"},{"location":"automated-testing/ui-testing/teams-tests/#automated-ui-tests-for-a-teams-application","text":"","title":"Automated UI Tests for a Teams Application"},{"location":"automated-testing/ui-testing/teams-tests/#overview","text":"This is an overview on how you can implement UI tests for a custom Teams application. The insights provided can also be applied to automated end-to-end testing.","title":"Overview"},{"location":"automated-testing/ui-testing/teams-tests/#general-observations","text":"Testing in a web browser is easier than on a native app. Testing a Teams app on a mobile device in an automated way is more challenging due to the fact that you are testing an app within an app: There is no Android Application Package (APK) / iOS App Store Package (IPA) publicly available for Microsoft Teams app itself. Mobile testing frameworks are designed with the assumption that you own the APK/IPA of the app under test. Workarounds need to be found to first automate the installation of Teams. Should you choose working with emulators, testing in a local Windows box is more stable than in a CI/CD. The latter involves a CI/CD agent and an emulator in a VM. When deciding whether to implement such tests, consider the project requirements as well as the advantages and disadvantages. Manual UI tests are often an acceptable solution due to their low effort requirements. The following are learnings from various engagements:","title":"General Observations"},{"location":"automated-testing/ui-testing/teams-tests/#web-based-ui-tests","text":"To implement web-based UI tests for your Teams application, follow the same approach as you would for testing any other web application with a UI. UI testing provides valuable guidance in this regard. Your starting point for the test would be to automatically launch a browser (using Selenium or similar frameworks) and navigate to https://teams.microsoft.com . If you want to test a Teams app that hasn\u2019t been published in the Teams store yet or if you\u2019d like to test the DEV/QA version of your app, you can use the Teams Toolkit and package your app based on the manifest.json . npx teamsfx package -- env dev -- manifest - path ... Once the app is installed, implement selectors to access your custom app and to perform various actions within the app.","title":"Web Based UI Tests"},{"location":"automated-testing/ui-testing/teams-tests/#pipeline","text":"If you are using Selenium and Edge as the browser, consider leveraging the selenium/standalone-edge Docker image which contains a standalone Selenium server with the Microsoft Edge browser installed. By default, it will run in headless mode, but by setting START_XVFB variable to True , you can control whether to start a virtual framebuffer server (Xvfb) that allows GUI applications to run without a display. Below is a code snippet which illustrates the usage of the image in a Gitlab pipeline: ... run-tests-dev: allow_failure: false image: ... environment: name: dev stage: tests services: - name: selenium/standalone-edge:latest alias: selenium variables: START_XVFB: \"true\" description: \"Start Xvfb server\" ... When running a test, you need to use the Selenium server URL for remote execution. With the definition from above, the URL is: http://selenium:4444/wd/hub . The code snippet below illustrates how you can initialize the Selenium driver to point to the remote Selenium server using JavaScript: var { Builder } = require ( \"selenium-webdriver\" ); const edge = require ( \"selenium-webdriver/edge\" ); var buildEdgeDriver = function () { let builder = new Builder (). forBrowser ( \"MicrosoftEdge\" ); builder = builder . usingServer ( \"http://selenium:4444/wd/hub\" ); builder . setEdgeOptions ( new edge . Options (). addArguments ( \"--inprivate\" )); return builder . build (); };","title":"Pipeline"},{"location":"automated-testing/ui-testing/teams-tests/#mobile-based-ui-tests","text":"Testing your custom Teams application on mobile devices is a bit more difficult than using the web-based approach as it requires usage of actual or simulated devices. Running such tests in a CI/CD pipeline can be more difficult and resource-intensive. One approach is to use real devices or cloud-based emulators from vendors such as BrowserStack which requires a license. Alternatively, you can use virtual devices hosted in Azure Virtual Machines.","title":"Mobile Based UI Tests"},{"location":"automated-testing/ui-testing/teams-tests/#option-1-using-android-virtual-devices-avd","text":"This approach enables the creation of Android UI tests using virtual devices. It comes with the advantage of not requiring paid licenses to certain vendors. However, due to the nature of emulators, compared to real devices, it may prove to be less stable. Always choose the solution that best fits your project requirements and resources. Overall setup: AVD - Android Virtual Devices - which are virtual representation of physical Android devices. Appium is an open-source project designed to facilitate UI automation of many app platforms, including mobile. Appium is based on the W3C WebDriver specification . Note: If you look at these commands in the WebDriver specification, you will notice that they are not defined in terms of any particular programming language. They are not Java commands, or JavaScript commands, or Python commands. Instead, they form part of an HTTP API which can be accessed from within any programming language. Appium implements a client-server architecture: The server (consisting of Appium itself along with any drivers or plugins you are using for automation) is connected to the devices under test, and is actually responsible for making automation happen on those devices. UiAutomator driver is compatible with Android platform. The client is responsible for sending commands to the server over the network, and receiving responses from the server as a result. You can choose the language of your choice to write the commands. For example, for Javascript WebDriverIO can be used as client. Here you can read more about Appium ecosystem The advantage of this architecture is that it opens the possibility of running the server in a VM, and the client in a pipeline, enabling the tests to be ran automatically on scheduled basis as part of CI/CD pipelines.","title":"Option 1: Using Android Virtual Devices (AVD)"},{"location":"automated-testing/ui-testing/teams-tests/#how-to-run-mobile-tests-locally-on-a-windows-machine-using-avd","text":"This approach involves: An emulator ( AVD - Android Virtual Devices ), which will represent the physical device. Appium server , which will redirect the commands from the test to your virtual device.","title":"How to Run Mobile Tests Locally on a Windows Machine Using AVD?"},{"location":"automated-testing/ui-testing/teams-tests/#creating-an-android-virtual-device","text":"Install Android Studio from official link . Note: At the time of writing the documentation, the latest version available was Android Studio Giraffe, 2022.3.1 Patch 2 for Window. Set ANDROID_HOME environment variable to point to the installation path of Android SDK. i.e. C:Users\\<user-name>\\AppData\\Local\\Android\\Sdk Install Java Development Kit (JDK) from official link . For the most recent devices JDK 9 is required, otherwise JDK 8 is required. Make sure you get the JDK and not the JRE. Set JAVA_HOME environment variable to the installation path, i.e. C:\\Program Files\\Java\\jdk-11 Create an AVD (Android Virtual Device): - Open Android Studio. From the Android Studio welcome screen, select More Action -> Virtual Device Manager , as instructed here - Click Create Device . - Choose a device definition with Play Store enabled . This is important, otherwise Teams cannot be installed on the device. - Choose a System image from the Recommended tab which includes access to Google Play services. You may need to install it before selecting it. - Start the emulator by clicking on the Run button from the Device Manage screen. - Manually install Microsoft Teams from Google Playstore on the device.","title":"Creating an Android Virtual Device"},{"location":"automated-testing/ui-testing/teams-tests/#setting-up-appium","text":"Install appium : Download NodeJs, if it is not already installed on your machine: Download | Node.js (nodejs.org) Install Appium globally: Install Appium - Appium Documentation Install the UiAutomator2 driver: Install the UiAutomator2 Driver - Appium Documentation . Go through the Set up Android automation requirements in the documentation, to make sure you have set up everything correctly. Read more about Appium Drivers here . - Start appium server by running appium command in a command prompt.","title":"Setting up Appium"},{"location":"automated-testing/ui-testing/teams-tests/#useful-commands","text":"List emulators that you have previously created, without opening Android Studio: emulator -list-avds","title":"Useful commands"},{"location":"automated-testing/ui-testing/teams-tests/#how-to-run-teams-mobile-tests-in-a-pipeline-using-an-azure-vm","text":"This approach leverages the fact that Appium implements a client-server architecture. In this approach, the Appium server as well as the AVD run on an Azure VM, while the client operates within a pipeline and sends commands to be executed on the device.","title":"How to run Teams mobile tests in a pipeline using an Azure VM?"},{"location":"automated-testing/ui-testing/teams-tests/#configure-the-vm","text":"This approach involves hosting a virtual device within a virtual machine. To set up the emulator (Android Virtual Device) in an Azure VM, the VM must support nested virtualization . Azure VM configuration which, at the time of writing the documentation, worked successfully with AVD and appium: Operating system: Windows (Windows-10 Pro) VM generation: V1 Size: Standard D4ds v5 16 GiB memory","title":"Configure the VM"},{"location":"automated-testing/ui-testing/teams-tests/#enable-connection-from-outside-to-appium-server-on-the-vm","text":"Note: By default appium server runs on port 4723. The rest of the steps will assume that this is the port where your appium server runs. In order to be able to reach appium server which runs on the VM from outside: Create an Inbound Rule for port 4723 from within the VM. Create an Inbound Security Rule in the NSG (Network Security Group) of the VM to be able to connect from that IP address to port 4723: - Find out the IP of the machine on which the tests will run on. - Replace the Source IP Address with the IP of your machine.","title":"Enable connection from outside to Appium server on the VM"},{"location":"automated-testing/ui-testing/teams-tests/#installing-android-studio-and-create-avd-inside-the-vm","text":"Follow the instructions under the end to end tests on a Windows machine section to install Android Studio and create an Android Virtual Device. When you launch the emulator, it may show a warning as below and will eventually crash: Solution to fix it: 1. Enable Windows Hypervisor Platform 1. Enable Hyper-V (if not enabled by default) 1. Restart the VM. 1. Restart the AVD.","title":"Installing Android Studio and create AVD inside the VM"},{"location":"automated-testing/ui-testing/teams-tests/#how-to-inspect-the-teams-app-in-an-azure-virtual-device-avd","text":"Inspecting the app is highly valuable when writing new tests, as it enables you to identify the unique IDs of various elements displayed on the screen. This process is similar to using DevTools, which allows you to navigate through the Document Object Model (DOM) of a web page. Appium Inspector is a very useful tool that allows you to inspect an app runing on an emulator. Note: This section assumes that you have already performed the prerequisites from How to run mobile test locally on a Windows machine using AVD?","title":"How to inspect the Teams app in an Azure Virtual Device (AVD)?"},{"location":"automated-testing/ui-testing/teams-tests/#steps","text":"Run the appium server with --alow-cors flag by running the following command in a terminal: appium --allow-cors Go to https://inspector.appiumpro.com and type in the following properties: { \"appium:deviceName\" : \"your-emulator-name\" , \"appium:appPackage\" : \"com.microsoft.teams\" , \"appium:appActivity\" : \"com.microsoft.skype.teams.Launcher\" , \"appium:automationName\" : \"UiAutomator2\" , \"platformName\" : \"Android\" } \"appium:deviceName\" - is the name of your emulator. In Useful commands sections from above, you can see how to get the name of your AVD. \"appium:appPackage\" - is the name of the package, should be kept to \" com.microsoft.teams \". \"appium:appActivity\"- is the name of the activity in the app that you want to launch, should be kept to \" com.microsoft.skype.teams.Launcher \" \"appium:automationName\" - is the name of the driver you are using, in this case, \" UiAutomator2 \" If the appium server runs on your local machine at the default portal, then Remote Host and Remote Port can be kept to the default values. The configuration should look similar to the printscren below: Press on Start Session . - In the browser, you should see a similar view as below: You can do any action on the emulator, and if you press on the \"Refresh\" button in the browser, the left hand side of the Appium Inspector will reflect your app. In the App Source you will be able to see the IDs of the elements, so you can write relevant selectors in your tests. Connecting to Appium server Below it is outlined how this can be achieved with JavaScript. A similar approach can be followed for other languages. Assuming you are using webdriverio as the client, you would need to initialize the remote connection as follows: const opts = { port : 4723 , hostname : \"your-hostname\" , capabilities : { platformName : \"android\" , \"appium:deviceName\" : \"the-name-of-the-virtual-device\" , \"appium:appPackage\" : \"com.microsoft.teams\" , \"appium:appActivity\" : \"com.microsoft.skype.teams.Launcher\" , \"appium:automationName\" : \"the-name-of-the-driver\" , }, }; // Create a new WebDriverIO instance with the Appium server URL and capabilities await wdio . remote ( opts ); \"port\": the port on which the Appium server runs on. By default, it is 4723. \"hostname\": the IP of the machine where the Appium sever runs on. If it is running locally, that is 127.0.0.1. If it runs in an Azure VM, it would be the public IP address of the VM. Note: ensure you have followed the steps from 2. Enable connection from outside to Appium server on the VM . \"platformName\": Appium can be used to connect to different platforms (Windows, iOS, Android). In our case, it would be \"android\". \"appium:deviceName\": the name of the Android Virtual Device. See Useful commands on how to find the name of the device. \"appium:appPackage\": the name of the app's package that you would like to launch. Teams' package name is \"com.microsoft.teams\". \"appium:appActivity\": the activity within Teams that you would like to launch on the device. In our case, we would like just to launch the app. The activity name for launching Teams is called \"com.microsoft.skype.teams.Launcher\". \"appium:automationName\": the name of the driver you are using. Note: Appium can communicate to different platforms. This is achieved by installing a dedicated driver, designed for each platform. In our case, it would be UiAutomator2 or Espresso , since they are both designed for Android platform.","title":"Steps"},{"location":"automated-testing/ui-testing/teams-tests/#option-2-using-browserstack","text":"BrowserStack serves as a cloud-based platform that enables developers to test both the web and mobile application across various browsers, operating systems, and real mobile devices. This can be seen as an alternative solution to the approach described earlier. The specific insights provided below relate to implementing such tests for a custom Microsoft Teams application: BrowserStack does not support out of the box the installation of Teams from the App Store or Play Store. However, there is a workaround, described in their documentation . Therefore, if you choose to go this way, you would first need to implement a step that installs Teams on the cloud-based device, by implementing the workaround described above. You may encounter issues with Google login, as it requires a newly created Google account, in order to log in to the store. To overcome this, make sure to disable 2FA from Google, further described in Troubleshooting Google login issues .","title":"Option 2: Using BrowserStack"},{"location":"automated-testing/unit-testing/","text":"Unit Testing Unit testing is a fundamental tool in every developer's toolbox. Unit tests not only help us test our code, they encourage good design practices, reduce the chances of bugs reaching production, and can even serve as examples or documentation on how code functions. Properly written unit tests can also improve developer efficiency. Unit testing also is one of the most commonly misunderstood forms of testing. Unit testing refers to a very specific type of testing; a unit test should be: Provably reliable - should be 100% reliable so failures indicate a bug in the code Fast - should run in milliseconds, a whole unit testing suite shouldn't take longer than a couple seconds Isolated - removing all external dependencies ensures reliability and speed Why Unit Testing It is no secret that writing unit tests is hard, and even harder to write well. Writing unit tests also increases the development time for every feature. So why should we write them? Unit tests reduce costs by catching bugs earlier and preventing regressions increase developer confidence in changes speed up the developer inner loop act as documentation as code For more details, see all the detailed descriptions of the points above . Unit Testing Design Blocks Unit testing is the lowest level of testing and as such generally has few components and dependencies. The system under test (abbreviated SUT) is the \"unit\" we are testing. Generally these are methods or functions, but depending on the language these could be different. In general, you want the unit to be as small as possible though. Most languages also have a wide suite of unit testing frameworks and test runners. These test frameworks have a wide range of functionality, but the base functionality should be a way to organize your tests and run them quickly. Finally, there is your unit test code ; unit test code is generally short and simple, preferring repetition to adding layers and complexity to the code. Applying the Unit Testing Getting started with writing a unit test is much easier than some other test types since it should require next to no setup and is just code. Each test framework is different in how you organize and write your tests, but the general techniques and best practices of writing a unit test are universal. Techniques These are some commonly used techniques that will help when authoring unit tests. For some examples, see the pages on using abstraction and dependency injection to author a unit test , or how to do test-driven development . Note that some of these techniques are more specific to strongly typed, object-oriented languages. Functional languages and scripting languages have similar techniques that may look different, but these terms are commonly used in all unit testing examples. Abstraction Abstraction is when we take an exact implementation detail, and we generalize it into a concept instead. This technique can be used in creating testable design and is used often especially in object-oriented languages. For unit tests, abstraction is commonly used to break a hard dependency and replace it with an abstraction. That abstraction then allows for greater flexibility in the code and allows for the a mock or simulator to be used in its place. One of the side effects of abstracting dependencies is that you may have an abstraction that has no test coverage. This is case where unit testing is not well-suited, you can not expect to unit test everything, things like dependencies will always be an uncovered case. This is why even if you have a robust unit testing suite, integration or functional testing should still be used - without that, a change in the way the dependency functions would never be caught. When building wrappers around third-party dependencies, it is best to keep the implementations with as little logic as possible, using a very simple facade that calls the dependency. An example of using abstraction can be found here . Dependency Injection Dependency injection is a technique which allows us to extract dependencies from our code. In a normal use-case of a dependant class, the dependency is constructed and used within the system under test. This creates a hard dependency between the two classes, which can make it particularly hard to test in isolation. Dependencies could be things like classes wrapping a REST API, or even something as simple as file access. By injecting the dependencies into our system rather than constructing them, we have \"inverted control\" of the dependency. You may see \"Inversion of Control\" and \"Dependency Injection\" used as separate terms, but it is very hard to have one and not the other, with some arguing that Dependency Injection is a more specific way of saying inversion of control . In certain languages such as C#, not using dependency injection can lead to code that is not unit testable since there is no way to inject mocked objects. Keeping testability in mind from the beginning and evaluating using dependency injection can save you from a time-intensive refactor later. One of the downsides of dependency injection is that it can easily go overboard. While there are no longer hard dependencies, there is still coupling between the interfaces, and passing around every interface implementation into every class presents just as many downsides as not using Dependency Injection. Being intentional with what dependencies get injected to what classes, is key to developing a maintainable system. Many languages include special Dependency Injection frameworks that take care of the boilerplate code and construction of the objects. Examples of this are Spring in Java or built into ASP.NET Core An example of using dependency injection can be found here . Test-Driven Development Test-Driven Development (TDD) is less a technique in how your code is designed, but a technique for writing your code that will lead you to a testable design from the start. The basic premise of test-driven development is that you write your test code first and then write the system under test to match the test you just wrote. This way all the test design is done up front and by the time you finish writing your system code, you are already at 100% test pass rate and test coverage. It also guarantees testable design is built into the system since the test was written first! For more information on TDD and an example, see the page on Test-Driven Development Best Practices Arrange/Act/Assert One common form of organizing your unit test code is called Arrange/Act/Assert. This divides up your unit test into 3 different discrete sections: Arrange - Set up all the variables, mocks, interfaces, and state you will need to run the test Act - Run the system under test, passing in any of the above objects that were created Assert - Check that with the given state that the system acted appropriately. Using this pattern to write tests makes them very readable and also familiar to future developers who would need to read your unit tests. Example Let's assume we have a class MyObject with a method TrySomething that interacts with an array of strings, but if the array has no elements, it will return false. We want to write a test that checks the case where array has no elements: [Fact] public void TrySomething_NoElements_ReturnsFalse () { // Arrange var elements = Array . Empty < string > (); var myObject = new MyObject (); // Act var myReturn = myObject . TrySomething ( elements ); // Assert Assert . False ( myReturn ); } Keep Tests Small and Test Only One Thing Unit tests should be short and test only one thing. This makes it easy to diagnose when there was a failure without needing something like which line number the test failed at. When using Arrange/Act/Assert , think of it like testing just one thing in the \"Act\" phase. There is some disagreement on whether testing one thing means \"assert one thing\" or \"test one state, with multiple asserts if needed\". Both have their advantages and disadvantages, but as with most technical disagreements there is no \"right\" answer. Consistency when writing your tests one way or the other is more important! Using a Standard Naming Convention for All Unit Tests Without having a set standard convention for unit test names, unit test names end up being either not descriptive enough, or duplicated across multiple different test classes. Establishing a standard is not only important for keeping your code consistent, but a good standard also improves the readability and debug-ability of a test. In this article, the convention used for all unit tests has been UnitName_StateUnderTest_ExpectedResult , but there are lots of other possible conventions as well, the important thing is to be consistent and descriptive. Having descriptive names such as the one above makes it trivial to find the test when there is a failure, and also already explains what the expectation of the test was and what state caused it to fail. This can be especially helpful when looking at failures in a CI/CD system where all you know is the name of the test that failed - instead now you know the name of the test and exactly why it failed (especially coupled with a test framework that logs helpful output on failures). Things to Avoid Some common pitfalls when writing a unit test that are important to avoid: Sleeps - A sleep can be an indicator that perhaps something is making a request to a dependency that it should not be. In general, if your code is flaky without the sleep, consider why it is failing and if you can remove the flakiness by introducing a more reliable way to communicate potential state changes. Adding sleeps to your unit tests also breaks one of our original tenets of unit testing: tests should be fast, as in order of milliseconds. If tests are taking on the order of seconds, they become more cumbersome to run. Reading from disk - It can be really tempting to the expected value of a function return in a file and read that file to compare the results. This creates a dependency with the system drive, and it breaks our tenet of keeping our unit tests isolated and 100% reliable. Any outside dependency such as file system access could potentially cause intermittent failures. Additionally, this could be a sign that perhaps the test or unit under test is too complex and should be simplified. Calling third-party APIs - When you do not control a third-party library that you are calling into, it's impossible to know for sure what that is doing, and it is best to abstract it out. Otherwise, you may be making REST calls or other potential areas of failure without directly writing the code for it. This is also generally a sign that the design of the system is not entirely testable. It is best to wrap third party API calls in interfaces or other structures so that they do not get invoked in unit tests. For more information see the page on mocking . Unit Testing Frameworks and Tools Test Frameworks Unit test frameworks are constantly changing. For a full list of every unit testing framework see the page on Wikipedia . Frameworks have many features and should be picked based on which feature-set fits best for the particular project. Mock Frameworks Many projects start with both a unit test framework, and also add a mock framework. While mocking frameworks have their uses and sometimes can be a requirement, it should not be something that is added without considering the broader implications and risks associated with heavy usage of mocks. To see if mocking is right for your project, or if a mock-free approach is more appropriate, see the page on mocking . Tools These tools allow for constant running of your unit tests with in-line code coverage, making the dev inner loop extremely fast and allows for easy TDD: Visual Studio Live Unit Testing Wallaby.js Infinitest for Java PyCrunch for Python Things to Consider Transferring Responsibility to Integration Tests In some situations it is worth considering to include the integration tests in the inner development loop to provide a sufficient code coverage to ensure the system is working properly. The prerequisite for this approach to be successful is to have integration tests being able to execute at a speed comparable to that of unit tests both locally and in a CI environment. Modern application frameworks like .NET or Spring Boot combined with the right mocking or stubbing approach for external dependencies offer excellent capabilities to enable such scenarios for testing. Usually, integration tests only prove that independently developed modules connect together as designed. The test coverage of integration tests can be extended to verify the correct behavior of the system as well. The responsibility of providing a sufficient branch and line code coverage can be transferred from unit tests to integration tests. Instead of several unit tests needed to test a specific case of functionality of the system, one integration scenario is created that covers the entire flow. For example in case of an API, the received HTTP responses and their content are verified for each request in test. This covers both the integration between components of the API and the correctness of its business logic. With this approach efficient integration tests can be treated as an extension of unit testing, taking over the responsibility of validating happy/failure path scenarios. It has the advantage of testing the system as a black box without any knowledge of its internals. Code refactoring has no impact on tests. Common testing techniques as TDD can be applied at a higher level which results in a development process that is driven by acceptance tests. Depending on the project specifics unit tests still play an important role. They can be used to help dictate a testable design at a lower level or to test complex business logic and corner cases if necessary. Conclusion Unit testing is extremely important, but it is also not the silver bullet; having proper unit tests is just a part of a well-tested system. However, writing proper unit tests will help with the design of your system as well as help catch regressions, bugs, and increase developer velocity. Resources Unit Testing Best Practices","title":"Unit Testing"},{"location":"automated-testing/unit-testing/#unit-testing","text":"Unit testing is a fundamental tool in every developer's toolbox. Unit tests not only help us test our code, they encourage good design practices, reduce the chances of bugs reaching production, and can even serve as examples or documentation on how code functions. Properly written unit tests can also improve developer efficiency. Unit testing also is one of the most commonly misunderstood forms of testing. Unit testing refers to a very specific type of testing; a unit test should be: Provably reliable - should be 100% reliable so failures indicate a bug in the code Fast - should run in milliseconds, a whole unit testing suite shouldn't take longer than a couple seconds Isolated - removing all external dependencies ensures reliability and speed","title":"Unit Testing"},{"location":"automated-testing/unit-testing/#why-unit-testing","text":"It is no secret that writing unit tests is hard, and even harder to write well. Writing unit tests also increases the development time for every feature. So why should we write them? Unit tests reduce costs by catching bugs earlier and preventing regressions increase developer confidence in changes speed up the developer inner loop act as documentation as code For more details, see all the detailed descriptions of the points above .","title":"Why Unit Testing"},{"location":"automated-testing/unit-testing/#unit-testing-design-blocks","text":"Unit testing is the lowest level of testing and as such generally has few components and dependencies. The system under test (abbreviated SUT) is the \"unit\" we are testing. Generally these are methods or functions, but depending on the language these could be different. In general, you want the unit to be as small as possible though. Most languages also have a wide suite of unit testing frameworks and test runners. These test frameworks have a wide range of functionality, but the base functionality should be a way to organize your tests and run them quickly. Finally, there is your unit test code ; unit test code is generally short and simple, preferring repetition to adding layers and complexity to the code.","title":"Unit Testing Design Blocks"},{"location":"automated-testing/unit-testing/#applying-the-unit-testing","text":"Getting started with writing a unit test is much easier than some other test types since it should require next to no setup and is just code. Each test framework is different in how you organize and write your tests, but the general techniques and best practices of writing a unit test are universal.","title":"Applying the Unit Testing"},{"location":"automated-testing/unit-testing/#techniques","text":"These are some commonly used techniques that will help when authoring unit tests. For some examples, see the pages on using abstraction and dependency injection to author a unit test , or how to do test-driven development . Note that some of these techniques are more specific to strongly typed, object-oriented languages. Functional languages and scripting languages have similar techniques that may look different, but these terms are commonly used in all unit testing examples.","title":"Techniques"},{"location":"automated-testing/unit-testing/#abstraction","text":"Abstraction is when we take an exact implementation detail, and we generalize it into a concept instead. This technique can be used in creating testable design and is used often especially in object-oriented languages. For unit tests, abstraction is commonly used to break a hard dependency and replace it with an abstraction. That abstraction then allows for greater flexibility in the code and allows for the a mock or simulator to be used in its place. One of the side effects of abstracting dependencies is that you may have an abstraction that has no test coverage. This is case where unit testing is not well-suited, you can not expect to unit test everything, things like dependencies will always be an uncovered case. This is why even if you have a robust unit testing suite, integration or functional testing should still be used - without that, a change in the way the dependency functions would never be caught. When building wrappers around third-party dependencies, it is best to keep the implementations with as little logic as possible, using a very simple facade that calls the dependency. An example of using abstraction can be found here .","title":"Abstraction"},{"location":"automated-testing/unit-testing/#dependency-injection","text":"Dependency injection is a technique which allows us to extract dependencies from our code. In a normal use-case of a dependant class, the dependency is constructed and used within the system under test. This creates a hard dependency between the two classes, which can make it particularly hard to test in isolation. Dependencies could be things like classes wrapping a REST API, or even something as simple as file access. By injecting the dependencies into our system rather than constructing them, we have \"inverted control\" of the dependency. You may see \"Inversion of Control\" and \"Dependency Injection\" used as separate terms, but it is very hard to have one and not the other, with some arguing that Dependency Injection is a more specific way of saying inversion of control . In certain languages such as C#, not using dependency injection can lead to code that is not unit testable since there is no way to inject mocked objects. Keeping testability in mind from the beginning and evaluating using dependency injection can save you from a time-intensive refactor later. One of the downsides of dependency injection is that it can easily go overboard. While there are no longer hard dependencies, there is still coupling between the interfaces, and passing around every interface implementation into every class presents just as many downsides as not using Dependency Injection. Being intentional with what dependencies get injected to what classes, is key to developing a maintainable system. Many languages include special Dependency Injection frameworks that take care of the boilerplate code and construction of the objects. Examples of this are Spring in Java or built into ASP.NET Core An example of using dependency injection can be found here .","title":"Dependency Injection"},{"location":"automated-testing/unit-testing/#test-driven-development","text":"Test-Driven Development (TDD) is less a technique in how your code is designed, but a technique for writing your code that will lead you to a testable design from the start. The basic premise of test-driven development is that you write your test code first and then write the system under test to match the test you just wrote. This way all the test design is done up front and by the time you finish writing your system code, you are already at 100% test pass rate and test coverage. It also guarantees testable design is built into the system since the test was written first! For more information on TDD and an example, see the page on Test-Driven Development","title":"Test-Driven Development"},{"location":"automated-testing/unit-testing/#best-practices","text":"","title":"Best Practices"},{"location":"automated-testing/unit-testing/#arrangeactassert","text":"One common form of organizing your unit test code is called Arrange/Act/Assert. This divides up your unit test into 3 different discrete sections: Arrange - Set up all the variables, mocks, interfaces, and state you will need to run the test Act - Run the system under test, passing in any of the above objects that were created Assert - Check that with the given state that the system acted appropriately. Using this pattern to write tests makes them very readable and also familiar to future developers who would need to read your unit tests.","title":"Arrange/Act/Assert"},{"location":"automated-testing/unit-testing/#example","text":"Let's assume we have a class MyObject with a method TrySomething that interacts with an array of strings, but if the array has no elements, it will return false. We want to write a test that checks the case where array has no elements: [Fact] public void TrySomething_NoElements_ReturnsFalse () { // Arrange var elements = Array . Empty < string > (); var myObject = new MyObject (); // Act var myReturn = myObject . TrySomething ( elements ); // Assert Assert . False ( myReturn ); }","title":"Example"},{"location":"automated-testing/unit-testing/#keep-tests-small-and-test-only-one-thing","text":"Unit tests should be short and test only one thing. This makes it easy to diagnose when there was a failure without needing something like which line number the test failed at. When using Arrange/Act/Assert , think of it like testing just one thing in the \"Act\" phase. There is some disagreement on whether testing one thing means \"assert one thing\" or \"test one state, with multiple asserts if needed\". Both have their advantages and disadvantages, but as with most technical disagreements there is no \"right\" answer. Consistency when writing your tests one way or the other is more important!","title":"Keep Tests Small and Test Only One Thing"},{"location":"automated-testing/unit-testing/#using-a-standard-naming-convention-for-all-unit-tests","text":"Without having a set standard convention for unit test names, unit test names end up being either not descriptive enough, or duplicated across multiple different test classes. Establishing a standard is not only important for keeping your code consistent, but a good standard also improves the readability and debug-ability of a test. In this article, the convention used for all unit tests has been UnitName_StateUnderTest_ExpectedResult , but there are lots of other possible conventions as well, the important thing is to be consistent and descriptive. Having descriptive names such as the one above makes it trivial to find the test when there is a failure, and also already explains what the expectation of the test was and what state caused it to fail. This can be especially helpful when looking at failures in a CI/CD system where all you know is the name of the test that failed - instead now you know the name of the test and exactly why it failed (especially coupled with a test framework that logs helpful output on failures).","title":"Using a Standard Naming Convention for All Unit Tests"},{"location":"automated-testing/unit-testing/#things-to-avoid","text":"Some common pitfalls when writing a unit test that are important to avoid: Sleeps - A sleep can be an indicator that perhaps something is making a request to a dependency that it should not be. In general, if your code is flaky without the sleep, consider why it is failing and if you can remove the flakiness by introducing a more reliable way to communicate potential state changes. Adding sleeps to your unit tests also breaks one of our original tenets of unit testing: tests should be fast, as in order of milliseconds. If tests are taking on the order of seconds, they become more cumbersome to run. Reading from disk - It can be really tempting to the expected value of a function return in a file and read that file to compare the results. This creates a dependency with the system drive, and it breaks our tenet of keeping our unit tests isolated and 100% reliable. Any outside dependency such as file system access could potentially cause intermittent failures. Additionally, this could be a sign that perhaps the test or unit under test is too complex and should be simplified. Calling third-party APIs - When you do not control a third-party library that you are calling into, it's impossible to know for sure what that is doing, and it is best to abstract it out. Otherwise, you may be making REST calls or other potential areas of failure without directly writing the code for it. This is also generally a sign that the design of the system is not entirely testable. It is best to wrap third party API calls in interfaces or other structures so that they do not get invoked in unit tests. For more information see the page on mocking .","title":"Things to Avoid"},{"location":"automated-testing/unit-testing/#unit-testing-frameworks-and-tools","text":"","title":"Unit Testing Frameworks and Tools"},{"location":"automated-testing/unit-testing/#test-frameworks","text":"Unit test frameworks are constantly changing. For a full list of every unit testing framework see the page on Wikipedia . Frameworks have many features and should be picked based on which feature-set fits best for the particular project.","title":"Test Frameworks"},{"location":"automated-testing/unit-testing/#mock-frameworks","text":"Many projects start with both a unit test framework, and also add a mock framework. While mocking frameworks have their uses and sometimes can be a requirement, it should not be something that is added without considering the broader implications and risks associated with heavy usage of mocks. To see if mocking is right for your project, or if a mock-free approach is more appropriate, see the page on mocking .","title":"Mock Frameworks"},{"location":"automated-testing/unit-testing/#tools","text":"These tools allow for constant running of your unit tests with in-line code coverage, making the dev inner loop extremely fast and allows for easy TDD: Visual Studio Live Unit Testing Wallaby.js Infinitest for Java PyCrunch for Python","title":"Tools"},{"location":"automated-testing/unit-testing/#things-to-consider","text":"","title":"Things to Consider"},{"location":"automated-testing/unit-testing/#transferring-responsibility-to-integration-tests","text":"In some situations it is worth considering to include the integration tests in the inner development loop to provide a sufficient code coverage to ensure the system is working properly. The prerequisite for this approach to be successful is to have integration tests being able to execute at a speed comparable to that of unit tests both locally and in a CI environment. Modern application frameworks like .NET or Spring Boot combined with the right mocking or stubbing approach for external dependencies offer excellent capabilities to enable such scenarios for testing. Usually, integration tests only prove that independently developed modules connect together as designed. The test coverage of integration tests can be extended to verify the correct behavior of the system as well. The responsibility of providing a sufficient branch and line code coverage can be transferred from unit tests to integration tests. Instead of several unit tests needed to test a specific case of functionality of the system, one integration scenario is created that covers the entire flow. For example in case of an API, the received HTTP responses and their content are verified for each request in test. This covers both the integration between components of the API and the correctness of its business logic. With this approach efficient integration tests can be treated as an extension of unit testing, taking over the responsibility of validating happy/failure path scenarios. It has the advantage of testing the system as a black box without any knowledge of its internals. Code refactoring has no impact on tests. Common testing techniques as TDD can be applied at a higher level which results in a development process that is driven by acceptance tests. Depending on the project specifics unit tests still play an important role. They can be used to help dictate a testable design at a lower level or to test complex business logic and corner cases if necessary.","title":"Transferring Responsibility to Integration Tests"},{"location":"automated-testing/unit-testing/#conclusion","text":"Unit testing is extremely important, but it is also not the silver bullet; having proper unit tests is just a part of a well-tested system. However, writing proper unit tests will help with the design of your system as well as help catch regressions, bugs, and increase developer velocity.","title":"Conclusion"},{"location":"automated-testing/unit-testing/#resources","text":"Unit Testing Best Practices","title":"Resources"},{"location":"automated-testing/unit-testing/authoring-example/","text":"Writing a Unit Test To illustrate some unit testing techniques for an object-oriented language, let's start with an example of some code we wish to add unit tests for. In this example, we have a configuration class that contains all the startup options for an app we are writing. Normally it reads from a .config file, but we are having three problems with the current implementation: There is a bug in the Configuration class, and we have no unit tests since it relies on reading a config file We can't unit test any of the code that relies on the Configuration class reading a config file In the future, we want to allow for configuration to be saved in the cloud and accessed via REST api. The bug we are trying to fix is that if there are multiple empty lines in the configuration file, an IndexOutOfRangeException is being thrown. Our class currently looks like this: using System.IO ; using System.Linq ; public class Configuration { // Public getter properties from configuration object public string MyProperty { get ; private set ; } public void Initialize () { var configContents = File . ReadAllLines ( \".config\" ); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } } Abstraction In our example, we have a single dependency: the file system. Rather than just abstracting the file system entirely, let us think about why we need the file system and abstract the concept rather than the implementation. In this case, we are using the File class to read from the config file, and the config contents. The abstraction concept here is some form or configuration reader that returns each line of the configuration in a string array. We could call it ConfigurationReader , and it has a single method, Read , which returns the contents. When creating abstractions, it can be good practice creating an interface for that abstraction, in languages that support it. In the example with C#, we can create an IConfigurationReader interface, and instead of just having a ConfigurationReader class we can be more specific and name if FileConfigurationReader to indicate that it reads from the file system: // IConfigurationReader.cs public interface IConfigurationReader { string [] Read (); } // FileConfigurationReader.cs public class FileConfigurationReader : IConfigurationReader { public string [] Read () { return File . ReadAllLines ( \".config\" ); } } Now that the file dependency has been abstracted away, we need to update our Configuration class's Initialize method to use the new abstraction instead of calling File.ReadAllLines directly: public void Initialize () { var configContents = new FileConfigurationReader (). Read (); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } As you can see, we still have a dependency on the file system, but that dependency has been abstracted out. We will need to use other techniques to break the dependency completely. Dependency Injection In the previous section, we abstracted the file access into a FileConfigurationReader but we still had a dependency on the file system in our function. We can use dependency injection to inject the right reader into our Configuration class: using System.IO ; using System.Linq ; public class Configuration { private readonly IConfigurationReader configReader ; // Public getter properties from configuration object public string MyProperty { get ; private set ; } public Configuration ( IConfigurationReader reader ) { this . configReader = reader ; } public void Initialize () { var configContents = configReader . Read (); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } } Above, a technique was used called Constructor Injection . This uses the object's constructor to set what our dependencies will be, which means whichever object creates the Configuration object will control which reader needs to get passed in. This is an example of \"inversion of control\", previously the Configuration object controlled the dependency, but instead we pushed up the control to whatever component creates this object. Note that we injected the interface IConfigurationReader and not the concrete class. This is what allows us to break the dependency; whereas originally we had a hard-coded dependency on the File class, now we only depend on an object that implements IConfigurationReader . Writing our first unit tests We started down this venture because we have a bug in the Configuration class that was not caught because we do not have unit tests. Let us write some unit tests that gives us full coverage of the Configuration class, including a test that tests the scenario described by the bug (if there are multiple empty lines in the configuration file, an IndexOutOfRangeException is being thrown). However, we still have one problem, we only have a single implementation of IConfigurationReader , and it uses the file system, meaning any unit tests we write will still have a dependency on the file system! Luckily since we used dependency injection, all we need to do is create an implementation of IConfigurationReader that does not depend on the file system. We could create a mock here, but instead let's create a concrete implementation of the interface which simply returns the passed in string[] - we can call it PassThroughConfigurationReader (for more details on why this approach may be better than mocking, see the page on mocking ) public class PassThroughConfigurationReader : IConfigurationReader { private readonly string [] contents ; public PassThroughConfigurationReader ( string [] contents ) { this . contents = contents ; } public string [] Read () { return this . contents ; } } This simple class will be used in our unit tests, so we can create different states without requiring lots of file access. Now that we have this in place, we can go ahead and write our unit tests, starting with the tests that describe the current behavior: public class ConfigurationTests { [Fact] public void Initialize_EmptyConfig_Throws () { var reader = new PassThroughConfigurationReader ( Array . Empty < string > ()); var config = new Configuration ( reader ); Assert . Throws < KeyNotFoundException > (() => config . Initialize ()); } [Fact] public void Initialize_CorrectFormat_SetsProperty () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } } Fixing the Bug All our current tests pass, and give us 100% coverage, however as evidenced by the bug, we must not be covering all possible inputs and outputs. In the case of the bug, multiple empty lines would cause an issue. Additionally, KeyNotFoundException is not a very friendly exception and is an implementation detail, not something that makes sense when designing the Configuration API. Let's add some more tests and align the tests with how we think the Configuration class should behave: public class ConfigurationTests { [Fact] public void Initialize_EmptyConfig_Throws () { var reader = new PassThroughConfigurationReader ( Array . Empty < string > ()); var config = new Configuration ( reader ); Assert . Throws < InvalidOperationException > (() => config . Initialize ()); } [Fact] public void Initialize_MalformedLine_Throws () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty\" , }); var config = new Configuration ( reader ); Assert . Throws < InvalidOperationException > (() => config . Initialize ()); } [Fact] public void Initialize_MultipleEqualSigns_PropertyContainsNoEquals () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myval1=myval2\" , }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myval1=myval2\" , config . MyProperty ); } [Fact] public void Initialize_WithBlankLines_Ignores () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" , string . Empty , }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } [Fact] public void Initialize_CorrectFormat_SetsProperty () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } } Now we have 4 failing tests and 1 passing test, but we have firmly established through the use of these tests how we expect callers to user the Configuration class and what is and isn't allowed as inputs. Now we just need to fix the Configuration class so that our tests pass: public void Initialize () { var configContents = configReader . Read (); if ( configContents . Length == 0 ) { throw new InvalidOperationException ( \"Empty config\" ); } // Config is in the format: key=value var config = configContents . Where ( l => ! string . IsNullOrWhiteSpace ( l )) . Select ( l => { var splitLine = l . Split ( '=' , 2 ); if ( splitLine . Length < 2 ) { throw new InvalidOperationException ( \"Malformed line\" ); } return splitLine ; }) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } Now all our tests pass! We have fixed our bug, added unit tests to the Configuration class, and have much higher confidence in future changes. Untestable Code As described in the abstraction section , not all code can be properly unit tested. In our case we have a single class that has 0% test coverage: FileConfigurationReader . This is expected; in this case we kept FileConfigurationReader as light as possible with no additional logic other than calling into the third-party dependency. FileConfigurationReader is an example of the facade design pattern . Testable Design and Future Improvements One of our original problems described in this example is that in the future we expect to load the configuration from a web API. By doing all the work of abstracting the way we load the configuration text and breaking the dependency on the file system, we have already done all the hard work to enable this future scenario! All that needs to be done next is to create a WebApiConfigurationReader implementation and use that the construct the Configuration object, and it should just work. That is one of the benefits of testable design, in the process of writing our tests in a safe way, a side effect of that is that we already have our dependencies that might change abstracted, and will require minimal changes to implement. Another added benefit is we have multiple possibilities opened by this testable design. For example, we can have a cascading configuration set up now using all 3 IConfigurationReader implementations, including the one we wrote only for our tests! We can first check if internet access is available and if so use WebApiConfigurationReader . If no internet is available, we can fall back to the local config file on the current system using FileConfigurationReader . If for some reason the config file does not exist, we can use the PassThroughConfigurationReader as a hard-coded default configuration somewhere in the code. We have full flexibility to do whatever we may need to do in the future!","title":"Writing a Unit Test"},{"location":"automated-testing/unit-testing/authoring-example/#writing-a-unit-test","text":"To illustrate some unit testing techniques for an object-oriented language, let's start with an example of some code we wish to add unit tests for. In this example, we have a configuration class that contains all the startup options for an app we are writing. Normally it reads from a .config file, but we are having three problems with the current implementation: There is a bug in the Configuration class, and we have no unit tests since it relies on reading a config file We can't unit test any of the code that relies on the Configuration class reading a config file In the future, we want to allow for configuration to be saved in the cloud and accessed via REST api. The bug we are trying to fix is that if there are multiple empty lines in the configuration file, an IndexOutOfRangeException is being thrown. Our class currently looks like this: using System.IO ; using System.Linq ; public class Configuration { // Public getter properties from configuration object public string MyProperty { get ; private set ; } public void Initialize () { var configContents = File . ReadAllLines ( \".config\" ); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } }","title":"Writing a Unit Test"},{"location":"automated-testing/unit-testing/authoring-example/#abstraction","text":"In our example, we have a single dependency: the file system. Rather than just abstracting the file system entirely, let us think about why we need the file system and abstract the concept rather than the implementation. In this case, we are using the File class to read from the config file, and the config contents. The abstraction concept here is some form or configuration reader that returns each line of the configuration in a string array. We could call it ConfigurationReader , and it has a single method, Read , which returns the contents. When creating abstractions, it can be good practice creating an interface for that abstraction, in languages that support it. In the example with C#, we can create an IConfigurationReader interface, and instead of just having a ConfigurationReader class we can be more specific and name if FileConfigurationReader to indicate that it reads from the file system: // IConfigurationReader.cs public interface IConfigurationReader { string [] Read (); } // FileConfigurationReader.cs public class FileConfigurationReader : IConfigurationReader { public string [] Read () { return File . ReadAllLines ( \".config\" ); } } Now that the file dependency has been abstracted away, we need to update our Configuration class's Initialize method to use the new abstraction instead of calling File.ReadAllLines directly: public void Initialize () { var configContents = new FileConfigurationReader (). Read (); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } As you can see, we still have a dependency on the file system, but that dependency has been abstracted out. We will need to use other techniques to break the dependency completely.","title":"Abstraction"},{"location":"automated-testing/unit-testing/authoring-example/#dependency-injection","text":"In the previous section, we abstracted the file access into a FileConfigurationReader but we still had a dependency on the file system in our function. We can use dependency injection to inject the right reader into our Configuration class: using System.IO ; using System.Linq ; public class Configuration { private readonly IConfigurationReader configReader ; // Public getter properties from configuration object public string MyProperty { get ; private set ; } public Configuration ( IConfigurationReader reader ) { this . configReader = reader ; } public void Initialize () { var configContents = configReader . Read (); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } } Above, a technique was used called Constructor Injection . This uses the object's constructor to set what our dependencies will be, which means whichever object creates the Configuration object will control which reader needs to get passed in. This is an example of \"inversion of control\", previously the Configuration object controlled the dependency, but instead we pushed up the control to whatever component creates this object. Note that we injected the interface IConfigurationReader and not the concrete class. This is what allows us to break the dependency; whereas originally we had a hard-coded dependency on the File class, now we only depend on an object that implements IConfigurationReader .","title":"Dependency Injection"},{"location":"automated-testing/unit-testing/authoring-example/#writing-our-first-unit-tests","text":"We started down this venture because we have a bug in the Configuration class that was not caught because we do not have unit tests. Let us write some unit tests that gives us full coverage of the Configuration class, including a test that tests the scenario described by the bug (if there are multiple empty lines in the configuration file, an IndexOutOfRangeException is being thrown). However, we still have one problem, we only have a single implementation of IConfigurationReader , and it uses the file system, meaning any unit tests we write will still have a dependency on the file system! Luckily since we used dependency injection, all we need to do is create an implementation of IConfigurationReader that does not depend on the file system. We could create a mock here, but instead let's create a concrete implementation of the interface which simply returns the passed in string[] - we can call it PassThroughConfigurationReader (for more details on why this approach may be better than mocking, see the page on mocking ) public class PassThroughConfigurationReader : IConfigurationReader { private readonly string [] contents ; public PassThroughConfigurationReader ( string [] contents ) { this . contents = contents ; } public string [] Read () { return this . contents ; } } This simple class will be used in our unit tests, so we can create different states without requiring lots of file access. Now that we have this in place, we can go ahead and write our unit tests, starting with the tests that describe the current behavior: public class ConfigurationTests { [Fact] public void Initialize_EmptyConfig_Throws () { var reader = new PassThroughConfigurationReader ( Array . Empty < string > ()); var config = new Configuration ( reader ); Assert . Throws < KeyNotFoundException > (() => config . Initialize ()); } [Fact] public void Initialize_CorrectFormat_SetsProperty () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } }","title":"Writing our first unit tests"},{"location":"automated-testing/unit-testing/authoring-example/#fixing-the-bug","text":"All our current tests pass, and give us 100% coverage, however as evidenced by the bug, we must not be covering all possible inputs and outputs. In the case of the bug, multiple empty lines would cause an issue. Additionally, KeyNotFoundException is not a very friendly exception and is an implementation detail, not something that makes sense when designing the Configuration API. Let's add some more tests and align the tests with how we think the Configuration class should behave: public class ConfigurationTests { [Fact] public void Initialize_EmptyConfig_Throws () { var reader = new PassThroughConfigurationReader ( Array . Empty < string > ()); var config = new Configuration ( reader ); Assert . Throws < InvalidOperationException > (() => config . Initialize ()); } [Fact] public void Initialize_MalformedLine_Throws () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty\" , }); var config = new Configuration ( reader ); Assert . Throws < InvalidOperationException > (() => config . Initialize ()); } [Fact] public void Initialize_MultipleEqualSigns_PropertyContainsNoEquals () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myval1=myval2\" , }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myval1=myval2\" , config . MyProperty ); } [Fact] public void Initialize_WithBlankLines_Ignores () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" , string . Empty , }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } [Fact] public void Initialize_CorrectFormat_SetsProperty () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } } Now we have 4 failing tests and 1 passing test, but we have firmly established through the use of these tests how we expect callers to user the Configuration class and what is and isn't allowed as inputs. Now we just need to fix the Configuration class so that our tests pass: public void Initialize () { var configContents = configReader . Read (); if ( configContents . Length == 0 ) { throw new InvalidOperationException ( \"Empty config\" ); } // Config is in the format: key=value var config = configContents . Where ( l => ! string . IsNullOrWhiteSpace ( l )) . Select ( l => { var splitLine = l . Split ( '=' , 2 ); if ( splitLine . Length < 2 ) { throw new InvalidOperationException ( \"Malformed line\" ); } return splitLine ; }) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } Now all our tests pass! We have fixed our bug, added unit tests to the Configuration class, and have much higher confidence in future changes.","title":"Fixing the Bug"},{"location":"automated-testing/unit-testing/authoring-example/#untestable-code","text":"As described in the abstraction section , not all code can be properly unit tested. In our case we have a single class that has 0% test coverage: FileConfigurationReader . This is expected; in this case we kept FileConfigurationReader as light as possible with no additional logic other than calling into the third-party dependency. FileConfigurationReader is an example of the facade design pattern .","title":"Untestable Code"},{"location":"automated-testing/unit-testing/authoring-example/#testable-design-and-future-improvements","text":"One of our original problems described in this example is that in the future we expect to load the configuration from a web API. By doing all the work of abstracting the way we load the configuration text and breaking the dependency on the file system, we have already done all the hard work to enable this future scenario! All that needs to be done next is to create a WebApiConfigurationReader implementation and use that the construct the Configuration object, and it should just work. That is one of the benefits of testable design, in the process of writing our tests in a safe way, a side effect of that is that we already have our dependencies that might change abstracted, and will require minimal changes to implement. Another added benefit is we have multiple possibilities opened by this testable design. For example, we can have a cascading configuration set up now using all 3 IConfigurationReader implementations, including the one we wrote only for our tests! We can first check if internet access is available and if so use WebApiConfigurationReader . If no internet is available, we can fall back to the local config file on the current system using FileConfigurationReader . If for some reason the config file does not exist, we can use the PassThroughConfigurationReader as a hard-coded default configuration somewhere in the code. We have full flexibility to do whatever we may need to do in the future!","title":"Testable Design and Future Improvements"},{"location":"automated-testing/unit-testing/custom-connector/","text":"Custom Connector Testing When developing Custom Connectors to put data into the Power Platform there are some strategies you can follow: Unit Testing There are several verifications one can do while developing custom connectors in order to be sure the code is working properly. There are two main ones: Validating the OpenAPI schema which the connector is defined. Validating if the schema also have all the information necessary for the certified connector process. (the later one is optional, but necessary in case you want to publish it as a certified connector). There are several tool to help validate the OpenAPI schema, a list of them are available in this link . A suggested tool would be swagger-cli . On the other hand, to validate if the custom connector you are building is correct to become a certified connector, use the paconn-cli , since it has a validate command that shows missing information from the custom connector definition.","title":"Custom Connector Testing"},{"location":"automated-testing/unit-testing/custom-connector/#custom-connector-testing","text":"When developing Custom Connectors to put data into the Power Platform there are some strategies you can follow:","title":"Custom Connector Testing"},{"location":"automated-testing/unit-testing/custom-connector/#unit-testing","text":"There are several verifications one can do while developing custom connectors in order to be sure the code is working properly. There are two main ones: Validating the OpenAPI schema which the connector is defined. Validating if the schema also have all the information necessary for the certified connector process. (the later one is optional, but necessary in case you want to publish it as a certified connector). There are several tool to help validate the OpenAPI schema, a list of them are available in this link . A suggested tool would be swagger-cli . On the other hand, to validate if the custom connector you are building is correct to become a certified connector, use the paconn-cli , since it has a validate command that shows missing information from the custom connector definition.","title":"Unit Testing"},{"location":"automated-testing/unit-testing/mocking/","text":"Mocking in Unit Tests One of the key components of writing unit tests is to remove the dependencies your system has and replacing it with an implementation you control. The most common method people use as the replacement for the dependency is a mock, and mocking frameworks exist to help make this process easier. Many frameworks and articles use different meanings for the differences between test doubles. A test double is a generic term for any \"pretend\" object used in place of a real one. This term, as well as others used in this page are the definitions provided by Martin Fowler . The most commonly used form of test double is Mocks, but there are many cases where Mocks perhaps are not the best choice and Fakes should be considered instead. Stubs Stub allows you to have predetermined behavior that substitutes real behavior. The dependency (abstract class or interface) is implemented as a stub with a logic as expected by the client. Stubs can be useful when the clients of the stubs all expect the same set of responses, e.g. you use a third party service. The key concept here is that stubs should never fail a unit or integration test where a mock can. Stubs do not require any sort of framework to run, but are usually supported by mocking frameworks to quickly build the stubs. Stubs are commonly used in combination with a dependency injection frameworks or libraries, where the real object is replaced by a stub implementation. Stubs can be useful especially during early development of a system, but since nearly every test requires its own stubs (to test the different states), this quickly becomes repetitive and involves a lot of boilerplate code. Rarely will you find a codebase that uses only stubs for mocking, they are usually paired with other test doubles. Stubs do not require any sort of framework to run, but are usually supported by mocking frameworks to quickly build the stubs. # Python test example, that creates an application # with a dependency injection framework an overrides # a service with a stub class StubTestCase ( TestBase ): def setUp ( self ) -> None : super ( StubTestCase , self ) . setUp () self . app . container . service_a . override ( StubService ()) def test_service (): service = self . app . container . service_a () self . assertTrue ( isinstance ( service , StubService )) Upsides Do not require any framework, easy to set up. Downsides Can involve rewriting the same code many times, lots of boilerplate. Mocks Fowler describes mocks as pre-programmed objects with expectations which form a specification of the calls they are expected to receive. In other words, mocks are a replacement object for the dependency that has certain expectations that are placed on it; those expectations might be things like validating a sub-method has been called a certain number of times or that arguments are passed down in a certain way. Mocking frameworks are abundant for every language, with some languages having mocks built into the unit test packages. They make writing unit tests easy and still encourage good unit testing practices. The main difference between a mock and most of the other test doubles is that mocks do behavioral verification , whereas other test doubles do state verification . With behavioral verification, you end up testing that the implementation of the system under test is as you expect, whereas with state verification the implementation is not tested, rather the inputs and the outputs to the system are validated. The major downside to behavioral verification is that it is tied to the implementation. One of the biggest advantages of writing unit tests is that when you make code changes you have confidence that if your unit tests continue to pass, that you are making a relatively safe change. If tests need to be updated every time because the behavior of the method has changed, then you lose that confidence because bugs could also be introduced into the test code. This also increases the development time and can be a source of frustration. For example, let's assume you have a method that you are testing that makes 5 web service calls. With mocks, one of your tests could be to check that those 5 web service calls were made. Sometime later the API is updated and only a single web service call needs to be made. Once the system code is changed, the unit test will fail because it expects 5 calls and not 1. The test needs to be updated, which results in lowered confidence in the change, as well as potentially introduces more areas for bugs to sneak in. Some would argue that in the example above, the unit test is not a good test anyway because it depends on the implementation, and that may be true; but one of the biggest problems with using mocks (and specifically mocking frameworks that allow these verifications), is that it encourages these types of tests to be written. By not using a mock framework that allows this, you never run the risk of writing tests that are validating the implementation. Upsides to Mocking Easy to write. Encourages testable design. Downsides to Mocking Behavioral testing can present problems with maintainability in unit test code. Usually requires a framework to be installed (or if no framework, lots of boilerplate code) Fakes Fake objects actually have working implementations, but usually take some shortcut which may make them not suitable for production. One of the common examples of using a Fake is an in-memory database - typically you want your database to be able to save data somewhere between application runs, but when writing unit tests if you have a fake implementation of your database APIs that are store all data in memory, you can use these for unit tests and not break abstraction as well as still keep your tests fast. Writing a fake does take more time than other test doubles, because they are full implementations, and can have their own suite of unit tests. In this sense though, they increase confidence in your code even more because your test double has been thoroughly tested for bugs before you even use it as a downstream dependency. Similarly to mocks, fakes also promote testable design, but unlike mocks they do not require any frameworks to write. Writing a fake is as easy as writing any other implementation class. Fakes can be included in the test code only, but many times they end up being \"promoted\" to the product code, and in some cases can even start off in the product code since it is held to the same standard with full unit tests. Especially if writing a library or an API that other developers can use, providing a fake in the product code means those developers no longer need to write their own mock implementations, further increasing re-usability of code. Upsides to Fakes No framework needed, is just like any other implementation. Encourages testable design. Code can be \"promoted\" to product code, so it is not wasted effort. Downsides to Fakes Takes more time to implement. Best Practices To keep your mocking efficient, consider these best practices to make your code testable, save time and make your test assertions more meaningful. Dependency Injection If you don\u2019t keep testability in mind from the beginning, once you start writing your tests, you might realize you have to do a time-intensive refactor to make the code unit testable. A common problem that can lead to non-testable code in certain languages such as C# is not using dependency injection. Consider using dependency injection so that a mock can easily be injected into your Subject Under Test (SUT) during a unit test. More information on using dependency injection can be found here . Assertions When it comes to assertions in unit tests you want to make sure that you assert the right things, not necessarily lots of things. Some assertions can be inefficient and not give you the confidence you need in the test result. When you are mocking a client or configuration and your method passes the mock result directly as a return value without significant changes, consider not asserting on the return value. Because if you do, you are mainly asserting whether you set up the mock correctly. For a very simple example, look at this class: public class SearchController : ControllerBase { public ISearchClient SearchClient { get ; } public SearchController ( ISearchClient searchClient ) { SearchClient = searchClient ; } public String GetName ( string id ) { return this . SearchClient . GetName ( id ); } } When testing the GetName method, you can set up a mock search client to return a certain value. Then, it\u2019s easy to assert that the return value is, in fact, this value from the mock. mockSearchClient . Setup ( x => x . GetName ( id )) . ReturnsAsync ( \"myResult\" ); var result = searchController . GetName ( id ); Assert . Equal ( \"myResult\" , result . Value ); But now, your method could look like this, and the test would still pass: public String GetName ( string id ) { return \"myResult\" ; } Similarly, if you set up your mock wrong, the test would fail even though the logic inside the method is sound. For efficient assertions that will give you confidence in your SUT, make assertions on your logic, not mock return values. The simple example above doesn\u2019t have a lot of logic, but you want to make sure that it calls the search client to retrieve the result. For this, you can use the verify method to make sure the search client was called using the right parameters even though you don\u2019t care about the result. mockSearchClient . Verify ( mock => mock . GetName ( id ), Times . Once ()); This example is kept simple to visualize the principle of making meaningful assertions. In a real world application, your SUT will probably have more logic inside. Pieces of glue code that have as little logic as this example don't always have to be unit tested and might instead be covered by integration tests. If there is more logic and a unit test with mocking is required, you should apply this principle by verifying mock calls and making assertions on the part of the mock result that was modified by your SUT. Callbacks It can be time-consuming to set up mocks if you want to make sure they are being called with the right parameters, especially if the parameters are complex. To make your testing more efficient, consider using callbacks to make assertions on the parameters after a method was called. Often you don\u2019t care about all the parameters but only a few, or even only parts of them if the parameters are also objects. It\u2019s easy to make a small mistake in the creation of the parameter, like missing an attribute that the actual method sets, and then your mock won\u2019t be called, even though you might not care about this attribute at all. To avoid this, you can define only the most relevant parameters to differentiate between method calls and use an any -statement for the others. In this example, the method has a complex search options parameter which would take a lot of time to set up manually. Since you only care about 2 attributes in the search options, you use an any -statement and store the options in a callback for later assertions. var actualOptions = new SearchOptions (); mockSearchClient . Setup ( x => x . Search ( \"[This parameter is most relevant]\" , It . IsAny < SearchOptions > () ) ) . Returns ( mockResults ) . Callback < string , SearchOptions > (( query , searchOptions ) => { actualOptions = searchOptions ; } ); Since you want to test your method logic, you should care only about the parts of the parameter which are influenced by your SUT, in this example, let's say the search mode and the search query type. So, with the variable you stored in the callback, you can make assertions on only these two attributes. Assert . Equal ( SearchMode . All , actualOptions . SearchMode ); Assert . Equal ( SearchQueryType . Full , actualOptions . QueryType ); This makes the test more explicit since it shows which parts of the logic you care about. It\u2019s also more efficient since you don\u2019t have to spend a lot of time setting up the parameters for the mock. Conclusion Using test doubles in unit tests is an essential part of having a healthy test suite. When looking at mocking frameworks and using test doubles, it is important to consider the future implications of integrating with a mocking framework from the start. Sometimes certain features of mocking frameworks seem essential, but usually that is a sign that the code itself is not abstracted enough if it requires a framework. If possible, starting without a mocking framework and attempting to create fake implementations will lead to a more healthy code base, but when that is not possible the onus is on the technical leaders of the team to find cases where mocks may be overused, rely too much on implementation details, or end up not testing the right things.","title":"Mocking in Unit Tests"},{"location":"automated-testing/unit-testing/mocking/#mocking-in-unit-tests","text":"One of the key components of writing unit tests is to remove the dependencies your system has and replacing it with an implementation you control. The most common method people use as the replacement for the dependency is a mock, and mocking frameworks exist to help make this process easier. Many frameworks and articles use different meanings for the differences between test doubles. A test double is a generic term for any \"pretend\" object used in place of a real one. This term, as well as others used in this page are the definitions provided by Martin Fowler . The most commonly used form of test double is Mocks, but there are many cases where Mocks perhaps are not the best choice and Fakes should be considered instead.","title":"Mocking in Unit Tests"},{"location":"automated-testing/unit-testing/mocking/#stubs","text":"Stub allows you to have predetermined behavior that substitutes real behavior. The dependency (abstract class or interface) is implemented as a stub with a logic as expected by the client. Stubs can be useful when the clients of the stubs all expect the same set of responses, e.g. you use a third party service. The key concept here is that stubs should never fail a unit or integration test where a mock can. Stubs do not require any sort of framework to run, but are usually supported by mocking frameworks to quickly build the stubs. Stubs are commonly used in combination with a dependency injection frameworks or libraries, where the real object is replaced by a stub implementation. Stubs can be useful especially during early development of a system, but since nearly every test requires its own stubs (to test the different states), this quickly becomes repetitive and involves a lot of boilerplate code. Rarely will you find a codebase that uses only stubs for mocking, they are usually paired with other test doubles. Stubs do not require any sort of framework to run, but are usually supported by mocking frameworks to quickly build the stubs. # Python test example, that creates an application # with a dependency injection framework an overrides # a service with a stub class StubTestCase ( TestBase ): def setUp ( self ) -> None : super ( StubTestCase , self ) . setUp () self . app . container . service_a . override ( StubService ()) def test_service (): service = self . app . container . service_a () self . assertTrue ( isinstance ( service , StubService ))","title":"Stubs"},{"location":"automated-testing/unit-testing/mocking/#upsides","text":"Do not require any framework, easy to set up.","title":"Upsides"},{"location":"automated-testing/unit-testing/mocking/#downsides","text":"Can involve rewriting the same code many times, lots of boilerplate.","title":"Downsides"},{"location":"automated-testing/unit-testing/mocking/#mocks","text":"Fowler describes mocks as pre-programmed objects with expectations which form a specification of the calls they are expected to receive. In other words, mocks are a replacement object for the dependency that has certain expectations that are placed on it; those expectations might be things like validating a sub-method has been called a certain number of times or that arguments are passed down in a certain way. Mocking frameworks are abundant for every language, with some languages having mocks built into the unit test packages. They make writing unit tests easy and still encourage good unit testing practices. The main difference between a mock and most of the other test doubles is that mocks do behavioral verification , whereas other test doubles do state verification . With behavioral verification, you end up testing that the implementation of the system under test is as you expect, whereas with state verification the implementation is not tested, rather the inputs and the outputs to the system are validated. The major downside to behavioral verification is that it is tied to the implementation. One of the biggest advantages of writing unit tests is that when you make code changes you have confidence that if your unit tests continue to pass, that you are making a relatively safe change. If tests need to be updated every time because the behavior of the method has changed, then you lose that confidence because bugs could also be introduced into the test code. This also increases the development time and can be a source of frustration. For example, let's assume you have a method that you are testing that makes 5 web service calls. With mocks, one of your tests could be to check that those 5 web service calls were made. Sometime later the API is updated and only a single web service call needs to be made. Once the system code is changed, the unit test will fail because it expects 5 calls and not 1. The test needs to be updated, which results in lowered confidence in the change, as well as potentially introduces more areas for bugs to sneak in. Some would argue that in the example above, the unit test is not a good test anyway because it depends on the implementation, and that may be true; but one of the biggest problems with using mocks (and specifically mocking frameworks that allow these verifications), is that it encourages these types of tests to be written. By not using a mock framework that allows this, you never run the risk of writing tests that are validating the implementation.","title":"Mocks"},{"location":"automated-testing/unit-testing/mocking/#upsides-to-mocking","text":"Easy to write. Encourages testable design.","title":"Upsides to Mocking"},{"location":"automated-testing/unit-testing/mocking/#downsides-to-mocking","text":"Behavioral testing can present problems with maintainability in unit test code. Usually requires a framework to be installed (or if no framework, lots of boilerplate code)","title":"Downsides to Mocking"},{"location":"automated-testing/unit-testing/mocking/#fakes","text":"Fake objects actually have working implementations, but usually take some shortcut which may make them not suitable for production. One of the common examples of using a Fake is an in-memory database - typically you want your database to be able to save data somewhere between application runs, but when writing unit tests if you have a fake implementation of your database APIs that are store all data in memory, you can use these for unit tests and not break abstraction as well as still keep your tests fast. Writing a fake does take more time than other test doubles, because they are full implementations, and can have their own suite of unit tests. In this sense though, they increase confidence in your code even more because your test double has been thoroughly tested for bugs before you even use it as a downstream dependency. Similarly to mocks, fakes also promote testable design, but unlike mocks they do not require any frameworks to write. Writing a fake is as easy as writing any other implementation class. Fakes can be included in the test code only, but many times they end up being \"promoted\" to the product code, and in some cases can even start off in the product code since it is held to the same standard with full unit tests. Especially if writing a library or an API that other developers can use, providing a fake in the product code means those developers no longer need to write their own mock implementations, further increasing re-usability of code.","title":"Fakes"},{"location":"automated-testing/unit-testing/mocking/#upsides-to-fakes","text":"No framework needed, is just like any other implementation. Encourages testable design. Code can be \"promoted\" to product code, so it is not wasted effort.","title":"Upsides to Fakes"},{"location":"automated-testing/unit-testing/mocking/#downsides-to-fakes","text":"Takes more time to implement.","title":"Downsides to Fakes"},{"location":"automated-testing/unit-testing/mocking/#best-practices","text":"To keep your mocking efficient, consider these best practices to make your code testable, save time and make your test assertions more meaningful.","title":"Best Practices"},{"location":"automated-testing/unit-testing/mocking/#dependency-injection","text":"If you don\u2019t keep testability in mind from the beginning, once you start writing your tests, you might realize you have to do a time-intensive refactor to make the code unit testable. A common problem that can lead to non-testable code in certain languages such as C# is not using dependency injection. Consider using dependency injection so that a mock can easily be injected into your Subject Under Test (SUT) during a unit test. More information on using dependency injection can be found here .","title":"Dependency Injection"},{"location":"automated-testing/unit-testing/mocking/#assertions","text":"When it comes to assertions in unit tests you want to make sure that you assert the right things, not necessarily lots of things. Some assertions can be inefficient and not give you the confidence you need in the test result. When you are mocking a client or configuration and your method passes the mock result directly as a return value without significant changes, consider not asserting on the return value. Because if you do, you are mainly asserting whether you set up the mock correctly. For a very simple example, look at this class: public class SearchController : ControllerBase { public ISearchClient SearchClient { get ; } public SearchController ( ISearchClient searchClient ) { SearchClient = searchClient ; } public String GetName ( string id ) { return this . SearchClient . GetName ( id ); } } When testing the GetName method, you can set up a mock search client to return a certain value. Then, it\u2019s easy to assert that the return value is, in fact, this value from the mock. mockSearchClient . Setup ( x => x . GetName ( id )) . ReturnsAsync ( \"myResult\" ); var result = searchController . GetName ( id ); Assert . Equal ( \"myResult\" , result . Value ); But now, your method could look like this, and the test would still pass: public String GetName ( string id ) { return \"myResult\" ; } Similarly, if you set up your mock wrong, the test would fail even though the logic inside the method is sound. For efficient assertions that will give you confidence in your SUT, make assertions on your logic, not mock return values. The simple example above doesn\u2019t have a lot of logic, but you want to make sure that it calls the search client to retrieve the result. For this, you can use the verify method to make sure the search client was called using the right parameters even though you don\u2019t care about the result. mockSearchClient . Verify ( mock => mock . GetName ( id ), Times . Once ()); This example is kept simple to visualize the principle of making meaningful assertions. In a real world application, your SUT will probably have more logic inside. Pieces of glue code that have as little logic as this example don't always have to be unit tested and might instead be covered by integration tests. If there is more logic and a unit test with mocking is required, you should apply this principle by verifying mock calls and making assertions on the part of the mock result that was modified by your SUT.","title":"Assertions"},{"location":"automated-testing/unit-testing/mocking/#callbacks","text":"It can be time-consuming to set up mocks if you want to make sure they are being called with the right parameters, especially if the parameters are complex. To make your testing more efficient, consider using callbacks to make assertions on the parameters after a method was called. Often you don\u2019t care about all the parameters but only a few, or even only parts of them if the parameters are also objects. It\u2019s easy to make a small mistake in the creation of the parameter, like missing an attribute that the actual method sets, and then your mock won\u2019t be called, even though you might not care about this attribute at all. To avoid this, you can define only the most relevant parameters to differentiate between method calls and use an any -statement for the others. In this example, the method has a complex search options parameter which would take a lot of time to set up manually. Since you only care about 2 attributes in the search options, you use an any -statement and store the options in a callback for later assertions. var actualOptions = new SearchOptions (); mockSearchClient . Setup ( x => x . Search ( \"[This parameter is most relevant]\" , It . IsAny < SearchOptions > () ) ) . Returns ( mockResults ) . Callback < string , SearchOptions > (( query , searchOptions ) => { actualOptions = searchOptions ; } ); Since you want to test your method logic, you should care only about the parts of the parameter which are influenced by your SUT, in this example, let's say the search mode and the search query type. So, with the variable you stored in the callback, you can make assertions on only these two attributes. Assert . Equal ( SearchMode . All , actualOptions . SearchMode ); Assert . Equal ( SearchQueryType . Full , actualOptions . QueryType ); This makes the test more explicit since it shows which parts of the logic you care about. It\u2019s also more efficient since you don\u2019t have to spend a lot of time setting up the parameters for the mock.","title":"Callbacks"},{"location":"automated-testing/unit-testing/mocking/#conclusion","text":"Using test doubles in unit tests is an essential part of having a healthy test suite. When looking at mocking frameworks and using test doubles, it is important to consider the future implications of integrating with a mocking framework from the start. Sometimes certain features of mocking frameworks seem essential, but usually that is a sign that the code itself is not abstracted enough if it requires a framework. If possible, starting without a mocking framework and attempting to create fake implementations will lead to a more healthy code base, but when that is not possible the onus is on the technical leaders of the team to find cases where mocks may be overused, rely too much on implementation details, or end up not testing the right things.","title":"Conclusion"},{"location":"automated-testing/unit-testing/tdd-example/","text":"Test-Driven Development Example With this method, rather than writing all your tests up front, you write one test at a time and then switch to write the system code that would make that test pass. It's important to write the bare minimum of code necessary even if it is not actually \"correct\". Once the test passes you can refactor the code to make it maybe make more sense, but again the logic should be simple. As you write more tests, the logic gets more and more complex, but you can continue to make the minimal changes to the system code with confidence because all code that was written is covered. As an example, let's assume we are trying to write a new function that validates a string is a valid password format. The password format should be a string larger than 8 characters containing at least one number. We start with the simplest possible test; one of the easiest ways to do this is to first write tests that validate inputs into the function: // Tests.cs public class Tests { [Fact] public void ValidatePassword_NullInput_Throws () { var s = new MyClass (); Assert . Throws < ArgumentNullException > (() => s . ValidatePassword ( null )); } } // MyClass.cs public class MyClass { public bool ValidatePassword ( string input ) { return false ; } } If we run this code, the test will fail as no exception was thrown since our code in ValidateString is just a stub. This is ok! This is the \"Red\" part of Red-Green-Refactor. Now we want to move onto the \"Green\" part - making the minimal change required to make this test pass: // MyClass.cs public class MyClass { public bool ValidatePassword ( string input ) { throw new ArgumentNullException ( nameof ( input )); } } Our tests pass, but this function doesn't really work, it will always throw the exception. That's ok! As we continue to write tests we will slowly add the logic for this function, and it will build on itself, all while guaranteeing our tests continue to pass. We will skip the \"Refactor\" stage at this point because there isn't anything to refactor. Next let's add a test that checks that the function returns false if the password is less than size 8: [Fact] public void ValidatePassword_SmallSize_ReturnsFalse () { var s = new MyClass (); Assert . False ( s . ValidatePassword ( \"abc\" )); } This test will pass as it still only throws an ArgumentNullException , but again, that is an expected failure. Fixing our function should see it pass: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } return false ; } Finally, some code that looks real! Note how it wasn't the test that checked for null that had us add the if statement for the null-check, but rather the subsequent test which unlocked a whole new branch. By adding that if statement, we made the bare minimum change necessary in order to get both tests to pass, but we still have work to do. In general, working in the order of adding a negative test first before adding a positive test will ensure that both cases get covered by the code in a way that can get tests. Red-Green-Refactor makes that process super easy by requiring the bare minimum change - since we only want to make the bare minimum changes, we just simply return false here, knowing full well that we will be adding logic later that will expand on this. Speaking of which, let's add the positive test now: [Fact] public void ValidatePassword_RightSize_ReturnsTrue () { var s = new MyClass (); Assert . True ( s . ValidatePassword ( \"abcdefgh1\" )); } Again, this test will fail at the start. One thing to note here if that its important that we try and make our tests resilient to future changes. When we write the code under test, we act very naively, only trying to make the current tests we have pass; when you write tests though, you want to ensure that everything you are doing is a valid case in the future. In this case, we could have written the input string as abcdefgh and when we eventually write the function it would pass, but later when we add tests that validate the function has the rest of the proper inputs it would fail incorrectly. Anyways, the next code change is: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length > 8 ) { return true ; } return false ; } Here we now have a passing test! However, the logic doesn't actually make much sense. We did the bare minimum change which was adding a new condition that passed for longer strings, but thinking forward we know this won't work as soon as we add additional validations. So let's use our first \"Refactor\" step in the Red-Green-Refactor flow! public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length < 8 ) { return false ; } return true ; } That looks better. Note how from a functional perspective, inverting the if-statement does not change what the function returns. This is an important part of the refactor flow, maintaining the logic by doing provably safe refactors, usually through the use of tooling and automated refactors from your IDE. Finally, we have one last requirement for our ValidatePassword method and that is that it needs to check that there is a number in the password. Let's again start with the negative test and validate that with a string with the valid length that the function returns false if we do not pass in a number: [Fact] public void ValidatePassword_ValidLength_ReturnsFalse () { var s = new MyClass (); Assert . False ( s . ValidatePassword ( \"abcdefghij\" )); } Of course the test fails as it is only checking length requirements. Let's fix the method to check for numbers: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length < 8 ) { return false ; } if ( ! input . Any ( char . IsDigit )) { return false ; } return true ; } Here we use a handy LINQ method to check if any of the char s in the string are a digit, and if not, return false. Tests now pass, and we can refactor. For readability, why not combine the if statements: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if (( input . Length < 8 ) || ( ! input . Any ( char . IsDigit ))) { return false ; } return true ; } As we refactor this code, we feel 100% confident in the changes we made as we have 100% test coverage which tests both positive and negative scenarios. In this case we actually already have a method that tests the positive case, so our function is done! Now that our code is completely tested we can make all sorts of changes and still have confidence that it works. For example, if we wanted to change the implementation of the method to use regex, all of our tests would still pass and still be valid. That is it! We finished writing our function, we have 100% test coverage, and if we had done something a little more complex, we are guaranteed that whatever we designed is already testable since the tests were written first!","title":"Test-Driven Development Example"},{"location":"automated-testing/unit-testing/tdd-example/#test-driven-development-example","text":"With this method, rather than writing all your tests up front, you write one test at a time and then switch to write the system code that would make that test pass. It's important to write the bare minimum of code necessary even if it is not actually \"correct\". Once the test passes you can refactor the code to make it maybe make more sense, but again the logic should be simple. As you write more tests, the logic gets more and more complex, but you can continue to make the minimal changes to the system code with confidence because all code that was written is covered. As an example, let's assume we are trying to write a new function that validates a string is a valid password format. The password format should be a string larger than 8 characters containing at least one number. We start with the simplest possible test; one of the easiest ways to do this is to first write tests that validate inputs into the function: // Tests.cs public class Tests { [Fact] public void ValidatePassword_NullInput_Throws () { var s = new MyClass (); Assert . Throws < ArgumentNullException > (() => s . ValidatePassword ( null )); } } // MyClass.cs public class MyClass { public bool ValidatePassword ( string input ) { return false ; } } If we run this code, the test will fail as no exception was thrown since our code in ValidateString is just a stub. This is ok! This is the \"Red\" part of Red-Green-Refactor. Now we want to move onto the \"Green\" part - making the minimal change required to make this test pass: // MyClass.cs public class MyClass { public bool ValidatePassword ( string input ) { throw new ArgumentNullException ( nameof ( input )); } } Our tests pass, but this function doesn't really work, it will always throw the exception. That's ok! As we continue to write tests we will slowly add the logic for this function, and it will build on itself, all while guaranteeing our tests continue to pass. We will skip the \"Refactor\" stage at this point because there isn't anything to refactor. Next let's add a test that checks that the function returns false if the password is less than size 8: [Fact] public void ValidatePassword_SmallSize_ReturnsFalse () { var s = new MyClass (); Assert . False ( s . ValidatePassword ( \"abc\" )); } This test will pass as it still only throws an ArgumentNullException , but again, that is an expected failure. Fixing our function should see it pass: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } return false ; } Finally, some code that looks real! Note how it wasn't the test that checked for null that had us add the if statement for the null-check, but rather the subsequent test which unlocked a whole new branch. By adding that if statement, we made the bare minimum change necessary in order to get both tests to pass, but we still have work to do. In general, working in the order of adding a negative test first before adding a positive test will ensure that both cases get covered by the code in a way that can get tests. Red-Green-Refactor makes that process super easy by requiring the bare minimum change - since we only want to make the bare minimum changes, we just simply return false here, knowing full well that we will be adding logic later that will expand on this. Speaking of which, let's add the positive test now: [Fact] public void ValidatePassword_RightSize_ReturnsTrue () { var s = new MyClass (); Assert . True ( s . ValidatePassword ( \"abcdefgh1\" )); } Again, this test will fail at the start. One thing to note here if that its important that we try and make our tests resilient to future changes. When we write the code under test, we act very naively, only trying to make the current tests we have pass; when you write tests though, you want to ensure that everything you are doing is a valid case in the future. In this case, we could have written the input string as abcdefgh and when we eventually write the function it would pass, but later when we add tests that validate the function has the rest of the proper inputs it would fail incorrectly. Anyways, the next code change is: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length > 8 ) { return true ; } return false ; } Here we now have a passing test! However, the logic doesn't actually make much sense. We did the bare minimum change which was adding a new condition that passed for longer strings, but thinking forward we know this won't work as soon as we add additional validations. So let's use our first \"Refactor\" step in the Red-Green-Refactor flow! public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length < 8 ) { return false ; } return true ; } That looks better. Note how from a functional perspective, inverting the if-statement does not change what the function returns. This is an important part of the refactor flow, maintaining the logic by doing provably safe refactors, usually through the use of tooling and automated refactors from your IDE. Finally, we have one last requirement for our ValidatePassword method and that is that it needs to check that there is a number in the password. Let's again start with the negative test and validate that with a string with the valid length that the function returns false if we do not pass in a number: [Fact] public void ValidatePassword_ValidLength_ReturnsFalse () { var s = new MyClass (); Assert . False ( s . ValidatePassword ( \"abcdefghij\" )); } Of course the test fails as it is only checking length requirements. Let's fix the method to check for numbers: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length < 8 ) { return false ; } if ( ! input . Any ( char . IsDigit )) { return false ; } return true ; } Here we use a handy LINQ method to check if any of the char s in the string are a digit, and if not, return false. Tests now pass, and we can refactor. For readability, why not combine the if statements: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if (( input . Length < 8 ) || ( ! input . Any ( char . IsDigit ))) { return false ; } return true ; } As we refactor this code, we feel 100% confident in the changes we made as we have 100% test coverage which tests both positive and negative scenarios. In this case we actually already have a method that tests the positive case, so our function is done! Now that our code is completely tested we can make all sorts of changes and still have confidence that it works. For example, if we wanted to change the implementation of the method to use regex, all of our tests would still pass and still be valid. That is it! We finished writing our function, we have 100% test coverage, and if we had done something a little more complex, we are guaranteed that whatever we designed is already testable since the tests were written first!","title":"Test-Driven Development Example"},{"location":"automated-testing/unit-testing/why-unit-tests/","text":"Why Unit Tests It is no secret that writing unit tests is hard, and even harder to write well. Writing unit tests also increases the development time for every feature. So why should we bother writing them? Reduce Costs There is no question that the later a bug is found, the more expensive it is to fix; especially so if the bug makes it into production. A 2008 research study by IBM estimates that a bug caught in production could cost 6 times as much as if it was caught during implementation. Increase Developer Confidence Many changes that developers make are not big features or something that requires an entire testing suite. A strong unit test suite helps increase the confidence of the developer that their change is not going to cause any downstream bugs. Having unit tests also helps with making safe, mechanical refactors that are provably safe; using things like refactoring tools to do mechanical refactoring and running unit tests that cover the refactored code should be enough to increase confidence in the commit. Speed Up Development Unit tests take time to write, but they also speed up development? While this may seem like an oxymoron, it is one of the strengths of a unit testing suite - over time it continues to grow and evolve until the tests become an essential part of the developer workflow. If the only testing available to a developer is a long-running system test, integration tests that require a deployment, or manual testing, it will increase the amount of time taken to write a feature. These types of tests should be a part of the \"Outer loop\"; tests that may take some time to run and validate more than just the code you are writing. Usually these types of outer loop tests get run at the PR stage or even later during merges into branches. The Developer Inner Loop is the process that developers go through as they are authoring code. This varies from developer to developer and language to language but typically is something like code -> build -> run -> repeat. When unit tests are inserted into the inner loop, developers can get early feedback and results from the code they are writing. Since unit tests execute really quickly, running tests shouldn't be seen as a barrier to entry for this loop. Tooling such as Visual Studio Live Unit Testing also help to shorten the inner loop even more. Documentation as Code Writing unit tests is a great way to show how the units of code you are writing are supposed to be used. In some ways, unit tests are better than any documentation or samples because they are (or at least should be) executed with every build so there is confidence that they are not out of date. Unit tests also should be so simple that they are easy to follow.","title":"Why Unit Tests"},{"location":"automated-testing/unit-testing/why-unit-tests/#why-unit-tests","text":"It is no secret that writing unit tests is hard, and even harder to write well. Writing unit tests also increases the development time for every feature. So why should we bother writing them?","title":"Why Unit Tests"},{"location":"automated-testing/unit-testing/why-unit-tests/#reduce-costs","text":"There is no question that the later a bug is found, the more expensive it is to fix; especially so if the bug makes it into production. A 2008 research study by IBM estimates that a bug caught in production could cost 6 times as much as if it was caught during implementation.","title":"Reduce Costs"},{"location":"automated-testing/unit-testing/why-unit-tests/#increase-developer-confidence","text":"Many changes that developers make are not big features or something that requires an entire testing suite. A strong unit test suite helps increase the confidence of the developer that their change is not going to cause any downstream bugs. Having unit tests also helps with making safe, mechanical refactors that are provably safe; using things like refactoring tools to do mechanical refactoring and running unit tests that cover the refactored code should be enough to increase confidence in the commit.","title":"Increase Developer Confidence"},{"location":"automated-testing/unit-testing/why-unit-tests/#speed-up-development","text":"Unit tests take time to write, but they also speed up development? While this may seem like an oxymoron, it is one of the strengths of a unit testing suite - over time it continues to grow and evolve until the tests become an essential part of the developer workflow. If the only testing available to a developer is a long-running system test, integration tests that require a deployment, or manual testing, it will increase the amount of time taken to write a feature. These types of tests should be a part of the \"Outer loop\"; tests that may take some time to run and validate more than just the code you are writing. Usually these types of outer loop tests get run at the PR stage or even later during merges into branches. The Developer Inner Loop is the process that developers go through as they are authoring code. This varies from developer to developer and language to language but typically is something like code -> build -> run -> repeat. When unit tests are inserted into the inner loop, developers can get early feedback and results from the code they are writing. Since unit tests execute really quickly, running tests shouldn't be seen as a barrier to entry for this loop. Tooling such as Visual Studio Live Unit Testing also help to shorten the inner loop even more.","title":"Speed Up Development"},{"location":"automated-testing/unit-testing/why-unit-tests/#documentation-as-code","text":"Writing unit tests is a great way to show how the units of code you are writing are supposed to be used. In some ways, unit tests are better than any documentation or samples because they are (or at least should be) executed with every build so there is confidence that they are not out of date. Unit tests also should be so simple that they are easy to follow.","title":"Documentation as Code"},{"location":"code-reviews/","text":"Code Reviews Developers working on projects should conduct peer code reviews on every pull request (or check-in to a shared branch). Goals Code review is a way to have a conversation about the code where participants will: Improve code quality by identifying and removing defects before they can be introduced into shared code branches. Learn and grow by having others review the code, we get exposed to unfamiliar design patterns or languages among other topics, and even break some bad habits. Shared understanding between the developers over the project's code. Resources Code review tools Google's Engineering Practices documentation: How to do a code review Best Kept Secrets of Peer Code Review","title":"Code Reviews"},{"location":"code-reviews/#code-reviews","text":"Developers working on projects should conduct peer code reviews on every pull request (or check-in to a shared branch).","title":"Code Reviews"},{"location":"code-reviews/#goals","text":"Code review is a way to have a conversation about the code where participants will: Improve code quality by identifying and removing defects before they can be introduced into shared code branches. Learn and grow by having others review the code, we get exposed to unfamiliar design patterns or languages among other topics, and even break some bad habits. Shared understanding between the developers over the project's code.","title":"Goals"},{"location":"code-reviews/#resources","text":"Code review tools Google's Engineering Practices documentation: How to do a code review Best Kept Secrets of Peer Code Review","title":"Resources"},{"location":"code-reviews/faq/","text":"FAQ This is a list of questions / frequently occurring issues when working with code reviews and answers how you can possibly tackle them. What Makes a Code Review Different from a PR? A pull request (PR) is a way to notify a task is finished and ready to be merged into the main working branch (source of truth). A code review is having someone go over the code in a PR and validate it before it is merged, but, in general, code reviews can take place outside PRs too. Code Review Pull Request Source code focused Intended to enhance and enable code reviews. Includes both source code but can have a broader scope (e.g., docs, integration tests, compiles) Intended for early feedback before submitting a PR Not intended for early feedback . Created when author is ready to merge Usually a synchronous review with faster feedback cycles (draft PRs as an exception). Examples: scheduled meetings, over-the-shoulder review, pair programming Usually a tool assisted asynchronous review but can be elevated to a synchronous meeting when needed Why do we Need Code Reviews? Our peer code reviews are structured around best practices, to find specific kinds of errors. Much like you would still run a linter over mobbed code, you would still ask someone to make the last pass to make sure the code conforms to expected standards and avoids common pitfalls. PRs are Too Large, How can we Fix This? Make sure you size the work items into small clear chunks, so the reviewer will be able to understand the code on their own. The team is instructed to commit early, before the full product backlog item / user story is complete, but rather when an individual item is done. If the work would result in an incomplete feature, make sure it can be turned off, until the full feature is delivered. More information can be found in Pull Requests - Size Guidance . How can we Expedite Code Reviews? Slow code reviews might cause delays in delivering features and cause frustration amongst team members. Possible Actions you can Take Add a rule for PR turnaround time to your work agreement. Set up a slot after the standup to go through pending PRs and assign the ones that are inactive. Dedicate a PR review manager who will be responsible to keep things flowing by assigning or notifying people when PR got stale. Use tools to better indicate stale reviews - Customize ADO - Task Boards . Which Tools can I use to Review a Complex PR? Checkout the Tools for help on how to perform reviews out of Visual Studio or Visual Studio Code. How can we Enforce the Code Review Policies? By configuring Branch Policies , you can easily enforce code reviews rules. We Pair or Mob. How Should This Reflect in our Code Reviews? There are two ways to perform a code review: Pair - Someone outside the pair should perform the code review. One of the other major benefits of code reviews is spreading knowledge about the code base to other members of the team that don't usually work in the part of the codebase under review. Mob - A member of the mob who spent less (or no) time at the keyboard should perform the code review.","title":"FAQ"},{"location":"code-reviews/faq/#faq","text":"This is a list of questions / frequently occurring issues when working with code reviews and answers how you can possibly tackle them.","title":"FAQ"},{"location":"code-reviews/faq/#what-makes-a-code-review-different-from-a-pr","text":"A pull request (PR) is a way to notify a task is finished and ready to be merged into the main working branch (source of truth). A code review is having someone go over the code in a PR and validate it before it is merged, but, in general, code reviews can take place outside PRs too. Code Review Pull Request Source code focused Intended to enhance and enable code reviews. Includes both source code but can have a broader scope (e.g., docs, integration tests, compiles) Intended for early feedback before submitting a PR Not intended for early feedback . Created when author is ready to merge Usually a synchronous review with faster feedback cycles (draft PRs as an exception). Examples: scheduled meetings, over-the-shoulder review, pair programming Usually a tool assisted asynchronous review but can be elevated to a synchronous meeting when needed","title":"What Makes a Code Review Different from a PR?"},{"location":"code-reviews/faq/#why-do-we-need-code-reviews","text":"Our peer code reviews are structured around best practices, to find specific kinds of errors. Much like you would still run a linter over mobbed code, you would still ask someone to make the last pass to make sure the code conforms to expected standards and avoids common pitfalls.","title":"Why do we Need Code Reviews?"},{"location":"code-reviews/faq/#prs-are-too-large-how-can-we-fix-this","text":"Make sure you size the work items into small clear chunks, so the reviewer will be able to understand the code on their own. The team is instructed to commit early, before the full product backlog item / user story is complete, but rather when an individual item is done. If the work would result in an incomplete feature, make sure it can be turned off, until the full feature is delivered. More information can be found in Pull Requests - Size Guidance .","title":"PRs are Too Large, How can we Fix This?"},{"location":"code-reviews/faq/#how-can-we-expedite-code-reviews","text":"Slow code reviews might cause delays in delivering features and cause frustration amongst team members.","title":"How can we Expedite Code Reviews?"},{"location":"code-reviews/faq/#possible-actions-you-can-take","text":"Add a rule for PR turnaround time to your work agreement. Set up a slot after the standup to go through pending PRs and assign the ones that are inactive. Dedicate a PR review manager who will be responsible to keep things flowing by assigning or notifying people when PR got stale. Use tools to better indicate stale reviews - Customize ADO - Task Boards .","title":"Possible Actions you can Take"},{"location":"code-reviews/faq/#which-tools-can-i-use-to-review-a-complex-pr","text":"Checkout the Tools for help on how to perform reviews out of Visual Studio or Visual Studio Code.","title":"Which Tools can I use to Review a Complex PR?"},{"location":"code-reviews/faq/#how-can-we-enforce-the-code-review-policies","text":"By configuring Branch Policies , you can easily enforce code reviews rules.","title":"How can we Enforce the Code Review Policies?"},{"location":"code-reviews/faq/#we-pair-or-mob-how-should-this-reflect-in-our-code-reviews","text":"There are two ways to perform a code review: Pair - Someone outside the pair should perform the code review. One of the other major benefits of code reviews is spreading knowledge about the code base to other members of the team that don't usually work in the part of the codebase under review. Mob - A member of the mob who spent less (or no) time at the keyboard should perform the code review.","title":"We Pair or Mob. How Should This Reflect in our Code Reviews?"},{"location":"code-reviews/inclusion-in-code-review/","text":"Inclusion in Code Review Below are some points which emphasize why inclusivity in code reviews is important: Code reviews are an important part of our job as software professionals. In ISE we work with cross cultural teams from across the globe. How we communicate affects team morale. Inclusive code reviews welcome new developers and make them comfortable with the team. Rude or personal attacks doing code reviews alienate - people can unknowingly make rude comments when reviewing pull requests (PRs). Types and Examples of Non-Inclusive Code Review Behavior Inequitable review assignments. Example: Assigning most reviews to few people and dismissing some members of the team altogether. Negative interpersonal interactions. Example: Long arguments over subjective topics such as code style. Biased decision making. Example: Comments about the developer and not the code. Assuming code from developer X will always be good and hence not reviewing it properly and vice versa. Examples of Inclusive Code Reviews Anyone and everyone in the team should be assigned PRs to review. Reviewer should be clear about what is an opinion, their personal preference, best practice or a fact. Arguments over personal preferences and opinions are mostly avoidable. Using inclusive language and tone in the code review comments. For example, being suggestive rather being prescriptive in the review comments is a good way to get the point across the table. It's a good practice for the author of a PR to thank the reviewer for the review, when they have contributed in improving the code or you have learnt something new. Using the sandwich method for recommending a code change to a new developer or a new customer: Sandwich the suggestion between 2 compliments. For example: \"Great work so far, but I would recommend a few changes here. Btw, I loved the use of XYZ here, nice job!\" Guidelines for the Author Aim to write a code that is easy to read, review and maintain. It\u2019s important to ensure that whoever is looking at the code, whether that be the reviewer or a future engineer, can understand the motivations and how your code achieves its goals. Proactively asking for targeted help or feedback. Respond clearly to questions asked by the reviewers. Avoid huge commits by submitting incremental changes. Commits which are large and contain changes to multiple files will lead to unfair review of the code. Biased behavior of reviewers may kick in while reviewing such PRs. For e.g. a huge commit from a senior developer may get approved without thorough review whereas a huge commit from a junior developer may never get reviewed and approved. Guidelines for the Reviewer Assume positive intent from the author. Write clear and elaborate comments. Identify subjectivity, choice of coding and best practice. It is good to discuss coding style and subjective coding choices in some other forum and not in the PR. A PR should not become a ground to discuss subjective coding choices and having long arguments over it. If you do not understand the code properly, refrain from commenting e.g., \"This code is incomprehensible\". It is better to have a call with the author and get a basic understanding of their work. Be suggestive and not prescriptive. A reviewer should suggest changes and not prescribe changes, let the author decide if they really want to accept the changes proposed. Culture and Code Reviews We in ISE, may come across situations in which code reviews are not ideal and often we are observing non inclusive code review behaviors. Its important to be aware of the fact that culture and communication style of a particular geography also influences how people interact over pull requests. In such cases, assuming positive intent of the author and reviewer is a good start to start analyzing quality of code reviews. Dealing with the Impostor Phenomenon Impostor phenomenon is a psychological pattern in which an individual doubts their skills, talents, or accomplishments and has a persistent internalized fear of being exposed as a \"fraud\" - Wikipedia . Someone experiencing impostor phenomenon may find submitting code for a review particularly stressful. It is important to realize that everybody can have meaningful contributions and not to let the perceived weaknesses prevent contributions. Some tips for overcoming the impostor phenomenon for authors: Review the guidelines highlighted above and make sure your code change adhere to them. Ask for help from a colleague - pair program with an experienced colleague that you can learn from. Some tips for overcoming the impostor phenomenon for reviewers: Anyone can have valuable insights. A fresh new pair of eyes are always welcome. Study the review until you have clearly understood it, check the corner cases and look for ways to improve it. If something is not clear, a simple specific question should be asked. If you have learnt something, you can always compliment the author. If possible, pair with someone to review the code so that you can establish a personal connection and have a more profound discussion about the code. Tools Below are some tools which may help in establishing inclusive code review culture within our teams. Anonymous GitHub Blind Code Reviews Gitmask inclusivelint","title":"Inclusion in Code Review"},{"location":"code-reviews/inclusion-in-code-review/#inclusion-in-code-review","text":"Below are some points which emphasize why inclusivity in code reviews is important: Code reviews are an important part of our job as software professionals. In ISE we work with cross cultural teams from across the globe. How we communicate affects team morale. Inclusive code reviews welcome new developers and make them comfortable with the team. Rude or personal attacks doing code reviews alienate - people can unknowingly make rude comments when reviewing pull requests (PRs).","title":"Inclusion in Code Review"},{"location":"code-reviews/inclusion-in-code-review/#types-and-examples-of-non-inclusive-code-review-behavior","text":"Inequitable review assignments. Example: Assigning most reviews to few people and dismissing some members of the team altogether. Negative interpersonal interactions. Example: Long arguments over subjective topics such as code style. Biased decision making. Example: Comments about the developer and not the code. Assuming code from developer X will always be good and hence not reviewing it properly and vice versa.","title":"Types and Examples of Non-Inclusive Code Review Behavior"},{"location":"code-reviews/inclusion-in-code-review/#examples-of-inclusive-code-reviews","text":"Anyone and everyone in the team should be assigned PRs to review. Reviewer should be clear about what is an opinion, their personal preference, best practice or a fact. Arguments over personal preferences and opinions are mostly avoidable. Using inclusive language and tone in the code review comments. For example, being suggestive rather being prescriptive in the review comments is a good way to get the point across the table. It's a good practice for the author of a PR to thank the reviewer for the review, when they have contributed in improving the code or you have learnt something new. Using the sandwich method for recommending a code change to a new developer or a new customer: Sandwich the suggestion between 2 compliments. For example: \"Great work so far, but I would recommend a few changes here. Btw, I loved the use of XYZ here, nice job!\"","title":"Examples of Inclusive Code Reviews"},{"location":"code-reviews/inclusion-in-code-review/#guidelines-for-the-author","text":"Aim to write a code that is easy to read, review and maintain. It\u2019s important to ensure that whoever is looking at the code, whether that be the reviewer or a future engineer, can understand the motivations and how your code achieves its goals. Proactively asking for targeted help or feedback. Respond clearly to questions asked by the reviewers. Avoid huge commits by submitting incremental changes. Commits which are large and contain changes to multiple files will lead to unfair review of the code. Biased behavior of reviewers may kick in while reviewing such PRs. For e.g. a huge commit from a senior developer may get approved without thorough review whereas a huge commit from a junior developer may never get reviewed and approved.","title":"Guidelines for the Author"},{"location":"code-reviews/inclusion-in-code-review/#guidelines-for-the-reviewer","text":"Assume positive intent from the author. Write clear and elaborate comments. Identify subjectivity, choice of coding and best practice. It is good to discuss coding style and subjective coding choices in some other forum and not in the PR. A PR should not become a ground to discuss subjective coding choices and having long arguments over it. If you do not understand the code properly, refrain from commenting e.g., \"This code is incomprehensible\". It is better to have a call with the author and get a basic understanding of their work. Be suggestive and not prescriptive. A reviewer should suggest changes and not prescribe changes, let the author decide if they really want to accept the changes proposed.","title":"Guidelines for the Reviewer"},{"location":"code-reviews/inclusion-in-code-review/#culture-and-code-reviews","text":"We in ISE, may come across situations in which code reviews are not ideal and often we are observing non inclusive code review behaviors. Its important to be aware of the fact that culture and communication style of a particular geography also influences how people interact over pull requests. In such cases, assuming positive intent of the author and reviewer is a good start to start analyzing quality of code reviews.","title":"Culture and Code Reviews"},{"location":"code-reviews/inclusion-in-code-review/#dealing-with-the-impostor-phenomenon","text":"Impostor phenomenon is a psychological pattern in which an individual doubts their skills, talents, or accomplishments and has a persistent internalized fear of being exposed as a \"fraud\" - Wikipedia . Someone experiencing impostor phenomenon may find submitting code for a review particularly stressful. It is important to realize that everybody can have meaningful contributions and not to let the perceived weaknesses prevent contributions. Some tips for overcoming the impostor phenomenon for authors: Review the guidelines highlighted above and make sure your code change adhere to them. Ask for help from a colleague - pair program with an experienced colleague that you can learn from. Some tips for overcoming the impostor phenomenon for reviewers: Anyone can have valuable insights. A fresh new pair of eyes are always welcome. Study the review until you have clearly understood it, check the corner cases and look for ways to improve it. If something is not clear, a simple specific question should be asked. If you have learnt something, you can always compliment the author. If possible, pair with someone to review the code so that you can establish a personal connection and have a more profound discussion about the code.","title":"Dealing with the Impostor Phenomenon"},{"location":"code-reviews/inclusion-in-code-review/#tools","text":"Below are some tools which may help in establishing inclusive code review culture within our teams. Anonymous GitHub Blind Code Reviews Gitmask inclusivelint","title":"Tools"},{"location":"code-reviews/pull-request-template/","text":"Pull Request Template # [Work Item ID](./link-to-the-work-item) For more information about how to contribute to this repo, visit this [ page ]( https://github.com/microsoft/code-with-engineering-playbook/blob/main/CONTRIBUTING.md ) ## Description --- > Should include a concise description of the changes (bug or feature), it's impact, along with a summary of the solution ## Steps to Reproduce Bug and Validate Solution --- > Only applicable if the work is to address a bug. Please remove this section if the work is for a feature or story > Provide details on the environment the bug is found, and detailed steps to recreate the bug. > This should be detailed enough for a team member to confirm that the bug no longer occurs ## PR Checklist --- > Use the check-list below to ensure your branch is ready for PR. If the item is not applicable, leave it blank. - [ ] I have updated the documentation accordingly. - [ ] I have added tests to cover my changes. - [ ] All new and existing tests passed. - [ ] My code follows the code style of this project. - [ ] I ran the lint checks which produced no new errors nor warnings for my changes. - [ ] I have checked to ensure there aren't other open Pull Requests for the same update/change. ## Does This Introduce a Breaking Change? --- - [ ] Yes - [ ] No > If this introduces a breaking change, please describe the impact and migration path for existing applications below. ## Testing --- > - Instructions for testing and validation of your code: > - What OS was used for testing. > - Which test sets were used. > - Description of test scenarios that you have tried. ## Any Relevant Logs or Outputs --- > - Use this section to attach pictures that demonstrates your changes working / healthy > - If you are printing something show a screenshot > - When you want to share long logs upload to: > `(StorageAccount)/pr-support/attachments/(PR Number)/(yourFiles) using [Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/)` or [portal.azure.com](https://portal.azure.com) and insert the link here. ## Other Information or Known Dependencies --- > - Any other information or known dependencies that is important to this PR. > - TODO that are to be done after this PR.","title":"Pull Request Template"},{"location":"code-reviews/pull-request-template/#pull-request-template","text":"# [Work Item ID](./link-to-the-work-item) For more information about how to contribute to this repo, visit this [ page ]( https://github.com/microsoft/code-with-engineering-playbook/blob/main/CONTRIBUTING.md ) ## Description --- > Should include a concise description of the changes (bug or feature), it's impact, along with a summary of the solution ## Steps to Reproduce Bug and Validate Solution --- > Only applicable if the work is to address a bug. Please remove this section if the work is for a feature or story > Provide details on the environment the bug is found, and detailed steps to recreate the bug. > This should be detailed enough for a team member to confirm that the bug no longer occurs ## PR Checklist --- > Use the check-list below to ensure your branch is ready for PR. If the item is not applicable, leave it blank. - [ ] I have updated the documentation accordingly. - [ ] I have added tests to cover my changes. - [ ] All new and existing tests passed. - [ ] My code follows the code style of this project. - [ ] I ran the lint checks which produced no new errors nor warnings for my changes. - [ ] I have checked to ensure there aren't other open Pull Requests for the same update/change. ## Does This Introduce a Breaking Change? --- - [ ] Yes - [ ] No > If this introduces a breaking change, please describe the impact and migration path for existing applications below. ## Testing --- > - Instructions for testing and validation of your code: > - What OS was used for testing. > - Which test sets were used. > - Description of test scenarios that you have tried. ## Any Relevant Logs or Outputs --- > - Use this section to attach pictures that demonstrates your changes working / healthy > - If you are printing something show a screenshot > - When you want to share long logs upload to: > `(StorageAccount)/pr-support/attachments/(PR Number)/(yourFiles) using [Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/)` or [portal.azure.com](https://portal.azure.com) and insert the link here. ## Other Information or Known Dependencies --- > - Any other information or known dependencies that is important to this PR. > - TODO that are to be done after this PR.","title":"Pull Request Template"},{"location":"code-reviews/pull-requests/","text":"Pull Requests Changes to any main codebase - main branch in Git repository, for example - must be done using pull requests (PR). Pull requests enable: Code inspection - see Code Reviews Running automated qualification of the code Linters Compilation Unit tests Integration tests etc. The requirements of pull requests can and should be enforced by policies, which can be set in the most modern version control and work item tracking systems. See Evidence and Measures section for more information. General Process Implement changes based on the well-defined description and acceptance criteria of the task at hand Then, before creating a new pull request: * Make sure the code conforms with the agreed coding conventions * This can be partially automated using linters * Ensure the code compiles and runs without errors or warnings * Write and/or update tests to cover the changes and make sure all new and existing tests pass * Write and/or update the documentation to match the changes Once convinced the criteria above are met, create and submit a new pull request adhering to the pull request template Follow the code review process to merge the changes to the main codebase The following diagram illustrates this approach. sequenceDiagram New branch->>+Pull request: New PR creation Pull request->>+Code review: Review process Code review->>+Pull request: Code updates Pull request->>+New branch: Merge Pull Request Pull request-->>-New branch: Delete branch Pull request ->>+ Main branch: Merge after completion New branch->>+Main branch: Goal of the Pull request Size Guidance We should always aim to keep pull requests small. Small PRs have multiple advantages: They are easier to review; a clear benefit for the reviewers. They are easier to deploy; this is aligned with the strategy of release fast and release often. Minimizes possible conflicts and stale PRs. However, we should keep PRs focused - for example around a functional feature, optimization or code readability and avoid having PRs that include code that is without context or loosely coupled. There is no right size, but keep in mind that a code review is a collaborative process, a big PRs could be difficult and therefore slower to review. We should always strive to have as small PRs as possible that still add value. Best Practices Beyond the size, remember that every PR should: be consistent, not break the build, and include related tests as part of the PR. Be consistent means that all the changes included on the PR should aim to solve one goal (ex. one user story) and be intrinsically related. Think of this as the Single-responsibility principle in terms of the whole project, the PR should have only one reason to change the project. Start small, it is easier to create a small PR from the start than to break up a bigger one. These are some strategies to keep PRs small depending on the \"cause\" of the inevitability, you could break the PR into self-container changes which still add value, release features that are hidden (see feature flag, feature toggling or canary releases) or break the PR into different layers (for example using design patterns like MVC or Observer/Subject). No matter the strategy. Pull Request Description Well written PR descriptions helps maintain a clean, well-structured change history. While every team need not conform to the same specification, it is important that the convention is agreed upon at the start of the project. One popular specification for open-source projects and others is the Conventional Commits specification , which is structured as: <type>[optional scope]: <description> [optional body] [optional footer] The <type> in this message can be selected from a list of types defined by the team, but many projects use the list of commit types from the Angular open-source project . It should be clear that scope , body and footer elements are optional , but having a required type and short description enables the features mentioned above. See also Pull Request Template Resources Writing a great pull request description Review code-with pull requests (Azure DevOps) Collaborating with issues and pull requests (GitHub) Google approach to PR size Feature Flags Facebook approach to hidden features Conventional Commits specification Angular Commit types","title":"Pull Requests"},{"location":"code-reviews/pull-requests/#pull-requests","text":"Changes to any main codebase - main branch in Git repository, for example - must be done using pull requests (PR). Pull requests enable: Code inspection - see Code Reviews Running automated qualification of the code Linters Compilation Unit tests Integration tests etc. The requirements of pull requests can and should be enforced by policies, which can be set in the most modern version control and work item tracking systems. See Evidence and Measures section for more information.","title":"Pull Requests"},{"location":"code-reviews/pull-requests/#general-process","text":"Implement changes based on the well-defined description and acceptance criteria of the task at hand Then, before creating a new pull request: * Make sure the code conforms with the agreed coding conventions * This can be partially automated using linters * Ensure the code compiles and runs without errors or warnings * Write and/or update tests to cover the changes and make sure all new and existing tests pass * Write and/or update the documentation to match the changes Once convinced the criteria above are met, create and submit a new pull request adhering to the pull request template Follow the code review process to merge the changes to the main codebase The following diagram illustrates this approach. sequenceDiagram New branch->>+Pull request: New PR creation Pull request->>+Code review: Review process Code review->>+Pull request: Code updates Pull request->>+New branch: Merge Pull Request Pull request-->>-New branch: Delete branch Pull request ->>+ Main branch: Merge after completion New branch->>+Main branch: Goal of the Pull request","title":"General Process"},{"location":"code-reviews/pull-requests/#size-guidance","text":"We should always aim to keep pull requests small. Small PRs have multiple advantages: They are easier to review; a clear benefit for the reviewers. They are easier to deploy; this is aligned with the strategy of release fast and release often. Minimizes possible conflicts and stale PRs. However, we should keep PRs focused - for example around a functional feature, optimization or code readability and avoid having PRs that include code that is without context or loosely coupled. There is no right size, but keep in mind that a code review is a collaborative process, a big PRs could be difficult and therefore slower to review. We should always strive to have as small PRs as possible that still add value.","title":"Size Guidance"},{"location":"code-reviews/pull-requests/#best-practices","text":"Beyond the size, remember that every PR should: be consistent, not break the build, and include related tests as part of the PR. Be consistent means that all the changes included on the PR should aim to solve one goal (ex. one user story) and be intrinsically related. Think of this as the Single-responsibility principle in terms of the whole project, the PR should have only one reason to change the project. Start small, it is easier to create a small PR from the start than to break up a bigger one. These are some strategies to keep PRs small depending on the \"cause\" of the inevitability, you could break the PR into self-container changes which still add value, release features that are hidden (see feature flag, feature toggling or canary releases) or break the PR into different layers (for example using design patterns like MVC or Observer/Subject). No matter the strategy.","title":"Best Practices"},{"location":"code-reviews/pull-requests/#pull-request-description","text":"Well written PR descriptions helps maintain a clean, well-structured change history. While every team need not conform to the same specification, it is important that the convention is agreed upon at the start of the project. One popular specification for open-source projects and others is the Conventional Commits specification , which is structured as: <type>[optional scope]: <description> [optional body] [optional footer] The <type> in this message can be selected from a list of types defined by the team, but many projects use the list of commit types from the Angular open-source project . It should be clear that scope , body and footer elements are optional , but having a required type and short description enables the features mentioned above. See also Pull Request Template","title":"Pull Request Description"},{"location":"code-reviews/pull-requests/#resources","text":"Writing a great pull request description Review code-with pull requests (Azure DevOps) Collaborating with issues and pull requests (GitHub) Google approach to PR size Feature Flags Facebook approach to hidden features Conventional Commits specification Angular Commit types","title":"Resources"},{"location":"code-reviews/tools/","text":"Code Review Tools Customize ADO Task Boards AzDO: Customize cards AzDO: Add columns on task board Reviewer Policies Setting required reviewer group in AzDO - Automatically include code reviewers Configuring Branch Policies AzDO: Configure branch policies AzDO: Configuring branch policies with the CLI tool: Create a policy configuration file Approval count policy GitHub: Configuring protected branches VSCode GitHub: GitHub Pull Requests Supports processing GitHub pull requests inside VS Code. Open the plugin from the Activity Bar Select Assigned To Me Select a PR Under Description you can choose to Check Out the branch and get into Review Mode and get a more integrated experience Azure DevOps: Azure DevOps Pull Requests Supports processing Azure DevOps pull requests inside VS Code. Open the plugin from the Activity Bar Select Assigned To Me Select a PR Under Description you can choose to Check Out the branch and get into Review Mode and get a more integrated experience Visual Studio The following extensions can be used to create an integrated code review experience in Visual Studio working with either GitHub or Azure DevOps. GitHub: GitHub Extension for Visual Studio Provides extended functionality for working with pull requests on GitHub directly out of Visual Studio. View -> Other Windows -> GitHub Click on the Pull Requests icon in the task bar Double click on a pending pull request Azure DevOps: Pull Requests for Visual Studio Work with pull requests on Azure DevOps directly out of Visual Studio. Open Team Explorer Click on Pull Requests Double-click a pull request - the Pull Request Details open Click on Checkout if you want to have the full change locally and have a more integrated experience Go through the changes and make comments Web Reviewable: Seamless multi-round GitHub reviews Supports multi-round GitHub code reviews, with keyboard shortcuts and more. VS Code extension is in-progress. Visit the Review Dashboard to see reviews awaiting your action, that have new comments for you, and more. Select a Pull Request from that list. Open any file in your browser, in Visual Studio Code, or any editor you've configured by clicking on your profile photo in the top-right Select an editor under \"External editor link template\". VS Code is an option, but so is any editor that supports URI's. Review the diff on an overall or per-file basis, leaving comments, code suggestions, and more","title":"Code Review Tools"},{"location":"code-reviews/tools/#code-review-tools","text":"","title":"Code Review Tools"},{"location":"code-reviews/tools/#customize-ado","text":"","title":"Customize ADO"},{"location":"code-reviews/tools/#task-boards","text":"AzDO: Customize cards AzDO: Add columns on task board","title":"Task Boards"},{"location":"code-reviews/tools/#reviewer-policies","text":"Setting required reviewer group in AzDO - Automatically include code reviewers","title":"Reviewer Policies"},{"location":"code-reviews/tools/#configuring-branch-policies","text":"AzDO: Configure branch policies AzDO: Configuring branch policies with the CLI tool: Create a policy configuration file Approval count policy GitHub: Configuring protected branches","title":"Configuring Branch Policies"},{"location":"code-reviews/tools/#vscode","text":"","title":"VSCode"},{"location":"code-reviews/tools/#github-github-pull-requests","text":"Supports processing GitHub pull requests inside VS Code. Open the plugin from the Activity Bar Select Assigned To Me Select a PR Under Description you can choose to Check Out the branch and get into Review Mode and get a more integrated experience","title":"GitHub: GitHub Pull Requests"},{"location":"code-reviews/tools/#azure-devops-azure-devops-pull-requests","text":"Supports processing Azure DevOps pull requests inside VS Code. Open the plugin from the Activity Bar Select Assigned To Me Select a PR Under Description you can choose to Check Out the branch and get into Review Mode and get a more integrated experience","title":"Azure DevOps: Azure DevOps Pull Requests"},{"location":"code-reviews/tools/#visual-studio","text":"The following extensions can be used to create an integrated code review experience in Visual Studio working with either GitHub or Azure DevOps.","title":"Visual Studio"},{"location":"code-reviews/tools/#github-github-extension-for-visual-studio","text":"Provides extended functionality for working with pull requests on GitHub directly out of Visual Studio. View -> Other Windows -> GitHub Click on the Pull Requests icon in the task bar Double click on a pending pull request","title":"GitHub: GitHub Extension for Visual Studio"},{"location":"code-reviews/tools/#azure-devops-pull-requests-for-visual-studio","text":"Work with pull requests on Azure DevOps directly out of Visual Studio. Open Team Explorer Click on Pull Requests Double-click a pull request - the Pull Request Details open Click on Checkout if you want to have the full change locally and have a more integrated experience Go through the changes and make comments","title":"Azure DevOps: Pull Requests for Visual Studio"},{"location":"code-reviews/tools/#web","text":"","title":"Web"},{"location":"code-reviews/tools/#reviewable-seamless-multi-round-github-reviews","text":"Supports multi-round GitHub code reviews, with keyboard shortcuts and more. VS Code extension is in-progress. Visit the Review Dashboard to see reviews awaiting your action, that have new comments for you, and more. Select a Pull Request from that list. Open any file in your browser, in Visual Studio Code, or any editor you've configured by clicking on your profile photo in the top-right Select an editor under \"External editor link template\". VS Code is an option, but so is any editor that supports URI's. Review the diff on an overall or per-file basis, leaving comments, code suggestions, and more","title":"Reviewable: Seamless multi-round GitHub reviews"},{"location":"code-reviews/evidence-and-measures/","text":"Evidence and Measures Evidence Many of the code quality assurance items can be automated or enforced by policies in modern version control and work item tracking systems. Verification of the policies on the main branch in Azure DevOps (AzDO) or GitHub , for example, may be sufficient evidence that a project team is conducting code reviews. The main branches in all repositories have branch policies. - Configure branch policies All builds produced out of project repositories include appropriate linters, run unit tests. Every bug work item should include a link to the pull request that introduced it, once the error has been diagnosed. This helps with learning. Each bug work item should include a note on how the bug might (or might not have) been caught in a code review. The project team regularly updates their code review checklists to reflect common issues they have encountered. Dev Leads should review a sample of pull requests and/or be co-reviewers with other developers to help everyone improve their skills as code reviewers. Measures The team can collect metrics of code reviews to measure their efficiency. Some useful metrics include: Defect Removal Efficiency (DRE) - a measure of the development team's ability to remove defects prior to release Time metrics: Time used preparing for code inspection sessions Time used in review sessions Lines of code (LOC) inspected per time unit/meeting It is a perfectly reasonable solution to track these metrics manually e.g. in an Excel sheet. It is also possible to utilize the features of project management platforms - for example, AzDO enables dashboards for metrics including tracking bugs . You may find ready-made plugins for various platforms - see GitHub Marketplace for instance - or you can choose to implement these features yourself. Remember that since defects removed thanks to reviews is far less costly compared to finding them in production, the cost of doing code reviews is actually negative! Resources A Guide to Code Inspections","title":"Evidence and Measures"},{"location":"code-reviews/evidence-and-measures/#evidence-and-measures","text":"","title":"Evidence and Measures"},{"location":"code-reviews/evidence-and-measures/#evidence","text":"Many of the code quality assurance items can be automated or enforced by policies in modern version control and work item tracking systems. Verification of the policies on the main branch in Azure DevOps (AzDO) or GitHub , for example, may be sufficient evidence that a project team is conducting code reviews. The main branches in all repositories have branch policies. - Configure branch policies All builds produced out of project repositories include appropriate linters, run unit tests. Every bug work item should include a link to the pull request that introduced it, once the error has been diagnosed. This helps with learning. Each bug work item should include a note on how the bug might (or might not have) been caught in a code review. The project team regularly updates their code review checklists to reflect common issues they have encountered. Dev Leads should review a sample of pull requests and/or be co-reviewers with other developers to help everyone improve their skills as code reviewers.","title":"Evidence"},{"location":"code-reviews/evidence-and-measures/#measures","text":"The team can collect metrics of code reviews to measure their efficiency. Some useful metrics include: Defect Removal Efficiency (DRE) - a measure of the development team's ability to remove defects prior to release Time metrics: Time used preparing for code inspection sessions Time used in review sessions Lines of code (LOC) inspected per time unit/meeting It is a perfectly reasonable solution to track these metrics manually e.g. in an Excel sheet. It is also possible to utilize the features of project management platforms - for example, AzDO enables dashboards for metrics including tracking bugs . You may find ready-made plugins for various platforms - see GitHub Marketplace for instance - or you can choose to implement these features yourself. Remember that since defects removed thanks to reviews is far less costly compared to finding them in production, the cost of doing code reviews is actually negative!","title":"Measures"},{"location":"code-reviews/evidence-and-measures/#resources","text":"A Guide to Code Inspections","title":"Resources"},{"location":"code-reviews/process-guidance/","text":"Process Guidance General Guidance Code reviews should be part of the software engineering team process regardless of the development model. Furthermore, the team should learn to execute reviews in a timely manner. Pull requests (PRs) left hanging can cause additional merge problems and go stale resulting in lost work. Qualified PRs are expected to reflect well-defined, concise tasks, and thus be compact in content. Reviewing a single task should then take relatively little time to complete. To ensure that the code review process is healthy, inclusive and meets the goals stated above, consider following these guidelines: Establish a service-level agreement (SLA) for code reviews and add it to your teams working agreement. Although modern DevOps environments incorporate tools for managing PRs, it can be useful to label tasks pending for review or to have a dedicated place for them on the task board - Customize AzDO task boards In the daily standup meeting check tasks pending for review and make sure they have reviewers assigned. Junior teams and teams new to the process can consider creating separate tasks for reviews together with the tasks themselves. Utilize tools to streamline the review process - Code review tools Foster inclusive code reviews - Inclusion in Code Review Measuring Code Review Process If the team is finding that code reviews are taking a significant time to merge, and it is becoming a blocker, consider the following additional recommendations: Measure the average time it takes to merge a PR per sprint cycle. Review during retrospective how the time to merge can be improved and prioritized. Assess the time to merge across sprints to see if the process is improving. Ping required approvers directly as a reminder. Code Reviews Shouldn't Include too Many Lines of Code It's easy to say a developer can review few hundred lines of code, but when the code surpasses certain amount of lines, the effectiveness of defects discovery will decrease and there is a lesser chance of doing a good review. It's not a matter of setting a code line limit, but rather using common sense. More code there is to review, the higher chances there are letting a bug sneak through. See PR size guidance . Automate Whenever Reasonable Use automation (linting, code analysis etc.) to avoid the need for \" nits \" and allow the reviewer to focus more on the functional aspects of the PR. By configuring automated builds, tests and checks (something achievable in the CI process ), teams can save human reviewers some time and let them focus in areas like design and functionality for proper evaluation. This will ensure higher chances of success as the team is focusing on the things that matter. Role specific guidance Author Guidance Reviewer Guidance","title":"Process Guidance"},{"location":"code-reviews/process-guidance/#process-guidance","text":"","title":"Process Guidance"},{"location":"code-reviews/process-guidance/#general-guidance","text":"Code reviews should be part of the software engineering team process regardless of the development model. Furthermore, the team should learn to execute reviews in a timely manner. Pull requests (PRs) left hanging can cause additional merge problems and go stale resulting in lost work. Qualified PRs are expected to reflect well-defined, concise tasks, and thus be compact in content. Reviewing a single task should then take relatively little time to complete. To ensure that the code review process is healthy, inclusive and meets the goals stated above, consider following these guidelines: Establish a service-level agreement (SLA) for code reviews and add it to your teams working agreement. Although modern DevOps environments incorporate tools for managing PRs, it can be useful to label tasks pending for review or to have a dedicated place for them on the task board - Customize AzDO task boards In the daily standup meeting check tasks pending for review and make sure they have reviewers assigned. Junior teams and teams new to the process can consider creating separate tasks for reviews together with the tasks themselves. Utilize tools to streamline the review process - Code review tools Foster inclusive code reviews - Inclusion in Code Review","title":"General Guidance"},{"location":"code-reviews/process-guidance/#measuring-code-review-process","text":"If the team is finding that code reviews are taking a significant time to merge, and it is becoming a blocker, consider the following additional recommendations: Measure the average time it takes to merge a PR per sprint cycle. Review during retrospective how the time to merge can be improved and prioritized. Assess the time to merge across sprints to see if the process is improving. Ping required approvers directly as a reminder.","title":"Measuring Code Review Process"},{"location":"code-reviews/process-guidance/#code-reviews-shouldnt-include-too-many-lines-of-code","text":"It's easy to say a developer can review few hundred lines of code, but when the code surpasses certain amount of lines, the effectiveness of defects discovery will decrease and there is a lesser chance of doing a good review. It's not a matter of setting a code line limit, but rather using common sense. More code there is to review, the higher chances there are letting a bug sneak through. See PR size guidance .","title":"Code Reviews Shouldn't Include too Many Lines of Code"},{"location":"code-reviews/process-guidance/#automate-whenever-reasonable","text":"Use automation (linting, code analysis etc.) to avoid the need for \" nits \" and allow the reviewer to focus more on the functional aspects of the PR. By configuring automated builds, tests and checks (something achievable in the CI process ), teams can save human reviewers some time and let them focus in areas like design and functionality for proper evaluation. This will ensure higher chances of success as the team is focusing on the things that matter.","title":"Automate Whenever Reasonable"},{"location":"code-reviews/process-guidance/#role-specific-guidance","text":"Author Guidance Reviewer Guidance","title":"Role specific guidance"},{"location":"code-reviews/process-guidance/author-guidance/","text":"Author Guidance Properly Describe Your Pull Request (PR) Give the PR a descriptive title, so that other members can easily (in one short sentence) understand what a PR is about. Every PR should have a proper description, that shows the reviewer what has been changed and why. Add Relevant Reviewers Add one or more reviewers (depending on your project's guidelines) to the PR. Ideally, you would add at least someone who has expertise and is familiar with the project, or the language used Adding someone less familiar with the project or the language can aid in verifying the changes are understandable, easy to read, and increases the expertise within the team In ISE code-with projects with a customer team, it is important to include reviewers from both organizations for knowledge transfer - Customize Reviewers Policy Be Open to Receive Feedback Discuss design/code logic and address all comments as follows: Resolve a comment, if the requested change has been made. Mark the comment as \"won't fix\", if you are not going to make the requested changes and provide a clear reasoning If the requested change is within the scope of the task, \"I'll do it later\" is not an acceptable reason! If the requested change is out of scope, create a new work item (task or bug) for it If you don't understand a comment, ask questions in the review itself as opposed to a private chat If a thread gets bloated without a conclusion, have a meeting with the reviewer (call them or knock on door) Use Checklists When creating a PR, it is a good idea to add a checklist of objectives of the PR in the description. This helps the reviewers to focus on the key areas of the code changes. Link a Task to Your PR Link the corresponding work items/tasks to the PR. There is no need to duplicate information between the work item and the PR, but if some details are missing in either one, together they provide more context to the reviewer. Code Should Have Annotations Before the Review If you can't avoid large PRs, include explanations of the changes in order to make it easier for the reviewer to review the code, with clear comments the reviewer can identify the goal of every code block.","title":"Author Guidance"},{"location":"code-reviews/process-guidance/author-guidance/#author-guidance","text":"","title":"Author Guidance"},{"location":"code-reviews/process-guidance/author-guidance/#properly-describe-your-pull-request-pr","text":"Give the PR a descriptive title, so that other members can easily (in one short sentence) understand what a PR is about. Every PR should have a proper description, that shows the reviewer what has been changed and why.","title":"Properly Describe Your Pull Request (PR)"},{"location":"code-reviews/process-guidance/author-guidance/#add-relevant-reviewers","text":"Add one or more reviewers (depending on your project's guidelines) to the PR. Ideally, you would add at least someone who has expertise and is familiar with the project, or the language used Adding someone less familiar with the project or the language can aid in verifying the changes are understandable, easy to read, and increases the expertise within the team In ISE code-with projects with a customer team, it is important to include reviewers from both organizations for knowledge transfer - Customize Reviewers Policy","title":"Add Relevant Reviewers"},{"location":"code-reviews/process-guidance/author-guidance/#be-open-to-receive-feedback","text":"Discuss design/code logic and address all comments as follows: Resolve a comment, if the requested change has been made. Mark the comment as \"won't fix\", if you are not going to make the requested changes and provide a clear reasoning If the requested change is within the scope of the task, \"I'll do it later\" is not an acceptable reason! If the requested change is out of scope, create a new work item (task or bug) for it If you don't understand a comment, ask questions in the review itself as opposed to a private chat If a thread gets bloated without a conclusion, have a meeting with the reviewer (call them or knock on door)","title":"Be Open to Receive Feedback"},{"location":"code-reviews/process-guidance/author-guidance/#use-checklists","text":"When creating a PR, it is a good idea to add a checklist of objectives of the PR in the description. This helps the reviewers to focus on the key areas of the code changes.","title":"Use Checklists"},{"location":"code-reviews/process-guidance/author-guidance/#link-a-task-to-your-pr","text":"Link the corresponding work items/tasks to the PR. There is no need to duplicate information between the work item and the PR, but if some details are missing in either one, together they provide more context to the reviewer.","title":"Link a Task to Your PR"},{"location":"code-reviews/process-guidance/author-guidance/#code-should-have-annotations-before-the-review","text":"If you can't avoid large PRs, include explanations of the changes in order to make it easier for the reviewer to review the code, with clear comments the reviewer can identify the goal of every code block.","title":"Code Should Have Annotations Before the Review"},{"location":"code-reviews/process-guidance/reviewer-guidance/","text":"Reviewer Guidance Since parts of reviews can be automated via linters and such, human reviewers can focus on architectural and functional correctness. Human reviewers should focus on: The correctness of the business logic embodied in the code. The correctness of any new or changed tests. The \"readability\" and maintainability of the overall design decisions reflected in the code. The checklist of common errors that the team maintains for each programming language. Code reviews should use the below guidance and checklists to ensure positive and effective code reviews. General Guidance Understand the Code You are Reviewing Read every line changed. If we have a stakeholder review, it\u2019s not necessary to run the PR unless it aids your understanding of the code. AzDO orders the files for you, but you should read the code in some logical sequence to aid understanding. If you don\u2019t fully understand a change in a file because you don\u2019t have context, click to view the whole file and read through the surrounding code or checkout the changes and view them in IDE. Ask the author to clarify. Take Your Time and Keep Focus on Scope You shouldn't review code hastily but neither take too long in one sitting. If you have many pull requests (PRs) to review or if the complexity of code is demanding, the recommendation is to take a break between the reviews to recover and focus on the ones you are most experienced with. Always remember that a goal of a code review is to verify that the goals of the corresponding task have been achieved. If you have concerns about the related, adjacent code that isn't in the scope of the PR, address those as separate tasks (e.g., bugs, technical debt). Don't block the current PR due to issues that are out of scope. Foster a Positive Code Review Culture Code reviews play a critical role in product quality and it should not represent an arena for long discussions or even worse a battle of egos. What matters is a bug caught, not who made it, not who found it, not who fixed it. The only thing that matters is having the best possible product. Be Considerate Be positive \u2013 encouraging, appreciation for good practices. Prefix a \u201cpoint of polish\u201d with \u201cNit:\u201d. Avoid language that points fingers like \u201cyou\u201d but rather use \u201cwe\u201d or \u201cthis line\u201d -- code reviews are not personal and language matters. Prefer asking questions above making statements. There might be a good reason for the author to do something. If you make a direct comment, explain why the code needs to be changed, preferably with an example. Talking about changes, you can suggest changes to a PR by using the suggestion feature (available in GitHub and Azure DevOps) or by creating a PR to the author branch. If a few back-and-forth comments don't resolve a disagreement, have a quick talk with each other (in-person or call) or create a group discussion this can lead to an array of improvements for upcoming PRs. Don't forget to update the PR with what you agreed on and why. First Design Pass Pull Request Overview Does the PR description make sense? Do all the changes logically fit in this PR, or are there unrelated changes? If necessary, are the changes made reflected in updates to the README or other docs? Especially if the changes affect how the user builds code. User Facing Changes If the code involves a user-facing change, is there a GIF/photo that explains the functionality? If not, it might be key to validate the PR to ensure the change does what is expected. Ensure UI changes look good without unexpected behavior. Design Do the interactions of the various pieces of code in the PR make sense? Does the code recognize and incorporate architectures and coding patterns? Code Quality Pass Complexity Are functions too complex? Is the single responsibility principle followed? Function or class should do one \u2018thing\u2019. Should a function be broken into multiple functions? If a method has greater than 3 arguments, it is potentially overly complex. Does the code add functionality that isn\u2019t needed? Can the code be understood easily by code readers? Naming/Readability Did the developer pick good names for functions, variables, etc? Error Handling Are errors handled gracefully and explicitly where necessary? Functionality Is there parallel programming in this PR that could cause race conditions? Carefully read through this logic. Could the code be optimized? For example: are there more calls to the database than need be? How does the functionality fit in the bigger picture? Can it have negative effects to the overall system? Are there security flaws? Does a variable name reveal any customer specific information? Is PII and EUII treated correctly? Are we logging any PII information? Style Are there extraneous comments? If the code isn\u2019t clear enough to explain itself, then the code should be made simpler. Comments may be there to explain why some code exists. Does the code adhere to the style guide/conventions that we have agreed upon? We use automated styling like black and prettier. Tests Tests should always be committed in the same PR as the code itself (\u2018I\u2019ll add tests next\u2019 is not acceptable). Make sure tests are sensible and valid assumptions are made. Make sure edge cases are handled as well. Tests can be a great source to understand the changes. It can be a strategy to look at tests first to help you understand the changes better.","title":"Reviewer Guidance"},{"location":"code-reviews/process-guidance/reviewer-guidance/#reviewer-guidance","text":"Since parts of reviews can be automated via linters and such, human reviewers can focus on architectural and functional correctness. Human reviewers should focus on: The correctness of the business logic embodied in the code. The correctness of any new or changed tests. The \"readability\" and maintainability of the overall design decisions reflected in the code. The checklist of common errors that the team maintains for each programming language. Code reviews should use the below guidance and checklists to ensure positive and effective code reviews.","title":"Reviewer Guidance"},{"location":"code-reviews/process-guidance/reviewer-guidance/#general-guidance","text":"","title":"General Guidance"},{"location":"code-reviews/process-guidance/reviewer-guidance/#understand-the-code-you-are-reviewing","text":"Read every line changed. If we have a stakeholder review, it\u2019s not necessary to run the PR unless it aids your understanding of the code. AzDO orders the files for you, but you should read the code in some logical sequence to aid understanding. If you don\u2019t fully understand a change in a file because you don\u2019t have context, click to view the whole file and read through the surrounding code or checkout the changes and view them in IDE. Ask the author to clarify.","title":"Understand the Code You are Reviewing"},{"location":"code-reviews/process-guidance/reviewer-guidance/#take-your-time-and-keep-focus-on-scope","text":"You shouldn't review code hastily but neither take too long in one sitting. If you have many pull requests (PRs) to review or if the complexity of code is demanding, the recommendation is to take a break between the reviews to recover and focus on the ones you are most experienced with. Always remember that a goal of a code review is to verify that the goals of the corresponding task have been achieved. If you have concerns about the related, adjacent code that isn't in the scope of the PR, address those as separate tasks (e.g., bugs, technical debt). Don't block the current PR due to issues that are out of scope.","title":"Take Your Time and Keep Focus on Scope"},{"location":"code-reviews/process-guidance/reviewer-guidance/#foster-a-positive-code-review-culture","text":"Code reviews play a critical role in product quality and it should not represent an arena for long discussions or even worse a battle of egos. What matters is a bug caught, not who made it, not who found it, not who fixed it. The only thing that matters is having the best possible product.","title":"Foster a Positive Code Review Culture"},{"location":"code-reviews/process-guidance/reviewer-guidance/#be-considerate","text":"Be positive \u2013 encouraging, appreciation for good practices. Prefix a \u201cpoint of polish\u201d with \u201cNit:\u201d. Avoid language that points fingers like \u201cyou\u201d but rather use \u201cwe\u201d or \u201cthis line\u201d -- code reviews are not personal and language matters. Prefer asking questions above making statements. There might be a good reason for the author to do something. If you make a direct comment, explain why the code needs to be changed, preferably with an example. Talking about changes, you can suggest changes to a PR by using the suggestion feature (available in GitHub and Azure DevOps) or by creating a PR to the author branch. If a few back-and-forth comments don't resolve a disagreement, have a quick talk with each other (in-person or call) or create a group discussion this can lead to an array of improvements for upcoming PRs. Don't forget to update the PR with what you agreed on and why.","title":"Be Considerate"},{"location":"code-reviews/process-guidance/reviewer-guidance/#first-design-pass","text":"","title":"First Design Pass"},{"location":"code-reviews/process-guidance/reviewer-guidance/#pull-request-overview","text":"Does the PR description make sense? Do all the changes logically fit in this PR, or are there unrelated changes? If necessary, are the changes made reflected in updates to the README or other docs? Especially if the changes affect how the user builds code.","title":"Pull Request Overview"},{"location":"code-reviews/process-guidance/reviewer-guidance/#user-facing-changes","text":"If the code involves a user-facing change, is there a GIF/photo that explains the functionality? If not, it might be key to validate the PR to ensure the change does what is expected. Ensure UI changes look good without unexpected behavior.","title":"User Facing Changes"},{"location":"code-reviews/process-guidance/reviewer-guidance/#design","text":"Do the interactions of the various pieces of code in the PR make sense? Does the code recognize and incorporate architectures and coding patterns?","title":"Design"},{"location":"code-reviews/process-guidance/reviewer-guidance/#code-quality-pass","text":"","title":"Code Quality Pass"},{"location":"code-reviews/process-guidance/reviewer-guidance/#complexity","text":"Are functions too complex? Is the single responsibility principle followed? Function or class should do one \u2018thing\u2019. Should a function be broken into multiple functions? If a method has greater than 3 arguments, it is potentially overly complex. Does the code add functionality that isn\u2019t needed? Can the code be understood easily by code readers?","title":"Complexity"},{"location":"code-reviews/process-guidance/reviewer-guidance/#namingreadability","text":"Did the developer pick good names for functions, variables, etc?","title":"Naming/Readability"},{"location":"code-reviews/process-guidance/reviewer-guidance/#error-handling","text":"Are errors handled gracefully and explicitly where necessary?","title":"Error Handling"},{"location":"code-reviews/process-guidance/reviewer-guidance/#functionality","text":"Is there parallel programming in this PR that could cause race conditions? Carefully read through this logic. Could the code be optimized? For example: are there more calls to the database than need be? How does the functionality fit in the bigger picture? Can it have negative effects to the overall system? Are there security flaws? Does a variable name reveal any customer specific information? Is PII and EUII treated correctly? Are we logging any PII information?","title":"Functionality"},{"location":"code-reviews/process-guidance/reviewer-guidance/#style","text":"Are there extraneous comments? If the code isn\u2019t clear enough to explain itself, then the code should be made simpler. Comments may be there to explain why some code exists. Does the code adhere to the style guide/conventions that we have agreed upon? We use automated styling like black and prettier.","title":"Style"},{"location":"code-reviews/process-guidance/reviewer-guidance/#tests","text":"Tests should always be committed in the same PR as the code itself (\u2018I\u2019ll add tests next\u2019 is not acceptable). Make sure tests are sensible and valid assumptions are made. Make sure edge cases are handled as well. Tests can be a great source to understand the changes. It can be a strategy to look at tests first to help you understand the changes better.","title":"Tests"},{"location":"code-reviews/recipes/azure-pipelines-yaml/","text":"YAML(Azure Pipelines) Code Reviews Style Guide Developers should follow the YAML schema reference . Code Analysis / Linting The most popular YAML linter is YAML extension. This extension provides YAML validation, document outlining, auto-completion, hover support and formatter features. VS Code Extensions There is an Azure Pipelines for VS Code extension to add syntax highlighting and autocompletion for Azure Pipelines YAML to VS Code. It also helps you set up continuous build and deployment for Azure WebApps without leaving VS Code. YAML in Azure Pipelines Overview When the pipeline is triggered, before running the pipeline, there are a few phases such as Queue Time, Compile Time and Runtime where variables are interpreted by their runtime expression syntax . When the pipeline is triggered, all nested YAML files are expanded to run in Azure Pipelines. This checklist contains some tips and tricks for reviewing all nested YAML files. These documents may be useful when reviewing YAML files: Azure Pipelines YAML documentation . Pipeline run sequence Key concepts for new Azure Pipelines Key concepts overview A trigger tells a Pipeline to run. A pipeline is made up of one or more stages. A pipeline can deploy to one or more environments. A stage is a way of organizing jobs in a pipeline and each stage can have one or more jobs. Each job runs on one agent. A job can also be agentless. Each agent runs a job that contains one or more steps. A step can be a task or script and is the smallest building block of a pipeline. A task is a pre-packaged script that performs an action, such as invoking a REST API or publishing a build artifact. An artifact is a collection of files or packages published by a run. Code Review Checklist In addition to the Code Review Checklist you should also look for these Azure Pipelines YAML specific code review items. Pipeline Structure The steps are well understood and components are easily identifiable. Ensure that there is a proper description displayName: for every step in the pipeline. Steps/stages of the pipeline are checked in Azure Pipelines to have more understanding of components. In case you have complex nested YAML files, The pipeline in Azure Pipelines is edited to find trigger root file. All the template file references are visited to ensure a small change does not cause breaking changes, changing one file may affect multiple pipelines Long inline scripts in YAML file are moved into script files YAML Structure Re-usable components are split into separate YAML templates. Variables are separated per environment stored in templates or variable groups. Variable value changes in Queue Time , Compile Time and Runtime are considered. Variable syntax values used with Macro Syntax , Template Expression Syntax and Runtime Expression Syntax are considered. Variables can change during the pipeline, Parameters cannot. Unused variables/parameters are removed in pipeline. Does the pipeline meet with stage/job Conditions criteria? Permission Check & Security Secret values shouldn't be printed in pipeline, issecret is used for printing secrets for debugging If pipeline is using variable groups in Library, ensure pipeline has access to the variable groups created. If pipeline has a remote task in other repo/organization, does it have access? If pipeline is trying to access a secure file, does it have the permission? If pipeline requires approval for environment deployments, Who is the approver? Does it need to keep secrets and manage them, did you consider using Azure KeyVault? Troubleshooting Tips Consider Variable Syntax with Runtime Expressions in the pipeline. Here is a nice sample to understand Expansion of variables . When we assign variable like below it won't set during initialize time, it'll assign during runtime, then we can retrieve some errors based on when template runs. - task : AzureWebApp@1 displayName : 'Deploy Azure Web App : $(webAppName)' inputs : azureSubscription : '$(azureServiceConnectionId)' appName : '$(webAppName)' package : $(Pipeline.Workspace)/drop/Application$(Build.BuildId).zip startUpCommand : 'gunicorn --bind=0.0.0.0 --workers=4 app:app' Error: After passing these variables as parameter, it loads values properly. - template : steps-deployment.yaml parameters : azureServiceConnectionId : ${{ variables.azureServiceConnectionId }} webAppName : ${{ variables.webAppName }} - task : AzureWebApp@1 displayName : 'Deploy Azure Web App :${{ parameters.webAppName }}' inputs : azureSubscription : '${{ parameters.azureServiceConnectionId }}' appName : '${{ parameters.webAppName }}' package : $(Pipeline.Workspace)/drop/Application$(Build.BuildId).zip startUpCommand : 'gunicorn --bind=0.0.0.0 --workers=4 app:app' Use issecret for printing secrets for debugging echo \"##vso[task.setvariable variable=token;issecret=true] ${ token } \"","title":"YAML(Azure Pipelines) Code Reviews"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#yamlazure-pipelines-code-reviews","text":"","title":"YAML(Azure Pipelines) Code Reviews"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#style-guide","text":"Developers should follow the YAML schema reference .","title":"Style Guide"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#code-analysis-linting","text":"The most popular YAML linter is YAML extension. This extension provides YAML validation, document outlining, auto-completion, hover support and formatter features.","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#vs-code-extensions","text":"There is an Azure Pipelines for VS Code extension to add syntax highlighting and autocompletion for Azure Pipelines YAML to VS Code. It also helps you set up continuous build and deployment for Azure WebApps without leaving VS Code.","title":"VS Code Extensions"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#yaml-in-azure-pipelines-overview","text":"When the pipeline is triggered, before running the pipeline, there are a few phases such as Queue Time, Compile Time and Runtime where variables are interpreted by their runtime expression syntax . When the pipeline is triggered, all nested YAML files are expanded to run in Azure Pipelines. This checklist contains some tips and tricks for reviewing all nested YAML files. These documents may be useful when reviewing YAML files: Azure Pipelines YAML documentation . Pipeline run sequence Key concepts for new Azure Pipelines Key concepts overview A trigger tells a Pipeline to run. A pipeline is made up of one or more stages. A pipeline can deploy to one or more environments. A stage is a way of organizing jobs in a pipeline and each stage can have one or more jobs. Each job runs on one agent. A job can also be agentless. Each agent runs a job that contains one or more steps. A step can be a task or script and is the smallest building block of a pipeline. A task is a pre-packaged script that performs an action, such as invoking a REST API or publishing a build artifact. An artifact is a collection of files or packages published by a run.","title":"YAML in Azure Pipelines Overview"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these Azure Pipelines YAML specific code review items.","title":"Code Review Checklist"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#pipeline-structure","text":"The steps are well understood and components are easily identifiable. Ensure that there is a proper description displayName: for every step in the pipeline. Steps/stages of the pipeline are checked in Azure Pipelines to have more understanding of components. In case you have complex nested YAML files, The pipeline in Azure Pipelines is edited to find trigger root file. All the template file references are visited to ensure a small change does not cause breaking changes, changing one file may affect multiple pipelines Long inline scripts in YAML file are moved into script files","title":"Pipeline Structure"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#yaml-structure","text":"Re-usable components are split into separate YAML templates. Variables are separated per environment stored in templates or variable groups. Variable value changes in Queue Time , Compile Time and Runtime are considered. Variable syntax values used with Macro Syntax , Template Expression Syntax and Runtime Expression Syntax are considered. Variables can change during the pipeline, Parameters cannot. Unused variables/parameters are removed in pipeline. Does the pipeline meet with stage/job Conditions criteria?","title":"YAML Structure"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#permission-check-security","text":"Secret values shouldn't be printed in pipeline, issecret is used for printing secrets for debugging If pipeline is using variable groups in Library, ensure pipeline has access to the variable groups created. If pipeline has a remote task in other repo/organization, does it have access? If pipeline is trying to access a secure file, does it have the permission? If pipeline requires approval for environment deployments, Who is the approver? Does it need to keep secrets and manage them, did you consider using Azure KeyVault?","title":"Permission Check &amp; Security"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#troubleshooting-tips","text":"Consider Variable Syntax with Runtime Expressions in the pipeline. Here is a nice sample to understand Expansion of variables . When we assign variable like below it won't set during initialize time, it'll assign during runtime, then we can retrieve some errors based on when template runs. - task : AzureWebApp@1 displayName : 'Deploy Azure Web App : $(webAppName)' inputs : azureSubscription : '$(azureServiceConnectionId)' appName : '$(webAppName)' package : $(Pipeline.Workspace)/drop/Application$(Build.BuildId).zip startUpCommand : 'gunicorn --bind=0.0.0.0 --workers=4 app:app' Error: After passing these variables as parameter, it loads values properly. - template : steps-deployment.yaml parameters : azureServiceConnectionId : ${{ variables.azureServiceConnectionId }} webAppName : ${{ variables.webAppName }} - task : AzureWebApp@1 displayName : 'Deploy Azure Web App :${{ parameters.webAppName }}' inputs : azureSubscription : '${{ parameters.azureServiceConnectionId }}' appName : '${{ parameters.webAppName }}' package : $(Pipeline.Workspace)/drop/Application$(Build.BuildId).zip startUpCommand : 'gunicorn --bind=0.0.0.0 --workers=4 app:app' Use issecret for printing secrets for debugging echo \"##vso[task.setvariable variable=token;issecret=true] ${ token } \"","title":"Troubleshooting Tips"},{"location":"code-reviews/recipes/bash/","text":"Bash Code Reviews Style Guide Developers should follow Google's Bash Style Guide . Code Analysis / Linting Projects must check bash code with shellcheck as part of the CI process . Apart from linting, shfmt can be used to automatically format shell scripts. There are few vscode code extensions which are based on shfmt like shell-format which can be used to automatically format shell scripts. Project Setup vscode-shellcheck Shellcheck extension should be used in VS Code, it provides static code analysis capabilities and auto fixing linting issues. To use vscode-shellcheck in vscode do the following: Install shellcheck on Your Machine For macOS brew install shellcheck For Ubuntu: apt-get install shellcheck Install shellcheck on VSCode Find the vscode-shellcheck extension in vscode and install it. Automatic Code Formatting shell-format shell-format extension does automatic formatting of your bash scripts, docker files and several configuration files. It is dependent on shfmt which can enforce google style guide checks for bash. To use shell-format in vscode do the following: Install shfmt on Your Machine Requires Go 1.13 or Later GO111MODULE = on go get mvdan.cc/sh/v3/cmd/shfmt Install shell-format on VSCode Find the shell-format extension in vscode and install it. Build Validation To automate this process in Azure DevOps you can add the following snippet to you azure-pipelines.yaml file. This will lint any scripts in the ./scripts/ folder. - bash : | echo \"This checks for formatting and common bash errors. See wiki for error details and ignore options: https://github.com/koalaman/shellcheck/wiki/SC1000\" export scversion=\"stable\" wget -qO- \"https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz\" | tar -xJv sudo mv \"shellcheck-${scversion}/shellcheck\" /usr/bin/ rm -r \"shellcheck-${scversion}\" shellcheck ./scripts/*.sh displayName : \"Validate Scripts: Shellcheck\" Also, your shell scripts can be formatted in your build pipeline by using the shfmt tool. To integrate shfmt in your build pipeline do the following: - bash : | echo \"This step does auto formatting of shell scripts\" shfmt -l -w ./scripts/*.sh displayName : \"Format Scripts: shfmt\" Unit testing using shunit2 can also be added to the build pipeline, using the following block: - bash : | echo \"This step unit tests shell scripts by using shunit2\" ./shunit2 displayName : \"Format Scripts: shfmt\" Pre-Commit Hooks All developers should run shellcheck and shfmt as pre-commit hooks. Step 1- Install pre-commit Run pip install pre-commit to install pre-commit. Alternatively you can run brew install pre-commit if you are using homebrew. Step 2- Add shellcheck and shfmt Add .pre-commit-config.yaml file to root of the go project. Run shfmt on pre-commit by adding it to .pre-commit-config.yaml file like below. - repo : git://github.com/pecigonzalo/pre-commit-fmt sha : master hooks : - id : shell-fmt args : - --indent=4 - repo : https://github.com/shellcheck-py/shellcheck-py rev : v0.7.1.1 hooks : - id : shellcheck Step 3 Run $ pre-commit install to set up the git hook scripts Dependencies Bash scripts are often used to 'glue together' other systems and tools. As such, Bash scripts can often have numerous and/or complicated dependencies. Consider using Docker containers to ensure that scripts are executed in a portable and reproducible environment that is guaranteed to contain all the correct dependencies. To ensure that dockerized scripts are nevertheless easy to execute, consider making the use of Docker transparent to the script's caller by wrapping the script in a 'bootstrap' which checks whether the script is running in Docker and re-executes itself in Docker if it's not the case. This provides the best of both worlds: easy script execution and consistent environments. if [[ \" ${ DOCKER } \" ! = \"true\" ]] ; then docker build -t my_script -f my_script.Dockerfile . > /dev/null docker run -e DOCKER = true my_script \" $@ \" exit $? fi # ... implementation of my_script here can assume that all of its dependencies exist since it's always running in Docker ... Code Review Checklist In addition to the Code Review Checklist you should also look for these bash specific code review items Does this code use Built-in Shell Options like set -o, set -e, set -u for execution control of shell scripts ? Is the code modularized? Shell scripts can be modularized like python modules. Portions of bash scripts should be sourced in complex bash projects. Are all exceptions handled correctly? Exceptions should be handled correctly using exit codes or trapping signals. Does the code pass all linting checks as per shellcheck and unit tests as per shunit2 ? Does the code uses relative paths or absolute paths? Relative paths should be avoided as they are prone to environment attacks. If relative path is needed, check that the PATH variable is set. Does the code take credentials as user input? Are the credentials masked or encrypted in the script? S","title":"Bash Code Reviews"},{"location":"code-reviews/recipes/bash/#bash-code-reviews","text":"","title":"Bash Code Reviews"},{"location":"code-reviews/recipes/bash/#style-guide","text":"Developers should follow Google's Bash Style Guide .","title":"Style Guide"},{"location":"code-reviews/recipes/bash/#code-analysis-linting","text":"Projects must check bash code with shellcheck as part of the CI process . Apart from linting, shfmt can be used to automatically format shell scripts. There are few vscode code extensions which are based on shfmt like shell-format which can be used to automatically format shell scripts.","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/bash/#project-setup","text":"","title":"Project Setup"},{"location":"code-reviews/recipes/bash/#vscode-shellcheck","text":"Shellcheck extension should be used in VS Code, it provides static code analysis capabilities and auto fixing linting issues. To use vscode-shellcheck in vscode do the following:","title":"vscode-shellcheck"},{"location":"code-reviews/recipes/bash/#install-shellcheck-on-your-machine","text":"For macOS brew install shellcheck For Ubuntu: apt-get install shellcheck","title":"Install shellcheck on Your Machine"},{"location":"code-reviews/recipes/bash/#install-shellcheck-on-vscode","text":"Find the vscode-shellcheck extension in vscode and install it.","title":"Install shellcheck on VSCode"},{"location":"code-reviews/recipes/bash/#automatic-code-formatting","text":"","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/bash/#shell-format","text":"shell-format extension does automatic formatting of your bash scripts, docker files and several configuration files. It is dependent on shfmt which can enforce google style guide checks for bash. To use shell-format in vscode do the following:","title":"shell-format"},{"location":"code-reviews/recipes/bash/#install-shfmt-on-your-machine","text":"Requires Go 1.13 or Later GO111MODULE = on go get mvdan.cc/sh/v3/cmd/shfmt","title":"Install shfmt on Your Machine"},{"location":"code-reviews/recipes/bash/#install-shell-format-on-vscode","text":"Find the shell-format extension in vscode and install it.","title":"Install shell-format on VSCode"},{"location":"code-reviews/recipes/bash/#build-validation","text":"To automate this process in Azure DevOps you can add the following snippet to you azure-pipelines.yaml file. This will lint any scripts in the ./scripts/ folder. - bash : | echo \"This checks for formatting and common bash errors. See wiki for error details and ignore options: https://github.com/koalaman/shellcheck/wiki/SC1000\" export scversion=\"stable\" wget -qO- \"https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz\" | tar -xJv sudo mv \"shellcheck-${scversion}/shellcheck\" /usr/bin/ rm -r \"shellcheck-${scversion}\" shellcheck ./scripts/*.sh displayName : \"Validate Scripts: Shellcheck\" Also, your shell scripts can be formatted in your build pipeline by using the shfmt tool. To integrate shfmt in your build pipeline do the following: - bash : | echo \"This step does auto formatting of shell scripts\" shfmt -l -w ./scripts/*.sh displayName : \"Format Scripts: shfmt\" Unit testing using shunit2 can also be added to the build pipeline, using the following block: - bash : | echo \"This step unit tests shell scripts by using shunit2\" ./shunit2 displayName : \"Format Scripts: shfmt\"","title":"Build Validation"},{"location":"code-reviews/recipes/bash/#pre-commit-hooks","text":"All developers should run shellcheck and shfmt as pre-commit hooks.","title":"Pre-Commit Hooks"},{"location":"code-reviews/recipes/bash/#step-1-install-pre-commit","text":"Run pip install pre-commit to install pre-commit. Alternatively you can run brew install pre-commit if you are using homebrew.","title":"Step 1- Install pre-commit"},{"location":"code-reviews/recipes/bash/#step-2-add-shellcheck-and-shfmt","text":"Add .pre-commit-config.yaml file to root of the go project. Run shfmt on pre-commit by adding it to .pre-commit-config.yaml file like below. - repo : git://github.com/pecigonzalo/pre-commit-fmt sha : master hooks : - id : shell-fmt args : - --indent=4 - repo : https://github.com/shellcheck-py/shellcheck-py rev : v0.7.1.1 hooks : - id : shellcheck","title":"Step 2- Add shellcheck and shfmt"},{"location":"code-reviews/recipes/bash/#step-3","text":"Run $ pre-commit install to set up the git hook scripts","title":"Step 3"},{"location":"code-reviews/recipes/bash/#dependencies","text":"Bash scripts are often used to 'glue together' other systems and tools. As such, Bash scripts can often have numerous and/or complicated dependencies. Consider using Docker containers to ensure that scripts are executed in a portable and reproducible environment that is guaranteed to contain all the correct dependencies. To ensure that dockerized scripts are nevertheless easy to execute, consider making the use of Docker transparent to the script's caller by wrapping the script in a 'bootstrap' which checks whether the script is running in Docker and re-executes itself in Docker if it's not the case. This provides the best of both worlds: easy script execution and consistent environments. if [[ \" ${ DOCKER } \" ! = \"true\" ]] ; then docker build -t my_script -f my_script.Dockerfile . > /dev/null docker run -e DOCKER = true my_script \" $@ \" exit $? fi # ... implementation of my_script here can assume that all of its dependencies exist since it's always running in Docker ...","title":"Dependencies"},{"location":"code-reviews/recipes/bash/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these bash specific code review items Does this code use Built-in Shell Options like set -o, set -e, set -u for execution control of shell scripts ? Is the code modularized? Shell scripts can be modularized like python modules. Portions of bash scripts should be sourced in complex bash projects. Are all exceptions handled correctly? Exceptions should be handled correctly using exit codes or trapping signals. Does the code pass all linting checks as per shellcheck and unit tests as per shunit2 ? Does the code uses relative paths or absolute paths? Relative paths should be avoided as they are prone to environment attacks. If relative path is needed, check that the PATH variable is set. Does the code take credentials as user input? Are the credentials masked or encrypted in the script? S","title":"Code Review Checklist"},{"location":"code-reviews/recipes/csharp/","text":"C# Code Reviews Style Guide Developers should follow Microsoft's C# Coding Conventions and, where applicable, Microsoft's Secure Coding Guidelines . Code Analysis / Linting We strongly believe that consistent style increases readability and maintainability of a code base. Hence, we are recommending analyzers / linters to enforce consistency and style rules. Project Setup We recommend using a common setup for your solution that you can refer to in all the projects that are part of the solution. Create a common.props file that contains the defaults for all of your projects: <Project> ... <ItemGroup> <PackageReference Include= \"Microsoft.CodeAnalysis.NetAnalyzers\" Version= \"5.0.3\" > <PrivateAssets> all </PrivateAssets> <IncludeAssets> runtime; build; native; contentfiles; analyzers; buildtransitive </IncludeAssets> </PackageReference> <PackageReference Include= \"StyleCop.Analyzers\" Version= \"1.1.118\" > <PrivateAssets> all </PrivateAssets> <IncludeAssets> runtime; build; native; contentfiles; analyzers; buildtransitive </IncludeAssets> </PackageReference> </ItemGroup> <PropertyGroup> <TreatWarningsAsErrors> true </TreatWarningsAsErrors> </PropertyGroup> <ItemGroup Condition= \"Exists('$(MSBuildThisFileDirectory)../.editorconfig')\" > <AdditionalFiles Include= \"$(MSBuildThisFileDirectory)../.editorconfig\" /> </ItemGroup> ... </Project> You can then reference the common.props in your other project files to ensure a consistent setup. <Project Sdk= \"Microsoft.NET.Sdk.Web\" > <Import Project= \"..\\common.props\" /> </Project> The .editorconfig allows for configuration and overrides of rules. You can have an .editorconfig file at project level to customize rules for different projects (test projects for example). Details about the configuration of different rules . .NET analyzers Microsoft's .NET analyzers has code quality rules and .NET API usage rules implemented as analyzers using the .NET Compiler Platform (Roslyn). This is the replacement for Microsoft's legacy FxCop analyzers. Enable or install first-party .NET analyzers . If you are currently using the legacy FxCop analyzers, migrate from FxCop analyzers to .NET analyzers . StyleCop Analyzer The StyleCop analyzer is a nuget package (StyleCop.Analyzers) that can be installed in any of your projects. It's mainly around code style rules and makes sure the team is following the same rules without having subjective discussions about braces and spaces. Detailed information can be found here: StyleCop Analyzers for the .NET Compiler Platform . The minimum rules set teams should adopt is the Managed Recommended Rules rule set. Automatic Code Formatting Use .editorconfig to configure code formatting rules in your project. Build Validation It's important that you enforce your code style and rules in the CI to avoid any team member merging code that does not comply with your standards into your git repo. If you are using FxCop analyzers and StyleCop analyzer, it's very simple to enable those in the CI. You have to make sure you are setting up the project using nuget and .editorconfig ( see Project setup ). Once you have this setup, you will have to configure the pipeline to build your code. That's pretty much it. The FxCop analyzers will run and report the result in your build pipeline. If there are rules that are violated, your build will be red. - task : DotNetCoreCLI@2 displayName : 'Style Check & Build' inputs : command : 'build' projects : '**/*.csproj' Enable Roslyn Support in VSCode The above steps also work in VS Code provided you enable Roslyn support for Omnisharp. The setting is omnisharp.enableRoslynAnalyzers and must be set to true . After enabling this setting you must \"Restart Omnisharp\" (this can be done from the Command Palette in VS Code or by restarting VS Code). Code Review Checklist In addition to the Code Review Checklist you should also look for these C# specific code review items Does this code make correct use of asynchronous programming constructs , including proper use of await and Task.WhenAll including CancellationTokens? Is the code subject to concurrency issues? Are shared objects properly protected? Is dependency injection (DI) used? Is it setup correctly? Are middleware included in this project configured correctly? Are resources released deterministically using the IDispose pattern? Are all disposable objects properly disposed ( using pattern )? Is the code creating a lot of short-lived objects. Could we optimize GC pressure? Is the code written in a way that causes boxing operations to happen? Does the code handle exceptions correctly ? Is package management being used (NuGet) instead of committing DLLs? Does this code use LINQ appropriately? Pulling LINQ into a project to replace a single short loop or in ways that do not perform well are usually not appropriate. Does this code properly validate arguments sanity (i.e. CA1062 )? Consider leveraging extensions such as Ensure.That Does this code include telemetry ( metrics, tracing and logging ) instrumentation? Does this code leverage the options design pattern by using classes to provide strongly typed access to groups of related settings? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why this is here. If the number is repetitive, is there a constant/enum or equivalent? Is proper exception handling set up? Catching the exception base class ( catch (Exception) ) is generally not the right pattern. Instead, catch the specific exceptions that can happen e.g., IOException . Is the use of #pragma fair? Are tests arranged correctly with the Arrange/Act/Assert pattern and properly documented in this way? If there is an asynchronous method, does the name of the method end with the Async suffix? If a method is asynchronous, is Task.Delay used instead of Thread.Sleep ? Task.Delay is not blocking the current thread and creates a task that will complete without blocking the thread, so in a multi-threaded, multi-task environment, this is the one to prefer. Is a cancellation token for asynchronous tasks needed rather than bool patterns? Is a minimum level of logging in place? Are the logging levels used sensible? Are internal vs private vs public classes and methods used the right way? Are auto property set and get used the right way? In a model without constructor and for deserialization, it is ok to have all accessible. For other classes usually a private set or internal set is better. Is the using pattern for streams and other disposable classes used? If not, better to have the Dispose method called explicitly. Are the classes that maintain collections in memory, thread safe? When used under concurrency, use lock pattern.","title":"C# Code Reviews"},{"location":"code-reviews/recipes/csharp/#c-code-reviews","text":"","title":"C# Code Reviews"},{"location":"code-reviews/recipes/csharp/#style-guide","text":"Developers should follow Microsoft's C# Coding Conventions and, where applicable, Microsoft's Secure Coding Guidelines .","title":"Style Guide"},{"location":"code-reviews/recipes/csharp/#code-analysis-linting","text":"We strongly believe that consistent style increases readability and maintainability of a code base. Hence, we are recommending analyzers / linters to enforce consistency and style rules.","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/csharp/#project-setup","text":"We recommend using a common setup for your solution that you can refer to in all the projects that are part of the solution. Create a common.props file that contains the defaults for all of your projects: <Project> ... <ItemGroup> <PackageReference Include= \"Microsoft.CodeAnalysis.NetAnalyzers\" Version= \"5.0.3\" > <PrivateAssets> all </PrivateAssets> <IncludeAssets> runtime; build; native; contentfiles; analyzers; buildtransitive </IncludeAssets> </PackageReference> <PackageReference Include= \"StyleCop.Analyzers\" Version= \"1.1.118\" > <PrivateAssets> all </PrivateAssets> <IncludeAssets> runtime; build; native; contentfiles; analyzers; buildtransitive </IncludeAssets> </PackageReference> </ItemGroup> <PropertyGroup> <TreatWarningsAsErrors> true </TreatWarningsAsErrors> </PropertyGroup> <ItemGroup Condition= \"Exists('$(MSBuildThisFileDirectory)../.editorconfig')\" > <AdditionalFiles Include= \"$(MSBuildThisFileDirectory)../.editorconfig\" /> </ItemGroup> ... </Project> You can then reference the common.props in your other project files to ensure a consistent setup. <Project Sdk= \"Microsoft.NET.Sdk.Web\" > <Import Project= \"..\\common.props\" /> </Project> The .editorconfig allows for configuration and overrides of rules. You can have an .editorconfig file at project level to customize rules for different projects (test projects for example). Details about the configuration of different rules .","title":"Project Setup"},{"location":"code-reviews/recipes/csharp/#net-analyzers","text":"Microsoft's .NET analyzers has code quality rules and .NET API usage rules implemented as analyzers using the .NET Compiler Platform (Roslyn). This is the replacement for Microsoft's legacy FxCop analyzers. Enable or install first-party .NET analyzers . If you are currently using the legacy FxCop analyzers, migrate from FxCop analyzers to .NET analyzers .","title":".NET analyzers"},{"location":"code-reviews/recipes/csharp/#stylecop-analyzer","text":"The StyleCop analyzer is a nuget package (StyleCop.Analyzers) that can be installed in any of your projects. It's mainly around code style rules and makes sure the team is following the same rules without having subjective discussions about braces and spaces. Detailed information can be found here: StyleCop Analyzers for the .NET Compiler Platform . The minimum rules set teams should adopt is the Managed Recommended Rules rule set.","title":"StyleCop Analyzer"},{"location":"code-reviews/recipes/csharp/#automatic-code-formatting","text":"Use .editorconfig to configure code formatting rules in your project.","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/csharp/#build-validation","text":"It's important that you enforce your code style and rules in the CI to avoid any team member merging code that does not comply with your standards into your git repo. If you are using FxCop analyzers and StyleCop analyzer, it's very simple to enable those in the CI. You have to make sure you are setting up the project using nuget and .editorconfig ( see Project setup ). Once you have this setup, you will have to configure the pipeline to build your code. That's pretty much it. The FxCop analyzers will run and report the result in your build pipeline. If there are rules that are violated, your build will be red. - task : DotNetCoreCLI@2 displayName : 'Style Check & Build' inputs : command : 'build' projects : '**/*.csproj'","title":"Build Validation"},{"location":"code-reviews/recipes/csharp/#enable-roslyn-support-in-vscode","text":"The above steps also work in VS Code provided you enable Roslyn support for Omnisharp. The setting is omnisharp.enableRoslynAnalyzers and must be set to true . After enabling this setting you must \"Restart Omnisharp\" (this can be done from the Command Palette in VS Code or by restarting VS Code).","title":"Enable Roslyn Support in VSCode"},{"location":"code-reviews/recipes/csharp/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these C# specific code review items Does this code make correct use of asynchronous programming constructs , including proper use of await and Task.WhenAll including CancellationTokens? Is the code subject to concurrency issues? Are shared objects properly protected? Is dependency injection (DI) used? Is it setup correctly? Are middleware included in this project configured correctly? Are resources released deterministically using the IDispose pattern? Are all disposable objects properly disposed ( using pattern )? Is the code creating a lot of short-lived objects. Could we optimize GC pressure? Is the code written in a way that causes boxing operations to happen? Does the code handle exceptions correctly ? Is package management being used (NuGet) instead of committing DLLs? Does this code use LINQ appropriately? Pulling LINQ into a project to replace a single short loop or in ways that do not perform well are usually not appropriate. Does this code properly validate arguments sanity (i.e. CA1062 )? Consider leveraging extensions such as Ensure.That Does this code include telemetry ( metrics, tracing and logging ) instrumentation? Does this code leverage the options design pattern by using classes to provide strongly typed access to groups of related settings? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why this is here. If the number is repetitive, is there a constant/enum or equivalent? Is proper exception handling set up? Catching the exception base class ( catch (Exception) ) is generally not the right pattern. Instead, catch the specific exceptions that can happen e.g., IOException . Is the use of #pragma fair? Are tests arranged correctly with the Arrange/Act/Assert pattern and properly documented in this way? If there is an asynchronous method, does the name of the method end with the Async suffix? If a method is asynchronous, is Task.Delay used instead of Thread.Sleep ? Task.Delay is not blocking the current thread and creates a task that will complete without blocking the thread, so in a multi-threaded, multi-task environment, this is the one to prefer. Is a cancellation token for asynchronous tasks needed rather than bool patterns? Is a minimum level of logging in place? Are the logging levels used sensible? Are internal vs private vs public classes and methods used the right way? Are auto property set and get used the right way? In a model without constructor and for deserialization, it is ok to have all accessible. For other classes usually a private set or internal set is better. Is the using pattern for streams and other disposable classes used? If not, better to have the Dispose method called explicitly. Are the classes that maintain collections in memory, thread safe? When used under concurrency, use lock pattern.","title":"Code Review Checklist"},{"location":"code-reviews/recipes/go/","text":"Go Code Reviews Style Guide Developers should follow the Effective Go Style Guide. Code Analysis / Linting Project Setup Below is the project setup that you would like to have in your VS Code. VSCode go Extension Using the Go extension for Visual Studio Code, you get language features like IntelliSense, code navigation, symbol search, bracket matching, snippets, etc. This extension includes rich language support for go in VS Code. go vet go vet is a static analysis tool that checks for common go errors, such as incorrect use of range loop variables or misaligned printf arguments. Go code should be able to build with no go vet errors. This will be part of vscode-go extension. golint Note: The golint library is deprecated and archived. The linter revive (below) might be a suitable replacement. golint can be an effective tool for finding many issues, but it errors on the side of false positives. It is best used by developers when working on code, not as part of an automated build process. This is the default linter which is set up as part of the vscode-go extension. revive Revive is a linter for go, it provides a framework for development of custom rules, and lets you define a strict preset for enhancing your development & code review processes. Automatic Code Formatting gofmt gofmt is the automated code format style guide for Go. This is part of the vs-code extension, and it is enabled by default to run on save of every file. Aggregator golangci-lint golangci-lint is the replacement for the now deprecated gometalinter . It is 2-7x faster than gometalinter along with a host of other benefits . golangci-lint is a powerful, customizable aggregator of linters. By default, several are enabled but not all. A full list of linters and their usages can be found here . It will allow you to configure each linter and choose which ones you would like to enable in your project. One awesome feature of golangci-lint is that is can be easily introduced to an existing large codebase using the --new-from-rev COMMITID . With this setting only newly introduced issues are flagged, allowing a team to improve new code without having to fix all historic issues in a large codebase. This provides a great path to improving code-reviews on existing solutions. golangci-lint can also be setup as the default linter in VS Code. Installation options for golangci-lint are present at golangci-lint . To use golangci-lint with VS Code, use the below recommended settings: \"go.lintTool\" : \"golangci-lint\" , \"go.lintFlags\" : [ \"--fast\" ] Pre-Commit Hooks All developers should run gofmt in a pre-commit hook to ensure standard formatting. Step 1- Install pre-commit Run pip install pre-commit to install pre-commit. Alternatively you can run brew install pre-commit if you are using homebrew. Step 2- Add go-fmt in pre-commit Add .pre-commit-config.yaml file to root of the go project. Run go-fmt on pre-commit by adding it to .pre-commit-config.yaml file like below. - repo : git://github.com/dnephin/pre-commit-golang rev : master hooks : - id : go-fmt Step 3 Run $ pre-commit install to set up the git hook scripts Build Validation gofmt should be run as a part of every build to enforce the common standard. To automate this process in Azure DevOps you can add the following snippet to your azure-pipelines.yaml file. This will format any scripts in the ./scripts/ folder. - script : go fmt workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code formatting\" govet should be run as a part of every build to check code linting. To automate this process in Azure DevOps you can add the following snippet to your azure-pipelines.yaml file. This will check linting of any scripts in the ./scripts/ folder. - script : go vet workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code linting\" Alternatively you can use golangci-lint as a step in the pipeline to do multiple enabled validations(including go vet and go fmt) of golangci-lint. - script : golangci-lint run --enable gofmt --fix workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code linting\" Sample Build Validation Pipeline in Azure DevOps trigger : master pool : vmImage : 'ubuntu-latest' steps : - task : GoTool@0 inputs : version : '1.13.5' - task : Go@0 inputs : command : 'get' arguments : '-d' workingDirectory : '$(System.DefaultWorkingDirectory)/scripts' - script : go fmt workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code formatting\" - script : go vet workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : 'Run go vet' - task : Go@0 inputs : command : 'build' workingDirectory : '$(System.DefaultWorkingDirectory)' - task : CopyFiles@2 inputs : TargetFolder : '$(Build.ArtifactStagingDirectory)' - task : PublishBuildArtifacts@1 inputs : artifactName : drop Code Review Checklist The Go language team maintains a list of common Code Review Comments for go that form the basis for a solid checklist for a team working in Go that should be followed in addition to the ISE Code Review Checklist Does this code handle errors correctly? This includes not throwing away errors with _ assignments and returning errors, instead of in-band error values ? Does this code follow Go standards for method receiver types ? Does this code pass values when it should? Are interfaces in this code defined in the correct packages ? Do go-routines in this code have clear lifetimes ? Is parallelism in this code handled via go-routines and channels with synchronous methods ? Does this code have meaningful Doc Comments ? Does this code have meaningful Package Comments ? Does this code use Contexts correctly? Do unit tests fail with meaningful messages ?","title":"Go Code Reviews"},{"location":"code-reviews/recipes/go/#go-code-reviews","text":"","title":"Go Code Reviews"},{"location":"code-reviews/recipes/go/#style-guide","text":"Developers should follow the Effective Go Style Guide.","title":"Style Guide"},{"location":"code-reviews/recipes/go/#code-analysis-linting","text":"","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/go/#project-setup","text":"Below is the project setup that you would like to have in your VS Code.","title":"Project Setup"},{"location":"code-reviews/recipes/go/#vscode-go-extension","text":"Using the Go extension for Visual Studio Code, you get language features like IntelliSense, code navigation, symbol search, bracket matching, snippets, etc. This extension includes rich language support for go in VS Code.","title":"VSCode go Extension"},{"location":"code-reviews/recipes/go/#go-vet","text":"go vet is a static analysis tool that checks for common go errors, such as incorrect use of range loop variables or misaligned printf arguments. Go code should be able to build with no go vet errors. This will be part of vscode-go extension.","title":"go vet"},{"location":"code-reviews/recipes/go/#golint","text":"Note: The golint library is deprecated and archived. The linter revive (below) might be a suitable replacement. golint can be an effective tool for finding many issues, but it errors on the side of false positives. It is best used by developers when working on code, not as part of an automated build process. This is the default linter which is set up as part of the vscode-go extension.","title":"golint"},{"location":"code-reviews/recipes/go/#revive","text":"Revive is a linter for go, it provides a framework for development of custom rules, and lets you define a strict preset for enhancing your development & code review processes.","title":"revive"},{"location":"code-reviews/recipes/go/#automatic-code-formatting","text":"","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/go/#gofmt","text":"gofmt is the automated code format style guide for Go. This is part of the vs-code extension, and it is enabled by default to run on save of every file.","title":"gofmt"},{"location":"code-reviews/recipes/go/#aggregator","text":"","title":"Aggregator"},{"location":"code-reviews/recipes/go/#golangci-lint","text":"golangci-lint is the replacement for the now deprecated gometalinter . It is 2-7x faster than gometalinter along with a host of other benefits . golangci-lint is a powerful, customizable aggregator of linters. By default, several are enabled but not all. A full list of linters and their usages can be found here . It will allow you to configure each linter and choose which ones you would like to enable in your project. One awesome feature of golangci-lint is that is can be easily introduced to an existing large codebase using the --new-from-rev COMMITID . With this setting only newly introduced issues are flagged, allowing a team to improve new code without having to fix all historic issues in a large codebase. This provides a great path to improving code-reviews on existing solutions. golangci-lint can also be setup as the default linter in VS Code. Installation options for golangci-lint are present at golangci-lint . To use golangci-lint with VS Code, use the below recommended settings: \"go.lintTool\" : \"golangci-lint\" , \"go.lintFlags\" : [ \"--fast\" ]","title":"golangci-lint"},{"location":"code-reviews/recipes/go/#pre-commit-hooks","text":"All developers should run gofmt in a pre-commit hook to ensure standard formatting.","title":"Pre-Commit Hooks"},{"location":"code-reviews/recipes/go/#step-1-install-pre-commit","text":"Run pip install pre-commit to install pre-commit. Alternatively you can run brew install pre-commit if you are using homebrew.","title":"Step 1- Install pre-commit"},{"location":"code-reviews/recipes/go/#step-2-add-go-fmt-in-pre-commit","text":"Add .pre-commit-config.yaml file to root of the go project. Run go-fmt on pre-commit by adding it to .pre-commit-config.yaml file like below. - repo : git://github.com/dnephin/pre-commit-golang rev : master hooks : - id : go-fmt","title":"Step 2- Add go-fmt in pre-commit"},{"location":"code-reviews/recipes/go/#step-3","text":"Run $ pre-commit install to set up the git hook scripts","title":"Step 3"},{"location":"code-reviews/recipes/go/#build-validation","text":"gofmt should be run as a part of every build to enforce the common standard. To automate this process in Azure DevOps you can add the following snippet to your azure-pipelines.yaml file. This will format any scripts in the ./scripts/ folder. - script : go fmt workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code formatting\" govet should be run as a part of every build to check code linting. To automate this process in Azure DevOps you can add the following snippet to your azure-pipelines.yaml file. This will check linting of any scripts in the ./scripts/ folder. - script : go vet workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code linting\" Alternatively you can use golangci-lint as a step in the pipeline to do multiple enabled validations(including go vet and go fmt) of golangci-lint. - script : golangci-lint run --enable gofmt --fix workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code linting\"","title":"Build Validation"},{"location":"code-reviews/recipes/go/#sample-build-validation-pipeline-in-azure-devops","text":"trigger : master pool : vmImage : 'ubuntu-latest' steps : - task : GoTool@0 inputs : version : '1.13.5' - task : Go@0 inputs : command : 'get' arguments : '-d' workingDirectory : '$(System.DefaultWorkingDirectory)/scripts' - script : go fmt workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code formatting\" - script : go vet workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : 'Run go vet' - task : Go@0 inputs : command : 'build' workingDirectory : '$(System.DefaultWorkingDirectory)' - task : CopyFiles@2 inputs : TargetFolder : '$(Build.ArtifactStagingDirectory)' - task : PublishBuildArtifacts@1 inputs : artifactName : drop","title":"Sample Build Validation Pipeline in Azure DevOps"},{"location":"code-reviews/recipes/go/#code-review-checklist","text":"The Go language team maintains a list of common Code Review Comments for go that form the basis for a solid checklist for a team working in Go that should be followed in addition to the ISE Code Review Checklist Does this code handle errors correctly? This includes not throwing away errors with _ assignments and returning errors, instead of in-band error values ? Does this code follow Go standards for method receiver types ? Does this code pass values when it should? Are interfaces in this code defined in the correct packages ? Do go-routines in this code have clear lifetimes ? Is parallelism in this code handled via go-routines and channels with synchronous methods ? Does this code have meaningful Doc Comments ? Does this code have meaningful Package Comments ? Does this code use Contexts correctly? Do unit tests fail with meaningful messages ?","title":"Code Review Checklist"},{"location":"code-reviews/recipes/java/","text":"Java Code Reviews Java Style Guide Developers should follow the Google Java Style Guide . Code Analysis / Linting We strongly believe that consistent style increases readability and maintainability of a code base. Hence, we are recommending analyzers to enforce consistency and style rules. We make use of Checkstyle using the same configuration used in the Azure Java SDK . FindBugs and PMD are also commonly used. Automatic Code Formatting Eclipse, and other Java IDEs, support automatic code formatting. If using Maven, some developers also make use of the formatter-maven-plugin . Build Validation It's important to enforce your code style and rules in the CI to avoid any team members merging code that does not comply with standards into your git repo. If building using Azure DevOps, Azure DevOps support Maven and Gradle build tasks using PMD , Checkstyle , and FindBugs code analysis tools as part of every build. Here is an example yaml for a Maven build task with all three analysis tools enabled: - task : Maven@3 displayName : 'Maven pom.xml' inputs : mavenPomFile : '$(Parameters.mavenPOMFile)' checkStyleRunAnalysis : true pmdRunAnalysis : true findBugsRunAnalysis : true Here is an example yaml for a Gradle build task with all three analysis tools enabled: - task : Gradle@2 displayName : 'gradlew build' inputs : checkStyleRunAnalysis : true findBugsRunAnalysis : true pmdRunAnalysis : true Code Review Checklist In addition to the Code Review Checklist you should also look for these Java specific code review items Does the project use Lambda to make code cleaner? Is dependency injection (DI) used? Is it setup correctly? If the code uses Spring Boot, are you using @Inject instead of @Autowire? Does the code handle exceptions correctly? Is the Azul Zulu OpenJDK being used? Is a build automation and package management tool (Gradle or Maven) being used?","title":"Java Code Reviews"},{"location":"code-reviews/recipes/java/#java-code-reviews","text":"","title":"Java Code Reviews"},{"location":"code-reviews/recipes/java/#java-style-guide","text":"Developers should follow the Google Java Style Guide .","title":"Java Style Guide"},{"location":"code-reviews/recipes/java/#code-analysis-linting","text":"We strongly believe that consistent style increases readability and maintainability of a code base. Hence, we are recommending analyzers to enforce consistency and style rules. We make use of Checkstyle using the same configuration used in the Azure Java SDK . FindBugs and PMD are also commonly used.","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/java/#automatic-code-formatting","text":"Eclipse, and other Java IDEs, support automatic code formatting. If using Maven, some developers also make use of the formatter-maven-plugin .","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/java/#build-validation","text":"It's important to enforce your code style and rules in the CI to avoid any team members merging code that does not comply with standards into your git repo. If building using Azure DevOps, Azure DevOps support Maven and Gradle build tasks using PMD , Checkstyle , and FindBugs code analysis tools as part of every build. Here is an example yaml for a Maven build task with all three analysis tools enabled: - task : Maven@3 displayName : 'Maven pom.xml' inputs : mavenPomFile : '$(Parameters.mavenPOMFile)' checkStyleRunAnalysis : true pmdRunAnalysis : true findBugsRunAnalysis : true Here is an example yaml for a Gradle build task with all three analysis tools enabled: - task : Gradle@2 displayName : 'gradlew build' inputs : checkStyleRunAnalysis : true findBugsRunAnalysis : true pmdRunAnalysis : true","title":"Build Validation"},{"location":"code-reviews/recipes/java/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these Java specific code review items Does the project use Lambda to make code cleaner? Is dependency injection (DI) used? Is it setup correctly? If the code uses Spring Boot, are you using @Inject instead of @Autowire? Does the code handle exceptions correctly? Is the Azul Zulu OpenJDK being used? Is a build automation and package management tool (Gradle or Maven) being used?","title":"Code Review Checklist"},{"location":"code-reviews/recipes/javascript-and-typescript/","text":"JavaScript/TypeScript Code Reviews Style Guide Developers should use prettier to do code formatting for JavaScript. Using an automated code formatting tool like Prettier enforces a well accepted style guide that was collaboratively built by a wide range of companies including Microsoft, Facebook, and AirBnB. For higher level style guidance not covered by prettier, we follow the AirBnB Style Guide . Code Analysis / Linting eslint Per guidance outlined in Palantir's 2019 TSLint road map , TypeScript code should be linted with ESLint . See the typescript-eslint documentation for more information around linting TypeScript code with ESLint. To install and configure linting with ESLint , install the following packages as dev-dependencies: npm install -D eslint @typescript-eslint/parser @typescript-eslint/eslint-plugin Add a .eslintrc.js to the root of your project: module . exports = { root : true , parser : '@typescript-eslint/parser' , plugins : [ '@typescript-eslint' , ], extends : [ 'eslint:recommended' , 'plugin:@typescript-eslint/eslint-recommended' , 'plugin:@typescript-eslint/recommended' , ], }; Add the following to the scripts of your package.json : \"scripts\" : { \"lint\" : \"eslint . --ext .js,.jsx,.ts,.tsx --ignore-path .gitignore\" } This will lint all .js , .jsx , .ts , .tsx files in your project and omit any files or directories specified in your .gitignore . You can run linting with: npm run lint Setting up Prettier Prettier is an opinionated code formatter. Getting started guide . Install with npm as a dev-dependency: npm install -D prettier eslint-config-prettier eslint-plugin-prettier Add prettier to your .eslintrc.js : module . exports = { root : true , parser : '@typescript-eslint/parser' , plugins : [ '@typescript-eslint' , ], extends : [ 'eslint:recommended' , 'plugin:@typescript-eslint/eslint-recommended' , 'plugin:@typescript-eslint/recommended' , 'prettier/@typescript-eslint' , 'plugin:prettier/recommended' , ], }; This will apply the prettier rule set when linting with ESLint. Auto Formatting with VSCode VS Code can be configured to automatically perform eslint --fix on save. Create a .vscode folder in the root of your project and add the following to your .vscode/settings.json : { \"editor.codeActionsOnSave\" : { \"source.fixAll.eslint\" : true }, } By default, we use the following overrides should be added to the VS Code configuration to standardize on single quotes, a four space drop, and to do ESLinting: { \"prettier.singleQuote\" : true , \"prettier.eslintIntegration\" : true , \"prettier.tabWidth\" : 4 } Setting Up Testing Playwright is highly recommended to be set up within a project. its an open source testing suite created by Microsoft. To install it use this command: npm install playwright Since playwright shows the tests in the browser you have to choose which browser you want it to run if unless using chrome, which is the default. You can do this by Build Validation To automate this process in Azure Devops you can add the following snippet to your pipeline definition yaml file. This will lint any scripts in the ./scripts/ folder. - task : Npm@1 displayName : 'Lint' inputs : command : 'custom' customCommand : 'run lint' workingDir : './scripts/' Pre-Commit Hooks All developers should run eslint in a pre-commit hook to ensure standard formatting. We highly recommend using an editor integration like vscode-eslint to provide realtime feedback. Under .git/hooks rename pre-commit.sample to pre-commit Remove the existing sample code in that file There are many examples of scripts for this on gist, like pre-commit-eslint Modify accordingly to include TypeScript files (include ts extension and make sure typescript-eslint is set up) Make the file executable: chmod +x .git/hooks/pre-commit As an alternative husky can be considered to simplify pre-commit hooks. Code Review Checklist In addition to the Code Review Checklist you should also look for these JavaScript and TypeScript specific code review items. Javascript / Typescript Checklist Does the code stick to our formatting and code standards? Does running prettier and ESLint over the code should yield no warnings or errors respectively? Does the change re-implement code that would be better served by pulling in a well known module from the ecosystem? Is \"use strict\"; used to reduce errors with undeclared variables? Are unit tests used where possible, also for APIs? Are tests arranged correctly with the Arrange/Act/Assert pattern and properly documented in this way? Are best practices for error handling followed, as well as try catch finally statements? Are the doWork().then(doSomething).then(checkSomething) properly followed for async calls, including expect , done ? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why it is there. If the number is repetitive, is there a constant/enum or equivalent? If there is an asynchronous method, does the name of the method end with the Async suffix? Is a minimum level of logging in place? Are the logging levels used sensible? Is document fragment manipulation limited to when you need to manipulate multiple sub elements? Does TypeScript code compile without raising linting errors? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why it is there. If the number is repetitive, is there a constant/enum or equivalent? Is there a proper /* */ in the various classes and methods? Are heavy operations implemented in the backend, leaving the controller as thin as possible? Is event handling on the html efficiently done?","title":"JavaScript/TypeScript Code Reviews"},{"location":"code-reviews/recipes/javascript-and-typescript/#javascripttypescript-code-reviews","text":"","title":"JavaScript/TypeScript Code Reviews"},{"location":"code-reviews/recipes/javascript-and-typescript/#style-guide","text":"Developers should use prettier to do code formatting for JavaScript. Using an automated code formatting tool like Prettier enforces a well accepted style guide that was collaboratively built by a wide range of companies including Microsoft, Facebook, and AirBnB. For higher level style guidance not covered by prettier, we follow the AirBnB Style Guide .","title":"Style Guide"},{"location":"code-reviews/recipes/javascript-and-typescript/#code-analysis-linting","text":"","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/javascript-and-typescript/#eslint","text":"Per guidance outlined in Palantir's 2019 TSLint road map , TypeScript code should be linted with ESLint . See the typescript-eslint documentation for more information around linting TypeScript code with ESLint. To install and configure linting with ESLint , install the following packages as dev-dependencies: npm install -D eslint @typescript-eslint/parser @typescript-eslint/eslint-plugin Add a .eslintrc.js to the root of your project: module . exports = { root : true , parser : '@typescript-eslint/parser' , plugins : [ '@typescript-eslint' , ], extends : [ 'eslint:recommended' , 'plugin:@typescript-eslint/eslint-recommended' , 'plugin:@typescript-eslint/recommended' , ], }; Add the following to the scripts of your package.json : \"scripts\" : { \"lint\" : \"eslint . --ext .js,.jsx,.ts,.tsx --ignore-path .gitignore\" } This will lint all .js , .jsx , .ts , .tsx files in your project and omit any files or directories specified in your .gitignore . You can run linting with: npm run lint","title":"eslint"},{"location":"code-reviews/recipes/javascript-and-typescript/#setting-up-prettier","text":"Prettier is an opinionated code formatter. Getting started guide . Install with npm as a dev-dependency: npm install -D prettier eslint-config-prettier eslint-plugin-prettier Add prettier to your .eslintrc.js : module . exports = { root : true , parser : '@typescript-eslint/parser' , plugins : [ '@typescript-eslint' , ], extends : [ 'eslint:recommended' , 'plugin:@typescript-eslint/eslint-recommended' , 'plugin:@typescript-eslint/recommended' , 'prettier/@typescript-eslint' , 'plugin:prettier/recommended' , ], }; This will apply the prettier rule set when linting with ESLint.","title":"Setting up Prettier"},{"location":"code-reviews/recipes/javascript-and-typescript/#auto-formatting-with-vscode","text":"VS Code can be configured to automatically perform eslint --fix on save. Create a .vscode folder in the root of your project and add the following to your .vscode/settings.json : { \"editor.codeActionsOnSave\" : { \"source.fixAll.eslint\" : true }, } By default, we use the following overrides should be added to the VS Code configuration to standardize on single quotes, a four space drop, and to do ESLinting: { \"prettier.singleQuote\" : true , \"prettier.eslintIntegration\" : true , \"prettier.tabWidth\" : 4 }","title":"Auto Formatting with VSCode"},{"location":"code-reviews/recipes/javascript-and-typescript/#setting-up-testing","text":"Playwright is highly recommended to be set up within a project. its an open source testing suite created by Microsoft. To install it use this command: npm install playwright Since playwright shows the tests in the browser you have to choose which browser you want it to run if unless using chrome, which is the default. You can do this by","title":"Setting Up Testing"},{"location":"code-reviews/recipes/javascript-and-typescript/#build-validation","text":"To automate this process in Azure Devops you can add the following snippet to your pipeline definition yaml file. This will lint any scripts in the ./scripts/ folder. - task : Npm@1 displayName : 'Lint' inputs : command : 'custom' customCommand : 'run lint' workingDir : './scripts/'","title":"Build Validation"},{"location":"code-reviews/recipes/javascript-and-typescript/#pre-commit-hooks","text":"All developers should run eslint in a pre-commit hook to ensure standard formatting. We highly recommend using an editor integration like vscode-eslint to provide realtime feedback. Under .git/hooks rename pre-commit.sample to pre-commit Remove the existing sample code in that file There are many examples of scripts for this on gist, like pre-commit-eslint Modify accordingly to include TypeScript files (include ts extension and make sure typescript-eslint is set up) Make the file executable: chmod +x .git/hooks/pre-commit As an alternative husky can be considered to simplify pre-commit hooks.","title":"Pre-Commit Hooks"},{"location":"code-reviews/recipes/javascript-and-typescript/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these JavaScript and TypeScript specific code review items.","title":"Code Review Checklist"},{"location":"code-reviews/recipes/javascript-and-typescript/#javascript-typescript-checklist","text":"Does the code stick to our formatting and code standards? Does running prettier and ESLint over the code should yield no warnings or errors respectively? Does the change re-implement code that would be better served by pulling in a well known module from the ecosystem? Is \"use strict\"; used to reduce errors with undeclared variables? Are unit tests used where possible, also for APIs? Are tests arranged correctly with the Arrange/Act/Assert pattern and properly documented in this way? Are best practices for error handling followed, as well as try catch finally statements? Are the doWork().then(doSomething).then(checkSomething) properly followed for async calls, including expect , done ? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why it is there. If the number is repetitive, is there a constant/enum or equivalent? If there is an asynchronous method, does the name of the method end with the Async suffix? Is a minimum level of logging in place? Are the logging levels used sensible? Is document fragment manipulation limited to when you need to manipulate multiple sub elements? Does TypeScript code compile without raising linting errors? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why it is there. If the number is repetitive, is there a constant/enum or equivalent? Is there a proper /* */ in the various classes and methods? Are heavy operations implemented in the backend, leaving the controller as thin as possible? Is event handling on the html efficiently done?","title":"Javascript / Typescript Checklist"},{"location":"code-reviews/recipes/markdown/","text":"Markdown Code Reviews Style Guide Developers should treat documentation like other source code and follow the same rules and checklists when reviewing documentation as code. Documentation should both use good Markdown syntax to ensure it's properly parsed, and follow good writing style guidelines to ensure the document is easy to read and understand. Markdown Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world\u2019s most popular markup languages. Using Markdown is different from using a WYSIWYG editor. In an application like Microsoft Word, you click buttons to format words and phrases, and the changes are visible immediately. Markdown isn\u2019t like that. When you create a Markdown-formatted file, you add Markdown syntax to the text to indicate which words and phrases should look different. You can find more information and full documentation here . Linters Markdown has specific way of being formatted. It is important to respect this formatting, otherwise some interpreters which are strict won't properly display the document. Linters are often used to help developers properly create documents by both verifying proper Markdown syntax, grammar and proper English language. A good setup includes a markdown linter used during editing and PR build verification, and a grammar linter used while editing the document. The following are a list of linters that could be used in this setup. markdownlint markdownlint is a linter for markdown that verifies Markdown syntax, and also enforces rules that make the text more readable. Markdownlint-cli is an easy-to-use CLI based on Markdownlint. It's available as a ruby gem , an npm package , a Node.js CLI and a VS Code extension . The VS Code extension Prettier also catches all markdownlint errors. Installing the Node.js CLI npm install -g markdownlint-cli Running markdownlint on a Node.js project markdownlint **/*.md --ignore node_modules Fixing errors automatically markdownlint **/*.md --ignore node_modules --fix A comprehensive list of markdownlint rules is available here . write-good write-good is a linter for English text that helps writing better documentation. npm install -g write-good Run write-good write-good *.md Run write-good without installing it npx write-good *.md Write Good is also available as an extension for VS Code VSCode Extensions Write Good Linter The Write Good Linter Extension integrates with VS Code to give grammar and language advice while editing the document. markdownlint Extension The markdownlint extension examines the Markdown documents, showing warnings for rule violations while editing. Build Validation Linting To automate linting with markdownlint for PR validation in GitHub actions, you can either use linters aggregator as we do with MegaLinter in this repository or use the following YAML. name : Markdownlint on : push : paths : - \"**/*.md\" pull_request : paths : - \"**/*.md\" jobs : lint : runs-on : ubuntu-latest steps : - uses : actions/checkout@v2 - name : Use Node.js uses : actions/setup-node@v1 with : node-version : 12.x - name : Run Markdownlint run : | npm i -g markdownlint-cli markdownlint \"**/*.md\" --ignore node_modules Checking Links To automate link check in your markdown files add markdown-link-check action to your validation pipeline: markdown-link-check : runs-on : ubuntu-latest steps : - uses : actions/checkout@master - uses : gaurav-nelson/github-action-markdown-link-check@v1 More information about markdown-link-check action options can be found at markdown-link-check home page Code Review Checklist In addition to the Code Review Checklist you should also look for these documentation specific code review items Is the document easy to read and understand and does it follow good writing guidelines ? Is there a single source of truth or is content repeated in more than one document? Is the documentation up to date with the code? Is the documentation technically, and ethically correct? Writing Style Guidelines The following are some examples of writing style guidelines. Agree in your team which guidelines you should apply to your project documentation. Save your guidelines together with your documentation, so they are easy to refer back to. Wording Use inclusive language, and avoid jargon and uncommon words. The docs should be easy to understand Be clear and concise, stick to the goal of the document Use active voice Spell check and grammar check the text Always follow chronological order Visit Plain English for tips on how to write documentation that is easy to understand. Document Organization Organize documents by topic rather than type, this makes it easier to find the documentation Each folder should have a top-level README.md and any other documents within that folder should link directly or indirectly from that README.md Document names with more than one word should use underscores instead of spaces, for example machine_learning_pipeline_design.md . The same applies to images Headings Start with a H1 (single # in markdown) and respect the order H1 > H2 > H3 etc Follow each heading with text before proceeding with the next heading Avoid putting numbers in headings. Numbers shift, and can create outdated titles Avoid using symbols and special characters in headers, this causes problems with anchor links Avoid links in headers Resources Avoid duplication of content, instead link to the single source of truth Link but don't summarize. Summarizing content on another page leads to the content living in two places Use meaningful anchor texts, e.g. instead of writing Follow the instructions [here](../recipes/markdown.md) write Follow the [Markdown guidelines](../recipes/markdown.md) Make sure links to Microsoft docs do not contain the language marker /en-us/ or /fr-fr/ , as this is automatically determined by the site itself. Lists List items should start with capital letters if possible Use ordered lists when the items describe a sequence to follow, otherwise use unordered lists For ordered lists, prefix each item with 1. When rendered, the list items will appear with sequential numbering. This avoids number-gaps in list Do not add commas , or semicolons ; to the end of list items, and avoid periods . unless the list item represents a complete sentence Images Place images in a separate directory named img Name images appropriately, avoiding generic names like screenshot.png Avoid adding large images or videos to source control, link to an external location instead Emphasis and Special Sections Use bold or italic to emphasize For sections that everyone reading this document needs to be aware of, use blocks Use backticks for code, a single backtick for inline code like pip install flake8 and 3 backticks for code blocks followed by the language for syntax highlighting def add ( num1 : int , num2 : int ): return num1 + num2 Use check boxes for task lists Item 1 Item 2 Item 3 Add a References section to the end of the document with links to external references Prefer tables to lists for comparisons and reports to make research and results more readable Option Pros Cons Option 1 Some pros Some cons Option 2 Some pros Some cons General Always use Markdown syntax, don't mix with HTML Make sure the extension of the files is .md - if the extension is missing, a linter might ignore the files","title":"Markdown Code Reviews"},{"location":"code-reviews/recipes/markdown/#markdown-code-reviews","text":"","title":"Markdown Code Reviews"},{"location":"code-reviews/recipes/markdown/#style-guide","text":"Developers should treat documentation like other source code and follow the same rules and checklists when reviewing documentation as code. Documentation should both use good Markdown syntax to ensure it's properly parsed, and follow good writing style guidelines to ensure the document is easy to read and understand.","title":"Style Guide"},{"location":"code-reviews/recipes/markdown/#markdown","text":"Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world\u2019s most popular markup languages. Using Markdown is different from using a WYSIWYG editor. In an application like Microsoft Word, you click buttons to format words and phrases, and the changes are visible immediately. Markdown isn\u2019t like that. When you create a Markdown-formatted file, you add Markdown syntax to the text to indicate which words and phrases should look different. You can find more information and full documentation here .","title":"Markdown"},{"location":"code-reviews/recipes/markdown/#linters","text":"Markdown has specific way of being formatted. It is important to respect this formatting, otherwise some interpreters which are strict won't properly display the document. Linters are often used to help developers properly create documents by both verifying proper Markdown syntax, grammar and proper English language. A good setup includes a markdown linter used during editing and PR build verification, and a grammar linter used while editing the document. The following are a list of linters that could be used in this setup.","title":"Linters"},{"location":"code-reviews/recipes/markdown/#markdownlint","text":"markdownlint is a linter for markdown that verifies Markdown syntax, and also enforces rules that make the text more readable. Markdownlint-cli is an easy-to-use CLI based on Markdownlint. It's available as a ruby gem , an npm package , a Node.js CLI and a VS Code extension . The VS Code extension Prettier also catches all markdownlint errors. Installing the Node.js CLI npm install -g markdownlint-cli Running markdownlint on a Node.js project markdownlint **/*.md --ignore node_modules Fixing errors automatically markdownlint **/*.md --ignore node_modules --fix A comprehensive list of markdownlint rules is available here .","title":"markdownlint"},{"location":"code-reviews/recipes/markdown/#write-good","text":"write-good is a linter for English text that helps writing better documentation. npm install -g write-good Run write-good write-good *.md Run write-good without installing it npx write-good *.md Write Good is also available as an extension for VS Code","title":"write-good"},{"location":"code-reviews/recipes/markdown/#vscode-extensions","text":"","title":"VSCode Extensions"},{"location":"code-reviews/recipes/markdown/#write-good-linter","text":"The Write Good Linter Extension integrates with VS Code to give grammar and language advice while editing the document.","title":"Write Good Linter"},{"location":"code-reviews/recipes/markdown/#markdownlint-extension","text":"The markdownlint extension examines the Markdown documents, showing warnings for rule violations while editing.","title":"markdownlint Extension"},{"location":"code-reviews/recipes/markdown/#build-validation","text":"","title":"Build Validation"},{"location":"code-reviews/recipes/markdown/#linting","text":"To automate linting with markdownlint for PR validation in GitHub actions, you can either use linters aggregator as we do with MegaLinter in this repository or use the following YAML. name : Markdownlint on : push : paths : - \"**/*.md\" pull_request : paths : - \"**/*.md\" jobs : lint : runs-on : ubuntu-latest steps : - uses : actions/checkout@v2 - name : Use Node.js uses : actions/setup-node@v1 with : node-version : 12.x - name : Run Markdownlint run : | npm i -g markdownlint-cli markdownlint \"**/*.md\" --ignore node_modules","title":"Linting"},{"location":"code-reviews/recipes/markdown/#checking-links","text":"To automate link check in your markdown files add markdown-link-check action to your validation pipeline: markdown-link-check : runs-on : ubuntu-latest steps : - uses : actions/checkout@master - uses : gaurav-nelson/github-action-markdown-link-check@v1 More information about markdown-link-check action options can be found at markdown-link-check home page","title":"Checking Links"},{"location":"code-reviews/recipes/markdown/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these documentation specific code review items Is the document easy to read and understand and does it follow good writing guidelines ? Is there a single source of truth or is content repeated in more than one document? Is the documentation up to date with the code? Is the documentation technically, and ethically correct?","title":"Code Review Checklist"},{"location":"code-reviews/recipes/markdown/#writing-style-guidelines","text":"The following are some examples of writing style guidelines. Agree in your team which guidelines you should apply to your project documentation. Save your guidelines together with your documentation, so they are easy to refer back to.","title":"Writing Style Guidelines"},{"location":"code-reviews/recipes/markdown/#wording","text":"Use inclusive language, and avoid jargon and uncommon words. The docs should be easy to understand Be clear and concise, stick to the goal of the document Use active voice Spell check and grammar check the text Always follow chronological order Visit Plain English for tips on how to write documentation that is easy to understand.","title":"Wording"},{"location":"code-reviews/recipes/markdown/#document-organization","text":"Organize documents by topic rather than type, this makes it easier to find the documentation Each folder should have a top-level README.md and any other documents within that folder should link directly or indirectly from that README.md Document names with more than one word should use underscores instead of spaces, for example machine_learning_pipeline_design.md . The same applies to images","title":"Document Organization"},{"location":"code-reviews/recipes/markdown/#headings","text":"Start with a H1 (single # in markdown) and respect the order H1 > H2 > H3 etc Follow each heading with text before proceeding with the next heading Avoid putting numbers in headings. Numbers shift, and can create outdated titles Avoid using symbols and special characters in headers, this causes problems with anchor links Avoid links in headers","title":"Headings"},{"location":"code-reviews/recipes/markdown/#resources","text":"Avoid duplication of content, instead link to the single source of truth Link but don't summarize. Summarizing content on another page leads to the content living in two places Use meaningful anchor texts, e.g. instead of writing Follow the instructions [here](../recipes/markdown.md) write Follow the [Markdown guidelines](../recipes/markdown.md) Make sure links to Microsoft docs do not contain the language marker /en-us/ or /fr-fr/ , as this is automatically determined by the site itself.","title":"Resources"},{"location":"code-reviews/recipes/markdown/#lists","text":"List items should start with capital letters if possible Use ordered lists when the items describe a sequence to follow, otherwise use unordered lists For ordered lists, prefix each item with 1. When rendered, the list items will appear with sequential numbering. This avoids number-gaps in list Do not add commas , or semicolons ; to the end of list items, and avoid periods . unless the list item represents a complete sentence","title":"Lists"},{"location":"code-reviews/recipes/markdown/#images","text":"Place images in a separate directory named img Name images appropriately, avoiding generic names like screenshot.png Avoid adding large images or videos to source control, link to an external location instead","title":"Images"},{"location":"code-reviews/recipes/markdown/#emphasis-and-special-sections","text":"Use bold or italic to emphasize For sections that everyone reading this document needs to be aware of, use blocks Use backticks for code, a single backtick for inline code like pip install flake8 and 3 backticks for code blocks followed by the language for syntax highlighting def add ( num1 : int , num2 : int ): return num1 + num2 Use check boxes for task lists Item 1 Item 2 Item 3 Add a References section to the end of the document with links to external references Prefer tables to lists for comparisons and reports to make research and results more readable Option Pros Cons Option 1 Some pros Some cons Option 2 Some pros Some cons","title":"Emphasis and Special Sections"},{"location":"code-reviews/recipes/markdown/#general","text":"Always use Markdown syntax, don't mix with HTML Make sure the extension of the files is .md - if the extension is missing, a linter might ignore the files","title":"General"},{"location":"code-reviews/recipes/python/","text":"Python Code Reviews Style Guide Developers should follow the PEP8 style guide with type hints . The use of type hints throughout paired with linting and type hint checking avoids common errors that are tricky to debug. Projects should check Python code with automated tools. Linting should be added to build validation, and both linting and code formatting can be added to your pre-commit hooks and VS Code. Code Analysis / Linting The 2 most popular python linters are Pylint and Flake8 . Both check adherence to PEP8 but vary a bit in what other rules they check. In general Pylint tends to be a bit more stringent and give more false positives but both are good options for linting python code. Both Pylint and Flake8 can be configured in VS Code using the VS Code python extension . Flake8 Flake8 is a simple and fast wrapper around Pyflakes (for detecting coding errors) and pycodestyle (for pep8). Install Flake8 pip install flake8 Add an extension for the pydocstyle (for doc strings ) tool to flake8. pip install flake8-docstrings Add an extension for pep8-naming (for naming conventions in pep8) tool to flake8. pip install pep8-naming Run Flake8 flake8 . # lint the whole project Pylint Install Pylint pip install pylint Run Pylint pylint src # lint the source directory Automatic Code Formatting Black Black is an unapologetic code formatting tool. It removes all need from pycodestyle nagging about formatting, so the team can focus on content vs style. It's not possible to configure black for your own style needs. pip install black Format python code black [ file/folder ] autopep8 Autopep8 is more lenient and allows more configuration if you want less stringent formatting. pip install autopep8 Format python code autopep8 [ file/folder ] --in-place yapf yapf Yet Another Python Formatter is a python formatter from Google based on ideas from gofmt. This is also more configurable, and a good option for automatic code formatting. pip install yapf Format python code yapf [ file/folder ] --in-place Bandit Bandit is a tool designed by the Python Code Quality Authority (PyCQA) to perform static analysis of Python code, specifically targeting security issues. It scans for common security issues in Python codebase. Installation : Add Bandit to your development environment with: pip install bandit VSCode Extensions Python The Python language extension is the base extension you should have installed for python development with VS Code. It enables intellisense, debugging, linting (with the above linters), testing with pytest or unittest, and code formatting with the formatters mentioned above. Pyright The Pyright extension augments VS Code with static type checking when you use type hints def add ( first_value : int , second_value : int ) -> int : return first_value + second_value Build Validation To automate linting with flake8 and testing with pytest in Azure Devops you can add the following snippet to you azure-pipelines.yaml file. trigger : branches : include : - develop - master paths : include : - src/* pool : vmImage : 'ubuntu-latest' jobs : - job : LintAndTest displayName : Lint and Test steps : - checkout : self lfs : true - task : UsePythonVersion@0 displayName : 'Set Python version to 3.6' inputs : versionSpec : '3.6' - script : pip3 install --user -r requirements.txt displayName : 'Install dependencies' - script : | # Install Flake8 pip3 install --user flake8 # Install PyTest pip3 install --user pytest displayName : 'Install Flake8 and PyTest' - script : | python3 -m flake8 displayName : 'Run Flake8 linter' - script : | # Run PyTest tester python3 -m pytest --junitxml=./test-results.xml displayName : 'Run PyTest Tester' - task : PublishTestResults@2 displayName : 'Publish PyTest results' condition : succeededOrFailed() inputs : testResultsFiles : '**/test-*.xml' testRunTitle : 'Publish test results for Python $(python.version)' To perform a PR validation on GitHub you can use a similar YAML configuration with GitHub Actions Pre-Commit Hooks Pre-commit hooks allow you to format and lint code locally before submitting the pull request. Adding pre-commit hooks for your python repository is easy using the pre-commit package Install pre-commit and add to the requirements.txt pip install pre-commit Add a .pre-commit-config.yaml file in the root of the repository, with the desired pre-commit actions repos : - repo : https://github.com/ambv/black rev : stable hooks : - id : black language_version : python3.6 - repo : https://github.com/pre-commit/pre-commit-hooks rev : v1.2.3 hooks : - id : flake8 Each individual developer that wants to set up pre-commit hooks can then run pre-commit install At the next attempted commit any lint failures will block the commit. Note: Installing pre-commit hooks is voluntary and done by each developer individually. Thus, it's not a replacement for build validation on the server Code Review Checklist In addition to the Code Review Checklist you should also look for these python specific code review items Are all new packages used included in requirements.txt Does the code pass all lint checks? Do functions use type hints, and are there any type hint errors? Is the code readable and using pythonic constructs wherever possible.","title":"Python Code Reviews"},{"location":"code-reviews/recipes/python/#python-code-reviews","text":"","title":"Python Code Reviews"},{"location":"code-reviews/recipes/python/#style-guide","text":"Developers should follow the PEP8 style guide with type hints . The use of type hints throughout paired with linting and type hint checking avoids common errors that are tricky to debug. Projects should check Python code with automated tools. Linting should be added to build validation, and both linting and code formatting can be added to your pre-commit hooks and VS Code.","title":"Style Guide"},{"location":"code-reviews/recipes/python/#code-analysis-linting","text":"The 2 most popular python linters are Pylint and Flake8 . Both check adherence to PEP8 but vary a bit in what other rules they check. In general Pylint tends to be a bit more stringent and give more false positives but both are good options for linting python code. Both Pylint and Flake8 can be configured in VS Code using the VS Code python extension .","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/python/#flake8","text":"Flake8 is a simple and fast wrapper around Pyflakes (for detecting coding errors) and pycodestyle (for pep8). Install Flake8 pip install flake8 Add an extension for the pydocstyle (for doc strings ) tool to flake8. pip install flake8-docstrings Add an extension for pep8-naming (for naming conventions in pep8) tool to flake8. pip install pep8-naming Run Flake8 flake8 . # lint the whole project","title":"Flake8"},{"location":"code-reviews/recipes/python/#pylint","text":"Install Pylint pip install pylint Run Pylint pylint src # lint the source directory","title":"Pylint"},{"location":"code-reviews/recipes/python/#automatic-code-formatting","text":"","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/python/#black","text":"Black is an unapologetic code formatting tool. It removes all need from pycodestyle nagging about formatting, so the team can focus on content vs style. It's not possible to configure black for your own style needs. pip install black Format python code black [ file/folder ]","title":"Black"},{"location":"code-reviews/recipes/python/#autopep8","text":"Autopep8 is more lenient and allows more configuration if you want less stringent formatting. pip install autopep8 Format python code autopep8 [ file/folder ] --in-place","title":"autopep8"},{"location":"code-reviews/recipes/python/#yapf","text":"yapf Yet Another Python Formatter is a python formatter from Google based on ideas from gofmt. This is also more configurable, and a good option for automatic code formatting. pip install yapf Format python code yapf [ file/folder ] --in-place","title":"yapf"},{"location":"code-reviews/recipes/python/#bandit","text":"Bandit is a tool designed by the Python Code Quality Authority (PyCQA) to perform static analysis of Python code, specifically targeting security issues. It scans for common security issues in Python codebase. Installation : Add Bandit to your development environment with: pip install bandit","title":"Bandit"},{"location":"code-reviews/recipes/python/#vscode-extensions","text":"","title":"VSCode Extensions"},{"location":"code-reviews/recipes/python/#python","text":"The Python language extension is the base extension you should have installed for python development with VS Code. It enables intellisense, debugging, linting (with the above linters), testing with pytest or unittest, and code formatting with the formatters mentioned above.","title":"Python"},{"location":"code-reviews/recipes/python/#pyright","text":"The Pyright extension augments VS Code with static type checking when you use type hints def add ( first_value : int , second_value : int ) -> int : return first_value + second_value","title":"Pyright"},{"location":"code-reviews/recipes/python/#build-validation","text":"To automate linting with flake8 and testing with pytest in Azure Devops you can add the following snippet to you azure-pipelines.yaml file. trigger : branches : include : - develop - master paths : include : - src/* pool : vmImage : 'ubuntu-latest' jobs : - job : LintAndTest displayName : Lint and Test steps : - checkout : self lfs : true - task : UsePythonVersion@0 displayName : 'Set Python version to 3.6' inputs : versionSpec : '3.6' - script : pip3 install --user -r requirements.txt displayName : 'Install dependencies' - script : | # Install Flake8 pip3 install --user flake8 # Install PyTest pip3 install --user pytest displayName : 'Install Flake8 and PyTest' - script : | python3 -m flake8 displayName : 'Run Flake8 linter' - script : | # Run PyTest tester python3 -m pytest --junitxml=./test-results.xml displayName : 'Run PyTest Tester' - task : PublishTestResults@2 displayName : 'Publish PyTest results' condition : succeededOrFailed() inputs : testResultsFiles : '**/test-*.xml' testRunTitle : 'Publish test results for Python $(python.version)' To perform a PR validation on GitHub you can use a similar YAML configuration with GitHub Actions","title":"Build Validation"},{"location":"code-reviews/recipes/python/#pre-commit-hooks","text":"Pre-commit hooks allow you to format and lint code locally before submitting the pull request. Adding pre-commit hooks for your python repository is easy using the pre-commit package Install pre-commit and add to the requirements.txt pip install pre-commit Add a .pre-commit-config.yaml file in the root of the repository, with the desired pre-commit actions repos : - repo : https://github.com/ambv/black rev : stable hooks : - id : black language_version : python3.6 - repo : https://github.com/pre-commit/pre-commit-hooks rev : v1.2.3 hooks : - id : flake8 Each individual developer that wants to set up pre-commit hooks can then run pre-commit install At the next attempted commit any lint failures will block the commit. Note: Installing pre-commit hooks is voluntary and done by each developer individually. Thus, it's not a replacement for build validation on the server","title":"Pre-Commit Hooks"},{"location":"code-reviews/recipes/python/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these python specific code review items Are all new packages used included in requirements.txt Does the code pass all lint checks? Do functions use type hints, and are there any type hint errors? Is the code readable and using pythonic constructs wherever possible.","title":"Code Review Checklist"},{"location":"code-reviews/recipes/terraform/","text":"Terraform Code Reviews Style Guide Developers should follow the terraform style guide . Projects should check Terraform scripts with automated tools. Code Analysis / Linting TFLint TFLint is a Terraform linter focused on possible errors, best practices, etc. Once TFLint installed in the environment, it can be invoked using the VS Code terraform extension . VSCode Extensions The following VS Code extensions are widely used. Terraform extension This extension provides syntax highlighting, linting, formatting and validation capabilities. Azure Terraform extension This extension provides Terraform command support, resource graph visualization and CloudShell integration inside VS Code. Build Validation Ensure you enforce the style guides during build. The following example script can be used to install terraform, and a linter that then checks for formatting and common errors. #! /bin/bash set -e SCRIPT_DIR = $( dirname \" $BASH_SOURCE \" ) cd \" $SCRIPT_DIR \" TF_VERSION = 0 .12.4 TF_LINT_VERSION = 0 .9.1 echo -e \"\\n\\n>>> Installing Terraform 0.12\" # Install terraform tooling for linting terraform wget -q https://releases.hashicorp.com/terraform/ ${ TF_VERSION } /terraform_ ${ TF_VERSION } _linux_amd64.zip -O /tmp/terraform.zip sudo unzip -q -o -d /usr/local/bin/ /tmp/terraform.zip echo \"\" echo -e \"\\n\\n>>> Install tflint (3rd party)\" wget -q https://github.com/wata727/tflint/releases/download/v ${ TF_LINT_VERSION } /tflint_linux_amd64.zip -O /tmp/tflint.zip sudo unzip -q -o -d /usr/local/bin/ /tmp/tflint.zip echo -e \"\\n\\n>>> Terraform version\" terraform -version echo -e \"\\n\\n>>> Terraform Format (if this fails use 'terraform fmt -recursive' command to resolve\" terraform fmt -recursive -diff -check echo -e \"\\n\\n>>> tflint\" tflint echo -e \"\\n\\n>>> Terraform init\" terraform init echo -e \"\\n\\n>>> Terraform validate\" terraform validate Code Review Checklist In addition to the Code Review Checklist you should also look for these Terraform specific code review items Providers Are all providers used in the terraform scripts versioned to prevent breaking changes in the future? Repository Organization The code split into reusable modules? Modules are split into separate .tf files where appropriate? The repository contains a README.md describing the architecture provisioned? If Terraform code is mixed with application source code, the Terraform code isolated into a dedicated folder? Terraform State The Terraform project configured using Azure Storage as remote state backend? The remote state backend storage account key stored a secure location (e.g. Azure Key Vault)? The project is configured to use state files based on the environment, and the deployment pipeline is configured to supply the state file name dynamically? Variables If the infrastructure will be different depending on the environment (e.g. Dev, UAT, Production), the environment specific parameters are supplied via a .tfvars file? All variables have type information. E.g. a list(string) or string . All variables have a description stating the purpose of the variable and its usage. default values are not supplied for variables which must be supplied by a user. Testing Unit and integration tests covering the Terraform code exist (e.g. Terratest , terratest-abstraction )? Naming and Code Structure Resource definitions and data sources are used correctly in the Terraform scripts? resource: Indicates to Terraform that the current configuration is in charge of managing the life cycle of the object data: Indicates to Terraform that you only want to get a reference to the existing object, but don\u2019t want to manage it as part of this configuration The resource names start with their containing provider's name followed by an underscore? e.g. resource from the provider postgresql might be named as postgresql_database ? The try function is only used with simple attribute references and type conversion functions? Overuse of the try function to suppress errors will lead to a configuration that is hard to understand and maintain. Explicit type conversion functions used to normalize types are only returned in module outputs? Explicit type conversions are rarely necessary in Terraform because it will convert types automatically where required. The Sensitive property on schema set to true for the fields that contains sensitive information? This will prevent the field's values from showing up in CLI output. General Recommendations Try avoiding nesting sub configuration within resources. Create a separate resource section for resources even though they can be declared as sub-element of a resource. For example, declaring subnets within virtual network vs declaring subnets as a separate resources compared to virtual network on Azure. Never hard-code any value in configuration. Declare them in locals section if a variable is needed multiple times as a static value and are internal to the configuration. The name s of the resources created on Azure should not be hard-coded or static. These names should be dynamic and user-provided using variable block. This is helpful especially in unit testing when multiple tests are running in parallel trying to create resources on Azure but need different names (few resources in Azure need to be named uniquely e.g. storage accounts). It is a good practice to output the ID of resources created on Azure from configuration. This is especially helpful when adding dynamic blocks for sub-elements/child elements to the parent resource. Use the required_providers block for establishing the dependency for providers along with pre-determined version. Use the terraform block to declare the provider dependency with exact version and also the terraform CLI version needed for the configuration. Validate the variable values supplied based on usage and type of variable. The validation can be done to variables by adding validation block. Validate that the component SKUs are the right ones, e.g. standard vs premium.","title":"Terraform Code Reviews"},{"location":"code-reviews/recipes/terraform/#terraform-code-reviews","text":"","title":"Terraform Code Reviews"},{"location":"code-reviews/recipes/terraform/#style-guide","text":"Developers should follow the terraform style guide . Projects should check Terraform scripts with automated tools.","title":"Style Guide"},{"location":"code-reviews/recipes/terraform/#code-analysis-linting","text":"","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/terraform/#tflint","text":"TFLint is a Terraform linter focused on possible errors, best practices, etc. Once TFLint installed in the environment, it can be invoked using the VS Code terraform extension .","title":"TFLint"},{"location":"code-reviews/recipes/terraform/#vscode-extensions","text":"The following VS Code extensions are widely used.","title":"VSCode Extensions"},{"location":"code-reviews/recipes/terraform/#terraform-extension","text":"This extension provides syntax highlighting, linting, formatting and validation capabilities.","title":"Terraform extension"},{"location":"code-reviews/recipes/terraform/#azure-terraform-extension","text":"This extension provides Terraform command support, resource graph visualization and CloudShell integration inside VS Code.","title":"Azure Terraform extension"},{"location":"code-reviews/recipes/terraform/#build-validation","text":"Ensure you enforce the style guides during build. The following example script can be used to install terraform, and a linter that then checks for formatting and common errors. #! /bin/bash set -e SCRIPT_DIR = $( dirname \" $BASH_SOURCE \" ) cd \" $SCRIPT_DIR \" TF_VERSION = 0 .12.4 TF_LINT_VERSION = 0 .9.1 echo -e \"\\n\\n>>> Installing Terraform 0.12\" # Install terraform tooling for linting terraform wget -q https://releases.hashicorp.com/terraform/ ${ TF_VERSION } /terraform_ ${ TF_VERSION } _linux_amd64.zip -O /tmp/terraform.zip sudo unzip -q -o -d /usr/local/bin/ /tmp/terraform.zip echo \"\" echo -e \"\\n\\n>>> Install tflint (3rd party)\" wget -q https://github.com/wata727/tflint/releases/download/v ${ TF_LINT_VERSION } /tflint_linux_amd64.zip -O /tmp/tflint.zip sudo unzip -q -o -d /usr/local/bin/ /tmp/tflint.zip echo -e \"\\n\\n>>> Terraform version\" terraform -version echo -e \"\\n\\n>>> Terraform Format (if this fails use 'terraform fmt -recursive' command to resolve\" terraform fmt -recursive -diff -check echo -e \"\\n\\n>>> tflint\" tflint echo -e \"\\n\\n>>> Terraform init\" terraform init echo -e \"\\n\\n>>> Terraform validate\" terraform validate","title":"Build Validation"},{"location":"code-reviews/recipes/terraform/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these Terraform specific code review items","title":"Code Review Checklist"},{"location":"code-reviews/recipes/terraform/#providers","text":"Are all providers used in the terraform scripts versioned to prevent breaking changes in the future?","title":"Providers"},{"location":"code-reviews/recipes/terraform/#repository-organization","text":"The code split into reusable modules? Modules are split into separate .tf files where appropriate? The repository contains a README.md describing the architecture provisioned? If Terraform code is mixed with application source code, the Terraform code isolated into a dedicated folder?","title":"Repository Organization"},{"location":"code-reviews/recipes/terraform/#terraform-state","text":"The Terraform project configured using Azure Storage as remote state backend? The remote state backend storage account key stored a secure location (e.g. Azure Key Vault)? The project is configured to use state files based on the environment, and the deployment pipeline is configured to supply the state file name dynamically?","title":"Terraform State"},{"location":"code-reviews/recipes/terraform/#variables","text":"If the infrastructure will be different depending on the environment (e.g. Dev, UAT, Production), the environment specific parameters are supplied via a .tfvars file? All variables have type information. E.g. a list(string) or string . All variables have a description stating the purpose of the variable and its usage. default values are not supplied for variables which must be supplied by a user.","title":"Variables"},{"location":"code-reviews/recipes/terraform/#testing","text":"Unit and integration tests covering the Terraform code exist (e.g. Terratest , terratest-abstraction )?","title":"Testing"},{"location":"code-reviews/recipes/terraform/#naming-and-code-structure","text":"Resource definitions and data sources are used correctly in the Terraform scripts? resource: Indicates to Terraform that the current configuration is in charge of managing the life cycle of the object data: Indicates to Terraform that you only want to get a reference to the existing object, but don\u2019t want to manage it as part of this configuration The resource names start with their containing provider's name followed by an underscore? e.g. resource from the provider postgresql might be named as postgresql_database ? The try function is only used with simple attribute references and type conversion functions? Overuse of the try function to suppress errors will lead to a configuration that is hard to understand and maintain. Explicit type conversion functions used to normalize types are only returned in module outputs? Explicit type conversions are rarely necessary in Terraform because it will convert types automatically where required. The Sensitive property on schema set to true for the fields that contains sensitive information? This will prevent the field's values from showing up in CLI output.","title":"Naming and Code Structure"},{"location":"code-reviews/recipes/terraform/#general-recommendations","text":"Try avoiding nesting sub configuration within resources. Create a separate resource section for resources even though they can be declared as sub-element of a resource. For example, declaring subnets within virtual network vs declaring subnets as a separate resources compared to virtual network on Azure. Never hard-code any value in configuration. Declare them in locals section if a variable is needed multiple times as a static value and are internal to the configuration. The name s of the resources created on Azure should not be hard-coded or static. These names should be dynamic and user-provided using variable block. This is helpful especially in unit testing when multiple tests are running in parallel trying to create resources on Azure but need different names (few resources in Azure need to be named uniquely e.g. storage accounts). It is a good practice to output the ID of resources created on Azure from configuration. This is especially helpful when adding dynamic blocks for sub-elements/child elements to the parent resource. Use the required_providers block for establishing the dependency for providers along with pre-determined version. Use the terraform block to declare the provider dependency with exact version and also the terraform CLI version needed for the configuration. Validate the variable values supplied based on usage and type of variable. The validation can be done to variables by adding validation block. Validate that the component SKUs are the right ones, e.g. standard vs premium.","title":"General Recommendations"},{"location":"design/exception-handling/","text":"Exception Handling Exception Constructs Almost all language platforms offer a construct of exception or equivalent to handle error scenarios. The underlying platform, used libraries or the authored code can \"throw\" exceptions to initiate an error flow. Some of the advantages of using exceptions are - Abstract different kind of errors Breaks the control flow from different code structures Navigate the call stack till the right catch block is identified Automatic collection of call stack Define different error handling flows thru multiple catch blocks Define finally block to cleanup resources Here is some guidance on exception handling in .Net C# Exception fundamentals Handling exceptions in .Net Custom Exceptions Although the platform offers numerous types of exceptions, often we need custom defined exceptions to arrive at an optimal low level design for error handling. The advantages of using custom exceptions are - Define exceptions specific to business domain of the requirement. E.g. InvalidCustomerException Wrap system/platform exceptions to define more generic system exception so that overall code base is more tech stack agnostic. E.g DatabaseWriteException which wraps MongoWriteException. Enrich the exception with more information about the code flow of the error. Enrich the exception with more information about the data context of the error. E.g. RecordId in property in DatabaseWriteException which carries the Id of the record failed to update. Define custom error message which is more business user friendly or support team friendly. Custom Exception Hierarchy Below diagram shows a sample hierarchy of custom exceptions. It defines a BaseException class which derives from System.Exception class and parent of all custom exceptions. BaseException also has additional properties for ActionCode and ResultCode. ActionCode represents the \"flow\" in which the error happened. ResultCode represents the exact error that happened. These additional properties help in defining different error handling flows in the catch blocks. Defines a number of System exceptions which derive from SystemException class. They will address all the errors generated by the technical aspects of the code. Like connectivity, read, write, buffer overflow etc Defines a number of Business exceptions which derive from BusinessException class. They will address all the errors generated by the business aspects of the code. Like data validations, duplicate rows. Error Details in API Response When an error occurs in an API, it has to rendered as response with all the necessary fields. There can be custom response schema drafted for these purposes. But one of the popular formats is the problem detail structure - Problem details There are inbuilt problem details middleware library built in ASP.Net core. For further details refer to below link Problem details service in ASP.Net core","title":"Exception Handling"},{"location":"design/exception-handling/#exception-handling","text":"","title":"Exception Handling"},{"location":"design/exception-handling/#exception-constructs","text":"Almost all language platforms offer a construct of exception or equivalent to handle error scenarios. The underlying platform, used libraries or the authored code can \"throw\" exceptions to initiate an error flow. Some of the advantages of using exceptions are - Abstract different kind of errors Breaks the control flow from different code structures Navigate the call stack till the right catch block is identified Automatic collection of call stack Define different error handling flows thru multiple catch blocks Define finally block to cleanup resources Here is some guidance on exception handling in .Net C# Exception fundamentals Handling exceptions in .Net","title":"Exception Constructs"},{"location":"design/exception-handling/#custom-exceptions","text":"Although the platform offers numerous types of exceptions, often we need custom defined exceptions to arrive at an optimal low level design for error handling. The advantages of using custom exceptions are - Define exceptions specific to business domain of the requirement. E.g. InvalidCustomerException Wrap system/platform exceptions to define more generic system exception so that overall code base is more tech stack agnostic. E.g DatabaseWriteException which wraps MongoWriteException. Enrich the exception with more information about the code flow of the error. Enrich the exception with more information about the data context of the error. E.g. RecordId in property in DatabaseWriteException which carries the Id of the record failed to update. Define custom error message which is more business user friendly or support team friendly.","title":"Custom Exceptions"},{"location":"design/exception-handling/#custom-exception-hierarchy","text":"Below diagram shows a sample hierarchy of custom exceptions. It defines a BaseException class which derives from System.Exception class and parent of all custom exceptions. BaseException also has additional properties for ActionCode and ResultCode. ActionCode represents the \"flow\" in which the error happened. ResultCode represents the exact error that happened. These additional properties help in defining different error handling flows in the catch blocks. Defines a number of System exceptions which derive from SystemException class. They will address all the errors generated by the technical aspects of the code. Like connectivity, read, write, buffer overflow etc Defines a number of Business exceptions which derive from BusinessException class. They will address all the errors generated by the business aspects of the code. Like data validations, duplicate rows.","title":"Custom Exception Hierarchy"},{"location":"design/exception-handling/#error-details-in-api-response","text":"When an error occurs in an API, it has to rendered as response with all the necessary fields. There can be custom response schema drafted for these purposes. But one of the popular formats is the problem detail structure - Problem details There are inbuilt problem details middleware library built in ASP.Net core. For further details refer to below link Problem details service in ASP.Net core","title":"Error Details in API Response"},{"location":"design/readme/","text":"Design Designing software well is hard. ISE has collected a number of practices which we find help in the design process. This covers not only technical design of software, but also architecture design and non-functional requirements gathering for new projects. Goals Provide recommendations for how to design software for maintainability, ease of extension, adherence to best practices, and sustainability. Reference or define process or checklists to help ensure well-designed software. Collate and point to reference sources (guides, repos, articles) that can help shortcut the learning process. Code Examples Folder Structure Folder Structure For Python Repository Project Templates Rust Actix Web, Diesel ORM, Test Containers, Onion Architecture Python Flask, SQLAlchemy ORM, Test Containers, Onion Architecture","title":"Design"},{"location":"design/readme/#design","text":"Designing software well is hard. ISE has collected a number of practices which we find help in the design process. This covers not only technical design of software, but also architecture design and non-functional requirements gathering for new projects.","title":"Design"},{"location":"design/readme/#goals","text":"Provide recommendations for how to design software for maintainability, ease of extension, adherence to best practices, and sustainability. Reference or define process or checklists to help ensure well-designed software. Collate and point to reference sources (guides, repos, articles) that can help shortcut the learning process.","title":"Goals"},{"location":"design/readme/#code-examples","text":"Folder Structure Folder Structure For Python Repository Project Templates Rust Actix Web, Diesel ORM, Test Containers, Onion Architecture Python Flask, SQLAlchemy ORM, Test Containers, Onion Architecture","title":"Code Examples"},{"location":"design/design-patterns/","text":"Design Patterns The design patterns section recommends patterns of software and architecture design. This section provides a curated list of commonly used patterns from trusted sources. Rather than duplicate or replace the cited sources, this section aims to compliment them with suggestions, guidance, and learnings based on firsthand experiences.","title":"Design Patterns"},{"location":"design/design-patterns/#design-patterns","text":"The design patterns section recommends patterns of software and architecture design. This section provides a curated list of commonly used patterns from trusted sources. Rather than duplicate or replace the cited sources, this section aims to compliment them with suggestions, guidance, and learnings based on firsthand experiences.","title":"Design Patterns"},{"location":"design/design-patterns/cloud-resource-design-guidance/","text":"Cloud Resource Design Guidance As cloud usage scales, considerations for subscription design, management groups, and resource naming/tagging conventions have an impact on governance, operations management, and adoption patterns. Note: Always work with the relevant stakeholders to ensure that introducing new patterns provides the intended value. When working in an existing cloud environment, it is important to understand any current patterns and how they are used before making a change to them. Resources The following references can be used to understand the latest best practices in organizing cloud resources: Organizing Subscriptions Resource Tagging Decision Guide Resource Naming Conventions Recommended Azure Resource Abbreviations Organizing Dev/Test/Production Workloads Tooling Azure Resource Naming Tool","title":"Cloud Resource Design Guidance"},{"location":"design/design-patterns/cloud-resource-design-guidance/#cloud-resource-design-guidance","text":"As cloud usage scales, considerations for subscription design, management groups, and resource naming/tagging conventions have an impact on governance, operations management, and adoption patterns. Note: Always work with the relevant stakeholders to ensure that introducing new patterns provides the intended value. When working in an existing cloud environment, it is important to understand any current patterns and how they are used before making a change to them.","title":"Cloud Resource Design Guidance"},{"location":"design/design-patterns/cloud-resource-design-guidance/#resources","text":"The following references can be used to understand the latest best practices in organizing cloud resources: Organizing Subscriptions Resource Tagging Decision Guide Resource Naming Conventions Recommended Azure Resource Abbreviations Organizing Dev/Test/Production Workloads","title":"Resources"},{"location":"design/design-patterns/cloud-resource-design-guidance/#tooling","text":"Azure Resource Naming Tool","title":"Tooling"},{"location":"design/design-patterns/data-heavy-design-guidance/","text":"Data and DataOps Fundamentals Most projects involve some type of data storage, data processing and data ops. For these projects, as with all projects, we follow the general guidelines laid out in other sections around security, testing, observability, CI/CD etc. Goal The goal of this section is to briefly describe how to apply the fundamentals to data heavy projects or portions of the project. Isolation Please be cautious of which isolation levels you are using. Even with a database that offers serializability, it is possible that within a transaction or connection you are leveraging a lower isolation level than the database offers. In particular, read uncommitted (or eventual consistency), can have a lot of unpredictable side effects and introduce bugs that are difficult to reason about. Eventually consistent systems should be treated as a last resort for achieving your scalability requirements; batching, sharding, and caching are all recommended solutions to increase your scalability. If none of these options are tenable, consider evaluating the \"New SQL\" databases like CockroachDB or TiDB, before leveraging an option that relies on eventual consistency. There are other levels of isolation, outside the isolation levels mentioned in the link above. Some of these have nuances different from the 4 main levels, and can be difficult to compare. Snapshot Isolation, strict serializability, \"read your own writes\", monotonic reads, bounded staleness, causal consistency, and linearizability are all other terms you can look into to learn more on the subject. Concurrency Control Your systems should (almost) always leverage some form of concurrency control, to ensure correctness amongst competing requests and to prevent data races. The 2 forms of concurrency control are pessimistic and optimistic . A pessimistic transaction involves a first request to \"lock the data\", and a second request to write the data. In between these requests, no other requests touching that data will succeed. See 2 Phase Locking (also often known as 2 Phase Commit) for more info. The (more) recommended approach is optimistic concurrency, where a user can read the object at a specific version, and update the object if and only if it hasn't changed. This is typically done via the Etag Header . A simple way to accomplish this on the database side is to increment a version number on each update. This can be done in a single executed statement as: WARNING: the below will not work when using an isolation level at or lower than read uncommitted (eventual consistency). -- Please treat this as pseudo code, and adjust as necessary. UPDATE < table_name > SET field1 = value1 , ..., fieldN = valueN , version = $ new_version WHERE ID = $ id AND version = $ version Data Tiering (Data Quality) Develop a common understanding of the quality of your datasets so that everyone understands the quality of the data, and expected use cases and limitations. A common data quality model is Bronze , Silver , Gold Bronze: This is a landing area for your raw datasets with none or minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store. Silver: These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by data scientists. Gold: These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard fact and dimension tables. Divide your data lake into three major areas containing your Bronze, Silver and Gold datasets. Note: Additional storage areas for malformed data, intermediate (sandbox) data, and libraries/packages/binaries are also useful when designing your storage organization. Data Validation Validate data early in your pipeline Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all datasets conform to a specific schema and known data invariants. This can also potentially prevent data pipeline failures in case of unexpected changes to the input data. Data that does not pass this validation stage can be rerouted to a record store dedicated for malformed data for diagnostic purposes. It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can be used to replay the data pipeline for both testing (i.e. testing data validation logic) and data recovery purposes (i.e. data corruption is introduced due to a bug in the data transformation code and thus the pipeline needs to be replayed). Idempotent Data Pipelines Make your data pipelines re-playable and idempotent Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines re-playable and idempotent, you can recover from this state through deployment of code fixes, and re-playing the data pipelines. Idempotency also ensures data-duplication is mitigated when replaying your data pipelines. Testing Ensure data transformation code is testable Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages. While it is possible to run tests against notebooks, by extracting the code into packages, you increase the developer productivity by increasing the speed of the feedback cycle. CI/CD, Source Control and Code Reviews All artifacts needed to build the data pipeline from scratch should be in source control. This included infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures etc.), reference/application data, data pipeline definitions and data validation and transformation logic. Any new artifacts (code) introduced to the repository should be code reviewed, both automatically (linting, credential scanning etc.) and peer reviewed. There should be a safe, repeatable process (CI/CD) to move the changes through dev, test and finally production. Security and Configuration Maintain a central, secure location for sensitive configuration such as database connection strings that can be accessed by the appropriate services within the specific environment. On Azure this is typically solved through securing secrets in a Key Vault per environment, then having the relevant services query KeyVault for the configuration Observability Monitor infrastructure, pipelines and data A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data should also be monitored. A common area that should have data monitoring is the malformed record store. End to End and Azure Technology Samples The DataOps for the Modern Data Warehouse repo contains both end-to-end and technology specific samples on how to implement DataOps on Azure. Image: CI/CD for Data pipelines on Azure - from DataOps for the Modern Data Warehouse repo","title":"Data and DataOps Fundamentals"},{"location":"design/design-patterns/data-heavy-design-guidance/#data-and-dataops-fundamentals","text":"Most projects involve some type of data storage, data processing and data ops. For these projects, as with all projects, we follow the general guidelines laid out in other sections around security, testing, observability, CI/CD etc.","title":"Data and DataOps Fundamentals"},{"location":"design/design-patterns/data-heavy-design-guidance/#goal","text":"The goal of this section is to briefly describe how to apply the fundamentals to data heavy projects or portions of the project.","title":"Goal"},{"location":"design/design-patterns/data-heavy-design-guidance/#isolation","text":"Please be cautious of which isolation levels you are using. Even with a database that offers serializability, it is possible that within a transaction or connection you are leveraging a lower isolation level than the database offers. In particular, read uncommitted (or eventual consistency), can have a lot of unpredictable side effects and introduce bugs that are difficult to reason about. Eventually consistent systems should be treated as a last resort for achieving your scalability requirements; batching, sharding, and caching are all recommended solutions to increase your scalability. If none of these options are tenable, consider evaluating the \"New SQL\" databases like CockroachDB or TiDB, before leveraging an option that relies on eventual consistency. There are other levels of isolation, outside the isolation levels mentioned in the link above. Some of these have nuances different from the 4 main levels, and can be difficult to compare. Snapshot Isolation, strict serializability, \"read your own writes\", monotonic reads, bounded staleness, causal consistency, and linearizability are all other terms you can look into to learn more on the subject.","title":"Isolation"},{"location":"design/design-patterns/data-heavy-design-guidance/#concurrency-control","text":"Your systems should (almost) always leverage some form of concurrency control, to ensure correctness amongst competing requests and to prevent data races. The 2 forms of concurrency control are pessimistic and optimistic . A pessimistic transaction involves a first request to \"lock the data\", and a second request to write the data. In between these requests, no other requests touching that data will succeed. See 2 Phase Locking (also often known as 2 Phase Commit) for more info. The (more) recommended approach is optimistic concurrency, where a user can read the object at a specific version, and update the object if and only if it hasn't changed. This is typically done via the Etag Header . A simple way to accomplish this on the database side is to increment a version number on each update. This can be done in a single executed statement as: WARNING: the below will not work when using an isolation level at or lower than read uncommitted (eventual consistency). -- Please treat this as pseudo code, and adjust as necessary. UPDATE < table_name > SET field1 = value1 , ..., fieldN = valueN , version = $ new_version WHERE ID = $ id AND version = $ version","title":"Concurrency Control"},{"location":"design/design-patterns/data-heavy-design-guidance/#data-tiering-data-quality","text":"Develop a common understanding of the quality of your datasets so that everyone understands the quality of the data, and expected use cases and limitations. A common data quality model is Bronze , Silver , Gold Bronze: This is a landing area for your raw datasets with none or minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store. Silver: These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by data scientists. Gold: These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard fact and dimension tables. Divide your data lake into three major areas containing your Bronze, Silver and Gold datasets. Note: Additional storage areas for malformed data, intermediate (sandbox) data, and libraries/packages/binaries are also useful when designing your storage organization.","title":"Data Tiering (Data Quality)"},{"location":"design/design-patterns/data-heavy-design-guidance/#data-validation","text":"Validate data early in your pipeline Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all datasets conform to a specific schema and known data invariants. This can also potentially prevent data pipeline failures in case of unexpected changes to the input data. Data that does not pass this validation stage can be rerouted to a record store dedicated for malformed data for diagnostic purposes. It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can be used to replay the data pipeline for both testing (i.e. testing data validation logic) and data recovery purposes (i.e. data corruption is introduced due to a bug in the data transformation code and thus the pipeline needs to be replayed).","title":"Data Validation"},{"location":"design/design-patterns/data-heavy-design-guidance/#idempotent-data-pipelines","text":"Make your data pipelines re-playable and idempotent Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines re-playable and idempotent, you can recover from this state through deployment of code fixes, and re-playing the data pipelines. Idempotency also ensures data-duplication is mitigated when replaying your data pipelines.","title":"Idempotent Data Pipelines"},{"location":"design/design-patterns/data-heavy-design-guidance/#testing","text":"Ensure data transformation code is testable Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages. While it is possible to run tests against notebooks, by extracting the code into packages, you increase the developer productivity by increasing the speed of the feedback cycle.","title":"Testing"},{"location":"design/design-patterns/data-heavy-design-guidance/#cicd-source-control-and-code-reviews","text":"All artifacts needed to build the data pipeline from scratch should be in source control. This included infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures etc.), reference/application data, data pipeline definitions and data validation and transformation logic. Any new artifacts (code) introduced to the repository should be code reviewed, both automatically (linting, credential scanning etc.) and peer reviewed. There should be a safe, repeatable process (CI/CD) to move the changes through dev, test and finally production.","title":"CI/CD, Source Control and Code Reviews"},{"location":"design/design-patterns/data-heavy-design-guidance/#security-and-configuration","text":"Maintain a central, secure location for sensitive configuration such as database connection strings that can be accessed by the appropriate services within the specific environment. On Azure this is typically solved through securing secrets in a Key Vault per environment, then having the relevant services query KeyVault for the configuration","title":"Security and Configuration"},{"location":"design/design-patterns/data-heavy-design-guidance/#observability","text":"Monitor infrastructure, pipelines and data A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data should also be monitored. A common area that should have data monitoring is the malformed record store.","title":"Observability"},{"location":"design/design-patterns/data-heavy-design-guidance/#end-to-end-and-azure-technology-samples","text":"The DataOps for the Modern Data Warehouse repo contains both end-to-end and technology specific samples on how to implement DataOps on Azure. Image: CI/CD for Data pipelines on Azure - from DataOps for the Modern Data Warehouse repo","title":"End to End and Azure Technology Samples"},{"location":"design/design-patterns/distributed-system-design-reference/","text":"Distributed System Design Reference Distributed systems introduce new and interesting problems that need addressing. Software engineering as a field has dealt with these problems for years, and there are phenomenal resources available for reference when creating a new distributed system. Some that we recommend are as follows: Martin Fowler's Patterns of Distributed Systems microservices.io Azure's Cloud Design Patterns","title":"Distributed System Design Reference"},{"location":"design/design-patterns/distributed-system-design-reference/#distributed-system-design-reference","text":"Distributed systems introduce new and interesting problems that need addressing. Software engineering as a field has dealt with these problems for years, and there are phenomenal resources available for reference when creating a new distributed system. Some that we recommend are as follows: Martin Fowler's Patterns of Distributed Systems microservices.io Azure's Cloud Design Patterns","title":"Distributed System Design Reference"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/","text":"Network Architecture Guidance for Azure The following are some best practices when setting up and working with network resources in Azure Cloud environments. Note: When working in an existing cloud environment, it is important to understand any current patterns, and how they are used, before making a change to them. You should also work with the relevant stakeholders to make sure that any new patterns you introduce provide enough value to make the change. Networking and VNet Setup Hub-and-Spoke Topology A hub-and-spoke network topology is a common architecture pattern used in Azure for organizing and managing network resources. It is based on the concept of a central hub that connects to various spoke networks. This model is particularly useful for organizing resources, maintaining security, and simplifying network management. The hub-and-spoke model is implemented using Azure Virtual Networks (VNet) and VNet peering. The hub: The central VNet acts as a hub, providing shared services such as network security, monitoring, and connectivity to on-premises or other cloud environments. Common components in the hub include Network Virtual Appliances (NVAs), Azure Firewall, VPN Gateway, and ExpressRoute Gateway. The spokes: The spoke VNets represent separate units or applications within an organization, each with its own set of resources and services. They connect to the hub through VNet peering, which allows for communication between the hub and spoke VNets. Implementing a hub-and-spoke model in Azure offers several benefits: Isolation and segmentation: By dividing resources into separate spoke VNets, you can isolate and segment workloads, preventing any potential issues or security risks from affecting other parts of the network. Centralized management: The hub VNet acts as a single point of management for shared services, making it easier to maintain, monitor, and enforce policies across the network. Simplified connectivity: VNet peering enables seamless communication between the hub and spoke VNets without the need for complex routing or additional gateways, reducing latency and management overhead. Scalability: The hub-and-spoke model can easily scale to accommodate additional spokes as the organization grows or as new applications and services are introduced. Cost savings: By centralizing shared services in the hub, organizations can reduce the costs associated with deploying and managing multiple instances of the same services across different VNets. Read more about hub-and-spoke topology When deploying hub/spoke, it is recommended that you do so in connection with landing zones . This ensures consistency across all environments as well as guardrails to ensure a high level of security while giving developers freedom within development environments. Firewall and Security When using a hub-and-spoke topology it is possible to deploy a centralized firewall in the Hub that all outgoing traffic or traffic to/from certain VNets, this allows for centralized threat protection while minimizing costs. DNS The best practices for handling DNS in Azure, and in cloud environments in general, include using managed DNS services. Some of the benefits of using managed DNS services is that the resources are designed to be secure, easy to deploy and configure. DNS forwarding: Set up DNS forwarding between your on-premises DNS servers and Azure DNS servers for name resolution across environments. Use Azure Private DNS zones for Azure resources: Configure Azure Private DNS zones for your Azure resources to ensure name resolution is kept within the virtual network. Read more about Hybrid/Multi-Cloud DNS infrastructure and Azure DNS infrastructure IP Allocation When allocating IP address spaces to Azure Virtual Networks (VNets), it's essential to follow best practices for proper management, and scalability. Here are some recommendations for IP allocation to VNets: Reserve IP addresses: Reserve IP addresses in your address space for critical resources or services. Public IP allocation: Minimize the use of public IP addresses and use Azure Private Link when possible to access services over a private connection. IP address management (IPAM): Use IPAM solutions to manage and track IP address allocation across your hybrid environment. Plan your address space: Choose an appropriate private address space (from RFC 1918) for your VNets that is large enough to accommodate future growth. Avoid overlapping with on-premises or other cloud networks. Use CIDR notation: Use Classless Inter-Domain Routing (CIDR) notation to define the VNet address space, which allows more efficient allocation and prevents wasting IP addresses. Use subnets: Divide your VNets into smaller subnets based on security, application, or environment requirements. This allows for better network management and security. Consider leaving a buffer between VNets: While it's not strictly necessary, leaving a buffer between VNets can be beneficial in some cases, especially when you anticipate future growth or when you might need to merge VNets. This can help avoid re-addressing conflicts when expanding or merging networks. Reserve IP addresses: Reserve a range of IP addresses within your VNet address space for critical resources or services. This ensures that they have a static IP address, which is essential for specific services or applications. Plan for hybrid scenarios: If you're working in a hybrid environment with on-premises or multi-cloud networks, ensure that you plan for IP address allocation across all environments. This includes avoiding overlapping address spaces and reserving IP addresses for specific resources like VPN gateways or ExpressRoute circuits. Read more at azure-best-practices/plan-for-ip-addressing Resource Allocation For resource allocation the best practices from Cloud Resource Design Guidance should be followed.","title":"Network Architecture Guidance for Azure"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#network-architecture-guidance-for-azure","text":"The following are some best practices when setting up and working with network resources in Azure Cloud environments. Note: When working in an existing cloud environment, it is important to understand any current patterns, and how they are used, before making a change to them. You should also work with the relevant stakeholders to make sure that any new patterns you introduce provide enough value to make the change.","title":"Network Architecture Guidance for Azure"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#networking-and-vnet-setup","text":"","title":"Networking and VNet Setup"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#hub-and-spoke-topology","text":"A hub-and-spoke network topology is a common architecture pattern used in Azure for organizing and managing network resources. It is based on the concept of a central hub that connects to various spoke networks. This model is particularly useful for organizing resources, maintaining security, and simplifying network management. The hub-and-spoke model is implemented using Azure Virtual Networks (VNet) and VNet peering. The hub: The central VNet acts as a hub, providing shared services such as network security, monitoring, and connectivity to on-premises or other cloud environments. Common components in the hub include Network Virtual Appliances (NVAs), Azure Firewall, VPN Gateway, and ExpressRoute Gateway. The spokes: The spoke VNets represent separate units or applications within an organization, each with its own set of resources and services. They connect to the hub through VNet peering, which allows for communication between the hub and spoke VNets. Implementing a hub-and-spoke model in Azure offers several benefits: Isolation and segmentation: By dividing resources into separate spoke VNets, you can isolate and segment workloads, preventing any potential issues or security risks from affecting other parts of the network. Centralized management: The hub VNet acts as a single point of management for shared services, making it easier to maintain, monitor, and enforce policies across the network. Simplified connectivity: VNet peering enables seamless communication between the hub and spoke VNets without the need for complex routing or additional gateways, reducing latency and management overhead. Scalability: The hub-and-spoke model can easily scale to accommodate additional spokes as the organization grows or as new applications and services are introduced. Cost savings: By centralizing shared services in the hub, organizations can reduce the costs associated with deploying and managing multiple instances of the same services across different VNets. Read more about hub-and-spoke topology When deploying hub/spoke, it is recommended that you do so in connection with landing zones . This ensures consistency across all environments as well as guardrails to ensure a high level of security while giving developers freedom within development environments.","title":"Hub-and-Spoke Topology"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#firewall-and-security","text":"When using a hub-and-spoke topology it is possible to deploy a centralized firewall in the Hub that all outgoing traffic or traffic to/from certain VNets, this allows for centralized threat protection while minimizing costs.","title":"Firewall and Security"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#dns","text":"The best practices for handling DNS in Azure, and in cloud environments in general, include using managed DNS services. Some of the benefits of using managed DNS services is that the resources are designed to be secure, easy to deploy and configure. DNS forwarding: Set up DNS forwarding between your on-premises DNS servers and Azure DNS servers for name resolution across environments. Use Azure Private DNS zones for Azure resources: Configure Azure Private DNS zones for your Azure resources to ensure name resolution is kept within the virtual network. Read more about Hybrid/Multi-Cloud DNS infrastructure and Azure DNS infrastructure","title":"DNS"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#ip-allocation","text":"When allocating IP address spaces to Azure Virtual Networks (VNets), it's essential to follow best practices for proper management, and scalability. Here are some recommendations for IP allocation to VNets: Reserve IP addresses: Reserve IP addresses in your address space for critical resources or services. Public IP allocation: Minimize the use of public IP addresses and use Azure Private Link when possible to access services over a private connection. IP address management (IPAM): Use IPAM solutions to manage and track IP address allocation across your hybrid environment. Plan your address space: Choose an appropriate private address space (from RFC 1918) for your VNets that is large enough to accommodate future growth. Avoid overlapping with on-premises or other cloud networks. Use CIDR notation: Use Classless Inter-Domain Routing (CIDR) notation to define the VNet address space, which allows more efficient allocation and prevents wasting IP addresses. Use subnets: Divide your VNets into smaller subnets based on security, application, or environment requirements. This allows for better network management and security. Consider leaving a buffer between VNets: While it's not strictly necessary, leaving a buffer between VNets can be beneficial in some cases, especially when you anticipate future growth or when you might need to merge VNets. This can help avoid re-addressing conflicts when expanding or merging networks. Reserve IP addresses: Reserve a range of IP addresses within your VNet address space for critical resources or services. This ensures that they have a static IP address, which is essential for specific services or applications. Plan for hybrid scenarios: If you're working in a hybrid environment with on-premises or multi-cloud networks, ensure that you plan for IP address allocation across all environments. This includes avoiding overlapping address spaces and reserving IP addresses for specific resources like VPN gateways or ExpressRoute circuits. Read more at azure-best-practices/plan-for-ip-addressing","title":"IP Allocation"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#resource-allocation","text":"For resource allocation the best practices from Cloud Resource Design Guidance should be followed.","title":"Resource Allocation"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/","text":"Network Architecture Guidance for Hybrid The following are best practices around how to design and configure resources, used for Hybrid and Multi-Cloud environments. Note: When working in an existing hybrid environment, it is important to understand any current patterns, and how they are used before making any changes. Hub-and-Spoke Topology The hub-and-spoke topology doesn't change much when using cloud/hybrid if configured correctly, The main different is that the hub VNet is peering to the on-prem network via a ExpressRoute and that all traffic from Azure might exit via the ExpressRoute and the on-prem internet connection. The generalized best practices are in Network Architecture Guidance for Azure#Hub and Spoke topology IP Allocation When working with Hybrid deployment, take extra care when planning IP allocation as there is a much greater risk of overlapping network ranges. The general best practices are available in the Network Architecture Guidance for Azure#ip-allocation Read more about this in Azure Best Practices Plan for IP Addressing ExpressRoute Environments using Express often tunnel all traffic from Azure via a private link (ExpressRoute) to an on-prem location. This imposes a few problems when working with PAAS offerings as not all of them connect via their respective private endpoint and instead use an external IP for outgoing connections, or some PAAS to PASS traffic occur internally in azure and won't function with disabled public networks. Two notable services here are data planes copies of storage accounts and a lot of the services not supporting private endpoints. Choose the right ExpressRoute circuit: Select an appropriate SKU (Standard or Premium) and bandwidth based on your organization's requirements. Redundancy: Ensure redundancy by provisioning two ExpressRoute circuits in different peering locations. Monitoring: Use Azure Monitor and Network Performance Monitor (NPM) to monitor the health and performance of your ExpressRoute circuits. DNS General best practices are available in Network Architecture Guidance for Azure#dns When using Azure DNS in a hybrid or multi-cloud environment it is important to ensure a consistent DNS and forwarding configuration which ensures that records are automatically updated and that all DNS servers are aware of each other and know which server is the authoritative for the different records. Read more about Hybrid/Multi-Cloud DNS infrastructure Resource Allocation For resource allocation the best practices from Cloud Resource Design Guidance should be followed.","title":"Network Architecture Guidance for Hybrid"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#network-architecture-guidance-for-hybrid","text":"The following are best practices around how to design and configure resources, used for Hybrid and Multi-Cloud environments. Note: When working in an existing hybrid environment, it is important to understand any current patterns, and how they are used before making any changes.","title":"Network Architecture Guidance for Hybrid"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#hub-and-spoke-topology","text":"The hub-and-spoke topology doesn't change much when using cloud/hybrid if configured correctly, The main different is that the hub VNet is peering to the on-prem network via a ExpressRoute and that all traffic from Azure might exit via the ExpressRoute and the on-prem internet connection. The generalized best practices are in Network Architecture Guidance for Azure#Hub and Spoke topology","title":"Hub-and-Spoke Topology"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#ip-allocation","text":"When working with Hybrid deployment, take extra care when planning IP allocation as there is a much greater risk of overlapping network ranges. The general best practices are available in the Network Architecture Guidance for Azure#ip-allocation Read more about this in Azure Best Practices Plan for IP Addressing","title":"IP Allocation"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#expressroute","text":"Environments using Express often tunnel all traffic from Azure via a private link (ExpressRoute) to an on-prem location. This imposes a few problems when working with PAAS offerings as not all of them connect via their respective private endpoint and instead use an external IP for outgoing connections, or some PAAS to PASS traffic occur internally in azure and won't function with disabled public networks. Two notable services here are data planes copies of storage accounts and a lot of the services not supporting private endpoints. Choose the right ExpressRoute circuit: Select an appropriate SKU (Standard or Premium) and bandwidth based on your organization's requirements. Redundancy: Ensure redundancy by provisioning two ExpressRoute circuits in different peering locations. Monitoring: Use Azure Monitor and Network Performance Monitor (NPM) to monitor the health and performance of your ExpressRoute circuits.","title":"ExpressRoute"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#dns","text":"General best practices are available in Network Architecture Guidance for Azure#dns When using Azure DNS in a hybrid or multi-cloud environment it is important to ensure a consistent DNS and forwarding configuration which ensures that records are automatically updated and that all DNS servers are aware of each other and know which server is the authoritative for the different records. Read more about Hybrid/Multi-Cloud DNS infrastructure","title":"DNS"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#resource-allocation","text":"For resource allocation the best practices from Cloud Resource Design Guidance should be followed.","title":"Resource Allocation"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/","text":"Non-Functional Requirements Capture Goals In software engineering projects, non-functional requirements, also known as quality attributes, are specifications that define the operational attributes of a system rather than its specific behaviors. Unlike functional requirements, which outline what a system should do, non-functional requirements describe how the system performs certain functions under specific conditions. Non-functional requirements generally increase the cost as they require special efforts during the implementation, but by defining these requirements in detail early in the engagement, they can be properly evaluated when the cost of their impact on subsequent design decisions is comparatively low. Documenting Non-Functional Requirements - Best Practices Be specific: Avoid ambiguity and make sure the requirement is quantitative, measurable and testable. Relate requirements with business objectives and understand the real impact of the system's behavior. Break it down: Try to define requirements at the component or process scope instead of the whole solution. Understand trade-off: Non-functional requirements may be in conflict with each other and it can be difficult to balance them and prioritize which one to implement. Template This template can serve as a structured framework for capturing and documenting non-functional requirements effectively. Adjustments can be made to tailor it to the specific needs and preferences of the project team. Requirement name: name or title Description: brief description. Describe the importance and impact of this requirement to the business. Priority: High/Medium/Low or Must-have/Nice-to-have, etc Measurement/Metric: metric or measurement criteria Verification Method: Automated test, benchmark, simulation, prototyping, etc. Constraints: Budget, Time, Resources, Infrastructure, etc. Owner/Responsible Party Dependencies: technical dependencies, data dependencies, regulatory dependencies, etc. Examples To support the process of capturing a project's comprehensive non-functional requirements, this document offers a taxonomy for non-functional requirements and provides a framework for their identification, exploration, assignment of customer stakeholders, and eventual codification into formal engineering requirements as input to subsequent solution design. Operational Requirements Quality Attribute Description Common Metrics Availability System's uptime and accessibility to users. - Uptime: Uptime measures the percentage of time that a system is operational and available for use. It is typically expressed as a percentage of total time (e.g., 99.9% uptime means the system is available 99.9% of the time). Common thresholds for uptime include: 99% uptime: The system is available 99% of the time, allowing for approximately 3.65 days of downtime per year. 99.9% uptime (three nines): The system is available 99.9% of the time, allowing for approximately 8.76 hours of downtime per year. 99.99% uptime (four nines): The system is available 99.99% of the time, allowing for approximately 52.56 minutes of downtime per year. 99.999% uptime (five nines): The system is available 99.999% of the time, allowing for approximately 5.26 minutes of downtime per year. Data Integrity Accuracy and consistency of data throughout its lifecycle. - Error Rate: The proportion of data entries that contain errors or inaccuracies. (\\text{Error Rate} = \\left( \\frac{\\text{Number of Errors}}{\\text{Total Number of Entries}} \\right) \\times 100) - Accuracy Rate: The percentage of data entries that are correct and match the source of truth. (\\text{Accuracy Rate} = \\left( \\frac{\\text{Number of Accurate Entries}}{\\text{Total Number of Entries}} \\right) \\times 100) - Duplicate Record Rate: The percentage of data entries that are duplicates. (\\text{Duplicate Record Rate} = \\left( \\frac{\\text{Number of Duplicate Entries}}{\\text{Total Number of Entries}} \\right) \\times 100) Disaster recovery and business continuity Determine the system's requirements for disaster recovery and business continuity, including backup and recovery procedures and disaster recovery testing. - Backup and Recovery: The application must have a Backup and Recovery plan in place that includes regular backups of all data and configurations, and a process for restoring data and functionality in the event of a disaster or disruption. - Redundancy: The application must have Redundancy built into its infrastructure, such as redundant servers, network devices, and power supplies, to ensure high availability and minimize downtime in the event of a failure. - Failover and high availability: The application must be designed to support Failover and high availability, such as by using load balancers or Failover clusters, to ensure that it can continue to operate in the event of a system failure or disruption. - Disaster Recovery plan: The application must have a comprehensive disaster Recovery plan that includes procedures for restoring data and functionality in the event of a major disaster, such as a natural disaster, cyber attack, or other catastrophic event. - Testing and Maintenance: The application must be regularly tested and maintained to ensure that it can withstand a disaster or disruption, and that all systems, processes, and data can be quickly restored and recovered. Reliability System's ability to maintain functionality under varying conditions and failure scenarios. - Mean Time Between Failures (MTBF): The system should achieve an MTBF of at least 1000 hours, indicating a high level of reliability with infrequent failures. - Mean Time to Recover (MTTR): The system should aim for an MTTR of less than 1 hour, ensuring quick recovery and minimal disruption in the event of a failure. - Redundancy Levels: The system should include redundancy mechanisms to achieve a redundancy level of N+1, ensuring high availability and fault tolerance. Performance Requirements Quality Attribute Description Common Metrics Capacity Maximum load or volume that the system can handle within specified performance criteria. - Maximum Load Capacity: The system should be capable of handling peak loads without exceeding predefined performance degradation thresholds. Maximum load capacity may be expressed in terms of concurrent users, transactions per second, or data volume. - Resource Utilization: Measures the percentage of system resources (CPU, memory, disk I/O, network bandwidth) consumed under normal operation. - Concurrency: Measures the number of simultaneous users or transactions the system can handle without degradation in performance. - Throughput: Measures the rate at which the system processes transactions, requests, or data. Thresholds may be defined in terms of transactions per second, requests per minute, or data throughput in bytes per second. Performance Define the expected response times, throughput, and resource usage of the solution. - Response time: The application must load and respond to user interactions within 500 ms for button clicks. - Throughput: The application must be able to handle 100 concurrent users or 500 transactions per second. - Resource utilization: The application must use less than 80% of CPU and 1 GB of memory. - Error rates: The application must have an error rate less than 1% of all requests, and be able to handle and recover from errors gracefully, without impacting user experience or data integrity. Scalability Determine how the system will handle increased user loads or larger datasets over time. - Load Balancing: The application must be able to handle a minimum of 250 concurrent users and support load balancing across at least 3 servers to handle peak traffic. - Database Scalability: The application's database must be able to handle at least 1 million records and support partitioning or sharding to ensure efficient storage and retrieval of data. - Cloud-Based Infrastructure: The application must be deployed on cloud-based infrastructure that can handle at least 100,000 requests per hour, and be able to scale up or down to meet changing demand. Microservices Architecture: The application must be designed using a microservices architecture that allows for easy scaling of individual services, and be able to handle at least 500 requests per second. - Caching: The application must be able to cache at least 10,000 records, with a cache hit rate of 95%, and support caching across multiple servers to ensure high availability. Security and Compliance Requirements Quality Attribute Description Common Metrics Compliance Adherence to legal, regulatory, and industry standards and requirements. See Microsoft Purview Compliance Manager Privacy Protection of sensitive information and compliance with privacy regulations. - Compliance with Privacy Regulations: Achieve full compliance with GDPR, CCPA and HIPAA. - Data Anonymization: Implement anonymization techniques in protecting individual privacy while still allowing for data analysis. - Data Encryption: Ensure that sensitive data is encrypted according to encryption standards and best practices. - User Privacy Preferences: The ability to respect and accommodate user privacy preferences regarding data collection, processing, and sharing. Security Establish the security requirements of the system, such as authentication, authorization, encryption, and compliance with industry or legal regulations. See Threat Modeling Tool Sustainability Ability to operate over an extended period while minimizing environmental impact and resource consumption. - Energy Efficiency: Kilowatt-hours/Transaction. - Carbon Footprint: Tons of CO2 emissions per year. System Maintainability Requirements Quality Attribute Description Common Metrics Interoperability Ability to interact and exchange data with other systems or components. - Data Format Compatibility: The system must be interoperable with various Electronic Health Records (HER) systems to exchange patient data securely. - Protocol Compatibility: The system should import and export banking information from the ERP using REST protocol. - API Compatibility: The solution must adhere to API standards, ensuring backward compatibility with previous API versions, and providing comprehensive documentation for developers. Maintainability Ease of modifying, updating, and extending the software over time. - Code Complexity: The level of complexity in the system's codebase, measured using metrics such as cyclomatic complexity or lines of code per function. Lower code complexity makes maintenance tasks easier and reduces the likelihood of introducing defects. A cyclomatic complexity score of less than 10 or a lines of code per function metric below 50 is often desirable. - Code Coverage: The percentage of code covered by automated tests. Higher code coverage indicates better testability and facilitates easier maintenance by enabling faster detection of defects. A code coverage threshold of 80% or higher is commonly targeted. - Documentation Quality: The comprehensiveness and clarity of documentation accompanying the system, including design documents, technical specifications, and user manuals. Well-written documentation reduces the time and effort required for maintenance tasks. Documentation should cover at least 80% of system functionality with clear explanations and examples. - Dependency Management: The management of external dependencies and libraries used in the system. Proper dependency management reduces the risk of compatibility issues and simplifies maintenance tasks such as updates and patches. - Code Churn: The frequency of code changes within a software system. High code churn may indicate instability or frequent updates, making maintenance more challenging. A code churn rate of less than 20% is generally considered acceptable. Observability The ability to measure a system's internal state and performance based on the outputs it generates, such as logs, metrics, and traces. -System Metrics: CPU usage, memory usage, disk I/O, network I/O, and other resource utilization metrics. - Application Metrics: Response times, request rates, error rates, and throughput. - Custom Metrics: Application-specific metrics, such as user sign-ups, or specific business logic indicators. Portability Ability to run the software on different platforms, environments, and devices. - Platform Compatibility: The ability of the software to run on different operating systems (e.g., Windows, macOS, Linux) or platforms (e.g., desktop, mobile, web). Portability requires the software to be compatible with multiple platforms, with a goal of supporting at least three major platforms. - Hardware Compatibility: The ability of the software to run on different hardware configurations, such as varying processor architectures (e.g., x86, ARM) or memory sizes. Portability involves ensuring compatibility with a wide range of hardware configurations, with a goal of supporting common hardware architectures. - File System Independence: The software's ability to operate independently of the underlying file system, ensuring compatibility with different file systems (e.g., NTFS, ext4, APFS). Portability involves using file system abstraction layers or APIs to abstract file system operations and ensure consistency across platforms. - Data Format Compatibility: The software's ability to read and write data in different formats, ensuring compatibility with common data interchange formats (e.g., JSON, XML, CSV). Portability involves supporting standard data formats and providing mechanisms for data conversion and interoperability. User Experience Requirements Quality Attribute Description Common Metrics Accessibility The solution must be usable by people with disabilities. Compliance with accessibility standards. Support for assistive technologies - Alternative Text for Images: All images and non-text content must have alternative text descriptions that can be read by screen readers. - Color contrast: The application must use color schemes that meet the recommended contrast ratio between foreground and background colors to ensure visibility for users with low vision. - Focus indicators: The application must provide visible focus indicators to highlight the currently focused element, which is especially important for users who rely on keyboard navigation. - Captions and Transcripts: All audio and video content must have captions and transcripts, to ensure that users with hearing impairments can access the content. - Language identification: The application must correctly identify the language of the content, to ensure that screen readers and other assistive technologies can read the content properly. Internationalization and Localization Adaptation of the software for use in different languages and cultures. Tailoring the software to meet the specific needs of different regions or locales. - Language and Locale Support: The software's support for different languages, character sets, and locales. Portability requires internationalization and localization efforts to ensure that the software can be used effectively in different regions and cultures, with support for at least five major languages. - Multi currency: The system's support for multiple currencies, allowing different symbols and conversion rates. Usability Intuitiveness, ease of learning, and user satisfaction with the software interface. - Task Completion Time: The average time it takes for users to complete specific tasks. A user must be able to complete an account settings in less than 2 minutes. - Ease of Navigation: The ease with which users can navigate through the system and find the information they need. This can be measured by observing user interactions or conducting usability tests. - User Satisfaction: User satisfaction can be measured using surveys, feedback forms, or satisfaction ratings. A satisfaction score of 70% or higher is typically considered satisfactory. - Learnability: The ease with which new users can learn to use the system. This can be measured by the time it takes for users to perform basic tasks or by conducting usability tests with novice users.","title":"Non-Functional Requirements Capture"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#non-functional-requirements-capture","text":"","title":"Non-Functional Requirements Capture"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#goals","text":"In software engineering projects, non-functional requirements, also known as quality attributes, are specifications that define the operational attributes of a system rather than its specific behaviors. Unlike functional requirements, which outline what a system should do, non-functional requirements describe how the system performs certain functions under specific conditions. Non-functional requirements generally increase the cost as they require special efforts during the implementation, but by defining these requirements in detail early in the engagement, they can be properly evaluated when the cost of their impact on subsequent design decisions is comparatively low.","title":"Goals"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#documenting-non-functional-requirements-best-practices","text":"Be specific: Avoid ambiguity and make sure the requirement is quantitative, measurable and testable. Relate requirements with business objectives and understand the real impact of the system's behavior. Break it down: Try to define requirements at the component or process scope instead of the whole solution. Understand trade-off: Non-functional requirements may be in conflict with each other and it can be difficult to balance them and prioritize which one to implement.","title":"Documenting Non-Functional Requirements - Best Practices"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#template","text":"This template can serve as a structured framework for capturing and documenting non-functional requirements effectively. Adjustments can be made to tailor it to the specific needs and preferences of the project team. Requirement name: name or title Description: brief description. Describe the importance and impact of this requirement to the business. Priority: High/Medium/Low or Must-have/Nice-to-have, etc Measurement/Metric: metric or measurement criteria Verification Method: Automated test, benchmark, simulation, prototyping, etc. Constraints: Budget, Time, Resources, Infrastructure, etc. Owner/Responsible Party Dependencies: technical dependencies, data dependencies, regulatory dependencies, etc.","title":"Template"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#examples","text":"To support the process of capturing a project's comprehensive non-functional requirements, this document offers a taxonomy for non-functional requirements and provides a framework for their identification, exploration, assignment of customer stakeholders, and eventual codification into formal engineering requirements as input to subsequent solution design.","title":"Examples"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#operational-requirements","text":"Quality Attribute Description Common Metrics Availability System's uptime and accessibility to users. - Uptime: Uptime measures the percentage of time that a system is operational and available for use. It is typically expressed as a percentage of total time (e.g., 99.9% uptime means the system is available 99.9% of the time). Common thresholds for uptime include: 99% uptime: The system is available 99% of the time, allowing for approximately 3.65 days of downtime per year. 99.9% uptime (three nines): The system is available 99.9% of the time, allowing for approximately 8.76 hours of downtime per year. 99.99% uptime (four nines): The system is available 99.99% of the time, allowing for approximately 52.56 minutes of downtime per year. 99.999% uptime (five nines): The system is available 99.999% of the time, allowing for approximately 5.26 minutes of downtime per year. Data Integrity Accuracy and consistency of data throughout its lifecycle. - Error Rate: The proportion of data entries that contain errors or inaccuracies. (\\text{Error Rate} = \\left( \\frac{\\text{Number of Errors}}{\\text{Total Number of Entries}} \\right) \\times 100) - Accuracy Rate: The percentage of data entries that are correct and match the source of truth. (\\text{Accuracy Rate} = \\left( \\frac{\\text{Number of Accurate Entries}}{\\text{Total Number of Entries}} \\right) \\times 100) - Duplicate Record Rate: The percentage of data entries that are duplicates. (\\text{Duplicate Record Rate} = \\left( \\frac{\\text{Number of Duplicate Entries}}{\\text{Total Number of Entries}} \\right) \\times 100) Disaster recovery and business continuity Determine the system's requirements for disaster recovery and business continuity, including backup and recovery procedures and disaster recovery testing. - Backup and Recovery: The application must have a Backup and Recovery plan in place that includes regular backups of all data and configurations, and a process for restoring data and functionality in the event of a disaster or disruption. - Redundancy: The application must have Redundancy built into its infrastructure, such as redundant servers, network devices, and power supplies, to ensure high availability and minimize downtime in the event of a failure. - Failover and high availability: The application must be designed to support Failover and high availability, such as by using load balancers or Failover clusters, to ensure that it can continue to operate in the event of a system failure or disruption. - Disaster Recovery plan: The application must have a comprehensive disaster Recovery plan that includes procedures for restoring data and functionality in the event of a major disaster, such as a natural disaster, cyber attack, or other catastrophic event. - Testing and Maintenance: The application must be regularly tested and maintained to ensure that it can withstand a disaster or disruption, and that all systems, processes, and data can be quickly restored and recovered. Reliability System's ability to maintain functionality under varying conditions and failure scenarios. - Mean Time Between Failures (MTBF): The system should achieve an MTBF of at least 1000 hours, indicating a high level of reliability with infrequent failures. - Mean Time to Recover (MTTR): The system should aim for an MTTR of less than 1 hour, ensuring quick recovery and minimal disruption in the event of a failure. - Redundancy Levels: The system should include redundancy mechanisms to achieve a redundancy level of N+1, ensuring high availability and fault tolerance.","title":"Operational Requirements"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#performance-requirements","text":"Quality Attribute Description Common Metrics Capacity Maximum load or volume that the system can handle within specified performance criteria. - Maximum Load Capacity: The system should be capable of handling peak loads without exceeding predefined performance degradation thresholds. Maximum load capacity may be expressed in terms of concurrent users, transactions per second, or data volume. - Resource Utilization: Measures the percentage of system resources (CPU, memory, disk I/O, network bandwidth) consumed under normal operation. - Concurrency: Measures the number of simultaneous users or transactions the system can handle without degradation in performance. - Throughput: Measures the rate at which the system processes transactions, requests, or data. Thresholds may be defined in terms of transactions per second, requests per minute, or data throughput in bytes per second. Performance Define the expected response times, throughput, and resource usage of the solution. - Response time: The application must load and respond to user interactions within 500 ms for button clicks. - Throughput: The application must be able to handle 100 concurrent users or 500 transactions per second. - Resource utilization: The application must use less than 80% of CPU and 1 GB of memory. - Error rates: The application must have an error rate less than 1% of all requests, and be able to handle and recover from errors gracefully, without impacting user experience or data integrity. Scalability Determine how the system will handle increased user loads or larger datasets over time. - Load Balancing: The application must be able to handle a minimum of 250 concurrent users and support load balancing across at least 3 servers to handle peak traffic. - Database Scalability: The application's database must be able to handle at least 1 million records and support partitioning or sharding to ensure efficient storage and retrieval of data. - Cloud-Based Infrastructure: The application must be deployed on cloud-based infrastructure that can handle at least 100,000 requests per hour, and be able to scale up or down to meet changing demand. Microservices Architecture: The application must be designed using a microservices architecture that allows for easy scaling of individual services, and be able to handle at least 500 requests per second. - Caching: The application must be able to cache at least 10,000 records, with a cache hit rate of 95%, and support caching across multiple servers to ensure high availability.","title":"Performance Requirements"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#security-and-compliance-requirements","text":"Quality Attribute Description Common Metrics Compliance Adherence to legal, regulatory, and industry standards and requirements. See Microsoft Purview Compliance Manager Privacy Protection of sensitive information and compliance with privacy regulations. - Compliance with Privacy Regulations: Achieve full compliance with GDPR, CCPA and HIPAA. - Data Anonymization: Implement anonymization techniques in protecting individual privacy while still allowing for data analysis. - Data Encryption: Ensure that sensitive data is encrypted according to encryption standards and best practices. - User Privacy Preferences: The ability to respect and accommodate user privacy preferences regarding data collection, processing, and sharing. Security Establish the security requirements of the system, such as authentication, authorization, encryption, and compliance with industry or legal regulations. See Threat Modeling Tool Sustainability Ability to operate over an extended period while minimizing environmental impact and resource consumption. - Energy Efficiency: Kilowatt-hours/Transaction. - Carbon Footprint: Tons of CO2 emissions per year.","title":"Security and Compliance Requirements"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#system-maintainability-requirements","text":"Quality Attribute Description Common Metrics Interoperability Ability to interact and exchange data with other systems or components. - Data Format Compatibility: The system must be interoperable with various Electronic Health Records (HER) systems to exchange patient data securely. - Protocol Compatibility: The system should import and export banking information from the ERP using REST protocol. - API Compatibility: The solution must adhere to API standards, ensuring backward compatibility with previous API versions, and providing comprehensive documentation for developers. Maintainability Ease of modifying, updating, and extending the software over time. - Code Complexity: The level of complexity in the system's codebase, measured using metrics such as cyclomatic complexity or lines of code per function. Lower code complexity makes maintenance tasks easier and reduces the likelihood of introducing defects. A cyclomatic complexity score of less than 10 or a lines of code per function metric below 50 is often desirable. - Code Coverage: The percentage of code covered by automated tests. Higher code coverage indicates better testability and facilitates easier maintenance by enabling faster detection of defects. A code coverage threshold of 80% or higher is commonly targeted. - Documentation Quality: The comprehensiveness and clarity of documentation accompanying the system, including design documents, technical specifications, and user manuals. Well-written documentation reduces the time and effort required for maintenance tasks. Documentation should cover at least 80% of system functionality with clear explanations and examples. - Dependency Management: The management of external dependencies and libraries used in the system. Proper dependency management reduces the risk of compatibility issues and simplifies maintenance tasks such as updates and patches. - Code Churn: The frequency of code changes within a software system. High code churn may indicate instability or frequent updates, making maintenance more challenging. A code churn rate of less than 20% is generally considered acceptable. Observability The ability to measure a system's internal state and performance based on the outputs it generates, such as logs, metrics, and traces. -System Metrics: CPU usage, memory usage, disk I/O, network I/O, and other resource utilization metrics. - Application Metrics: Response times, request rates, error rates, and throughput. - Custom Metrics: Application-specific metrics, such as user sign-ups, or specific business logic indicators. Portability Ability to run the software on different platforms, environments, and devices. - Platform Compatibility: The ability of the software to run on different operating systems (e.g., Windows, macOS, Linux) or platforms (e.g., desktop, mobile, web). Portability requires the software to be compatible with multiple platforms, with a goal of supporting at least three major platforms. - Hardware Compatibility: The ability of the software to run on different hardware configurations, such as varying processor architectures (e.g., x86, ARM) or memory sizes. Portability involves ensuring compatibility with a wide range of hardware configurations, with a goal of supporting common hardware architectures. - File System Independence: The software's ability to operate independently of the underlying file system, ensuring compatibility with different file systems (e.g., NTFS, ext4, APFS). Portability involves using file system abstraction layers or APIs to abstract file system operations and ensure consistency across platforms. - Data Format Compatibility: The software's ability to read and write data in different formats, ensuring compatibility with common data interchange formats (e.g., JSON, XML, CSV). Portability involves supporting standard data formats and providing mechanisms for data conversion and interoperability.","title":"System Maintainability Requirements"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#user-experience-requirements","text":"Quality Attribute Description Common Metrics Accessibility The solution must be usable by people with disabilities. Compliance with accessibility standards. Support for assistive technologies - Alternative Text for Images: All images and non-text content must have alternative text descriptions that can be read by screen readers. - Color contrast: The application must use color schemes that meet the recommended contrast ratio between foreground and background colors to ensure visibility for users with low vision. - Focus indicators: The application must provide visible focus indicators to highlight the currently focused element, which is especially important for users who rely on keyboard navigation. - Captions and Transcripts: All audio and video content must have captions and transcripts, to ensure that users with hearing impairments can access the content. - Language identification: The application must correctly identify the language of the content, to ensure that screen readers and other assistive technologies can read the content properly. Internationalization and Localization Adaptation of the software for use in different languages and cultures. Tailoring the software to meet the specific needs of different regions or locales. - Language and Locale Support: The software's support for different languages, character sets, and locales. Portability requires internationalization and localization efforts to ensure that the software can be used effectively in different regions and cultures, with support for at least five major languages. - Multi currency: The system's support for multiple currencies, allowing different symbols and conversion rates. Usability Intuitiveness, ease of learning, and user satisfaction with the software interface. - Task Completion Time: The average time it takes for users to complete specific tasks. A user must be able to complete an account settings in less than 2 minutes. - Ease of Navigation: The ease with which users can navigate through the system and find the information they need. This can be measured by observing user interactions or conducting usability tests. - User Satisfaction: User satisfaction can be measured using surveys, feedback forms, or satisfaction ratings. A satisfaction score of 70% or higher is typically considered satisfactory. - Learnability: The ease with which new users can learn to use the system. This can be measured by the time it takes for users to perform basic tasks or by conducting usability tests with novice users.","title":"User Experience Requirements"},{"location":"design/design-patterns/object-oriented-design-reference/","text":"Object-Oriented Design Reference When writing software for large projects, the hardest part is often communication and maintenance. Following proven design patterns can optimize for maintenance, readability, and ease of extension. In particular, object-oriented design patterns are well-established in the industry. Please refer to the following resources to create strong object-oriented designs: Design Patterns Wikipedia Object Oriented Design Website","title":"Object-Oriented Design Reference"},{"location":"design/design-patterns/object-oriented-design-reference/#object-oriented-design-reference","text":"When writing software for large projects, the hardest part is often communication and maintenance. Following proven design patterns can optimize for maintenance, readability, and ease of extension. In particular, object-oriented design patterns are well-established in the industry. Please refer to the following resources to create strong object-oriented designs: Design Patterns Wikipedia Object Oriented Design Website","title":"Object-Oriented Design Reference"},{"location":"design/design-patterns/rest-api-design-guidance/","text":"REST API Design Guidance Goals Elevate Microsoft's published REST API design guidelines . Highlight common design decisions and factors to consider when designing. Provide additional resources to inform API design in areas not directly addressed by the Microsoft guidelines. Common API Design Decisions The Microsoft REST API guidelines provide design guidance covering a multitude of use-cases. The following sections are a good place to start as they are likely required considerations by any REST API design: URL Structure HTTP Methods HTTP Status Codes Collections JSON Standardizations Versioning Naming Creating API Contracts As different development teams expose APIs to access various REST based services, it's important to have an API contract to share between the producer and consumers of APIs. Open API format is one of the most popular API description format. This Open API document can be produced in two ways: Design-First - Team starts developing APIs by first describing API designs as an Open API document and later generates server side boilerplate code with the help of this document. Code-First - Team starts writing the server side API interface code e.g. controllers, DTOs etc. and later generates and Open API document from it. Design-First Approach A Design-First approach means that APIs are treated as \"first-class citizens\" and everything about a project revolves around the idea that at the end these APIs will be consumed by clients. So based on the business requirements API development team first start describing API designs as an Open API document and collaborate with the stakeholders to gather feedback. This approach is quite useful if a project is about developing externally exposed set of APIs which will be consumed by partners. In this approach, we first agree upon an API contract (Open API document) creating clear expectations on both API producer and consumer sides so both teams can begin work in parallel as per the pre-agreed API design. Key Benefits of this approach: Early API design feedback. Clearly established expectations for both consumer & producer as both have agreed upon an API contract. Development teams can work in parallel. Testing team can use API contracts to write early tests even before business logic is in place. By looking at different models, paths, attributes and other aspects of the API testing can provide their input which can be very valuable. During an agile development cycle API definitions are not impacted by incremental dev changes. API design is not influenced by actual implementation limitations & code structure. Server side boilerplate code e.g. controllers, DTOs etc. can be auto generated from API contracts. May improve collaboration between API producer & consumer teams. Planning a Design-First Development: Identify use cases & key services which API should offer. Identify key stakeholders of API and try to include them during API design phase to get continuous feedback. Write API contract definitions. Maintain consistent style for API status codes, versioning, error responses etc. Encourage peer reviews via pull requests. Generate server side boilerplate code & client SDKs from API contract definitions. Important Points to consider: If API requirements changes often during initial development phase, than a Design-First approach may not be a good fit as this will introduce additional overhead, requiring repeated updates & maintenance to the API contract definitions. It might be worthwhile to first try out your platform specific code generator and evaluate how much more additional work will be required in order to meet your project requirements and coding guidelines because it is possible that a particular platform specific code generator might not be able to generate a flexible & maintainable implementation of actual code. For instance If your web framework requires annotations to be present on your controller classes (e.g. for API versioning or authentication), make sure that the code generation tool you use fully supports them. Microsoft TypeSpec is a valuable tool for developers who are working on complex APIs. By providing reusable patterns it can streamline API development and promote best practices. We have put together some samples about how to enforce an API design-first approach in a GitHub CI/CD pipeline to help accelerate it's adoption in a Design-First Development. Code-First Approach A Code-First approach means that development teams first implements server side API interface code e.g. controllers, DTOs etc. and than generates API contract definitions out of it. In current times this approach is more widely popular within developer community than Design-First Approach. This approach has the advantages of allowing the team to quickly implement APIs and also providing the flexibility to react very quickly to any unexpected API requirement changes. Key Benefits of this approach: Rapid development of APIs as development team can start implementing APIs much faster directly after understanding key requirements & use cases. Development team has better control & flexibility to implement server side API interfaces in a way which best suited for project structure. More popular among development teams so its easier to get consensus on a related topic and also has more ready to use code examples available on various blogs or developer forums regarding how to generate Open API definitions out of actual code. During initial phase of development where both API producer & consumers requirements might change often this approach is better as it provides flexibility to quickly react on such changes. Important Points to consider: A generated Open API definition can become outdated, so its important to have automated checks to avoid this otherwise generated client SDKs will be out of sync and may cause issues for API consumers. With Agile development, it is hard to ensure that definitions embedded in runtime code remain stable, especially across rounds of refactoring and when serving multiple concurrent API versions. It might be useful to regularly generate Open API definition and store it in version control system otherwise generating the OpenAPI definition at runtime might makes it more complex in scenarios where that definition is required at development/CI time. How to Interpret and Apply the Guidelines The API guidelines document includes a section on how to apply the guidelines depending on whether the API is new or existing. In particular, when working in an existing API ecosystem, be sure to align with stakeholders on a definition of what constitutes a breaking change to understand the impact of implementing certain best practices. We do not recommend making a breaking change to a service that predates these guidelines simply for the sake of compliance. Resources Microsoft's Recommended Reading List for REST APIs Documentation - Guidance - REST APIs Detailed HTTP status code definitions Semantic Versioning Other Public API Guidelines Microsoft TypeSpec Microsoft TypeSpec GitHub Workflow samples","title":"REST API Design Guidance"},{"location":"design/design-patterns/rest-api-design-guidance/#rest-api-design-guidance","text":"","title":"REST API Design Guidance"},{"location":"design/design-patterns/rest-api-design-guidance/#goals","text":"Elevate Microsoft's published REST API design guidelines . Highlight common design decisions and factors to consider when designing. Provide additional resources to inform API design in areas not directly addressed by the Microsoft guidelines.","title":"Goals"},{"location":"design/design-patterns/rest-api-design-guidance/#common-api-design-decisions","text":"The Microsoft REST API guidelines provide design guidance covering a multitude of use-cases. The following sections are a good place to start as they are likely required considerations by any REST API design: URL Structure HTTP Methods HTTP Status Codes Collections JSON Standardizations Versioning Naming","title":"Common API Design Decisions"},{"location":"design/design-patterns/rest-api-design-guidance/#creating-api-contracts","text":"As different development teams expose APIs to access various REST based services, it's important to have an API contract to share between the producer and consumers of APIs. Open API format is one of the most popular API description format. This Open API document can be produced in two ways: Design-First - Team starts developing APIs by first describing API designs as an Open API document and later generates server side boilerplate code with the help of this document. Code-First - Team starts writing the server side API interface code e.g. controllers, DTOs etc. and later generates and Open API document from it.","title":"Creating API Contracts"},{"location":"design/design-patterns/rest-api-design-guidance/#design-first-approach","text":"A Design-First approach means that APIs are treated as \"first-class citizens\" and everything about a project revolves around the idea that at the end these APIs will be consumed by clients. So based on the business requirements API development team first start describing API designs as an Open API document and collaborate with the stakeholders to gather feedback. This approach is quite useful if a project is about developing externally exposed set of APIs which will be consumed by partners. In this approach, we first agree upon an API contract (Open API document) creating clear expectations on both API producer and consumer sides so both teams can begin work in parallel as per the pre-agreed API design. Key Benefits of this approach: Early API design feedback. Clearly established expectations for both consumer & producer as both have agreed upon an API contract. Development teams can work in parallel. Testing team can use API contracts to write early tests even before business logic is in place. By looking at different models, paths, attributes and other aspects of the API testing can provide their input which can be very valuable. During an agile development cycle API definitions are not impacted by incremental dev changes. API design is not influenced by actual implementation limitations & code structure. Server side boilerplate code e.g. controllers, DTOs etc. can be auto generated from API contracts. May improve collaboration between API producer & consumer teams. Planning a Design-First Development: Identify use cases & key services which API should offer. Identify key stakeholders of API and try to include them during API design phase to get continuous feedback. Write API contract definitions. Maintain consistent style for API status codes, versioning, error responses etc. Encourage peer reviews via pull requests. Generate server side boilerplate code & client SDKs from API contract definitions. Important Points to consider: If API requirements changes often during initial development phase, than a Design-First approach may not be a good fit as this will introduce additional overhead, requiring repeated updates & maintenance to the API contract definitions. It might be worthwhile to first try out your platform specific code generator and evaluate how much more additional work will be required in order to meet your project requirements and coding guidelines because it is possible that a particular platform specific code generator might not be able to generate a flexible & maintainable implementation of actual code. For instance If your web framework requires annotations to be present on your controller classes (e.g. for API versioning or authentication), make sure that the code generation tool you use fully supports them. Microsoft TypeSpec is a valuable tool for developers who are working on complex APIs. By providing reusable patterns it can streamline API development and promote best practices. We have put together some samples about how to enforce an API design-first approach in a GitHub CI/CD pipeline to help accelerate it's adoption in a Design-First Development.","title":"Design-First Approach"},{"location":"design/design-patterns/rest-api-design-guidance/#code-first-approach","text":"A Code-First approach means that development teams first implements server side API interface code e.g. controllers, DTOs etc. and than generates API contract definitions out of it. In current times this approach is more widely popular within developer community than Design-First Approach. This approach has the advantages of allowing the team to quickly implement APIs and also providing the flexibility to react very quickly to any unexpected API requirement changes. Key Benefits of this approach: Rapid development of APIs as development team can start implementing APIs much faster directly after understanding key requirements & use cases. Development team has better control & flexibility to implement server side API interfaces in a way which best suited for project structure. More popular among development teams so its easier to get consensus on a related topic and also has more ready to use code examples available on various blogs or developer forums regarding how to generate Open API definitions out of actual code. During initial phase of development where both API producer & consumers requirements might change often this approach is better as it provides flexibility to quickly react on such changes. Important Points to consider: A generated Open API definition can become outdated, so its important to have automated checks to avoid this otherwise generated client SDKs will be out of sync and may cause issues for API consumers. With Agile development, it is hard to ensure that definitions embedded in runtime code remain stable, especially across rounds of refactoring and when serving multiple concurrent API versions. It might be useful to regularly generate Open API definition and store it in version control system otherwise generating the OpenAPI definition at runtime might makes it more complex in scenarios where that definition is required at development/CI time.","title":"Code-First Approach"},{"location":"design/design-patterns/rest-api-design-guidance/#how-to-interpret-and-apply-the-guidelines","text":"The API guidelines document includes a section on how to apply the guidelines depending on whether the API is new or existing. In particular, when working in an existing API ecosystem, be sure to align with stakeholders on a definition of what constitutes a breaking change to understand the impact of implementing certain best practices. We do not recommend making a breaking change to a service that predates these guidelines simply for the sake of compliance.","title":"How to Interpret and Apply the Guidelines"},{"location":"design/design-patterns/rest-api-design-guidance/#resources","text":"Microsoft's Recommended Reading List for REST APIs Documentation - Guidance - REST APIs Detailed HTTP status code definitions Semantic Versioning Other Public API Guidelines Microsoft TypeSpec Microsoft TypeSpec GitHub Workflow samples","title":"Resources"},{"location":"design/design-reviews/","text":"Design Reviews Goals Reduce technical debt for our customers Continue to iterate on design after Game Plan review Generate useful technical artifacts that can be referenced by Microsoft and customers Measures Cost of Change When incorporating design reviews as part of the engineering process, decisions are front-loaded before implementation begins. Making a decision of using Azure Kubernetes Service instead of App Services at the design phase likely only requires updating documentation. However, making this pivot after implementation has started or after a solution is in use is much more costly. Are these changes occurring before or after implementation? How large of effort are they typically? Reviewer Participation How many individuals participate across the designs created? Cumulatively if this is a larger number this would indicate a wider contribution of ideas and perspectives. A lower number (i.e. same 2 individuals only on every review) might indicate a limited set of perspectives. Is anyone participating from outside the core development team? Time To Potential Solutions How long does it typically take to go from requirements to solution options (multiple)? There is a healthy balancing act between spending too much or too little time evaluating different potential solutions. Too little time puts higher risk of costly changes required after implementation. Too much time delays target value from being delivered; as well as subsequent features in queue. However, the faster the team can identify the most critical information necessary to make an informed decision , the faster value can be provided with lower risk of costly changes down the road. Time to Decisions How long does it take to make a decision on which solution to implement? There is also a healthy balancing act in supporting a healthy debate while not hindering the team's delivery. The ideal case is for a team to quickly digest the solution options presented, ask questions, and debate before finally reaching quorum on a particular approach. In cases where no quorum can be reached, the person with the most context on the problem (typically story owner) should make the final decision. Prioritize delivering value and learning. Disagree and commit! Impact Solutions can be quickly be operated into customer's production environment Easier for other dev crews to leverage your teams work Easier for engineers to ramp up on projects Increase team velocity by front-loading changes and decisions when they cost the least Increased team engagement and transparency by soliciting wide reviewer participation Participation Dev Crew The dev crew should always participate in all design review sessions Domain Experts Domain experts should participate in design review sessions as needed ISE Tech Domains Customer subject-matter experts (SME) Senior Leadership Facilitation Guidance Recipes Please see our Design Review Recipes for guidance on design process. Sync Design Reviews via In-Person / Virtual Meetings Joint meetings with dev crew, subject-matter experts (SMEs) and customer engineers Async Design Reviews via Pull-Requests See the async design review recipe for guidance on facilitating async design reviews. This can be useful for teams that are geographically distributed across different time-zones. Technical Spike A technical spike is most often used for evaluating the impact new technology has on the current implementation. Please read more here . Design Documentation Document and update the architecture design in the project design documentation Track and document design decisions in a decision log Document decision process in trade studies when multiple solutions exist for the given problem Early on in engagements, the team must decide where to land artifacts generated from design reviews. Typically, we meet the customer where they are at (for example, using their Confluence instance to land documentation if that is their preferred process). However, similar to storing decision logs, trade studies, etc. in the development repo, there are also large benefits to maintaining design review artifacts in the repo as well. Usually these artifacts can be further added to root level documentation directory or even at the root of the corresponding project if the repo is monolithic. In adding them to the project repo, these artifacts must similarly be reviewed in Pull Requests (typically preceding but sometimes accompanying implementation) which allows async review/discussion. Furthermore, artifacts can then easily link to other sections of the repo and source code files (via markdown links ).","title":"Design Reviews"},{"location":"design/design-reviews/#design-reviews","text":"","title":"Design Reviews"},{"location":"design/design-reviews/#goals","text":"Reduce technical debt for our customers Continue to iterate on design after Game Plan review Generate useful technical artifacts that can be referenced by Microsoft and customers","title":"Goals"},{"location":"design/design-reviews/#measures","text":"","title":"Measures"},{"location":"design/design-reviews/#cost-of-change","text":"When incorporating design reviews as part of the engineering process, decisions are front-loaded before implementation begins. Making a decision of using Azure Kubernetes Service instead of App Services at the design phase likely only requires updating documentation. However, making this pivot after implementation has started or after a solution is in use is much more costly. Are these changes occurring before or after implementation? How large of effort are they typically?","title":"Cost of Change"},{"location":"design/design-reviews/#reviewer-participation","text":"How many individuals participate across the designs created? Cumulatively if this is a larger number this would indicate a wider contribution of ideas and perspectives. A lower number (i.e. same 2 individuals only on every review) might indicate a limited set of perspectives. Is anyone participating from outside the core development team?","title":"Reviewer Participation"},{"location":"design/design-reviews/#time-to-potential-solutions","text":"How long does it typically take to go from requirements to solution options (multiple)? There is a healthy balancing act between spending too much or too little time evaluating different potential solutions. Too little time puts higher risk of costly changes required after implementation. Too much time delays target value from being delivered; as well as subsequent features in queue. However, the faster the team can identify the most critical information necessary to make an informed decision , the faster value can be provided with lower risk of costly changes down the road.","title":"Time To Potential Solutions"},{"location":"design/design-reviews/#time-to-decisions","text":"How long does it take to make a decision on which solution to implement? There is also a healthy balancing act in supporting a healthy debate while not hindering the team's delivery. The ideal case is for a team to quickly digest the solution options presented, ask questions, and debate before finally reaching quorum on a particular approach. In cases where no quorum can be reached, the person with the most context on the problem (typically story owner) should make the final decision. Prioritize delivering value and learning. Disagree and commit!","title":"Time to Decisions"},{"location":"design/design-reviews/#impact","text":"Solutions can be quickly be operated into customer's production environment Easier for other dev crews to leverage your teams work Easier for engineers to ramp up on projects Increase team velocity by front-loading changes and decisions when they cost the least Increased team engagement and transparency by soliciting wide reviewer participation","title":"Impact"},{"location":"design/design-reviews/#participation","text":"","title":"Participation"},{"location":"design/design-reviews/#dev-crew","text":"The dev crew should always participate in all design review sessions","title":"Dev Crew"},{"location":"design/design-reviews/#domain-experts","text":"Domain experts should participate in design review sessions as needed ISE Tech Domains Customer subject-matter experts (SME) Senior Leadership","title":"Domain Experts"},{"location":"design/design-reviews/#facilitation-guidance","text":"","title":"Facilitation Guidance"},{"location":"design/design-reviews/#recipes","text":"Please see our Design Review Recipes for guidance on design process.","title":"Recipes"},{"location":"design/design-reviews/#sync-design-reviews-via-in-person-virtual-meetings","text":"Joint meetings with dev crew, subject-matter experts (SMEs) and customer engineers","title":"Sync Design Reviews via In-Person / Virtual Meetings"},{"location":"design/design-reviews/#async-design-reviews-via-pull-requests","text":"See the async design review recipe for guidance on facilitating async design reviews. This can be useful for teams that are geographically distributed across different time-zones.","title":"Async Design Reviews via Pull-Requests"},{"location":"design/design-reviews/#technical-spike","text":"A technical spike is most often used for evaluating the impact new technology has on the current implementation. Please read more here .","title":"Technical Spike"},{"location":"design/design-reviews/#design-documentation","text":"Document and update the architecture design in the project design documentation Track and document design decisions in a decision log Document decision process in trade studies when multiple solutions exist for the given problem Early on in engagements, the team must decide where to land artifacts generated from design reviews. Typically, we meet the customer where they are at (for example, using their Confluence instance to land documentation if that is their preferred process). However, similar to storing decision logs, trade studies, etc. in the development repo, there are also large benefits to maintaining design review artifacts in the repo as well. Usually these artifacts can be further added to root level documentation directory or even at the root of the corresponding project if the repo is monolithic. In adding them to the project repo, these artifacts must similarly be reviewed in Pull Requests (typically preceding but sometimes accompanying implementation) which allows async review/discussion. Furthermore, artifacts can then easily link to other sections of the repo and source code files (via markdown links ).","title":"Design Documentation"},{"location":"design/design-reviews/decision-log/","text":"Design Decision Log Not all requirements can be captured in the beginning of an agile project during one or more design sessions. The initial architecture design can evolve or change during the project, especially if there are multiple possible technology choices that can be made. Tracking these changes within a large document is in most cases not ideal, as one can lose oversight over the design changes made at which point in time. Having to scan through a large document to find a specific content takes time, and in many cases the consequences of a decision is not documented. Why is it Important to Track Design Decisions Tracking an architecture design decision can have many advantages: Developers and project stakeholders can see the decision log and track the changes, even as the team composition changes over time. The log is kept up-to-date. The context of a decision including the consequences for the team are documented with the decision. It is easier to find the design decision in a log than having to read a large document. What is a Recommended Format for Tracking Decisions In addition to incorporating a design decision as an update of the overall design documentation of the project, the decisions could be tracked as Architecture Decision Records as Michael Nygard proposed in his blog. The effort invested in design reviews and discussions can be different throughout the course of a project. Sometimes decisions are made quickly without having to go into a detailed comparison of competing technologies. In some cases, it is necessary to have a more elaborate study of advantages and disadvantages, as is described in the documentation of Trade Studies . In other cases, it can be helpful to conduct Engineering Feasibility Spikes . An ADR can incorporate each of these different approaches. Architecture Decision Record (ADR) An architecture decision record has the structure [Ascending number]. [Title of decision] The title should give the reader the information on what was decided upon. Example: 001. App level logging with Serilog and Application Insights Hint: When several developers regularly start ADRs in parallel, it becomes difficult to deal with conflicting ascending numbers. An easy way to overcome this is to give ADRs the ID of the work item they relate to. Date: The date the decision was made. Status: [Proposed/Accepted/Deprecated/Superseded] A proposed design can be reviewed by the development team prior to accepting it. A previous decision can be superseded by a new one, or the ADR record marked as deprecated in case it is not valid anymore. Context: The text should provide the reader an understanding of the problem, or as Michael Nygard puts it, a value-neutral [an objective] description of the forces at play. Example: Due to the microservices design of the platform, we need to ensure consistency of logging throughout each service so tracking of usage, performance, errors etc. can be performed end-to-end. A single logging/monitoring framework should be used where possible to achieve this, whilst allowing the flexibility for integration/export into other tools at a later stage. The developers should be equipped with a simple interface to log messages and metrics. If the development team had a data-driven approach to back the decision, i.e., a study that evaluates the potential choices against a set of objective criteria by following the guidance in Trade Studies , the study should be referred to in this section. Decision: The decision made, it should begin with 'We will...' or 'We have agreed to ... Example: We have agreed to utilize Serilog as the Dotnet Logging framework of choice at the application level, with integration into Log Analytics and Application Insights for analysis. Consequences: The resulting context, after having applied the decision. Example: Sampling will need to be configured in Application Insights so that it does not become overly-expensive when ingesting millions of messages, but also does not prevent capture of essential information. The team will need to only log what is agreed to be essential for monitoring as part of design reviews, to reduce noise and unnecessary levels of sampling. Where to Store ADRs ADRs can be stored and tracked in any version control system such as git. As a recommended practice, ADRs can be added as pull request in the proposed status to be discussed by the team until it is updated to accepted to be merged with the main branch. They are usually stored in a folder structure doc/adr or doc/arch . Additionally, it can be useful to track ADRs in a decision-log.md to provide useful metadata in an obvious format. Decision Logs A decision log is a Markdown file containing a table which provides executive summaries of the decisions contained in ADRs, as well as some other metadata. You can see a template table at doc/decision-log.md . When to Track ADRs Architecture design decisions are usually tracked whenever significant decisions are made that affect the structure and characteristics of the solution or framework we are building. ADRs can also be used to document results of spikes when evaluating different technology choices. Examples of ADRs The first ADR could be the decision to use ADRs to track design decisions, 0001-record-architecture-decisions.md , followed by actual decisions in the engagement as in the example used above, 0002-app-level-logging.md .","title":"Design Decision Log"},{"location":"design/design-reviews/decision-log/#design-decision-log","text":"Not all requirements can be captured in the beginning of an agile project during one or more design sessions. The initial architecture design can evolve or change during the project, especially if there are multiple possible technology choices that can be made. Tracking these changes within a large document is in most cases not ideal, as one can lose oversight over the design changes made at which point in time. Having to scan through a large document to find a specific content takes time, and in many cases the consequences of a decision is not documented.","title":"Design Decision Log"},{"location":"design/design-reviews/decision-log/#why-is-it-important-to-track-design-decisions","text":"Tracking an architecture design decision can have many advantages: Developers and project stakeholders can see the decision log and track the changes, even as the team composition changes over time. The log is kept up-to-date. The context of a decision including the consequences for the team are documented with the decision. It is easier to find the design decision in a log than having to read a large document.","title":"Why is it Important to Track Design Decisions"},{"location":"design/design-reviews/decision-log/#what-is-a-recommended-format-for-tracking-decisions","text":"In addition to incorporating a design decision as an update of the overall design documentation of the project, the decisions could be tracked as Architecture Decision Records as Michael Nygard proposed in his blog. The effort invested in design reviews and discussions can be different throughout the course of a project. Sometimes decisions are made quickly without having to go into a detailed comparison of competing technologies. In some cases, it is necessary to have a more elaborate study of advantages and disadvantages, as is described in the documentation of Trade Studies . In other cases, it can be helpful to conduct Engineering Feasibility Spikes . An ADR can incorporate each of these different approaches.","title":"What is a Recommended Format for Tracking Decisions"},{"location":"design/design-reviews/decision-log/#architecture-decision-record-adr","text":"An architecture decision record has the structure [Ascending number]. [Title of decision] The title should give the reader the information on what was decided upon. Example: 001. App level logging with Serilog and Application Insights Hint: When several developers regularly start ADRs in parallel, it becomes difficult to deal with conflicting ascending numbers. An easy way to overcome this is to give ADRs the ID of the work item they relate to. Date: The date the decision was made. Status: [Proposed/Accepted/Deprecated/Superseded] A proposed design can be reviewed by the development team prior to accepting it. A previous decision can be superseded by a new one, or the ADR record marked as deprecated in case it is not valid anymore. Context: The text should provide the reader an understanding of the problem, or as Michael Nygard puts it, a value-neutral [an objective] description of the forces at play. Example: Due to the microservices design of the platform, we need to ensure consistency of logging throughout each service so tracking of usage, performance, errors etc. can be performed end-to-end. A single logging/monitoring framework should be used where possible to achieve this, whilst allowing the flexibility for integration/export into other tools at a later stage. The developers should be equipped with a simple interface to log messages and metrics. If the development team had a data-driven approach to back the decision, i.e., a study that evaluates the potential choices against a set of objective criteria by following the guidance in Trade Studies , the study should be referred to in this section. Decision: The decision made, it should begin with 'We will...' or 'We have agreed to ... Example: We have agreed to utilize Serilog as the Dotnet Logging framework of choice at the application level, with integration into Log Analytics and Application Insights for analysis. Consequences: The resulting context, after having applied the decision. Example: Sampling will need to be configured in Application Insights so that it does not become overly-expensive when ingesting millions of messages, but also does not prevent capture of essential information. The team will need to only log what is agreed to be essential for monitoring as part of design reviews, to reduce noise and unnecessary levels of sampling.","title":"Architecture Decision Record (ADR)"},{"location":"design/design-reviews/decision-log/#where-to-store-adrs","text":"ADRs can be stored and tracked in any version control system such as git. As a recommended practice, ADRs can be added as pull request in the proposed status to be discussed by the team until it is updated to accepted to be merged with the main branch. They are usually stored in a folder structure doc/adr or doc/arch . Additionally, it can be useful to track ADRs in a decision-log.md to provide useful metadata in an obvious format.","title":"Where to Store ADRs"},{"location":"design/design-reviews/decision-log/#decision-logs","text":"A decision log is a Markdown file containing a table which provides executive summaries of the decisions contained in ADRs, as well as some other metadata. You can see a template table at doc/decision-log.md .","title":"Decision Logs"},{"location":"design/design-reviews/decision-log/#when-to-track-adrs","text":"Architecture design decisions are usually tracked whenever significant decisions are made that affect the structure and characteristics of the solution or framework we are building. ADRs can also be used to document results of spikes when evaluating different technology choices.","title":"When to Track ADRs"},{"location":"design/design-reviews/decision-log/#examples-of-adrs","text":"The first ADR could be the decision to use ADRs to track design decisions, 0001-record-architecture-decisions.md , followed by actual decisions in the engagement as in the example used above, 0002-app-level-logging.md .","title":"Examples of ADRs"},{"location":"design/design-reviews/decision-log/doc/decision-log/","text":"Decision Log This document is used to track key decisions that are made during the course of the project. This can be used at a later stage to understand why decisions were made and by whom. Decision Date Alternatives Considered Reasoning Detailed doc Made By Work Required A one-sentence summary of the decision made. Date the decision was made. A list of the other approaches considered. A two to three sentence summary of why the decision was made. A link to the ADR with the format [Title] DR. Who made this decision? A link to the work item for the linked ADR.","title":"Decision Log"},{"location":"design/design-reviews/decision-log/doc/decision-log/#decision-log","text":"This document is used to track key decisions that are made during the course of the project. This can be used at a later stage to understand why decisions were made and by whom. Decision Date Alternatives Considered Reasoning Detailed doc Made By Work Required A one-sentence summary of the decision made. Date the decision was made. A list of the other approaches considered. A two to three sentence summary of why the decision was made. A link to the ADR with the format [Title] DR. Who made this decision? A link to the work item for the linked ADR.","title":"Decision Log"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/","text":"1. Record architecture decisions Date: 2020-03-20 Status Accepted Context We need to record the architectural decisions made on this project. Decision We will use Architecture Decision Records, as described by Michael Nygard . Consequences See Michael Nygard's article, linked above. For a lightweight ADR tool set, see Nat Pryce's adr-tools .","title":"1. Record architecture decisions"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#1-record-architecture-decisions","text":"Date: 2020-03-20","title":"1. Record architecture decisions"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#status","text":"Accepted","title":"Status"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#context","text":"We need to record the architectural decisions made on this project.","title":"Context"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#decision","text":"We will use Architecture Decision Records, as described by Michael Nygard .","title":"Decision"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#consequences","text":"See Michael Nygard's article, linked above. For a lightweight ADR tool set, see Nat Pryce's adr-tools .","title":"Consequences"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/","text":"2. App-level Logging with Serilog and Application Insights Date: 2020-04-08 Status Accepted Context Due to the microservices design of the platform, we need to ensure consistency of logging throughout each service so tracking of usage, performance, errors etc. can be performed end-to-end. A single logging/monitoring framework should be used where possible to achieve this, whilst allowing the flexibility for integration/export into other tools at a later stage. The developers should be equipped with a simple interface to log messages and metrics. Decision We have agreed to utilize Serilog as the Dotnet Logging framework of choice at the application level, with integration into Log Analytics and Application Insights for analysis. Consequences Sampling will need to be configured in Application Insights so that it does not become overly-expensive when ingesting millions of messages, but also does not prevent capture of essential information. The team will need to only log what is agreed to be essential for monitoring as part of design reviews, to reduce noise and unnecessary levels of sampling.","title":"2. App-level Logging with Serilog and Application Insights"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#2-app-level-logging-with-serilog-and-application-insights","text":"Date: 2020-04-08","title":"2. App-level Logging with Serilog and Application Insights"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#status","text":"Accepted","title":"Status"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#context","text":"Due to the microservices design of the platform, we need to ensure consistency of logging throughout each service so tracking of usage, performance, errors etc. can be performed end-to-end. A single logging/monitoring framework should be used where possible to achieve this, whilst allowing the flexibility for integration/export into other tools at a later stage. The developers should be equipped with a simple interface to log messages and metrics.","title":"Context"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#decision","text":"We have agreed to utilize Serilog as the Dotnet Logging framework of choice at the application level, with integration into Log Analytics and Application Insights for analysis.","title":"Decision"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#consequences","text":"Sampling will need to be configured in Application Insights so that it does not become overly-expensive when ingesting millions of messages, but also does not prevent capture of essential information. The team will need to only log what is agreed to be essential for monitoring as part of design reviews, to reduce noise and unnecessary levels of sampling.","title":"Consequences"},{"location":"design/design-reviews/decision-log/examples/memory/","text":"Memory These examples were taken from the Memory project, an internal tool for tracking an individual's accomplishments. The main example here is the Decision Log . Since this log was used from the start, the decisions are mostly based on technology choices made in the start of the project. All line items have a link out to the trade studies done for each technology choice.","title":"Memory"},{"location":"design/design-reviews/decision-log/examples/memory/#memory","text":"These examples were taken from the Memory project, an internal tool for tracking an individual's accomplishments. The main example here is the Decision Log . Since this log was used from the start, the decisions are mostly based on technology choices made in the start of the project. All line items have a link out to the trade studies done for each technology choice.","title":"Memory"},{"location":"design/design-reviews/decision-log/examples/memory/decision-log/","text":"Decision Log This document is used to track key decisions that are made during the course of the project. This can be used at a later stage to understand why decisions were made and by whom. Decision Date Alternatives Considered Reasoning Detailed doc Made By Work Required Use Architecture Decision Records 01/25/2021 Standard Design Docs An easy and low cost solution of tracking architecture decisions over the lifetime of a project Record Architecture Decisions Dev Team #21654 Use ArgoCD 01/26/2021 FluxCD ArgoCD is more feature rich, will support more scenarios, and will be a better tool to put in our tool belts. So we have decided at this point to go with ArgoCD GitOps Trade Study Dev Team #21672 Use Helm 01/28/2021 Kustomize, Kubes, Gitkube, Draft Platform maturity, templating, ArgoCD support K8s Package Manager Trade Study Dev Team #21674 Use CosmosDB 01/29/2021 Blob Storage, CosmosDB, SQL Server, Neo4j, JanusGraph, ArangoDB CosmosDB has better Azure integration, managed identity, and the Gremlin API is powerful. Graph Storage Trade Study and Decision Dev Team #21650 Use Azure Traffic Manager 02/02/2021 Azure Front Door A lightweight solution to route traffic between multiple k8s regional clusters Routing Trade Study Dev Team #21673 Use Linkerd + Contour 02/02/2021 Istio, Consul, Ambassador, Traefik A CNCF backed cloud native k8s stack to deliver service mesh, API gateway and ingress Routing Trade Study Dev Team #21673 Use ARM Templates 02/02/2021 Terraform, Pulumi, Az CLI Azure Native, Az Monitoring and incremental updates support Automated Deployment Trade Study Dev Team #21651 Use 99designs/gqlgen 02/04/2021 graphql, graphql-go, thunder Type safety, auto-registration and code generation GraphQL Golang Trade Study Dev Team #21775 Create normalized role data model 03/25/2021 Career Stage Profiles (CSP), Microsoft Role Library Requires a data model that support the data requirements of both role systems Role Data Model Schema Dev Team #22035 Design for edges and vertices 03/25/2021 N/A N/A Data Model Dev Team #21976 Use grammes 03/29/2021 Gremlin, gremgo, gremcos Balance of documentation and maturity Gremlin API library Trade Study Dev Team #21870 Design for Gremlin implementation 04/02/2021 N/A N/A Gremlin Dev Team #21980 Design for Gremlin implementation 04/02/2021 N/A N/A Gremlin Dev Team #21980 Expose 1:1 data model from API to DB 04/02/2021 Exposing a minified version of data model contract Team decided that there were no pieces of data that we can rule out as being useful. Will update if data model becomes too complex API README Dev Team #21658 Deprecate SonarCloud 04/05/2021 Checkstyle, PMD, FindBugs Requires paid plan to use in a private repo Code Quality & Security Dev Team #22090 Adopted Stable Tagging Strategy 04/08/2021 N/A Team aligned on the proposed docker container tagging strategy Tagging Strategy Dev Team #22005","title":"Decision Log"},{"location":"design/design-reviews/decision-log/examples/memory/decision-log/#decision-log","text":"This document is used to track key decisions that are made during the course of the project. This can be used at a later stage to understand why decisions were made and by whom. Decision Date Alternatives Considered Reasoning Detailed doc Made By Work Required Use Architecture Decision Records 01/25/2021 Standard Design Docs An easy and low cost solution of tracking architecture decisions over the lifetime of a project Record Architecture Decisions Dev Team #21654 Use ArgoCD 01/26/2021 FluxCD ArgoCD is more feature rich, will support more scenarios, and will be a better tool to put in our tool belts. So we have decided at this point to go with ArgoCD GitOps Trade Study Dev Team #21672 Use Helm 01/28/2021 Kustomize, Kubes, Gitkube, Draft Platform maturity, templating, ArgoCD support K8s Package Manager Trade Study Dev Team #21674 Use CosmosDB 01/29/2021 Blob Storage, CosmosDB, SQL Server, Neo4j, JanusGraph, ArangoDB CosmosDB has better Azure integration, managed identity, and the Gremlin API is powerful. Graph Storage Trade Study and Decision Dev Team #21650 Use Azure Traffic Manager 02/02/2021 Azure Front Door A lightweight solution to route traffic between multiple k8s regional clusters Routing Trade Study Dev Team #21673 Use Linkerd + Contour 02/02/2021 Istio, Consul, Ambassador, Traefik A CNCF backed cloud native k8s stack to deliver service mesh, API gateway and ingress Routing Trade Study Dev Team #21673 Use ARM Templates 02/02/2021 Terraform, Pulumi, Az CLI Azure Native, Az Monitoring and incremental updates support Automated Deployment Trade Study Dev Team #21651 Use 99designs/gqlgen 02/04/2021 graphql, graphql-go, thunder Type safety, auto-registration and code generation GraphQL Golang Trade Study Dev Team #21775 Create normalized role data model 03/25/2021 Career Stage Profiles (CSP), Microsoft Role Library Requires a data model that support the data requirements of both role systems Role Data Model Schema Dev Team #22035 Design for edges and vertices 03/25/2021 N/A N/A Data Model Dev Team #21976 Use grammes 03/29/2021 Gremlin, gremgo, gremcos Balance of documentation and maturity Gremlin API library Trade Study Dev Team #21870 Design for Gremlin implementation 04/02/2021 N/A N/A Gremlin Dev Team #21980 Design for Gremlin implementation 04/02/2021 N/A N/A Gremlin Dev Team #21980 Expose 1:1 data model from API to DB 04/02/2021 Exposing a minified version of data model contract Team decided that there were no pieces of data that we can rule out as being useful. Will update if data model becomes too complex API README Dev Team #21658 Deprecate SonarCloud 04/05/2021 Checkstyle, PMD, FindBugs Requires paid plan to use in a private repo Code Quality & Security Dev Team #22090 Adopted Stable Tagging Strategy 04/08/2021 N/A Team aligned on the proposed docker container tagging strategy Tagging Strategy Dev Team #22005","title":"Decision Log"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/","text":"Graph Model Graph Vertices and Edges The set of vertices (entities) and edges (relationships) of the graph model Vertex (Source) Edge Type Relationship Type Vertex (Target) Notes Required Profession Applies 1:many Discipline Top most level of categorization * Discipline Defines 1:many Role Groups of related roles within a profession * AppliedBy 1:1 Profession 1 Role Requires 1:many Responsibility Individual role mapped to an employee 1+ Requires 1:many Competency 1+ RequiredBy 1:1 Discipline 1 Succeeds 1:1 Role Supports career progression between roles 1 Precedes 1:1 Role Supports career progression between roles 1 AssignedTo 1:many User Profile * Responsibility Expects 1:many Key Result A group of expected outcomes and key results for employees within a role 1+ ExpectedBy 1:1 Role 1 Competency Describes 1:many Behavior A set of behaviors that contribute to success 1+ DescribedBy 1:1 Role 1 Key Result ExpectedBy 1:1 Responsibility The expected outcome of performing a responsibility 1 Behavior ContributesTo 1:1 Competency The way in which one acts or conducts oneself 1 User Profile Fulfills many:1 Role 1+ Authors 1:many Entry * Reads many:many Entry * Entry SharedWith many:many User Profile Business logic should add manager to this list by default. These users should only have read access. * Demonstrates many:many Competency * Demonstrates many:many Behavior * Demonstrates many:many Responsibility * Demonstrates many:many Result * AuthoredBy many:1 UserProfile 1+ DiscussedBy 1:many Commentary * References many:many Artifact * Competency DemonstratedBy many:many Entry * Behavior DemonstratedBy many:many Entry * Responsibility DemonstratedBy many:many Entry * Result DemonstratedBy many:many Entry * Commentary Discusses many:1 Entry * Artifact ReferencedBy many:many Entry 1+ Graph Properties The full set of data properties available on each vertex and edge Vertex/Edge Property Data Type Notes Required (Any) ID guid 1 Profession Title String 1 Description String 0 Discipline Title String 1 Description String 0 Role Title String 1 Description String 0 Level Band String SDE, SDE II, Senior, etc 1 Responsibility Title String 1 Description String 0 Competency Title String 1 Description String 0 Key Result Description String 1 Behavior Description String 1 User Profile Theme selection string there are only 2: dark, light 1 PersonaId guid[] there are only 2: User, Admin 1+ UserId guid Points to AAD object 1 DeploymentRing string[] Is used to deploy new versions 1 Project string[] list of user created projects * Entry Title string 1 DateCreated date 1 ReadyToShare boolean false if draft 1 AreaOfImpact string[] 3 options: self, contribute to others, leverage others * Commentary Data string 1 DateCreated date 1 Artifact Data string 1 DateCreated date 1 ArtifactType string describes the artifact type: markdown, blob link 1 Vertex Descriptions Profession Top most level of categorization { \"title\" : \"Software Engineering\" , \"description\" : \"Description of profession\" , \"disciplines\" : [] } Discipline Groups of related roles within a profession { \"title\" : \"Site Reliability Engineering\" , \"description\" : \"Description of discipline\" , \"roles\" : [] } Role Individual role mapped to an employee { \"title\" : \"Site Reliability Engineering IC2\" , \"description\" : \"Detailed description of role\" , \"responsibilities\" : [], \"competencies\" : [] } Responsibility A group of expected outcomes and key results for employees within a role { \"title\" : \"Technical Knowledge and Domain Specific Expertise\" , \"results\" : [] } Competency A set of behaviors that contribute to success { \"title\" : \"Adaptability\" , \"behaviors\" : [] } Key Result The expected outcome of performing a responsibility { \"description\" : \"Develops a foundational understanding of distributed systems design...\" } Behavior The way in which one acts or conducts oneself { \"description\" : \"Actively seeks information and tests assumptions.\" } User The user object refers to whom a person is. We do not store our own rather use Azure OIDs. User Profile The user profile contains any user settings and edges specific to Memory. Persona A user may hold multiple personas. Entry The same entry object can hold many kinds of data, and at this stage of the project we decide that we will not store external data, so it's up to the user to provide a link to the data for a reader to click into and get redirected to a new tab to open. Note: This means that in the web app, we will need to ensure links are opened in new tabs. Project Projects are just string fields to represent what a user wants to group their entries under. Area of Impact This refers to the 3 areas of impact in the venn-style diagram in the HR tool. The options are: self, contributing to impact of others, building on others' work. Commentary A comment is essentially a piece of text. However, anyone that an entry is shared with can add commentary on an entry. Artifact The artifact object contains the relevant data as markdown, or a link to the relevant data. Full Role JSON Example { \"id\" : \"abc123\" , \"title\" : \"Site Reliability Engineering IC2\" , \"description\" : \"Detailed description of role\" , \"responsibilities\" : [ { \"id\" : \"abc123\" , \"title\" : \"Technical Knowledge and Domain Specific Expertise\" , \"results\" : [ { \"description\" : \"Develops a foundational understanding of distributed systems design...\" }, { \"description\" : \"Develops an understanding of the code, features, and operations of specific products...\" } ] }, { \"id\" : \"abc123\" , \"title\" : \"Contributions to Development and Design\" , \"results\" : [ { \"description\" : \"Develops and tests basic changes to optimize code...\" }, { \"description\" : \"Supports ongoing engagements with product engineering teams...\" } ] } ], \"competencies\" : [ { \"id\" : \"abc123\" , \"title\" : \"Adaptability\" , \"behaviors\" : [ { \"description\" : \"Actively seeks information and tests assumptions.\" }, { \"description\" : \"Shifts his or her approach in response to the demands of a changing situation.\" } ] }, { \"id\" : \"abc123\" , \"title\" : \"Collaboration\" , \"behaviors\" : [ { \"description\" : \"Removes barriers by working with others around a shared need or customer benefit.\" }, { \"description\" : \" Incorporates diverse perspectives to thoroughly address complex business issues.\" } ] } ] } API Data Model Because there is no internal edges or vertices that need to be hidden from API consumers, the API will expose a 1:1 mapping of the current data model for consumption. This is subject to change if our data model becomes too complex for downstream users.","title":"Graph Model"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#graph-model","text":"","title":"Graph Model"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#graph-vertices-and-edges","text":"The set of vertices (entities) and edges (relationships) of the graph model Vertex (Source) Edge Type Relationship Type Vertex (Target) Notes Required Profession Applies 1:many Discipline Top most level of categorization * Discipline Defines 1:many Role Groups of related roles within a profession * AppliedBy 1:1 Profession 1 Role Requires 1:many Responsibility Individual role mapped to an employee 1+ Requires 1:many Competency 1+ RequiredBy 1:1 Discipline 1 Succeeds 1:1 Role Supports career progression between roles 1 Precedes 1:1 Role Supports career progression between roles 1 AssignedTo 1:many User Profile * Responsibility Expects 1:many Key Result A group of expected outcomes and key results for employees within a role 1+ ExpectedBy 1:1 Role 1 Competency Describes 1:many Behavior A set of behaviors that contribute to success 1+ DescribedBy 1:1 Role 1 Key Result ExpectedBy 1:1 Responsibility The expected outcome of performing a responsibility 1 Behavior ContributesTo 1:1 Competency The way in which one acts or conducts oneself 1 User Profile Fulfills many:1 Role 1+ Authors 1:many Entry * Reads many:many Entry * Entry SharedWith many:many User Profile Business logic should add manager to this list by default. These users should only have read access. * Demonstrates many:many Competency * Demonstrates many:many Behavior * Demonstrates many:many Responsibility * Demonstrates many:many Result * AuthoredBy many:1 UserProfile 1+ DiscussedBy 1:many Commentary * References many:many Artifact * Competency DemonstratedBy many:many Entry * Behavior DemonstratedBy many:many Entry * Responsibility DemonstratedBy many:many Entry * Result DemonstratedBy many:many Entry * Commentary Discusses many:1 Entry * Artifact ReferencedBy many:many Entry 1+","title":"Graph Vertices and Edges"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#graph-properties","text":"The full set of data properties available on each vertex and edge Vertex/Edge Property Data Type Notes Required (Any) ID guid 1 Profession Title String 1 Description String 0 Discipline Title String 1 Description String 0 Role Title String 1 Description String 0 Level Band String SDE, SDE II, Senior, etc 1 Responsibility Title String 1 Description String 0 Competency Title String 1 Description String 0 Key Result Description String 1 Behavior Description String 1 User Profile Theme selection string there are only 2: dark, light 1 PersonaId guid[] there are only 2: User, Admin 1+ UserId guid Points to AAD object 1 DeploymentRing string[] Is used to deploy new versions 1 Project string[] list of user created projects * Entry Title string 1 DateCreated date 1 ReadyToShare boolean false if draft 1 AreaOfImpact string[] 3 options: self, contribute to others, leverage others * Commentary Data string 1 DateCreated date 1 Artifact Data string 1 DateCreated date 1 ArtifactType string describes the artifact type: markdown, blob link 1","title":"Graph Properties"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#vertex-descriptions","text":"","title":"Vertex Descriptions"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#profession","text":"Top most level of categorization { \"title\" : \"Software Engineering\" , \"description\" : \"Description of profession\" , \"disciplines\" : [] }","title":"Profession"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#discipline","text":"Groups of related roles within a profession { \"title\" : \"Site Reliability Engineering\" , \"description\" : \"Description of discipline\" , \"roles\" : [] }","title":"Discipline"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#role","text":"Individual role mapped to an employee { \"title\" : \"Site Reliability Engineering IC2\" , \"description\" : \"Detailed description of role\" , \"responsibilities\" : [], \"competencies\" : [] }","title":"Role"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#responsibility","text":"A group of expected outcomes and key results for employees within a role { \"title\" : \"Technical Knowledge and Domain Specific Expertise\" , \"results\" : [] }","title":"Responsibility"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#competency","text":"A set of behaviors that contribute to success { \"title\" : \"Adaptability\" , \"behaviors\" : [] }","title":"Competency"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#key-result","text":"The expected outcome of performing a responsibility { \"description\" : \"Develops a foundational understanding of distributed systems design...\" }","title":"Key Result"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#behavior","text":"The way in which one acts or conducts oneself { \"description\" : \"Actively seeks information and tests assumptions.\" }","title":"Behavior"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#user","text":"The user object refers to whom a person is. We do not store our own rather use Azure OIDs.","title":"User"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#user-profile","text":"The user profile contains any user settings and edges specific to Memory.","title":"User Profile"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#persona","text":"A user may hold multiple personas.","title":"Persona"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#entry","text":"The same entry object can hold many kinds of data, and at this stage of the project we decide that we will not store external data, so it's up to the user to provide a link to the data for a reader to click into and get redirected to a new tab to open. Note: This means that in the web app, we will need to ensure links are opened in new tabs.","title":"Entry"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#project","text":"Projects are just string fields to represent what a user wants to group their entries under.","title":"Project"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#area-of-impact","text":"This refers to the 3 areas of impact in the venn-style diagram in the HR tool. The options are: self, contributing to impact of others, building on others' work.","title":"Area of Impact"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#commentary","text":"A comment is essentially a piece of text. However, anyone that an entry is shared with can add commentary on an entry.","title":"Commentary"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#artifact","text":"The artifact object contains the relevant data as markdown, or a link to the relevant data.","title":"Artifact"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#full-role-json-example","text":"{ \"id\" : \"abc123\" , \"title\" : \"Site Reliability Engineering IC2\" , \"description\" : \"Detailed description of role\" , \"responsibilities\" : [ { \"id\" : \"abc123\" , \"title\" : \"Technical Knowledge and Domain Specific Expertise\" , \"results\" : [ { \"description\" : \"Develops a foundational understanding of distributed systems design...\" }, { \"description\" : \"Develops an understanding of the code, features, and operations of specific products...\" } ] }, { \"id\" : \"abc123\" , \"title\" : \"Contributions to Development and Design\" , \"results\" : [ { \"description\" : \"Develops and tests basic changes to optimize code...\" }, { \"description\" : \"Supports ongoing engagements with product engineering teams...\" } ] } ], \"competencies\" : [ { \"id\" : \"abc123\" , \"title\" : \"Adaptability\" , \"behaviors\" : [ { \"description\" : \"Actively seeks information and tests assumptions.\" }, { \"description\" : \"Shifts his or her approach in response to the demands of a changing situation.\" } ] }, { \"id\" : \"abc123\" , \"title\" : \"Collaboration\" , \"behaviors\" : [ { \"description\" : \"Removes barriers by working with others around a shared need or customer benefit.\" }, { \"description\" : \" Incorporates diverse perspectives to thoroughly address complex business issues.\" } ] } ] }","title":"Full Role JSON Example"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#api-data-model","text":"Because there is no internal edges or vertices that need to be hidden from API consumers, the API will expose a 1:1 mapping of the current data model for consumption. This is subject to change if our data model becomes too complex for downstream users.","title":"API Data Model"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/","text":"Application Deployment The Memory application leverages Azure DevOps for work item tracking as well as continuous integration (CI) and continuous deployment (CD). Environments The Memory project uses multiple environments to isolate and test changes before promoting releases to the global user base. New environment rollouts are automatically triggered based upon a successful deployment of the previous stage /environment. The development , staging and production environments leverage slot deployment during an environment rollout. After a new release is deployed to a staging slot, it is validated through a series of functional integration tests. Upon a 100% pass rate of all tests the staging & production slots are swapped effectively making updates to the environment available. Any errors or failed tests halt the deployment in the current stage and prevent changes to further environments. Each deployed environment is completely isolated and does not share any components. They each have unique resource instances of Azure Traffic Manager, Cosmos DB, etc. Deployment Dependencies Development Staging Production CI Quality Gates Development Staging Manual Approval Local The local environment is used by individual software engineers during the development of new features and components. Engineers leverage some components from the deployed development environment that are not available on certain platforms or are unable to run locally. CosmosDB (Emulator only exists for Windows) The local environment also does not use Azure Traffic Manager. The frontend web app directly communicates to the backend REST API typically running on a separate localhost port mapping. Development The development environment is used as the first quality gate. All code that is checked into the main branch is automatically deployed to this environment after all CI quality gates have passed. Dev Regions West US (westus) Staging The staging environment is used to validate new features, components and other changes prior to production rollout. This environment is primarily used by developers, QA and other company stakeholders. Staging Regions West US (westus) East US (eastus) Production The production environment is used by the worldwide user base. Changes to this environment are gated by manual approval by your product's leadership team in addition to other automatic quality gates. Production Regions West US (westus) Central US (centralus) East US (eastus) Environment Variable Group Infrastructure Setup (memory-common) appName businessUnit serviceConnection subscriptionId Development Setup (memory-dev) environmentName (placeholder) Staging Setup (memory-staging) environmentName (placeholder) Production Setup (memory-prod) environmentName (placeholder)","title":"Application Deployment"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#application-deployment","text":"The Memory application leverages Azure DevOps for work item tracking as well as continuous integration (CI) and continuous deployment (CD).","title":"Application Deployment"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#environments","text":"The Memory project uses multiple environments to isolate and test changes before promoting releases to the global user base. New environment rollouts are automatically triggered based upon a successful deployment of the previous stage /environment. The development , staging and production environments leverage slot deployment during an environment rollout. After a new release is deployed to a staging slot, it is validated through a series of functional integration tests. Upon a 100% pass rate of all tests the staging & production slots are swapped effectively making updates to the environment available. Any errors or failed tests halt the deployment in the current stage and prevent changes to further environments. Each deployed environment is completely isolated and does not share any components. They each have unique resource instances of Azure Traffic Manager, Cosmos DB, etc.","title":"Environments"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#deployment-dependencies","text":"Development Staging Production CI Quality Gates Development Staging Manual Approval","title":"Deployment Dependencies"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#local","text":"The local environment is used by individual software engineers during the development of new features and components. Engineers leverage some components from the deployed development environment that are not available on certain platforms or are unable to run locally. CosmosDB (Emulator only exists for Windows) The local environment also does not use Azure Traffic Manager. The frontend web app directly communicates to the backend REST API typically running on a separate localhost port mapping.","title":"Local"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#development","text":"The development environment is used as the first quality gate. All code that is checked into the main branch is automatically deployed to this environment after all CI quality gates have passed.","title":"Development"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#dev-regions","text":"West US (westus)","title":"Dev Regions"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#staging","text":"The staging environment is used to validate new features, components and other changes prior to production rollout. This environment is primarily used by developers, QA and other company stakeholders.","title":"Staging"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#staging-regions","text":"West US (westus) East US (eastus)","title":"Staging Regions"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#production","text":"The production environment is used by the worldwide user base. Changes to this environment are gated by manual approval by your product's leadership team in addition to other automatic quality gates.","title":"Production"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#production-regions","text":"West US (westus) Central US (centralus) East US (eastus)","title":"Production Regions"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#environment-variable-group","text":"","title":"Environment Variable Group"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#infrastructure-setup-memory-common","text":"appName businessUnit serviceConnection subscriptionId","title":"Infrastructure Setup (memory-common)"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#development-setup-memory-dev","text":"environmentName (placeholder)","title":"Development Setup (memory-dev)"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#staging-setup-memory-staging","text":"environmentName (placeholder)","title":"Staging Setup (memory-staging)"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#production-setup-memory-prod","text":"environmentName (placeholder)","title":"Production Setup (memory-prod)"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/","text":"Trade Study: GitOps Conducted by: Tess and Jeff Backlog Work Item: #21672 Decision Makers: Wallace, whole team Overview For Memory, we will be creating a cloud native application with infrastructure as code. We will use GitOps for Continuous Deployment through pull requests infrastructure changes to be reflected. Overall, between our two options, one is more simple and targeted in a way that we believe would meet the requirements for this project. The other does the same, with additional features that may or may not be worth the extra configuration and setup. Evaluation Criteria Repo style: mono versus multi Policy Enforcement Deployment Methods Deployment Monitoring Admission Control Azure Documentation availability Maintainability Maturity User Interface Solutions Flux Flux is a tool created by Waveworks and is built on top of Kubernetes' API extension system, supports multi-tenancy, and integrates seamlessly with popular tools like Prometheus. Flux Acceptance Criteria Evaluation Repo style: mono versus multi Flux supports both as of v2 Policy Enforcement Azure Policy is in Preview Deployment Methods Define a Helm release using Helm Controllers Kustomization describes deployments Deployment Monitoring Flux works with Prometheus for deployment monitoring as well as Grafana dashboards Admission Control Flux uses RBAC from Kubernetes to lock down sync permissions. Uses the service account to access image pull secrets Azure Documentation availability Great, better when using Helm Operators Maintainability Manage via YAML files in git repo Maturity v2 is published under Apache license in GitHub , it works with Helm v3, and has PR commits from as recently as today 945 stars, 94 forks User Interface CLI, the simplest lightweight option Other features to call out (see more on website) Flux only supports Pull-based deployments which means it must be paired with an operator Flux can send notifications and receive webhooks for syncing Health assessments Dependency management Automatic deployment Garbage collection Deploy on commit Variations Controllers Both Controller options are optional. The Helm Controller additionally fetches helm artifacts to publish, see below diagram. The Kustomize Controller manages state and continuous deployment. We will not decide between the controller to use here, as that's a separate trade study, however we will note that Helm is more widely documented within Flux documentation. Flux v1 Flux v1 is only in maintenance mode and should not be used anymore. So this section does not consider the v1 option a valid option. GitOps Toolkit Flux v2 is built on top of the GitOps Toolkit , however we do not evaluate using the GitOps Toolkit alone as that is for when you want to make your own CD system, which is not what we want. ArgoCD with Helm Charts ArgoCD is a declarative, GitOps-based Continuous Delivery (CD) tool for Kubernetes. ArgoCD with Helm Acceptance Criteria Evaluation Repo style: mono versus multi ArgoCD supports both Policy Enforcement Azure Policy is in Preview Deployment Methods Deploy with Helm Chart Use Kustomize to apply some post-rendering to the Helm release templates Deployment Monitoring Argo CD expose two sets of Prometheus metrics (application metrics and API server metrics) for deployment monitoring. Admission Control ArgoCD use RBAC feature. RBAC requires SSO configuration or one or more local users setup. Once SSO or local users are configured, additional RBAC roles can be defined Argo CD does not have its own user management system and has only one built-in user admin. The admin user is a superuser, and it has unrestricted access to the system Authorization is handled via JWT tokens and checking group claims in them Azure Documentation availability Argo has documentation on Azure AD Maturity Has PR commits from as recently as today 5,000 stars, 1,100 forks Maintainability Can use GitOps to manage it User Interface ArgoCD has a GUI and can be used across clusters Other features to call out (see more on website) ArgoCD support both pull model and push model for continuous delivery Argo can send notifications, but you need a separate tool for it Argo can receive webhooks Health assessments Potentially much more useful multi-tenancy tools. Manages multiple projects, maps them to teams, etc. SSO Integration Garbage collection Results This section should contain a table that has each solution rated against each of the evaluation criteria: Solution Repo style Policy Enforcement Deployment Methods Deployment Monitoring Admission Control Azure Doc Maintainability Maturity UI Flux mono, multi Azure Policy, preview Helm, Kustomize Prometheus, Grafana RBAC Yes on Azure YAML in git repo 945 stars, 94 forks, currently maintained CLI ArgoCD mono, multi Azure Policy, preview Helm, Kustomize, KSonnet, ... Prometheus, Grafana RBAC Only in their own docs manifests in git repo 5,000 stars, 1,100 forks GUI, multiple clusters in same GUI Decision ArgoCD is more feature rich, will support more scenarios, and will be a better tool to put in our tool belts. So we have decided at this point to go with ArgoCD. Resources GitOps Enforcement Monitoring Policies Deployment Push with ArgoCD in Azure DevOps","title":"Trade Study: GitOps"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#trade-study-gitops","text":"Conducted by: Tess and Jeff Backlog Work Item: #21672 Decision Makers: Wallace, whole team","title":"Trade Study: GitOps"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#overview","text":"For Memory, we will be creating a cloud native application with infrastructure as code. We will use GitOps for Continuous Deployment through pull requests infrastructure changes to be reflected. Overall, between our two options, one is more simple and targeted in a way that we believe would meet the requirements for this project. The other does the same, with additional features that may or may not be worth the extra configuration and setup.","title":"Overview"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#evaluation-criteria","text":"Repo style: mono versus multi Policy Enforcement Deployment Methods Deployment Monitoring Admission Control Azure Documentation availability Maintainability Maturity User Interface","title":"Evaluation Criteria"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#solutions","text":"","title":"Solutions"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#flux","text":"Flux is a tool created by Waveworks and is built on top of Kubernetes' API extension system, supports multi-tenancy, and integrates seamlessly with popular tools like Prometheus.","title":"Flux"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#flux-acceptance-criteria-evaluation","text":"Repo style: mono versus multi Flux supports both as of v2 Policy Enforcement Azure Policy is in Preview Deployment Methods Define a Helm release using Helm Controllers Kustomization describes deployments Deployment Monitoring Flux works with Prometheus for deployment monitoring as well as Grafana dashboards Admission Control Flux uses RBAC from Kubernetes to lock down sync permissions. Uses the service account to access image pull secrets Azure Documentation availability Great, better when using Helm Operators Maintainability Manage via YAML files in git repo Maturity v2 is published under Apache license in GitHub , it works with Helm v3, and has PR commits from as recently as today 945 stars, 94 forks User Interface CLI, the simplest lightweight option Other features to call out (see more on website) Flux only supports Pull-based deployments which means it must be paired with an operator Flux can send notifications and receive webhooks for syncing Health assessments Dependency management Automatic deployment Garbage collection Deploy on commit","title":"Flux Acceptance Criteria Evaluation"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#variations","text":"","title":"Variations"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#controllers","text":"Both Controller options are optional. The Helm Controller additionally fetches helm artifacts to publish, see below diagram. The Kustomize Controller manages state and continuous deployment. We will not decide between the controller to use here, as that's a separate trade study, however we will note that Helm is more widely documented within Flux documentation.","title":"Controllers"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#flux-v1","text":"Flux v1 is only in maintenance mode and should not be used anymore. So this section does not consider the v1 option a valid option.","title":"Flux v1"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#gitops-toolkit","text":"Flux v2 is built on top of the GitOps Toolkit , however we do not evaluate using the GitOps Toolkit alone as that is for when you want to make your own CD system, which is not what we want.","title":"GitOps Toolkit"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#argocd-with-helm-charts","text":"ArgoCD is a declarative, GitOps-based Continuous Delivery (CD) tool for Kubernetes.","title":"ArgoCD with Helm Charts"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#argocd-with-helm-acceptance-criteria-evaluation","text":"Repo style: mono versus multi ArgoCD supports both Policy Enforcement Azure Policy is in Preview Deployment Methods Deploy with Helm Chart Use Kustomize to apply some post-rendering to the Helm release templates Deployment Monitoring Argo CD expose two sets of Prometheus metrics (application metrics and API server metrics) for deployment monitoring. Admission Control ArgoCD use RBAC feature. RBAC requires SSO configuration or one or more local users setup. Once SSO or local users are configured, additional RBAC roles can be defined Argo CD does not have its own user management system and has only one built-in user admin. The admin user is a superuser, and it has unrestricted access to the system Authorization is handled via JWT tokens and checking group claims in them Azure Documentation availability Argo has documentation on Azure AD Maturity Has PR commits from as recently as today 5,000 stars, 1,100 forks Maintainability Can use GitOps to manage it User Interface ArgoCD has a GUI and can be used across clusters Other features to call out (see more on website) ArgoCD support both pull model and push model for continuous delivery Argo can send notifications, but you need a separate tool for it Argo can receive webhooks Health assessments Potentially much more useful multi-tenancy tools. Manages multiple projects, maps them to teams, etc. SSO Integration Garbage collection","title":"ArgoCD with Helm Acceptance Criteria Evaluation"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#results","text":"This section should contain a table that has each solution rated against each of the evaluation criteria: Solution Repo style Policy Enforcement Deployment Methods Deployment Monitoring Admission Control Azure Doc Maintainability Maturity UI Flux mono, multi Azure Policy, preview Helm, Kustomize Prometheus, Grafana RBAC Yes on Azure YAML in git repo 945 stars, 94 forks, currently maintained CLI ArgoCD mono, multi Azure Policy, preview Helm, Kustomize, KSonnet, ... Prometheus, Grafana RBAC Only in their own docs manifests in git repo 5,000 stars, 1,100 forks GUI, multiple clusters in same GUI","title":"Results"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#decision","text":"ArgoCD is more feature rich, will support more scenarios, and will be a better tool to put in our tool belts. So we have decided at this point to go with ArgoCD.","title":"Decision"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#resources","text":"GitOps Enforcement Monitoring Policies Deployment Push with ArgoCD in Azure DevOps","title":"Resources"},{"location":"design/design-reviews/recipes/","text":"Design Review Recipes Design reviews come in all shapes and sizes. There are also different items to consider when creating a design at different stages during an engagement Design Review Process Incorporate design reviews throughout the lifetime of an engagement Design Review Templates Game Plan The same template already in use today High level architecture and design Includes technologies, languages & products to complete engagement objective Milestone / Epic Design Review Should be considered when an engagement contains multiple milestones or epics Design should be more detailed than game plan May require unique deployment, security and/or privacy characteristics from other milestones Feature / Story Design Review Design for complex features or stories Will reuse deployment, security and other characteristics defined within game plan or milestone May require new libraries, OSS or patterns to accomplish goals Task Design Review Highly detailed design for a complex tasks with many unknowns Will integrate into higher level feature/component designs","title":"Design Review Recipes"},{"location":"design/design-reviews/recipes/#design-review-recipes","text":"Design reviews come in all shapes and sizes. There are also different items to consider when creating a design at different stages during an engagement","title":"Design Review Recipes"},{"location":"design/design-reviews/recipes/#design-review-process","text":"Incorporate design reviews throughout the lifetime of an engagement","title":"Design Review Process"},{"location":"design/design-reviews/recipes/#design-review-templates","text":"","title":"Design Review Templates"},{"location":"design/design-reviews/recipes/#game-plan","text":"The same template already in use today High level architecture and design Includes technologies, languages & products to complete engagement objective","title":"Game Plan"},{"location":"design/design-reviews/recipes/#milestone-epic-design-review","text":"Should be considered when an engagement contains multiple milestones or epics Design should be more detailed than game plan May require unique deployment, security and/or privacy characteristics from other milestones","title":"Milestone / Epic Design Review"},{"location":"design/design-reviews/recipes/#feature-story-design-review","text":"Design for complex features or stories Will reuse deployment, security and other characteristics defined within game plan or milestone May require new libraries, OSS or patterns to accomplish goals","title":"Feature / Story Design Review"},{"location":"design/design-reviews/recipes/#task-design-review","text":"Highly detailed design for a complex tasks with many unknowns Will integrate into higher level feature/component designs","title":"Task Design Review"},{"location":"design/design-reviews/recipes/async-design-reviews/","text":"Async Design Reviews Goals Allow team members to review designs as their work schedule allows. Impact This in turn results in the following benefits: Higher Participation & Accessibility . They do not need to be online and available at the same time as others to review. Reduced Time Constraint . Reviewers can spend longer than the duration of a single meeting to think through the approach and provide feedback. Measures The metrics and/or KPIs used for design reviews overall would still apply. See design reviews for measures guidance. Participation The participation should be same as any design review. See design reviews for participation guidance. Facilitation Guidance The concept is to have the design follow the same workflow as any code changes to implement story or task. Rather than code however, the artifacts being added or changed are Markdown documents as well as any other supporting artifacts (prototypes, code samples, diagrams, etc). Prerequisites Source Controlled Design Docs Design documentation must live in a source control repository that supports pull requests (i.e. git). The following guidelines can be used to determine what repository houses the docs Keeping docs in the same repo as the affected code allows for the docs to be updated atomically alongside code within the same pull request. If the documentation represents code that lives in many different repositories, it may make more sense to keep the docs in their own repository. Place the docs so that they do not trigger CI builds for the affected code (assuming the documentation was the only change). This can be done by placing them in an isolated directory should they live alongside the code they represent. See directory structure example below. -root --src --docs <-- exclude from ci build trigger --design Workflow The designer branches the repo with the documentation. The designer works on adding or updating documentation relevant to the design. The designer submits pull request and requests specific team members to review. Reviewers provide feedback to Designer who incorporates the feedback. (OPTIONAL) Design review meeting might be held to give deeper explanation of design to reviewers. Design is approved/accepted and merged to main branch. Tips for Faster Review Cycles To make sure a design is reviewed in a timely manner, it's important to directly request reviews from team members. If team members are assigned without asking, or if no one is assigned it's likely the design will sit for longer without review. Try the following actions: Make it the designer's responsibility to find reviewers for their design The designer should ask a team member directly (face-to-face conversation, async messaging, etc) if they are available to review. Only if they agree, then assign them as a reviewer. Indicate if the design is ready to be merged once approved. Indicate Design Completeness It helps the reviewer to understand if the design is ready to be accepted or if its still a work-in-progress. The level and type of feedback the reviewer provides will likely be different depending on its state. Try the following actions to indicate the design state Mark the PR as a Draft. Some ALM tools support opening a pull request as a Draft such as Azure DevOps. Prefix the title with \"DRAFT\", \"WIP\", or \"work-in-progress\". Set the pull request to automatically merge after approvals and checks have passed. This can indicate to the reviewer the design is complete from the designer's perspective. Practice Inclusive Behaviors The designated reviewers are not the only team members that can provide feedback on the design. If other team members voluntarily committed time to providing feedback or asking questions, be sure to respond. Utilize face-to-face conversation (in person or virtual) to resolve feedback or questions from others as needed. This aids in building team cohesiveness in ensuring everyone understands and is willing to commit to a given design. This practice demonstrates inclusive behavior ; which will promote trust and respect within the team. Respond to all PR comments objectively and respectively irrespective of the authors level, position, or title. After two round trips of question/response, resort to synchronous communication for resolution (i.e. virtual or physical face-to-face conversation).","title":"Async Design Reviews"},{"location":"design/design-reviews/recipes/async-design-reviews/#async-design-reviews","text":"","title":"Async Design Reviews"},{"location":"design/design-reviews/recipes/async-design-reviews/#goals","text":"Allow team members to review designs as their work schedule allows.","title":"Goals"},{"location":"design/design-reviews/recipes/async-design-reviews/#impact","text":"This in turn results in the following benefits: Higher Participation & Accessibility . They do not need to be online and available at the same time as others to review. Reduced Time Constraint . Reviewers can spend longer than the duration of a single meeting to think through the approach and provide feedback.","title":"Impact"},{"location":"design/design-reviews/recipes/async-design-reviews/#measures","text":"The metrics and/or KPIs used for design reviews overall would still apply. See design reviews for measures guidance.","title":"Measures"},{"location":"design/design-reviews/recipes/async-design-reviews/#participation","text":"The participation should be same as any design review. See design reviews for participation guidance.","title":"Participation"},{"location":"design/design-reviews/recipes/async-design-reviews/#facilitation-guidance","text":"The concept is to have the design follow the same workflow as any code changes to implement story or task. Rather than code however, the artifacts being added or changed are Markdown documents as well as any other supporting artifacts (prototypes, code samples, diagrams, etc).","title":"Facilitation Guidance"},{"location":"design/design-reviews/recipes/async-design-reviews/#prerequisites","text":"","title":"Prerequisites"},{"location":"design/design-reviews/recipes/async-design-reviews/#source-controlled-design-docs","text":"Design documentation must live in a source control repository that supports pull requests (i.e. git). The following guidelines can be used to determine what repository houses the docs Keeping docs in the same repo as the affected code allows for the docs to be updated atomically alongside code within the same pull request. If the documentation represents code that lives in many different repositories, it may make more sense to keep the docs in their own repository. Place the docs so that they do not trigger CI builds for the affected code (assuming the documentation was the only change). This can be done by placing them in an isolated directory should they live alongside the code they represent. See directory structure example below. -root --src --docs <-- exclude from ci build trigger --design","title":"Source Controlled Design Docs"},{"location":"design/design-reviews/recipes/async-design-reviews/#workflow","text":"The designer branches the repo with the documentation. The designer works on adding or updating documentation relevant to the design. The designer submits pull request and requests specific team members to review. Reviewers provide feedback to Designer who incorporates the feedback. (OPTIONAL) Design review meeting might be held to give deeper explanation of design to reviewers. Design is approved/accepted and merged to main branch.","title":"Workflow"},{"location":"design/design-reviews/recipes/async-design-reviews/#tips-for-faster-review-cycles","text":"To make sure a design is reviewed in a timely manner, it's important to directly request reviews from team members. If team members are assigned without asking, or if no one is assigned it's likely the design will sit for longer without review. Try the following actions: Make it the designer's responsibility to find reviewers for their design The designer should ask a team member directly (face-to-face conversation, async messaging, etc) if they are available to review. Only if they agree, then assign them as a reviewer. Indicate if the design is ready to be merged once approved.","title":"Tips for Faster Review Cycles"},{"location":"design/design-reviews/recipes/async-design-reviews/#indicate-design-completeness","text":"It helps the reviewer to understand if the design is ready to be accepted or if its still a work-in-progress. The level and type of feedback the reviewer provides will likely be different depending on its state. Try the following actions to indicate the design state Mark the PR as a Draft. Some ALM tools support opening a pull request as a Draft such as Azure DevOps. Prefix the title with \"DRAFT\", \"WIP\", or \"work-in-progress\". Set the pull request to automatically merge after approvals and checks have passed. This can indicate to the reviewer the design is complete from the designer's perspective.","title":"Indicate Design Completeness"},{"location":"design/design-reviews/recipes/async-design-reviews/#practice-inclusive-behaviors","text":"The designated reviewers are not the only team members that can provide feedback on the design. If other team members voluntarily committed time to providing feedback or asking questions, be sure to respond. Utilize face-to-face conversation (in person or virtual) to resolve feedback or questions from others as needed. This aids in building team cohesiveness in ensuring everyone understands and is willing to commit to a given design. This practice demonstrates inclusive behavior ; which will promote trust and respect within the team. Respond to all PR comments objectively and respectively irrespective of the authors level, position, or title. After two round trips of question/response, resort to synchronous communication for resolution (i.e. virtual or physical face-to-face conversation).","title":"Practice Inclusive Behaviors"},{"location":"design/design-reviews/recipes/engagement-process/","text":"Incorporating Design Reviews into an Engagement Introduction Design reviews should not feel like a burden. Design reviews can be easily incorporated into the dev crew process with minimal overhead. Only create design reviews when needed. Not every story or task requires a complete design review. Leverage this guidance to make changes that best fit in with the team. Every team works differently. Leverage Microsoft subject-matter experts (SME) as needed during design reviews. Not every story needs SME or leadership sign-off. Most design reviews can be fully executed within a dev crew. Use diagrams to visualize concepts and architecture. The following guidelines outline how Microsoft and the customer together can incorporate design reviews into their day-to-day agile processes. Envisioning / Architecture Design Session (ADS) Early in an engagement Microsoft works with customers to understand their unique goals and objectives and establish a definition of done. Microsoft dives deep into existing customer infrastructure and architecture to understand potential constraints. Additionally, we seek to understand and uncover specific non-functional requirements that influence the solution. During this time the team uncovers many unknowns, leveraging all new-found information, in order to help generate an impactful design that meets customer goals. After ADS it can be helpful to conduct Engineering Feasibility Spikes to further de-risk technologies being considered for the engagement. Tip : All unknowns have not been addressed at this point. Sprint Planning In many engagements Microsoft works with customers using a SCRUM agile development process which begins with sprint planning. Sprint planning is a great opportunity to dive deep into the next set of high priority work. Some key points to address are the following: Identify stories that require design reviews Separate design from implementation for complex stories Assign an owner to each design story Stories that will benefit from design reviews have one or more of the following in common: There are many unknown or unclear requirements There is a wide distribution of anticipated workload, or story pointing, across the dev crew The developer cannot clearly illustrate all tasks required for the story Tip: After sprint planning is complete the team should consider hosting an initial design review discussion to dive deep in the design requirement of the stories that were identified. This will provide more clarity so that the team can move forward with a design review, synchronously or asynchronously, and complete tasks. Sprint Backlog Refinement If your team is not already hosting a Sprint Backlog Refinement session at least once per week you should consider it. It is a great opportunity to: Keep the backlog clean Re-prioritize work based on shifting business priorities Fill in missing descriptions and acceptance criteria Identify stories that require design reviews The team can follow the same steps from sprint planning to help identify which stories require design reviews. This can often save much time during the actual sprint planning meetings to focus on the task at hand. Sprint Retrospectives Sprint retrospectives are a great time to check in with the dev team, identify what is working or not working, and propose changes to keep improving. It is also a great time to check in on design reviews Did any of the designs change from last sprint? How have design changes impacted the engagement? Have previous design artifacts been updated to reflect new changes? All design artifacts should be treated as a living document. As requirements change or uncover more unknowns the dev crew should retroactively update all design artifacts. Missing this critical step may cause the customer to incur future technical debt. Artifacts that are not up to date are bugs in the design. Tip: Keep your artifacts up to date by adding it to your teams definition of done for all user stories. Sync Design Reviews It is often helpful to schedule 1-2 design sessions per sprint as part of the normal aforementioned meeting cadence. Throughout the sprint, folks can add design topics to the meeting agenda and if there is nothing to discuss for a particular meeting occurrence, it can simply be cancelled. While these sessions may not always be used, they help project members align on timing and purpose early on and establish precedence, often encouraging participation so design topics don't slip through the cracks. Oftentimes, it is helpful for those project members intending to present their design to the wider group to distribute documentation on their design prior to the session so that other participants can come prepared with context heading into the session. It should be noted that the necessity of these sessions certainly evolves over the course of the engagement. Early on, or in other times of more ambiguity, these meetings are typically used more often and more fully. Lastly, while it is suggested that sync design reviews are scheduled during the normal sprint cadence, scheduling ad-hoc sessions should not be discouraged - even if these reviews are limited to the participants of a specific workstream. Wrap-up Sprints Wrap-up sprints are a great time to tie up loose ends with the customer and hand-off solution. Customer hand-off becomes a lot easier when there are design artifacts to reference and deliver alongside the completed solution. During your wrap-up sprints the dev crew should consider the following: Are the design artifacts up to date? Are the design artifacts stored in an accessible location?","title":"Incorporating Design Reviews into an Engagement"},{"location":"design/design-reviews/recipes/engagement-process/#incorporating-design-reviews-into-an-engagement","text":"","title":"Incorporating Design Reviews into an Engagement"},{"location":"design/design-reviews/recipes/engagement-process/#introduction","text":"Design reviews should not feel like a burden. Design reviews can be easily incorporated into the dev crew process with minimal overhead. Only create design reviews when needed. Not every story or task requires a complete design review. Leverage this guidance to make changes that best fit in with the team. Every team works differently. Leverage Microsoft subject-matter experts (SME) as needed during design reviews. Not every story needs SME or leadership sign-off. Most design reviews can be fully executed within a dev crew. Use diagrams to visualize concepts and architecture. The following guidelines outline how Microsoft and the customer together can incorporate design reviews into their day-to-day agile processes.","title":"Introduction"},{"location":"design/design-reviews/recipes/engagement-process/#envisioning-architecture-design-session-ads","text":"Early in an engagement Microsoft works with customers to understand their unique goals and objectives and establish a definition of done. Microsoft dives deep into existing customer infrastructure and architecture to understand potential constraints. Additionally, we seek to understand and uncover specific non-functional requirements that influence the solution. During this time the team uncovers many unknowns, leveraging all new-found information, in order to help generate an impactful design that meets customer goals. After ADS it can be helpful to conduct Engineering Feasibility Spikes to further de-risk technologies being considered for the engagement. Tip : All unknowns have not been addressed at this point.","title":"Envisioning / Architecture Design Session (ADS)"},{"location":"design/design-reviews/recipes/engagement-process/#sprint-planning","text":"In many engagements Microsoft works with customers using a SCRUM agile development process which begins with sprint planning. Sprint planning is a great opportunity to dive deep into the next set of high priority work. Some key points to address are the following: Identify stories that require design reviews Separate design from implementation for complex stories Assign an owner to each design story Stories that will benefit from design reviews have one or more of the following in common: There are many unknown or unclear requirements There is a wide distribution of anticipated workload, or story pointing, across the dev crew The developer cannot clearly illustrate all tasks required for the story Tip: After sprint planning is complete the team should consider hosting an initial design review discussion to dive deep in the design requirement of the stories that were identified. This will provide more clarity so that the team can move forward with a design review, synchronously or asynchronously, and complete tasks.","title":"Sprint Planning"},{"location":"design/design-reviews/recipes/engagement-process/#sprint-backlog-refinement","text":"If your team is not already hosting a Sprint Backlog Refinement session at least once per week you should consider it. It is a great opportunity to: Keep the backlog clean Re-prioritize work based on shifting business priorities Fill in missing descriptions and acceptance criteria Identify stories that require design reviews The team can follow the same steps from sprint planning to help identify which stories require design reviews. This can often save much time during the actual sprint planning meetings to focus on the task at hand.","title":"Sprint Backlog Refinement"},{"location":"design/design-reviews/recipes/engagement-process/#sprint-retrospectives","text":"Sprint retrospectives are a great time to check in with the dev team, identify what is working or not working, and propose changes to keep improving. It is also a great time to check in on design reviews Did any of the designs change from last sprint? How have design changes impacted the engagement? Have previous design artifacts been updated to reflect new changes? All design artifacts should be treated as a living document. As requirements change or uncover more unknowns the dev crew should retroactively update all design artifacts. Missing this critical step may cause the customer to incur future technical debt. Artifacts that are not up to date are bugs in the design. Tip: Keep your artifacts up to date by adding it to your teams definition of done for all user stories.","title":"Sprint Retrospectives"},{"location":"design/design-reviews/recipes/engagement-process/#sync-design-reviews","text":"It is often helpful to schedule 1-2 design sessions per sprint as part of the normal aforementioned meeting cadence. Throughout the sprint, folks can add design topics to the meeting agenda and if there is nothing to discuss for a particular meeting occurrence, it can simply be cancelled. While these sessions may not always be used, they help project members align on timing and purpose early on and establish precedence, often encouraging participation so design topics don't slip through the cracks. Oftentimes, it is helpful for those project members intending to present their design to the wider group to distribute documentation on their design prior to the session so that other participants can come prepared with context heading into the session. It should be noted that the necessity of these sessions certainly evolves over the course of the engagement. Early on, or in other times of more ambiguity, these meetings are typically used more often and more fully. Lastly, while it is suggested that sync design reviews are scheduled during the normal sprint cadence, scheduling ad-hoc sessions should not be discouraged - even if these reviews are limited to the participants of a specific workstream.","title":"Sync Design Reviews"},{"location":"design/design-reviews/recipes/engagement-process/#wrap-up-sprints","text":"Wrap-up sprints are a great time to tie up loose ends with the customer and hand-off solution. Customer hand-off becomes a lot easier when there are design artifacts to reference and deliver alongside the completed solution. During your wrap-up sprints the dev crew should consider the following: Are the design artifacts up to date? Are the design artifacts stored in an accessible location?","title":"Wrap-up Sprints"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/","text":"Engineering Feasibility Spikes: Identifying and Mitigating Risk Introduction Some engagements require more de-risking than others. Even after Architectural Design Sessions (ADS) an engagement may still have substantial technical unknowns. These types of engagements warrant an exploratory/validation phase where Engineering Feasibility Spikes can be conducted immediately after envisioning/ADS and before engineering sprints. Engineering Feasibility Spikes Are regimented yet collaborative time-boxed investigatory activities conducted in a feedback loop to capitalize on individual learnings to inform the team. Increase the team\u2019s knowledge and understanding while minimizing engagement risks. The following guidelines outline how Microsoft and the customer can incorporate engineering feasibility spikes into the day-to-day agile processes. Pre-Mortem A good way to gauge what engineering spikes to conduct is to do a pre-mortem. What is a Pre-Mortem? A 90-minute meeting after envisioning/ADS that includes the entire team (and can also include the customer) which answers \"Imagine the project has failed. What problems and challenges caused this failure?\" Allows the entire team to initially raise concerns and risks early in the engagement. This input is used to decide which risks to pursue as engineering spikes. Sharing Learnings & Current Progress Feedback Loop The key element from conducting the engineering feasibility spikes is sharing the outcomes in-flight. The team gets together and shares learning on a weekly basis (or more frequently if needed). The sharing is done via a 30-minute call. Everyone on the Dev Crew joins the call (even if not everyone is assigned an engineering spike story or even if the spike work was underway and not fully completed). The feedback loop is significantly tighter/shorter than in sprint-based agile process. Instead of using the Sprint as the forcing function to adjust/pivot/re-prioritize, the interim sharing sessions were the trigger. Re-Prioritizing the Next Spikes After the team shares current progress, another round of planning is done. This allows the team to Establish a very tight feedback loop. Re-prioritize the next spike(s) because of the outcome from the current engineering feasibility spikes. Adjusting Based on Context During the sharing call, and when the team believes it has enough information, the team sometimes comes to the realization that the original spike acceptance criteria is no longer valid. The team pivots into another area that provides more value. A decision log can be used to track outcomes. Engineering Feasibility Sprints Diagram The process is depicted in the diagram below. Benefits Creating Code Samples to Prove Out Ideas It is important to note to be intentional about the spikes not aiming to produce production-level code. The team sometimes must write code to arrive at the technical learning. The team must be cognizant that the code written for the spikes is not going to serve as the code for the final solution. The code written is just enough to drive the investigation forward with greater confidence. For example, supposed the team was exploring the API choreography of creating a Graph client with various Azure Active Directory (AAD) authentication flows and permissions. The code to demonstrate this is implemented in a console app, but it could have been done via an Express server, etc. The fact that it was a console app was not important, but rather the ability of the Graph client to be able to do operations against the Graph API endpoint with the minimal number of permissions is the main learning goal. Targeted Conversations By sharing the progress of the spike, the team\u2019s collective knowledge increases. The spikes allow the team to drive succinct conversations with various Product Groups (PGs) and other subject matter experts (SMEs). Rather than speaking at a hypothetical level, the team playbacks project/architecture concerns and concretely points out why something is a showstopper or not a viable way forward. Increased Customer Trust This process leads to increased customer trust. Using this process, the team Brings the customer along in the decision-making process and guides them how to go forward. Provides answers with confidence and suggests sound architectural designs. Conducting engineering feasibility spikes sets the team and the customer up for success, especially if it highlights technology learnings that help the customer fully understand the feasibility/viability of an engineering solution. Summary of Key Points A pre-mortem can involve the whole team in surfacing business and technical risks. The key purpose of the engineering feasibility spike is learning. Learning comes from both conducting and sharing insights from spikes. Use new spike infused learnings to revise, refine, re-prioritize, or create the next set of spikes. When spikes are completed, look for new weekly rhythms like adding a \u2018risk\u2019 column to the retro board or raising topics at daily standup to identify emerging risks.","title":"Engineering Feasibility Spikes: Identifying and Mitigating Risk"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#engineering-feasibility-spikes-identifying-and-mitigating-risk","text":"","title":"Engineering Feasibility Spikes: Identifying and Mitigating Risk"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#introduction","text":"Some engagements require more de-risking than others. Even after Architectural Design Sessions (ADS) an engagement may still have substantial technical unknowns. These types of engagements warrant an exploratory/validation phase where Engineering Feasibility Spikes can be conducted immediately after envisioning/ADS and before engineering sprints.","title":"Introduction"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#engineering-feasibility-spikes","text":"Are regimented yet collaborative time-boxed investigatory activities conducted in a feedback loop to capitalize on individual learnings to inform the team. Increase the team\u2019s knowledge and understanding while minimizing engagement risks. The following guidelines outline how Microsoft and the customer can incorporate engineering feasibility spikes into the day-to-day agile processes.","title":"Engineering Feasibility Spikes"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#pre-mortem","text":"A good way to gauge what engineering spikes to conduct is to do a pre-mortem.","title":"Pre-Mortem"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#what-is-a-pre-mortem","text":"A 90-minute meeting after envisioning/ADS that includes the entire team (and can also include the customer) which answers \"Imagine the project has failed. What problems and challenges caused this failure?\" Allows the entire team to initially raise concerns and risks early in the engagement. This input is used to decide which risks to pursue as engineering spikes.","title":"What is a Pre-Mortem?"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#sharing-learnings-current-progress","text":"","title":"Sharing Learnings &amp; Current Progress"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#feedback-loop","text":"The key element from conducting the engineering feasibility spikes is sharing the outcomes in-flight. The team gets together and shares learning on a weekly basis (or more frequently if needed). The sharing is done via a 30-minute call. Everyone on the Dev Crew joins the call (even if not everyone is assigned an engineering spike story or even if the spike work was underway and not fully completed). The feedback loop is significantly tighter/shorter than in sprint-based agile process. Instead of using the Sprint as the forcing function to adjust/pivot/re-prioritize, the interim sharing sessions were the trigger.","title":"Feedback Loop"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#re-prioritizing-the-next-spikes","text":"After the team shares current progress, another round of planning is done. This allows the team to Establish a very tight feedback loop. Re-prioritize the next spike(s) because of the outcome from the current engineering feasibility spikes.","title":"Re-Prioritizing the Next Spikes"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#adjusting-based-on-context","text":"During the sharing call, and when the team believes it has enough information, the team sometimes comes to the realization that the original spike acceptance criteria is no longer valid. The team pivots into another area that provides more value. A decision log can be used to track outcomes.","title":"Adjusting Based on Context"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#engineering-feasibility-sprints-diagram","text":"The process is depicted in the diagram below.","title":"Engineering Feasibility Sprints Diagram"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#benefits","text":"","title":"Benefits"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#creating-code-samples-to-prove-out-ideas","text":"It is important to note to be intentional about the spikes not aiming to produce production-level code. The team sometimes must write code to arrive at the technical learning. The team must be cognizant that the code written for the spikes is not going to serve as the code for the final solution. The code written is just enough to drive the investigation forward with greater confidence. For example, supposed the team was exploring the API choreography of creating a Graph client with various Azure Active Directory (AAD) authentication flows and permissions. The code to demonstrate this is implemented in a console app, but it could have been done via an Express server, etc. The fact that it was a console app was not important, but rather the ability of the Graph client to be able to do operations against the Graph API endpoint with the minimal number of permissions is the main learning goal.","title":"Creating Code Samples to Prove Out Ideas"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#targeted-conversations","text":"By sharing the progress of the spike, the team\u2019s collective knowledge increases. The spikes allow the team to drive succinct conversations with various Product Groups (PGs) and other subject matter experts (SMEs). Rather than speaking at a hypothetical level, the team playbacks project/architecture concerns and concretely points out why something is a showstopper or not a viable way forward.","title":"Targeted Conversations"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#increased-customer-trust","text":"This process leads to increased customer trust. Using this process, the team Brings the customer along in the decision-making process and guides them how to go forward. Provides answers with confidence and suggests sound architectural designs. Conducting engineering feasibility spikes sets the team and the customer up for success, especially if it highlights technology learnings that help the customer fully understand the feasibility/viability of an engineering solution.","title":"Increased Customer Trust"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#summary-of-key-points","text":"A pre-mortem can involve the whole team in surfacing business and technical risks. The key purpose of the engineering feasibility spike is learning. Learning comes from both conducting and sharing insights from spikes. Use new spike infused learnings to revise, refine, re-prioritize, or create the next set of spikes. When spikes are completed, look for new weekly rhythms like adding a \u2018risk\u2019 column to the retro board or raising topics at daily standup to identify emerging risks.","title":"Summary of Key Points"},{"location":"design/design-reviews/recipes/high-level-design-recipe/","text":"High Level / Game Plan Design Recipe Why is this Valuable? Design at macroscopic level shows the interactions between systems and services that will be used to accomplish the project. It is intended to ensure there is high level understanding of the plan for what to build, which off-the-shelf components will be used, and which external components will need to interact with the deliverable. Things to Keep in Mind As with all other aspects of the project, design reviews must provide a friendly and safe environment so that any team member feels comfortable proposing a design for review and can use the opportunity to grow and learn from the constructive / non-judgemental feedback from peers and subject-matter experts (see Team Agreements ). Attempt to illustrate different personas involved in the use cases and how/which boxes are their entry points. Prefer pictures over paragraphs. The diagrams aren't intended to generate code, so they should be fairly high level. Artifacts should indicate the direction of calls (are they outbound, inbound, or bidirectional?) and call out system boundaries where ports might need to be opened or additional infrastructure work may be needed to allow calls to be made. Sequence diagrams are helpful to show the flow of calls among components + systems. Generic box diagrams depicting data flow or call origination/destination are useful. However, the title should clearly define what the arrows show indicate. In most cases, a diagram will show either data flow or call directions but not both. Visualize the contrasting aspects of the system/diagram for ease of communication. e.g. differing technologies employed, modified vs. untouched components, or internet vs. local cloud components. Colors, grouping boxes, and iconography can be used for differentiating. Prefer ease-of-understanding for communicating ideas over strict UML correctness. Design reviews should be lightweight and should not feel like an additional process overhead. Examples","title":"High Level / Game Plan Design Recipe"},{"location":"design/design-reviews/recipes/high-level-design-recipe/#high-level-game-plan-design-recipe","text":"","title":"High Level / Game Plan Design Recipe"},{"location":"design/design-reviews/recipes/high-level-design-recipe/#why-is-this-valuable","text":"Design at macroscopic level shows the interactions between systems and services that will be used to accomplish the project. It is intended to ensure there is high level understanding of the plan for what to build, which off-the-shelf components will be used, and which external components will need to interact with the deliverable.","title":"Why is this Valuable?"},{"location":"design/design-reviews/recipes/high-level-design-recipe/#things-to-keep-in-mind","text":"As with all other aspects of the project, design reviews must provide a friendly and safe environment so that any team member feels comfortable proposing a design for review and can use the opportunity to grow and learn from the constructive / non-judgemental feedback from peers and subject-matter experts (see Team Agreements ). Attempt to illustrate different personas involved in the use cases and how/which boxes are their entry points. Prefer pictures over paragraphs. The diagrams aren't intended to generate code, so they should be fairly high level. Artifacts should indicate the direction of calls (are they outbound, inbound, or bidirectional?) and call out system boundaries where ports might need to be opened or additional infrastructure work may be needed to allow calls to be made. Sequence diagrams are helpful to show the flow of calls among components + systems. Generic box diagrams depicting data flow or call origination/destination are useful. However, the title should clearly define what the arrows show indicate. In most cases, a diagram will show either data flow or call directions but not both. Visualize the contrasting aspects of the system/diagram for ease of communication. e.g. differing technologies employed, modified vs. untouched components, or internet vs. local cloud components. Colors, grouping boxes, and iconography can be used for differentiating. Prefer ease-of-understanding for communicating ideas over strict UML correctness. Design reviews should be lightweight and should not feel like an additional process overhead.","title":"Things to Keep in Mind"},{"location":"design/design-reviews/recipes/high-level-design-recipe/#examples","text":"","title":"Examples"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/","text":"Milestone / Epic Design Review Recipe Why is this Valuable? Design at epic/milestone level can help the team make better decisions about prioritization by summarizing the value, effort, complexity, risks, and dependencies. This brief document can help the team align on the selected approach and briefly explain the rationale for other teams, subject-matter experts, project advisors, and new team members. Things to Keep in Mind As with all other aspects of the project, design reviews must provide a friendly and safe environment so that any team member feels comfortable proposing a design for review and can use the opportunity to grow and learn from the constructive / non-judgemental feedback from peers and subject-matter experts (see Team Agreements ). Design reviews should be lightweight and should not feel like an additional process overhead. Dev Lead can usually provide guidance on whether a given epic/milestone needs a design review and can help other team members in preparation. This is not a strict template that must be followed and teams should not be bogged down with polished \"design presentations\". Think of the recipe below as a \"menu of options\" for potential questions to think through in designing this epic. Not all sections are required for every epic. Focus on sections and questions that are most relevant for making the decision and rationalizing the trade-offs. Milestone/epic design is considered high-level design but is usually more detailed than the design included in the Game Plan, but will likely re-use some technologies, non-functional requirements, and constraints mentioned in the Game Plan. As the team learned more about the project and further refined the scope of the epic, they may specifically call out notable changes to the overall approach and, in particular, highlight any unique deployment, security, private, scalability, etc. characteristics of this milestone. Template You can download the Milestone/Epic Design Review Template , copy it into your project, and use it as described in the async design review recipe .","title":"Milestone / Epic Design Review Recipe"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/#milestone-epic-design-review-recipe","text":"","title":"Milestone / Epic Design Review Recipe"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/#why-is-this-valuable","text":"Design at epic/milestone level can help the team make better decisions about prioritization by summarizing the value, effort, complexity, risks, and dependencies. This brief document can help the team align on the selected approach and briefly explain the rationale for other teams, subject-matter experts, project advisors, and new team members.","title":"Why is this Valuable?"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/#things-to-keep-in-mind","text":"As with all other aspects of the project, design reviews must provide a friendly and safe environment so that any team member feels comfortable proposing a design for review and can use the opportunity to grow and learn from the constructive / non-judgemental feedback from peers and subject-matter experts (see Team Agreements ). Design reviews should be lightweight and should not feel like an additional process overhead. Dev Lead can usually provide guidance on whether a given epic/milestone needs a design review and can help other team members in preparation. This is not a strict template that must be followed and teams should not be bogged down with polished \"design presentations\". Think of the recipe below as a \"menu of options\" for potential questions to think through in designing this epic. Not all sections are required for every epic. Focus on sections and questions that are most relevant for making the decision and rationalizing the trade-offs. Milestone/epic design is considered high-level design but is usually more detailed than the design included in the Game Plan, but will likely re-use some technologies, non-functional requirements, and constraints mentioned in the Game Plan. As the team learned more about the project and further refined the scope of the epic, they may specifically call out notable changes to the overall approach and, in particular, highlight any unique deployment, security, private, scalability, etc. characteristics of this milestone.","title":"Things to Keep in Mind"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/#template","text":"You can download the Milestone/Epic Design Review Template , copy it into your project, and use it as described in the async design review recipe .","title":"Template"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/","text":"Preferred Diagram Tooling At each stage in the engagement process, diagrams are a key part of the design review. The preferred tooling for creating and maintaining diagrams is to choose one of the following: Microsoft Visio Microsoft PowerPoint The .drawio.png (or .drawio ) format from diagrams.net (formerly draw.io ) In all cases, we recommend storing the exported PNG images from these diagrams in the repo along with the source files so they can easily be referenced in documentation and more easily reviewed during PRs. The .drawio.png format stores both at once. Microsoft Visio It contains a lot of shapes out of the box, including Azure icons, the desktop app exists on PC, and there's a great Web app. Most diagrams in the Azure Architecture Center are Visio diagrams. Microsoft PowerPoint Diagrams can be easily reused in presentations, a PowerPoint license is pretty common, the desktop app exists on PC and on the Mac, and there's a great Web app. .drawio.png There are different desktop, web apps and VS Code extensions. This tooling can be used like Visio or LucidChart, without the licensing/remote storage concerns. Furthermore, Diagrams.net has a collection of Azure/Office/Microsoft icons, as well as other well-known tech, so it is not only useful for swimlanes and flow diagrams, but also for architecture diagrams. .drawio.png should be preferred over the .drawio format. The .drawio.png format uses the metadata layer within the PNG file-format to hide SVG vector graphics representation, then renders the .png when saving. This clever use of both the meta layer and image layer allows anyone to further edit the PNG file. It also renders like a normal PNG in browsers and other viewers, making it easy to transfer and embed. Furthermore, it can be edited within VSCode very easily using the Draw.io Integration VSCode Extension .","title":"Preferred Diagram Tooling"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/#preferred-diagram-tooling","text":"At each stage in the engagement process, diagrams are a key part of the design review. The preferred tooling for creating and maintaining diagrams is to choose one of the following: Microsoft Visio Microsoft PowerPoint The .drawio.png (or .drawio ) format from diagrams.net (formerly draw.io ) In all cases, we recommend storing the exported PNG images from these diagrams in the repo along with the source files so they can easily be referenced in documentation and more easily reviewed during PRs. The .drawio.png format stores both at once.","title":"Preferred Diagram Tooling"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/#microsoft-visio","text":"It contains a lot of shapes out of the box, including Azure icons, the desktop app exists on PC, and there's a great Web app. Most diagrams in the Azure Architecture Center are Visio diagrams.","title":"Microsoft Visio"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/#microsoft-powerpoint","text":"Diagrams can be easily reused in presentations, a PowerPoint license is pretty common, the desktop app exists on PC and on the Mac, and there's a great Web app.","title":"Microsoft PowerPoint"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/#drawiopng","text":"There are different desktop, web apps and VS Code extensions. This tooling can be used like Visio or LucidChart, without the licensing/remote storage concerns. Furthermore, Diagrams.net has a collection of Azure/Office/Microsoft icons, as well as other well-known tech, so it is not only useful for swimlanes and flow diagrams, but also for architecture diagrams. .drawio.png should be preferred over the .drawio format. The .drawio.png format uses the metadata layer within the PNG file-format to hide SVG vector graphics representation, then renders the .png when saving. This clever use of both the meta layer and image layer allows anyone to further edit the PNG file. It also renders like a normal PNG in browsers and other viewers, making it easy to transfer and embed. Furthermore, it can be edited within VSCode very easily using the Draw.io Integration VSCode Extension .","title":".drawio.png"},{"location":"design/design-reviews/recipes/technical-spike/","text":"Technical Spike From Wikipedia ... A spike in a sprint can be used in a number of ways: As a way to familiarize the team with new hardware or software To analyze a problem thoroughly and assist in properly dividing work among separate team members. Spike tests can also be used to mitigate future risk, and may uncover additional issues that have escaped notice. A distinction can be made between technical spikes and functional spikes. The technical spike is used more often for evaluating the impact new technology has on the current implementation. A functional spike is used to determine the interaction with a new feature or implementation. Engineering feasibility spikes can also be conducted to de-risk an engagement and increase the team's understanding. Deliverable Generally the deliverable from a Technical Spike should be a document detailing what was evaluated and the outcome of that evaluation. The specifics contained in the document will vary, but there are some general principles that might be helpful. Problem Statement/Goals: Be sure to include a section that clearly details why an evaluation is being done and what the outcome of this evaluation should be. This is helpful to ensure that the technical spike was productive and advanced the overall project in some way. Make sure it is repeatable: Detail the components used, installation instructions, configuration, etc. required to build the environment that was used for evaluation and testing. If any testing is performed, make sure to include the scripts, links to the applications, configuration options, etc. so that testing could be performed again. There are many reasons that the evaluation environment may need to be rebuilt. For example: Another scenario needs to be tested. A new version of the technology has been released. The technology needs to be tested on a new platform. Fact-Finding: The goal of a spike should be fact-finding, not decision-making or recommendation. Ideally, the technology spike digs into a number of technical questions and gets answers so that the broader project team can then come back together and agree on an appropriate course forward. Evidence: Generally you will use sections to summarize the results of testing which do not include the potentially hundreds of detailed results, however, you should include all detailed testing results in an appendix or an attachment. Having full results detailed somewhere will help the team trust the results. In addition, data can be interpreted lots of different ways, and it may be necessary to go back to the original data for a new interpretation. Organization: The technical documentation can be lengthy. It is generally a good idea to organize sections with headers and include a table of contents. Generally sections towards the beginning of the document should summarize data and use one or more appendices for more details.","title":"Technical Spike"},{"location":"design/design-reviews/recipes/technical-spike/#technical-spike","text":"From Wikipedia ... A spike in a sprint can be used in a number of ways: As a way to familiarize the team with new hardware or software To analyze a problem thoroughly and assist in properly dividing work among separate team members. Spike tests can also be used to mitigate future risk, and may uncover additional issues that have escaped notice. A distinction can be made between technical spikes and functional spikes. The technical spike is used more often for evaluating the impact new technology has on the current implementation. A functional spike is used to determine the interaction with a new feature or implementation. Engineering feasibility spikes can also be conducted to de-risk an engagement and increase the team's understanding.","title":"Technical Spike"},{"location":"design/design-reviews/recipes/technical-spike/#deliverable","text":"Generally the deliverable from a Technical Spike should be a document detailing what was evaluated and the outcome of that evaluation. The specifics contained in the document will vary, but there are some general principles that might be helpful. Problem Statement/Goals: Be sure to include a section that clearly details why an evaluation is being done and what the outcome of this evaluation should be. This is helpful to ensure that the technical spike was productive and advanced the overall project in some way. Make sure it is repeatable: Detail the components used, installation instructions, configuration, etc. required to build the environment that was used for evaluation and testing. If any testing is performed, make sure to include the scripts, links to the applications, configuration options, etc. so that testing could be performed again. There are many reasons that the evaluation environment may need to be rebuilt. For example: Another scenario needs to be tested. A new version of the technology has been released. The technology needs to be tested on a new platform. Fact-Finding: The goal of a spike should be fact-finding, not decision-making or recommendation. Ideally, the technology spike digs into a number of technical questions and gets answers so that the broader project team can then come back together and agree on an appropriate course forward. Evidence: Generally you will use sections to summarize the results of testing which do not include the potentially hundreds of detailed results, however, you should include all detailed testing results in an appendix or an attachment. Having full results detailed somewhere will help the team trust the results. In addition, data can be interpreted lots of different ways, and it may be necessary to go back to the original data for a new interpretation. Organization: The technical documentation can be lengthy. It is generally a good idea to organize sections with headers and include a table of contents. Generally sections towards the beginning of the document should summarize data and use one or more appendices for more details.","title":"Deliverable"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/","text":"Template: Feature / Story Design Review [DRAFT/WIP] [Feature or Story Design Title] Does the feature re-use or extend existing patterns / interfaces that have already been established for the project? Does the feature expose new patterns or interfaces that will establish a new standard for new future development? Feature/Story Name Engagement: [Engagement] Customer: [Customer] Authors: [Author1, Author2, etc.] Overview/Problem Statement It can also be a link to the work item . Describe the feature/story with a high-level summary. Consider additional background and justification, for posterity and historical context. List any assumptions that were made for this design. Goals/In-Scope List the goals that the feature/story will help us achieve that are most relevant for the design review discussion. This should include acceptance criteria required to meet definition of done . Non-Goals / Out-of-Scope List the non-goals for the feature/story. This contains work that is beyond the scope of what the feature/component/service is intended for. Proposed Design Briefly describe the high-level architecture for the feature/story. Relevant diagrams (e.g. sequence, component, context, deployment) should be included here. Technology Describe the relevant OS, Web server, presentation layer, persistence layer, caching, eventing/messaging/jobs, etc. \u2013 whatever is applicable to the overall technology solution and how are they going to be used. Describe the usage of any libraries of OSS components. Briefly list the languages(s) and platform(s) that comprise the stack. Non-Functional Requirements What are the primary performance and scalability concerns for this feature/story? Are there specific latency, availability, and RTO/RPO objectives that must be met? Are there specific bottlenecks or potential problem areas? For example, are operations CPU or I/O (network, disk) bound? How large are the data sets and how fast do they grow? What is the expected usage pattern of the service? For example, will there be peaks and valleys of intense concurrent usage? Are there specific cost constraints? (e.g. $ per transaction/device/user) Dependencies Does this feature/story need to be sequenced after another feature/story assigned to the same team and why? Is the feature/story dependent on another team completing other work? Will the team need to wait for that work to be completed or could the work proceed in parallel? Risks & Mitigation Does the team need assistance from subject-matter experts? What security and privacy concerns does this milestone/epic have? Is all sensitive information and secrets treated in a safe and secure manner? Open Questions List any open questions/concerns here. Resources List any additional resources here including links to backlog items, work items or other documents.","title":"Template: Feature / Story Design Review"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#template-feature-story-design-review","text":"","title":"Template: Feature / Story Design Review"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#draftwip-feature-or-story-design-title","text":"Does the feature re-use or extend existing patterns / interfaces that have already been established for the project? Does the feature expose new patterns or interfaces that will establish a new standard for new future development? Feature/Story Name Engagement: [Engagement] Customer: [Customer] Authors: [Author1, Author2, etc.]","title":"[DRAFT/WIP] [Feature or Story Design Title]"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#overviewproblem-statement","text":"It can also be a link to the work item . Describe the feature/story with a high-level summary. Consider additional background and justification, for posterity and historical context. List any assumptions that were made for this design.","title":"Overview/Problem Statement"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#goalsin-scope","text":"List the goals that the feature/story will help us achieve that are most relevant for the design review discussion. This should include acceptance criteria required to meet definition of done .","title":"Goals/In-Scope"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#non-goals-out-of-scope","text":"List the non-goals for the feature/story. This contains work that is beyond the scope of what the feature/component/service is intended for.","title":"Non-Goals / Out-of-Scope"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#proposed-design","text":"Briefly describe the high-level architecture for the feature/story. Relevant diagrams (e.g. sequence, component, context, deployment) should be included here.","title":"Proposed Design"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#technology","text":"Describe the relevant OS, Web server, presentation layer, persistence layer, caching, eventing/messaging/jobs, etc. \u2013 whatever is applicable to the overall technology solution and how are they going to be used. Describe the usage of any libraries of OSS components. Briefly list the languages(s) and platform(s) that comprise the stack.","title":"Technology"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#non-functional-requirements","text":"What are the primary performance and scalability concerns for this feature/story? Are there specific latency, availability, and RTO/RPO objectives that must be met? Are there specific bottlenecks or potential problem areas? For example, are operations CPU or I/O (network, disk) bound? How large are the data sets and how fast do they grow? What is the expected usage pattern of the service? For example, will there be peaks and valleys of intense concurrent usage? Are there specific cost constraints? (e.g. $ per transaction/device/user)","title":"Non-Functional Requirements"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#dependencies","text":"Does this feature/story need to be sequenced after another feature/story assigned to the same team and why? Is the feature/story dependent on another team completing other work? Will the team need to wait for that work to be completed or could the work proceed in parallel?","title":"Dependencies"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#risks-mitigation","text":"Does the team need assistance from subject-matter experts? What security and privacy concerns does this milestone/epic have? Is all sensitive information and secrets treated in a safe and secure manner?","title":"Risks &amp; Mitigation"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#open-questions","text":"List any open questions/concerns here.","title":"Open Questions"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#resources","text":"List any additional resources here including links to backlog items, work items or other documents.","title":"Resources"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/","text":"Template: Milestone / Epic Design Review [DRAFT/WIP] [Milestone/Epic Design Title] Please refer to the milestone/epic design review recipe for things to keep in mind when using this template. Milestone / Epic: Name Project / Engagement: [Project Engagement] Authors: [Author1, Author2, etc.] Overview / Problem Statement Describe the milestone/epic with a high-level summary and a problem statement. Consider including or linking to any additional background (e.g. Game Plan or Checkpoint docs) if it is useful for historical context. Goals / In-Scope List a few bullet points of goals that this milestone/epic will achieve and that are most relevant for the design review discussion. You may include acceptable criteria required to meet the Definition of Done . Non-goals / Out-of-Scope List a few bullet points of non-goals to clarify the work that is beyond the scope of the design review for this milestone/epic. Proposed Design / Suggested Approach To optimize the time investment, this should be brief since it is likely that details will change as the epic/milestone is further decomposed into features and stories. The goal being to convey the vision and complexity in something that can be understood in a few minutes and can help guide a discussion (either asynchronously via comments or in a meeting). A paragraph to describe the proposed design / suggested approach for this milestone/epic. A diagram (e.g. architecture, sequence, component, deployment, etc.) or pseudo-code snippet to make it easier to talk through the approach. List a few of the alternative approaches that were considered and include the brief key Pros and Cons used to help rationalize the decision. For example: Pros Cons Simple to implement Creates secondary identity system Repeatable pattern/code artifact Deployment requires admin credentials Technology Briefly list the languages(s) and platform(s) that comprise the stack. This may include anything that is needed to understand the overall solution: OS, web server, presentation layer, persistence layer, caching, eventing, etc. Non-Functional Requirements What are the primary performance and scalability concerns for this milestone/epic? Are there specific latency, availability, and RTO/RPO objectives that must be met? Are there specific bottlenecks or potential problem areas? For example, are operations CPU or I/O (network, disk) bound? How large are the data sets and how fast do they grow? What is the expected usage pattern of the service? For example, will there be peaks and valleys of intense concurrent usage? Are there specific cost constraints? (e.g. $ per transaction/device/user) Operationalization Are there any specific considerations for the CI/CD setup of milestone/epic? Is there a process (manual or automated) to promote builds from lower environments to higher ones? Does this milestone/epic require zero-downtime deployments, and if so, how are they achieved? Are there mechanisms in place to rollback a deployment? What is the process for monitoring the functionality provided by this milestone/epic? Dependencies Does this milestone/epic need to be sequenced after another epic assigned to the same team and why? Is the milestone/epic dependent on another team completing other work? Will the team need to wait for that work to be completed or could the work proceed in parallel? Risks & Mitigations Does the team need assistance from subject-matter experts? What security and privacy concerns does this milestone/epic have? Is all sensitive information and secrets treated in a safe and secure manner? Open Questions Include any open questions and concerns. Resources Include any additional resources including links to work items or other documents.","title":"Template: Milestone / Epic Design Review"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#template-milestone-epic-design-review","text":"","title":"Template: Milestone / Epic Design Review"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#draftwip-milestoneepic-design-title","text":"Please refer to the milestone/epic design review recipe for things to keep in mind when using this template. Milestone / Epic: Name Project / Engagement: [Project Engagement] Authors: [Author1, Author2, etc.]","title":"[DRAFT/WIP] [Milestone/Epic Design Title]"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#overview-problem-statement","text":"Describe the milestone/epic with a high-level summary and a problem statement. Consider including or linking to any additional background (e.g. Game Plan or Checkpoint docs) if it is useful for historical context.","title":"Overview / Problem Statement"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#goals-in-scope","text":"List a few bullet points of goals that this milestone/epic will achieve and that are most relevant for the design review discussion. You may include acceptable criteria required to meet the Definition of Done .","title":"Goals / In-Scope"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#non-goals-out-of-scope","text":"List a few bullet points of non-goals to clarify the work that is beyond the scope of the design review for this milestone/epic.","title":"Non-goals / Out-of-Scope"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#proposed-design-suggested-approach","text":"To optimize the time investment, this should be brief since it is likely that details will change as the epic/milestone is further decomposed into features and stories. The goal being to convey the vision and complexity in something that can be understood in a few minutes and can help guide a discussion (either asynchronously via comments or in a meeting). A paragraph to describe the proposed design / suggested approach for this milestone/epic. A diagram (e.g. architecture, sequence, component, deployment, etc.) or pseudo-code snippet to make it easier to talk through the approach. List a few of the alternative approaches that were considered and include the brief key Pros and Cons used to help rationalize the decision. For example: Pros Cons Simple to implement Creates secondary identity system Repeatable pattern/code artifact Deployment requires admin credentials","title":"Proposed Design / Suggested Approach"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#technology","text":"Briefly list the languages(s) and platform(s) that comprise the stack. This may include anything that is needed to understand the overall solution: OS, web server, presentation layer, persistence layer, caching, eventing, etc.","title":"Technology"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#non-functional-requirements","text":"What are the primary performance and scalability concerns for this milestone/epic? Are there specific latency, availability, and RTO/RPO objectives that must be met? Are there specific bottlenecks or potential problem areas? For example, are operations CPU or I/O (network, disk) bound? How large are the data sets and how fast do they grow? What is the expected usage pattern of the service? For example, will there be peaks and valleys of intense concurrent usage? Are there specific cost constraints? (e.g. $ per transaction/device/user)","title":"Non-Functional Requirements"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#operationalization","text":"Are there any specific considerations for the CI/CD setup of milestone/epic? Is there a process (manual or automated) to promote builds from lower environments to higher ones? Does this milestone/epic require zero-downtime deployments, and if so, how are they achieved? Are there mechanisms in place to rollback a deployment? What is the process for monitoring the functionality provided by this milestone/epic?","title":"Operationalization"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#dependencies","text":"Does this milestone/epic need to be sequenced after another epic assigned to the same team and why? Is the milestone/epic dependent on another team completing other work? Will the team need to wait for that work to be completed or could the work proceed in parallel?","title":"Dependencies"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#risks-mitigations","text":"Does the team need assistance from subject-matter experts? What security and privacy concerns does this milestone/epic have? Is all sensitive information and secrets treated in a safe and secure manner?","title":"Risks &amp; Mitigations"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#open-questions","text":"Include any open questions and concerns.","title":"Open Questions"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#resources","text":"Include any additional resources including links to work items or other documents.","title":"Resources"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/","text":"Template: Task Design Review [DRAFT/WIP] [Task Design Title] When developing a design document for a new task, it should contain a detailed design proposal demonstrating how it will solve the goals outlined below. Not all tasks require a design review, but when they do it is likely that there many unknowns, or the solution may be more complex. The design should include diagrams, pseudocode, interface contracts as needed to provide a detailed understanding of the proposal. Task Name Story Name Engagement: [Engagement] Customer: [Customer] Authors: [Author1, Author2, etc.] Overview/Problem Statement It can also be a link to the work item . Describe the task with a high-level summary. Consider additional background and justification, for posterity and historical context. Goals/In-Scope List a few bullet points of what this task will achieve and that are most relevant for the design review discussion. This should include acceptance criteria required to meet the definition of done . Non-goals / Out-of-Scope List a few bullet points of non-goals to clarify the work that is beyond the scope of the design review for this task. Proposed Options Describe the detailed design to accomplish the proposed task. What patterns & practices will be used and why were they chosen. Were any alternate proposals considered? What new components are required to be developed? Are there any existing components that require updates? Relevant diagrams (e.g. sequence, component, context, deployment) should be included here. Technology Choices Describe any libraries and OSS components that will be used to complete the task. Briefly list the languages(s) and platform(s) that comprise the stack. Open Questions List any open questions/concerns here. Resources List any additional resources here including links to backlog items, work items or other documents.","title":"Template: Task Design Review"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#template-task-design-review","text":"","title":"Template: Task Design Review"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#draftwip-task-design-title","text":"When developing a design document for a new task, it should contain a detailed design proposal demonstrating how it will solve the goals outlined below. Not all tasks require a design review, but when they do it is likely that there many unknowns, or the solution may be more complex. The design should include diagrams, pseudocode, interface contracts as needed to provide a detailed understanding of the proposal. Task Name Story Name Engagement: [Engagement] Customer: [Customer] Authors: [Author1, Author2, etc.]","title":"[DRAFT/WIP] [Task Design Title]"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#overviewproblem-statement","text":"It can also be a link to the work item . Describe the task with a high-level summary. Consider additional background and justification, for posterity and historical context.","title":"Overview/Problem Statement"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#goalsin-scope","text":"List a few bullet points of what this task will achieve and that are most relevant for the design review discussion. This should include acceptance criteria required to meet the definition of done .","title":"Goals/In-Scope"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#non-goals-out-of-scope","text":"List a few bullet points of non-goals to clarify the work that is beyond the scope of the design review for this task.","title":"Non-goals / Out-of-Scope"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#proposed-options","text":"Describe the detailed design to accomplish the proposed task. What patterns & practices will be used and why were they chosen. Were any alternate proposals considered? What new components are required to be developed? Are there any existing components that require updates? Relevant diagrams (e.g. sequence, component, context, deployment) should be included here.","title":"Proposed Options"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#technology-choices","text":"Describe any libraries and OSS components that will be used to complete the task. Briefly list the languages(s) and platform(s) that comprise the stack.","title":"Technology Choices"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#open-questions","text":"List any open questions/concerns here.","title":"Open Questions"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#resources","text":"List any additional resources here including links to backlog items, work items or other documents.","title":"Resources"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/","text":"Template: Technical Spike Spike: [Spike Name] Conducted by: {Names and at least one email address for follow-up questions} Backlog Work Item: {Link to the work item to provide more context} Sprint : {Which sprint did the study take place. Include sprint start date} Goal Describe what question(s) the spike intends to answer and why. Method Describe how the team will uncover the answer to the question(s) the spike intends to answer. For example: Build prototype to test. Research existing documents and samples. Discuss with subject matter experts. Evidence Document the evidence collected that informed the conclusions below. Examples may include: Recorded or live demos of a prototype providing the desired capabilities Metrics collected while testing the prototype Documentation that indicates the solution can provided the desired capabilities Conclusions What was the answer to the question(s) outlined at the start of the spike? Capture what was learned that will inform future work. Next Steps What work is expected as an outcome of the learning within this spike. Was there work that was blocked or dependent on the learning within this spike?","title":"Template: Technical Spike"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#template-technical-spike","text":"","title":"Template: Technical Spike"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#spike-spike-name","text":"Conducted by: {Names and at least one email address for follow-up questions} Backlog Work Item: {Link to the work item to provide more context} Sprint : {Which sprint did the study take place. Include sprint start date}","title":"Spike: [Spike Name]"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#goal","text":"Describe what question(s) the spike intends to answer and why.","title":"Goal"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#method","text":"Describe how the team will uncover the answer to the question(s) the spike intends to answer. For example: Build prototype to test. Research existing documents and samples. Discuss with subject matter experts.","title":"Method"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#evidence","text":"Document the evidence collected that informed the conclusions below. Examples may include: Recorded or live demos of a prototype providing the desired capabilities Metrics collected while testing the prototype Documentation that indicates the solution can provided the desired capabilities","title":"Evidence"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#conclusions","text":"What was the answer to the question(s) outlined at the start of the spike? Capture what was learned that will inform future work.","title":"Conclusions"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#next-steps","text":"What work is expected as an outcome of the learning within this spike. Was there work that was blocked or dependent on the learning within this spike?","title":"Next Steps"},{"location":"design/design-reviews/trade-studies/","text":"Trade Studies Trade studies are a tool for selecting the best option out of several possible options for a given problem (for example: compute, storage). They evaluate potential choices against a set of objective criteria/requirements to clearly lay out the benefits and limitations of each solution. Trade studies are a concept from systems engineering that we adapted for software projects. Trade studies have proved to be a critical tool to drive alignment with the stakeholders, earn credibility while doing so and ensure our decisions were backed by data and not bias. When to Use Trade studies go hand in hand with high level architecture design. This usually occurs as project requirements are solidifying, before coding begins. Trade studies continue to be useful throughout the project any time there are multiple options that need to be selected from. New decision point could occur from changing requirements, getting results of a research spike, or identifying challenges that were not originally seen. Trade studies should be avoided if there is a clear solution choice. Because they require each solution to be fully thought out, they have the potential to take a lot of time to complete. When there is a clear design, the trade study should be omitted, and an entry should be made in the Decision Log documenting the decision. Why Trade Studies Trade studies are a way of formalizing the design process and leaving a documentation record for why the decision was made. This gives a few advantages: The trade study template guides a user through the design process. This provides structure to the design stage. Having a uniform design process aids splitting work amongst team members. We have had success with engineers pairing to define requirements, evaluation criteria, and brainstorming possible solutions. Then they can each split to review solutions in parallel, before rejoining to make the final decision. The completed trade study document helps drive alignment across the team and decision makers. For presenting results of the study, the document itself can be used to highlight the main points. Alternatively, we have extracted requirements, diagrams for each solution, and the results table into a slide deck to give high level overviews of the results. The completed trade study gets checked into the code repository, providing documentation of the decision process. This leaves a history of the requirements at the time that lead to each decision. Also, the results table gives a quick reference for how the decision would be impacted if requirements change as the project proceeds. Flow of a Trade Study Trade studies can vary widely in scope; however, they follow the common pattern below: Solidify the requirements \u2013 Work with the stakeholders to agree on the requirements for the functionality that you are trying to build. Create evaluation criteria \u2013 This is a set of qualitative and quantitative assessment points that represent the requirements. Taken together, they become an easy to measure stand-in for the potentially abstract requirements. Brainstorm solutions \u2013 Gather a list of possible solutions to the problem. Then, use your best judgement to pick the 2-4 solutions that seem most promising. For assistance narrowing solutions, remember to reach out to subject-matter experts and other teams who may have gone through a similar decision. Evaluate shortlisted solutions \u2013 Dive deep into each solution and measure it against the evaluation criteria. In this stage, time box your research to avoid overly investing in any given area. Compare results and choose solution - Align the decision with the team. If you are unable to decide, then a clear list of action items and owners to drive the final decision must be produced. Template See template.md for an example of how to structure the above information. This template was created to guide a user through conducting a trade study. Once the decision has been made we recommend adding an entry to the Decision Log that has references back to the full text of the trade study.","title":"Trade Studies"},{"location":"design/design-reviews/trade-studies/#trade-studies","text":"Trade studies are a tool for selecting the best option out of several possible options for a given problem (for example: compute, storage). They evaluate potential choices against a set of objective criteria/requirements to clearly lay out the benefits and limitations of each solution. Trade studies are a concept from systems engineering that we adapted for software projects. Trade studies have proved to be a critical tool to drive alignment with the stakeholders, earn credibility while doing so and ensure our decisions were backed by data and not bias.","title":"Trade Studies"},{"location":"design/design-reviews/trade-studies/#when-to-use","text":"Trade studies go hand in hand with high level architecture design. This usually occurs as project requirements are solidifying, before coding begins. Trade studies continue to be useful throughout the project any time there are multiple options that need to be selected from. New decision point could occur from changing requirements, getting results of a research spike, or identifying challenges that were not originally seen. Trade studies should be avoided if there is a clear solution choice. Because they require each solution to be fully thought out, they have the potential to take a lot of time to complete. When there is a clear design, the trade study should be omitted, and an entry should be made in the Decision Log documenting the decision.","title":"When to Use"},{"location":"design/design-reviews/trade-studies/#why-trade-studies","text":"Trade studies are a way of formalizing the design process and leaving a documentation record for why the decision was made. This gives a few advantages: The trade study template guides a user through the design process. This provides structure to the design stage. Having a uniform design process aids splitting work amongst team members. We have had success with engineers pairing to define requirements, evaluation criteria, and brainstorming possible solutions. Then they can each split to review solutions in parallel, before rejoining to make the final decision. The completed trade study document helps drive alignment across the team and decision makers. For presenting results of the study, the document itself can be used to highlight the main points. Alternatively, we have extracted requirements, diagrams for each solution, and the results table into a slide deck to give high level overviews of the results. The completed trade study gets checked into the code repository, providing documentation of the decision process. This leaves a history of the requirements at the time that lead to each decision. Also, the results table gives a quick reference for how the decision would be impacted if requirements change as the project proceeds.","title":"Why Trade Studies"},{"location":"design/design-reviews/trade-studies/#flow-of-a-trade-study","text":"Trade studies can vary widely in scope; however, they follow the common pattern below: Solidify the requirements \u2013 Work with the stakeholders to agree on the requirements for the functionality that you are trying to build. Create evaluation criteria \u2013 This is a set of qualitative and quantitative assessment points that represent the requirements. Taken together, they become an easy to measure stand-in for the potentially abstract requirements. Brainstorm solutions \u2013 Gather a list of possible solutions to the problem. Then, use your best judgement to pick the 2-4 solutions that seem most promising. For assistance narrowing solutions, remember to reach out to subject-matter experts and other teams who may have gone through a similar decision. Evaluate shortlisted solutions \u2013 Dive deep into each solution and measure it against the evaluation criteria. In this stage, time box your research to avoid overly investing in any given area. Compare results and choose solution - Align the decision with the team. If you are unable to decide, then a clear list of action items and owners to drive the final decision must be produced.","title":"Flow of a Trade Study"},{"location":"design/design-reviews/trade-studies/#template","text":"See template.md for an example of how to structure the above information. This template was created to guide a user through conducting a trade study. Once the decision has been made we recommend adding an entry to the Decision Log that has references back to the full text of the trade study.","title":"Template"},{"location":"design/design-reviews/trade-studies/template/","text":"Trade Study Template This generic template can be used for any situation where we have a set of requirements that can be satisfied by multiple solutions. They can range in scope from choice of which open source package to use, to full architecture designs. Trade Study/Design: [Trade Study Name] Conducted by: {Names of those that can answer follow-up questions and at least one email address} Backlog Work Item: {Link to the work item to provide more context} Sprint: {Which sprint did the study take place? Include sprint start date} Decision: {Solution chosen to proceed with} Decision Makers: IMPORTANT Designs should be completed within a sprint. Most designs will benefit from brevity. To accomplish this: Narrow the scope of the design. Narrow evaluation to 2 to 3 solutions. Design experiments to collect evidence as fast as possible. Overview Description of the problem we are solving. This should include: Assumptions about the rest of the system Constraints that apply to the system, both business and technical Requirements for the functionality that needs to be implemented, including possible inputs and outputs (optional) A diagram showing the different pieces Desired Outcomes The following section should establish the desired capabilities of the solution for it to be successful. This can be done by answering the following questions either directly or via link to related artifact (i.e. PBI or Feature description). Acceptance: What capabilities should be demonstrable for a stakeholder to accept the solution? Justification: How does this contribute to the broader project objectives? IMPORTANT This is not intended to define outcomes for the design activity itself. It is intended to define the outcomes for the solution being designed. As mentioned in the User Interface section, if the trade study is analyzing an application development solution, make use of the persona stories to derive desired outcomes. For example, if a persona story exemplifies a certain accessibility requirement, the parallel desired outcome may be \"The application must be accessible for people with vision-based disabilities\". Evaluation Criteria The former should be condensed down to a set of \"evaluation criteria\" that we can rate any potential solutions against. Examples of evaluation criteria: Runs on Windows and Linux - Binary response Compute Usage - Could be categories that effectively rank different options: High, Medium, Low Cost of the solution \u2013 An estimated numeric field The results section contains a table evaluating each solution against the evaluation criteria. Key Metrics (Optional) If available, describe any measurable metrics that are important to the success of the solution. Examples include, but are not limited to: Performance & Scale targets such as, Requests/Second, Latency, and Response time (at a given percentile). Azure consumption cost budget. For example, given certain usage, solution expected to cost X dollars per month. Availability uptime of XX% over X time period. Consistency. Writes available for read within X milliseconds. Recovery point objective (RPO) & Recovery time objective (RTO). Constraints (Optional) If applicable, describe the boundaries from which we have to design the solution. This could be thought of as the \"box\" the team has to work within. This box may be defined as: Technologies, services, and languages an organization is comfortable operating/managing. Devices, operating systems, and/or browsers that must be supported. Backward Compatibility. For example, public interfaces consumed by client or third party apps cannot introduce breaking changes. Integrations or dependencies with other systems. For example, push notifications to client apps must be done via existing websockets channel. Accessibility Accessibility is never optional . Microsoft has made a public commitment to always produce accessible applications. For more information visit the official Microsoft accessibility site and read the Accessibility page. Consider the following prompts when determining application accessibility requirements: Does the application meet industry accessibility standards? Are training, support, and documentation resources accessible? Is the application designed to be inclusive for people will a broad range of abilities, languages, and cultures? Solution Hypotheses Enumerate the solutions that are believed to deliver the outcomes defined above. Note: Limiting the evaluated solutions to 2 or 3 potential candidates can help manage the time spent on the evaluation. If there are more than 3 candidates, prioritize what the team feels are the top 3. If appropriate, the eliminated candidates can be mentioned to capture why they were eliminated. Additionally, there should be at least two options compared, otherwise you didn't need a trade study. [Solution 1] Add a brief description of the solution and how its expected to produce the desired outcomes. If appropriate, illustrations/diagrams can be used to reduce the amount of text explanation required to describe the solution. NOTE: Using present tense language to describe the solution can help avoid confusion between current state and future state. For example, use \"This solution works by doing...\" vs. \"This solution would work by doing...\". Each solution section should contain the following: Description of the solution (optional) A diagram to quickly reference the solution Possible variations - things that are small variations on the main solution can be grouped together Evaluation of the idea based on the evaluation criteria above The depth, detail, and contents of these sections will vary based on the complexity of the functionality being developed. Experiment(s) Describe how the solution will be evaluated to prove or dis-prove that it will produce the desired outcomes. This could take many forms such as building a prototype and researching existing documentation and sample solutions. Additionally, document any assumptions made as part of the experiment. NOTE: Time boxing these experiments can be beneficial to make sure the team is making the best use of the time by focusing on collecting key evidence in the simplest/fastest way possible. Evidence Present the evidence collected during experimentation that supports the hypothesis that this solution will meet the desired outcomes. Examples may include: Recorded or live demos of a prototype providing the desired capabilities Metrics collected while testing the prototype Documentation that indicates the solution can provide the desired capabilities NOTE: Evidence is not required for every capability, metric, or constraint for the design to be considered done. Instead, focus on presenting evidence that is most relevant and impactful towards supporting or eliminating the hypothesis. [Solution 2] ... [Solution N] ... Results This section should contain a table that has each solution rated against each of the evaluation criteria: Solution Evaluation Criteria 1 Evaluation Criteria 2 ... Evaluation Criteria N Solution 1 Solution 2 ... Solution M Note: The formatting of the table can change. In the past, we have had success with qualitative descriptions in the table entries and color coding the cells to represent good, fair, bad. Decision The chosen solution, or a list of questions that need to be answered before the decision can be made. In the latter case, each question needs an action item and an assigned person for answering the question. Once those questions are answered, the document must be updated to reflect the answers, and the final decision. In the first case, describe which solution was chosen and why. Summarize what evidence informed the decision and how that evidence mapped to the desired outcomes. Note: Decisions should be made with the understanding that they can change as the team learns more. It's a starting point, not a contract. Next Steps What work is expected once a decision has been reached? Examples include but are not limited to: Creating new PBI's or modifying existing ones Follow up spikes Creating specification for public interfaces and integrations between other work streams. Decision Log Entry","title":"Trade Study Template"},{"location":"design/design-reviews/trade-studies/template/#trade-study-template","text":"This generic template can be used for any situation where we have a set of requirements that can be satisfied by multiple solutions. They can range in scope from choice of which open source package to use, to full architecture designs.","title":"Trade Study Template"},{"location":"design/design-reviews/trade-studies/template/#trade-studydesign-trade-study-name","text":"Conducted by: {Names of those that can answer follow-up questions and at least one email address} Backlog Work Item: {Link to the work item to provide more context} Sprint: {Which sprint did the study take place? Include sprint start date} Decision: {Solution chosen to proceed with} Decision Makers: IMPORTANT Designs should be completed within a sprint. Most designs will benefit from brevity. To accomplish this: Narrow the scope of the design. Narrow evaluation to 2 to 3 solutions. Design experiments to collect evidence as fast as possible.","title":"Trade Study/Design: [Trade Study Name]"},{"location":"design/design-reviews/trade-studies/template/#overview","text":"Description of the problem we are solving. This should include: Assumptions about the rest of the system Constraints that apply to the system, both business and technical Requirements for the functionality that needs to be implemented, including possible inputs and outputs (optional) A diagram showing the different pieces","title":"Overview"},{"location":"design/design-reviews/trade-studies/template/#desired-outcomes","text":"The following section should establish the desired capabilities of the solution for it to be successful. This can be done by answering the following questions either directly or via link to related artifact (i.e. PBI or Feature description). Acceptance: What capabilities should be demonstrable for a stakeholder to accept the solution? Justification: How does this contribute to the broader project objectives? IMPORTANT This is not intended to define outcomes for the design activity itself. It is intended to define the outcomes for the solution being designed. As mentioned in the User Interface section, if the trade study is analyzing an application development solution, make use of the persona stories to derive desired outcomes. For example, if a persona story exemplifies a certain accessibility requirement, the parallel desired outcome may be \"The application must be accessible for people with vision-based disabilities\".","title":"Desired Outcomes"},{"location":"design/design-reviews/trade-studies/template/#evaluation-criteria","text":"The former should be condensed down to a set of \"evaluation criteria\" that we can rate any potential solutions against. Examples of evaluation criteria: Runs on Windows and Linux - Binary response Compute Usage - Could be categories that effectively rank different options: High, Medium, Low Cost of the solution \u2013 An estimated numeric field The results section contains a table evaluating each solution against the evaluation criteria.","title":"Evaluation Criteria"},{"location":"design/design-reviews/trade-studies/template/#key-metrics-optional","text":"If available, describe any measurable metrics that are important to the success of the solution. Examples include, but are not limited to: Performance & Scale targets such as, Requests/Second, Latency, and Response time (at a given percentile). Azure consumption cost budget. For example, given certain usage, solution expected to cost X dollars per month. Availability uptime of XX% over X time period. Consistency. Writes available for read within X milliseconds. Recovery point objective (RPO) & Recovery time objective (RTO).","title":"Key Metrics (Optional)"},{"location":"design/design-reviews/trade-studies/template/#constraints-optional","text":"If applicable, describe the boundaries from which we have to design the solution. This could be thought of as the \"box\" the team has to work within. This box may be defined as: Technologies, services, and languages an organization is comfortable operating/managing. Devices, operating systems, and/or browsers that must be supported. Backward Compatibility. For example, public interfaces consumed by client or third party apps cannot introduce breaking changes. Integrations or dependencies with other systems. For example, push notifications to client apps must be done via existing websockets channel.","title":"Constraints (Optional)"},{"location":"design/design-reviews/trade-studies/template/#accessibility","text":"Accessibility is never optional . Microsoft has made a public commitment to always produce accessible applications. For more information visit the official Microsoft accessibility site and read the Accessibility page. Consider the following prompts when determining application accessibility requirements: Does the application meet industry accessibility standards? Are training, support, and documentation resources accessible? Is the application designed to be inclusive for people will a broad range of abilities, languages, and cultures?","title":"Accessibility"},{"location":"design/design-reviews/trade-studies/template/#solution-hypotheses","text":"Enumerate the solutions that are believed to deliver the outcomes defined above. Note: Limiting the evaluated solutions to 2 or 3 potential candidates can help manage the time spent on the evaluation. If there are more than 3 candidates, prioritize what the team feels are the top 3. If appropriate, the eliminated candidates can be mentioned to capture why they were eliminated. Additionally, there should be at least two options compared, otherwise you didn't need a trade study.","title":"Solution Hypotheses"},{"location":"design/design-reviews/trade-studies/template/#solution-1","text":"Add a brief description of the solution and how its expected to produce the desired outcomes. If appropriate, illustrations/diagrams can be used to reduce the amount of text explanation required to describe the solution. NOTE: Using present tense language to describe the solution can help avoid confusion between current state and future state. For example, use \"This solution works by doing...\" vs. \"This solution would work by doing...\". Each solution section should contain the following: Description of the solution (optional) A diagram to quickly reference the solution Possible variations - things that are small variations on the main solution can be grouped together Evaluation of the idea based on the evaluation criteria above The depth, detail, and contents of these sections will vary based on the complexity of the functionality being developed.","title":"[Solution 1]"},{"location":"design/design-reviews/trade-studies/template/#experiments","text":"Describe how the solution will be evaluated to prove or dis-prove that it will produce the desired outcomes. This could take many forms such as building a prototype and researching existing documentation and sample solutions. Additionally, document any assumptions made as part of the experiment. NOTE: Time boxing these experiments can be beneficial to make sure the team is making the best use of the time by focusing on collecting key evidence in the simplest/fastest way possible.","title":"Experiment(s)"},{"location":"design/design-reviews/trade-studies/template/#evidence","text":"Present the evidence collected during experimentation that supports the hypothesis that this solution will meet the desired outcomes. Examples may include: Recorded or live demos of a prototype providing the desired capabilities Metrics collected while testing the prototype Documentation that indicates the solution can provide the desired capabilities NOTE: Evidence is not required for every capability, metric, or constraint for the design to be considered done. Instead, focus on presenting evidence that is most relevant and impactful towards supporting or eliminating the hypothesis.","title":"Evidence"},{"location":"design/design-reviews/trade-studies/template/#solution-2","text":"...","title":"[Solution 2]"},{"location":"design/design-reviews/trade-studies/template/#solution-n","text":"...","title":"[Solution N]"},{"location":"design/design-reviews/trade-studies/template/#results","text":"This section should contain a table that has each solution rated against each of the evaluation criteria: Solution Evaluation Criteria 1 Evaluation Criteria 2 ... Evaluation Criteria N Solution 1 Solution 2 ... Solution M Note: The formatting of the table can change. In the past, we have had success with qualitative descriptions in the table entries and color coding the cells to represent good, fair, bad.","title":"Results"},{"location":"design/design-reviews/trade-studies/template/#decision","text":"The chosen solution, or a list of questions that need to be answered before the decision can be made. In the latter case, each question needs an action item and an assigned person for answering the question. Once those questions are answered, the document must be updated to reflect the answers, and the final decision. In the first case, describe which solution was chosen and why. Summarize what evidence informed the decision and how that evidence mapped to the desired outcomes. Note: Decisions should be made with the understanding that they can change as the team learns more. It's a starting point, not a contract.","title":"Decision"},{"location":"design/design-reviews/trade-studies/template/#next-steps","text":"What work is expected once a decision has been reached? Examples include but are not limited to: Creating new PBI's or modifying existing ones Follow up spikes Creating specification for public interfaces and integrations between other work streams. Decision Log Entry","title":"Next Steps"},{"location":"design/diagram-types/","text":"Diagram Types Creating and maintaining diagrams is a challenge for any team. Common reasons across these challenges include: Not leveraging tools to assist in generating diagrams Uncertainty on what to include in a diagram and when to create one Overcoming these challenges and effectively using design diagrams can amplify a team's ability to execute throughout the entire Software Development Lifecycle, from the design phase when proposing various designs to leveraging it as documentation as part of the maintenance phase. This section will share sample tools for diagram generation, provide a high level overview of the different types of diagrams and provide examples of some of these types. There are two primary classes of diagrams: Structural Behavior Within each of these classes, there are many types of diagrams, each intended to convey specific types of information. When different types of diagrams are effectively used in a solution, system, or repository, one can deliver a cohesive and incrementally detailed design. Sample Design Diagrams This section contains educational material and examples for the following design diagrams: Class Diagrams - Useful to document the structural design of a codebase's relationship between classes, and their corresponding methods Component Diagrams - Useful to document a high level structural overview of all the components and their direct \"touch points\" with other Components Sequence Diagrams - Useful to document a behavior overview of the system, capturing the various \"use cases\" or \"actions\" that triggers the system to perform some business logic Deployment Diagram - Useful in order to document the networking and hosting environments where the system will operate in Supplemental Resources Each of the above types of diagrams will provide specific resources related to its type. Below are the generic resources: Visual Paradigm UML Structural vs Behavior Diagrams PlantUML - requires a generator from code to PlantUML syntax to generate diagrams C# to PlantUML Drawing manually","title":"Diagram Types"},{"location":"design/diagram-types/#diagram-types","text":"Creating and maintaining diagrams is a challenge for any team. Common reasons across these challenges include: Not leveraging tools to assist in generating diagrams Uncertainty on what to include in a diagram and when to create one Overcoming these challenges and effectively using design diagrams can amplify a team's ability to execute throughout the entire Software Development Lifecycle, from the design phase when proposing various designs to leveraging it as documentation as part of the maintenance phase. This section will share sample tools for diagram generation, provide a high level overview of the different types of diagrams and provide examples of some of these types. There are two primary classes of diagrams: Structural Behavior Within each of these classes, there are many types of diagrams, each intended to convey specific types of information. When different types of diagrams are effectively used in a solution, system, or repository, one can deliver a cohesive and incrementally detailed design.","title":"Diagram Types"},{"location":"design/diagram-types/#sample-design-diagrams","text":"This section contains educational material and examples for the following design diagrams: Class Diagrams - Useful to document the structural design of a codebase's relationship between classes, and their corresponding methods Component Diagrams - Useful to document a high level structural overview of all the components and their direct \"touch points\" with other Components Sequence Diagrams - Useful to document a behavior overview of the system, capturing the various \"use cases\" or \"actions\" that triggers the system to perform some business logic Deployment Diagram - Useful in order to document the networking and hosting environments where the system will operate in","title":"Sample Design Diagrams"},{"location":"design/diagram-types/#supplemental-resources","text":"Each of the above types of diagrams will provide specific resources related to its type. Below are the generic resources: Visual Paradigm UML Structural vs Behavior Diagrams PlantUML - requires a generator from code to PlantUML syntax to generate diagrams C# to PlantUML Drawing manually","title":"Supplemental Resources"},{"location":"design/diagram-types/class-diagrams/","text":"Class Diagrams Purpose This document is intended to provide a baseline understanding for what, why, and how to incorporate Class Diagrams as part of your engagement. Regarding the how , the section at the bottom will provide tools and plugins to automate as much as possible when generating Class Diagrams through VSCode. Wikipedia defines UML Class Diagrams as: a type of static structure diagram that describes the structure of a system by showing the system's classes, their attributes, operations (or methods), and the relationships among objects. The key terms to make a note of here are: static structure showing the system's classes, attributes, operations, and relationships Class Diagrams are a type of a static structure because it focuses on the properties, and relationships of classes. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics. Essential Takeaways Each \"Component\" (Stand alone piece of software - think datastores, microservices, serverless functions, user interfaces, etc...) of a Product or System will have it's own Class Diagram. Class Diagrams should tell a \"story\", where each Diagram will require Engineers to really think about: The responsibility / operations of each class. What can (should) the class perform? The class' attributes and properties. What can be set by an implementor of this class? What are all (if any) universally static properties? The visibility or accessibility that a class' operation may have to other classes The relationship between each class or the various instances When to Create? Because Class Diagrams represent one of the more granular depiction of what a \"product\" or \"system\" is composed of, it is recommended to begin the creation of these diagrams at the beginning and throughout the engineering portions of an engagement. This does mean that any code change (new feature, enhancement, code refactor) might involve updating one or many Class Diagrams. Although this might seem like a downside of Class Diagrams, it actually can become a very strong benefit. Because Class Diagrams tell a \"story\" for each Component of a product (see the previous section), it requires a substantial amount of upfront thought and design considerations. This amount of upfront thought ultimately results in making more effective code changes, and may even minimize the level of refactors in future stages of the engagement. Class Diagrams also provides quick \"alert indicators\" when a refactor might be necessary. Reasons could be due to seeing that a particular class might be doing too much, have too many dependencies, or when the codebase might produce a very \"messy\" or \"chaotic\" Class Diagram. If the Class Diagram is unreadable, the code will probably be unreadable Examples One can find many examples online such as at UML Diagrams . Below are some basic examples: Versioning Because Class Diagrams will be changing rapidly, essentially anytime a class is changed in the code, and because it might be very large in size, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: Wait until the engagement progresses (maybe 10-20% completion) before publishing a Class Diagram. It is not worth publishing a Class Diagram from the beginning as it will be changing daily Once the most crucial classes are developed, update the published diagram periodically. Ideally whenever a large refactor or net new class is introduced. If the team uses an IDE plugin to automatically generate the diagram from their development environment, this becomes more of a documentation task rather than a necessity As the engagement approaches its end (90-100% completion), update the published diagram whenever a change to an existing class as part of a feature or story acceptance criteria Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches. Resources Wikipedia Visual Paradigm VS Code Plugins: C#, Visual Basic, C++ using Class Designer Component TypeScript classdiagram-ts PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax C# to PlantUML Drawing manually","title":"Class Diagrams"},{"location":"design/diagram-types/class-diagrams/#class-diagrams","text":"","title":"Class Diagrams"},{"location":"design/diagram-types/class-diagrams/#purpose","text":"This document is intended to provide a baseline understanding for what, why, and how to incorporate Class Diagrams as part of your engagement. Regarding the how , the section at the bottom will provide tools and plugins to automate as much as possible when generating Class Diagrams through VSCode. Wikipedia defines UML Class Diagrams as: a type of static structure diagram that describes the structure of a system by showing the system's classes, their attributes, operations (or methods), and the relationships among objects. The key terms to make a note of here are: static structure showing the system's classes, attributes, operations, and relationships Class Diagrams are a type of a static structure because it focuses on the properties, and relationships of classes. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics.","title":"Purpose"},{"location":"design/diagram-types/class-diagrams/#essential-takeaways","text":"Each \"Component\" (Stand alone piece of software - think datastores, microservices, serverless functions, user interfaces, etc...) of a Product or System will have it's own Class Diagram. Class Diagrams should tell a \"story\", where each Diagram will require Engineers to really think about: The responsibility / operations of each class. What can (should) the class perform? The class' attributes and properties. What can be set by an implementor of this class? What are all (if any) universally static properties? The visibility or accessibility that a class' operation may have to other classes The relationship between each class or the various instances","title":"Essential Takeaways"},{"location":"design/diagram-types/class-diagrams/#when-to-create","text":"Because Class Diagrams represent one of the more granular depiction of what a \"product\" or \"system\" is composed of, it is recommended to begin the creation of these diagrams at the beginning and throughout the engineering portions of an engagement. This does mean that any code change (new feature, enhancement, code refactor) might involve updating one or many Class Diagrams. Although this might seem like a downside of Class Diagrams, it actually can become a very strong benefit. Because Class Diagrams tell a \"story\" for each Component of a product (see the previous section), it requires a substantial amount of upfront thought and design considerations. This amount of upfront thought ultimately results in making more effective code changes, and may even minimize the level of refactors in future stages of the engagement. Class Diagrams also provides quick \"alert indicators\" when a refactor might be necessary. Reasons could be due to seeing that a particular class might be doing too much, have too many dependencies, or when the codebase might produce a very \"messy\" or \"chaotic\" Class Diagram. If the Class Diagram is unreadable, the code will probably be unreadable","title":"When to Create?"},{"location":"design/diagram-types/class-diagrams/#examples","text":"One can find many examples online such as at UML Diagrams . Below are some basic examples:","title":"Examples"},{"location":"design/diagram-types/class-diagrams/#versioning","text":"Because Class Diagrams will be changing rapidly, essentially anytime a class is changed in the code, and because it might be very large in size, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: Wait until the engagement progresses (maybe 10-20% completion) before publishing a Class Diagram. It is not worth publishing a Class Diagram from the beginning as it will be changing daily Once the most crucial classes are developed, update the published diagram periodically. Ideally whenever a large refactor or net new class is introduced. If the team uses an IDE plugin to automatically generate the diagram from their development environment, this becomes more of a documentation task rather than a necessity As the engagement approaches its end (90-100% completion), update the published diagram whenever a change to an existing class as part of a feature or story acceptance criteria Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches.","title":"Versioning"},{"location":"design/diagram-types/class-diagrams/#resources","text":"Wikipedia Visual Paradigm VS Code Plugins: C#, Visual Basic, C++ using Class Designer Component TypeScript classdiagram-ts PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax C# to PlantUML Drawing manually","title":"Resources"},{"location":"design/diagram-types/component-diagrams/","text":"Component Diagrams Purpose This document is intended to provide a baseline understanding for what, why, and how to incorporate Component Diagrams as part of your engagement. Regarding the how , the section at the bottom will provide tools and plugins to streamline as much as possible when generating Component Diagrams through VSCode. Wikipedia defines UML Component Diagrams as: a component diagram depicts how components are wired together to form larger components or software systems. Component Diagrams are a type of a static structure because it focuses on the responsibility and relationships between components as part of the overall system or solution. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics. ...Hold on a second... what is a Component? A Component is a runnable solution that performs a set of operations and can possibly be interfaced through a particular API. One can think of Components as a \"stand alone\" piece of software - think datastores, microservices, serverless functions, user interfaces, etc... Essential Takeaways The primary two takeaways from a Component Diagram should be: A quick view of all the various components (User Interface, Service, Data Storage) involved in the system The immediate \"touch points\" that a particular Component has with other Components, including how that \"touch point\" is accomplished (HTTP, FTP, etc...) Depending on the complexity of the system, a team might decide to create several Component Diagrams. Where there is one diagram per Component (depicting all it's immediate \"touch points\" with other Components). Or if a system is simple, the team might decide to create a single Component Diagram capturing all Components in the diagram. When to Create? Because Component Diagrams represent a high level overview of the entire system from a Component focus, it is recommended to begin the creation of this diagram from the beginning of an engagement, and update it as the various Components are identified, developed, and introduced into the system. Otherwise, if this is left till later, then there is risk that: the team won't be able to identify areas of improvement the team or other necessary stakeholders won't have a full understanding on how the system works as it is being developed Because of the inherent granularity of the system, the Component Diagrams won't have to be updated as often as Class Diagrams . Things that might merit updating a Component Diagram could be: A deletion or addition of a new Component into the system A change to a system Component's interaction APIs A change to a system Component's immediate \"touch points\" with other Components Because Component Diagrams focuses on informing the various \"touch points\" between Components, it requires some upfront thought in order to determine what Components are needed and what interaction mechanisms are most effective per the system requirements. This amount of upfront thought should be approached in a pragmatic manner - as the design may evolve over time, and that is perfectly fine, as long as changes are influenced based on functional requirements and non-functional requirements. Examples Below are some basic examples: Versioning Because Component Diagrams will be changing periodically, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Component Diagram will provide a common visual to all engineers when working on the different parts of the solution Throughout the engagement, update the published diagram periodically. Ideally whenever a new Component is introduced into the system, or whenever a new \"touch point\" occurs between Components Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches. Resources Wikipedia Visual Paradigm VS Code Plugins: PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Component Diagrams"},{"location":"design/diagram-types/component-diagrams/#component-diagrams","text":"","title":"Component Diagrams"},{"location":"design/diagram-types/component-diagrams/#purpose","text":"This document is intended to provide a baseline understanding for what, why, and how to incorporate Component Diagrams as part of your engagement. Regarding the how , the section at the bottom will provide tools and plugins to streamline as much as possible when generating Component Diagrams through VSCode. Wikipedia defines UML Component Diagrams as: a component diagram depicts how components are wired together to form larger components or software systems. Component Diagrams are a type of a static structure because it focuses on the responsibility and relationships between components as part of the overall system or solution. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics. ...Hold on a second... what is a Component? A Component is a runnable solution that performs a set of operations and can possibly be interfaced through a particular API. One can think of Components as a \"stand alone\" piece of software - think datastores, microservices, serverless functions, user interfaces, etc...","title":"Purpose"},{"location":"design/diagram-types/component-diagrams/#essential-takeaways","text":"The primary two takeaways from a Component Diagram should be: A quick view of all the various components (User Interface, Service, Data Storage) involved in the system The immediate \"touch points\" that a particular Component has with other Components, including how that \"touch point\" is accomplished (HTTP, FTP, etc...) Depending on the complexity of the system, a team might decide to create several Component Diagrams. Where there is one diagram per Component (depicting all it's immediate \"touch points\" with other Components). Or if a system is simple, the team might decide to create a single Component Diagram capturing all Components in the diagram.","title":"Essential Takeaways"},{"location":"design/diagram-types/component-diagrams/#when-to-create","text":"Because Component Diagrams represent a high level overview of the entire system from a Component focus, it is recommended to begin the creation of this diagram from the beginning of an engagement, and update it as the various Components are identified, developed, and introduced into the system. Otherwise, if this is left till later, then there is risk that: the team won't be able to identify areas of improvement the team or other necessary stakeholders won't have a full understanding on how the system works as it is being developed Because of the inherent granularity of the system, the Component Diagrams won't have to be updated as often as Class Diagrams . Things that might merit updating a Component Diagram could be: A deletion or addition of a new Component into the system A change to a system Component's interaction APIs A change to a system Component's immediate \"touch points\" with other Components Because Component Diagrams focuses on informing the various \"touch points\" between Components, it requires some upfront thought in order to determine what Components are needed and what interaction mechanisms are most effective per the system requirements. This amount of upfront thought should be approached in a pragmatic manner - as the design may evolve over time, and that is perfectly fine, as long as changes are influenced based on functional requirements and non-functional requirements.","title":"When to Create?"},{"location":"design/diagram-types/component-diagrams/#examples","text":"Below are some basic examples:","title":"Examples"},{"location":"design/diagram-types/component-diagrams/#versioning","text":"Because Component Diagrams will be changing periodically, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Component Diagram will provide a common visual to all engineers when working on the different parts of the solution Throughout the engagement, update the published diagram periodically. Ideally whenever a new Component is introduced into the system, or whenever a new \"touch point\" occurs between Components Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches.","title":"Versioning"},{"location":"design/diagram-types/component-diagrams/#resources","text":"Wikipedia Visual Paradigm VS Code Plugins: PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Resources"},{"location":"design/diagram-types/deployment-diagrams/","text":"Deployment Diagrams Purpose This document is intended to provide a baseline understanding for what, why, and how to incorporate Deployment Diagrams as part of your engagement. Wikipedia defines UML Deployment Diagrams as: models the physical deployment of artifacts on nodes Deployment Diagrams are a type of a static structure because it focuses on the infrastructure and hosting where all aspects of the system reside in. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics. Essential Takeaways The Deployment diagram should contain all Components identified in the Component Diagram(s) , but captured alongside the following elements: Firewalls VNETs and subnets Virtual machines Cloud Services Data Stores Servers (Web, proxy) Load Balancers This diagram should inform the audience: where things are hosted / running in what network boundaries are involved in the system When to Create? Because Deployment Diagrams represent the final \"hosting\" architecture, it's recommended to create the \"final envisioned\" diagram from the beginning of an engagement. This allows the team to have a shared idea on what the team is working towards. Keep in mind that this might change if any non-functional requirement was not considered at the start of the engagement. This is okay, but requires creating the necessary Backlog Items and updating the Deployment diagram in order to capture these changes. It's also worthwhile to create and maintain a Deployment Diagram depicting the \"current\" state of the system. At times, it may be beneficial for there to be a Deployment Diagram per each environment (Dev, QA, Staging, Prod, etc...). However, this adds to the amount of maintenance required and should only be performed if there are substantial differences across environments. The \"current\" Deployment diagram should be updated when: A new element has been introduced or removed in the system (see the \"Essential Takeaways\" section for a list of possible elements) Examples Below are some basic examples: Versioning Because Deployment Diagrams will be changing periodically, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Component Diagram will provide a common visual to all engineers when working on the different parts of the solution Throughout the engagement, update the \"actual / current\" diagram (state represented from the \"main\" branch) periodically. Ideally whenever a new Component is introduced into the system, or whenever a new \"touch point\" occurs between Components. Resources Wikipedia Visual Paradigm PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Deployment Diagrams"},{"location":"design/diagram-types/deployment-diagrams/#deployment-diagrams","text":"","title":"Deployment Diagrams"},{"location":"design/diagram-types/deployment-diagrams/#purpose","text":"This document is intended to provide a baseline understanding for what, why, and how to incorporate Deployment Diagrams as part of your engagement. Wikipedia defines UML Deployment Diagrams as: models the physical deployment of artifacts on nodes Deployment Diagrams are a type of a static structure because it focuses on the infrastructure and hosting where all aspects of the system reside in. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics.","title":"Purpose"},{"location":"design/diagram-types/deployment-diagrams/#essential-takeaways","text":"The Deployment diagram should contain all Components identified in the Component Diagram(s) , but captured alongside the following elements: Firewalls VNETs and subnets Virtual machines Cloud Services Data Stores Servers (Web, proxy) Load Balancers This diagram should inform the audience: where things are hosted / running in what network boundaries are involved in the system","title":"Essential Takeaways"},{"location":"design/diagram-types/deployment-diagrams/#when-to-create","text":"Because Deployment Diagrams represent the final \"hosting\" architecture, it's recommended to create the \"final envisioned\" diagram from the beginning of an engagement. This allows the team to have a shared idea on what the team is working towards. Keep in mind that this might change if any non-functional requirement was not considered at the start of the engagement. This is okay, but requires creating the necessary Backlog Items and updating the Deployment diagram in order to capture these changes. It's also worthwhile to create and maintain a Deployment Diagram depicting the \"current\" state of the system. At times, it may be beneficial for there to be a Deployment Diagram per each environment (Dev, QA, Staging, Prod, etc...). However, this adds to the amount of maintenance required and should only be performed if there are substantial differences across environments. The \"current\" Deployment diagram should be updated when: A new element has been introduced or removed in the system (see the \"Essential Takeaways\" section for a list of possible elements)","title":"When to Create?"},{"location":"design/diagram-types/deployment-diagrams/#examples","text":"Below are some basic examples:","title":"Examples"},{"location":"design/diagram-types/deployment-diagrams/#versioning","text":"Because Deployment Diagrams will be changing periodically, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Component Diagram will provide a common visual to all engineers when working on the different parts of the solution Throughout the engagement, update the \"actual / current\" diagram (state represented from the \"main\" branch) periodically. Ideally whenever a new Component is introduced into the system, or whenever a new \"touch point\" occurs between Components.","title":"Versioning"},{"location":"design/diagram-types/deployment-diagrams/#resources","text":"Wikipedia Visual Paradigm PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Resources"},{"location":"design/diagram-types/sequence-diagrams/","text":"Sequence Diagrams Purpose This document is intended to provide a baseline understanding for what, why, and how to incorporate Sequence Diagrams as part of an engagement. Regarding the how , the section at the bottom will provide tools and plugins to streamline as much as possible when generating Sequence Diagrams through VSCode. Wikipedia defines UML Sequence Diagrams responsible to: depict the objects involved in the scenario and the sequence of messages exchanged between the objects needed to carry out the functionality of the scenario What is a scenario ? It can be: an actual user persona performing an action a system specific trigger (time based, condition based) that results in an action to occur What is a message in this context? It can be: a synchronous or asynchronous request a transfer of any form of data between any objects What is an object in this context? It can be: any specific user persona any service any data store a system (black box composed of unknown services, data stores or other components) an abstract sub-scenario (in order to minimize high complexity of a scenario) Essential Takeaways A Sequence Diagram should: start with a scenario indicate which object or \"actor\" initiated that scenario have the scenario clearly indicate what the \"end\" state is, even if it doesn't necessarily end back with the object that initiated the scenario It is okay for a single Sequence Diagram to have many different scenarios if they have some related context that merits them being grouped. Another important thing to keep in mind, is that the objects involved in a Sequence Diagram should refer to existing Components from a Component Diagram . There are 2 areas where complexity can result in an overly \"crowded\" Sequence Diagram, making it costly to maintain. They are: Large number of objects / components involved in a particular scenario Capturing all the possible \"failure\" situations that a scenario may encounter Large Number of Objects A Sequence Diagram typically starts with an end user persona performing an action, and then shows all the various components and request/data transfers that are involved in that scenario. However, more often than not, the complete end-to-end flow for that scenario may be too complex in order to capture within a single Sequence Diagram. When this level of complexity occurs, consider creating separate sub-scenario Sequence Diagrams , and using it as an object in a particular Sequence Diagram. Examples for this are \"Authentication\" or \"Authorization\". Almost all user persona scenarios will have several objects/components involved in either of these sub-scenarios, but it is not necessary to include them in every Sequence Diagram once the sub-scenarios have a stand-alone Sequence Diagram created. Be sure that when using this approach of sub-scenarios to give it a name that encapsulates what the sub-scenarios is performing, and to determine the appropriate \"actor\" and \"action\" that initiates the sub-scenarios. The combination and story telling between these end user Sequence Diagrams and the sub-scenarios Sequence Diagrams can greatly improve readability by distributing the level of complexity across multiple diagrams and take advantage of reusability of common sub-scenarios. Handling Large Number of Failure Situations Another factor of high complexity is the possible failure situations that a particular scenario may encounter. Each object / component involved in the scenario could have several different \"failure\" situations, which could result in a very crowded and messy Sequence Diagram. In order to make it realistic to manage all these scenarios, try to: Identify the most common failure situations that an \"actor\" may face as part of a scenario. Capturing these in a sequence diagram and documenting the other scenarios without having to manage them in a diagram will accomplish the goal of awareness \"Bubble up\" and \"abstract\" all the vast number of failure situations that can occur downstream in the system, and depict how the object / component closest to the \"actor\" handles all these failures and informs the \"actor\" of them When to Create? Because Sequence Diagrams represent a detailed overview of the behavior of the system, outlining the various messages/requests sent within the system, it is recommended to begin the creation of these diagrams from the beginning of an engagement. While updating it as the various communications between Components are introduced into the system. The risks of not creating Sequence Diagrams early on are that: the team will not create any because of it being perceived more as a \"chore\" instead of adding value the team will be unable to gain insights in time, from visualizing the various messages and requests sent between Components, in order to perform any potential refactoring the team or other necessary stakeholders won't have a complete understanding of the request/message/data flow within the system Because of the inherent granularity of the system, the Sequence Diagrams won't have to be updated as often as Class Diagrams , but may require more maintenance than Component Diagrams . Things that might merit updating a Sequence Diagram could be: A new request/message/data being sent across Components involved in a scenario A change to one or several Components involved in a Sequence Diagram. Such as splitting a component into multiple ones, or consolidating many Components into a single one The introduction of a new Use Case or scenario that the system now supports Examples Place Order Scenario: A \"Member\" user persona places an order, which can be composed of many \"order items\" The \"Member\" user persona can be either of type \"VIP\" or \"Ordinary\" Depending on the \"Member type\", each \"order item\" will be shipped using either a Courier or via Mail If the \"Member\" user persona selected the option to be informed once all \"order items\" have been shipped, then the system will send a notification Facebook User Authentication Scenario: A user persona uses a Web Browser to interact with an \"application\" which tries to access a specific \"Facebook resource\" The \"Facebook Authorization Server\" is involved in order to have the user to authenticate with Facebook The user persona then receives a \"permission form\" in order to authorize the \"application\" access to the \"Facebook resource\" If the \"application\" was not authorized, then the \"application\" returns back an error If the \"application\" was authorized, then the \"application\" retrieves an \"access token\" from the \"Facebook Authorization Server\" and uses it to securely access the \"Facebook resource\" from the \"Facebook Content Server\". Once the content is obtained, the \"application\" sends it to the Web Browser Versioning Because Sequence Diagrams are more expensive to maintain, it's recommended to \"publish\" an image of the generated diagram often, whenever a new \"use case\" or \"scenario\" is identified as part of the system behavior or requirements. The most important element to these diagrams is to ensure that the latest version is accurate . If the latest diagram shows a sequence of communication between components that are no longer valid, then the diagram causes more harm than good. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Sequence Diagram will provide a common visual to all engineers when working on the different parts of the solution (focusing on the data flow and request flow) Throughout the engagement, update the published diagram periodically. Ideally whenever a new \"use case\" or \"scenario\" is identified, or when a Component is introduced or removed in the system, or when a change in data/request flow is made in the system Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches. Resources Wikipedia Visual Paradigm VS Code Plugins: PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Sequence Diagrams"},{"location":"design/diagram-types/sequence-diagrams/#sequence-diagrams","text":"","title":"Sequence Diagrams"},{"location":"design/diagram-types/sequence-diagrams/#purpose","text":"This document is intended to provide a baseline understanding for what, why, and how to incorporate Sequence Diagrams as part of an engagement. Regarding the how , the section at the bottom will provide tools and plugins to streamline as much as possible when generating Sequence Diagrams through VSCode. Wikipedia defines UML Sequence Diagrams responsible to: depict the objects involved in the scenario and the sequence of messages exchanged between the objects needed to carry out the functionality of the scenario What is a scenario ? It can be: an actual user persona performing an action a system specific trigger (time based, condition based) that results in an action to occur What is a message in this context? It can be: a synchronous or asynchronous request a transfer of any form of data between any objects What is an object in this context? It can be: any specific user persona any service any data store a system (black box composed of unknown services, data stores or other components) an abstract sub-scenario (in order to minimize high complexity of a scenario)","title":"Purpose"},{"location":"design/diagram-types/sequence-diagrams/#essential-takeaways","text":"A Sequence Diagram should: start with a scenario indicate which object or \"actor\" initiated that scenario have the scenario clearly indicate what the \"end\" state is, even if it doesn't necessarily end back with the object that initiated the scenario It is okay for a single Sequence Diagram to have many different scenarios if they have some related context that merits them being grouped. Another important thing to keep in mind, is that the objects involved in a Sequence Diagram should refer to existing Components from a Component Diagram . There are 2 areas where complexity can result in an overly \"crowded\" Sequence Diagram, making it costly to maintain. They are: Large number of objects / components involved in a particular scenario Capturing all the possible \"failure\" situations that a scenario may encounter","title":"Essential Takeaways"},{"location":"design/diagram-types/sequence-diagrams/#large-number-of-objects","text":"A Sequence Diagram typically starts with an end user persona performing an action, and then shows all the various components and request/data transfers that are involved in that scenario. However, more often than not, the complete end-to-end flow for that scenario may be too complex in order to capture within a single Sequence Diagram. When this level of complexity occurs, consider creating separate sub-scenario Sequence Diagrams , and using it as an object in a particular Sequence Diagram. Examples for this are \"Authentication\" or \"Authorization\". Almost all user persona scenarios will have several objects/components involved in either of these sub-scenarios, but it is not necessary to include them in every Sequence Diagram once the sub-scenarios have a stand-alone Sequence Diagram created. Be sure that when using this approach of sub-scenarios to give it a name that encapsulates what the sub-scenarios is performing, and to determine the appropriate \"actor\" and \"action\" that initiates the sub-scenarios. The combination and story telling between these end user Sequence Diagrams and the sub-scenarios Sequence Diagrams can greatly improve readability by distributing the level of complexity across multiple diagrams and take advantage of reusability of common sub-scenarios.","title":"Large Number of Objects"},{"location":"design/diagram-types/sequence-diagrams/#handling-large-number-of-failure-situations","text":"Another factor of high complexity is the possible failure situations that a particular scenario may encounter. Each object / component involved in the scenario could have several different \"failure\" situations, which could result in a very crowded and messy Sequence Diagram. In order to make it realistic to manage all these scenarios, try to: Identify the most common failure situations that an \"actor\" may face as part of a scenario. Capturing these in a sequence diagram and documenting the other scenarios without having to manage them in a diagram will accomplish the goal of awareness \"Bubble up\" and \"abstract\" all the vast number of failure situations that can occur downstream in the system, and depict how the object / component closest to the \"actor\" handles all these failures and informs the \"actor\" of them","title":"Handling Large Number of Failure Situations"},{"location":"design/diagram-types/sequence-diagrams/#when-to-create","text":"Because Sequence Diagrams represent a detailed overview of the behavior of the system, outlining the various messages/requests sent within the system, it is recommended to begin the creation of these diagrams from the beginning of an engagement. While updating it as the various communications between Components are introduced into the system. The risks of not creating Sequence Diagrams early on are that: the team will not create any because of it being perceived more as a \"chore\" instead of adding value the team will be unable to gain insights in time, from visualizing the various messages and requests sent between Components, in order to perform any potential refactoring the team or other necessary stakeholders won't have a complete understanding of the request/message/data flow within the system Because of the inherent granularity of the system, the Sequence Diagrams won't have to be updated as often as Class Diagrams , but may require more maintenance than Component Diagrams . Things that might merit updating a Sequence Diagram could be: A new request/message/data being sent across Components involved in a scenario A change to one or several Components involved in a Sequence Diagram. Such as splitting a component into multiple ones, or consolidating many Components into a single one The introduction of a new Use Case or scenario that the system now supports","title":"When to Create?"},{"location":"design/diagram-types/sequence-diagrams/#examples","text":"Place Order Scenario: A \"Member\" user persona places an order, which can be composed of many \"order items\" The \"Member\" user persona can be either of type \"VIP\" or \"Ordinary\" Depending on the \"Member type\", each \"order item\" will be shipped using either a Courier or via Mail If the \"Member\" user persona selected the option to be informed once all \"order items\" have been shipped, then the system will send a notification Facebook User Authentication Scenario: A user persona uses a Web Browser to interact with an \"application\" which tries to access a specific \"Facebook resource\" The \"Facebook Authorization Server\" is involved in order to have the user to authenticate with Facebook The user persona then receives a \"permission form\" in order to authorize the \"application\" access to the \"Facebook resource\" If the \"application\" was not authorized, then the \"application\" returns back an error If the \"application\" was authorized, then the \"application\" retrieves an \"access token\" from the \"Facebook Authorization Server\" and uses it to securely access the \"Facebook resource\" from the \"Facebook Content Server\". Once the content is obtained, the \"application\" sends it to the Web Browser","title":"Examples"},{"location":"design/diagram-types/sequence-diagrams/#versioning","text":"Because Sequence Diagrams are more expensive to maintain, it's recommended to \"publish\" an image of the generated diagram often, whenever a new \"use case\" or \"scenario\" is identified as part of the system behavior or requirements. The most important element to these diagrams is to ensure that the latest version is accurate . If the latest diagram shows a sequence of communication between components that are no longer valid, then the diagram causes more harm than good. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Sequence Diagram will provide a common visual to all engineers when working on the different parts of the solution (focusing on the data flow and request flow) Throughout the engagement, update the published diagram periodically. Ideally whenever a new \"use case\" or \"scenario\" is identified, or when a Component is introduced or removed in the system, or when a change in data/request flow is made in the system Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches.","title":"Versioning"},{"location":"design/diagram-types/sequence-diagrams/#resources","text":"Wikipedia Visual Paradigm VS Code Plugins: PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Resources"},{"location":"design/sustainability/","text":"Sustainable Software Engineering The choices made throughout the engineering process regarding cloud services, software architecture design and automation can have a big impact on the carbon footprint of a solution. Some choices are always beneficial, like turning off unused resources. Other choices require a more nuanced understanding of the business case at hand and its potential carbon impact. Goal One goal of this section is to provide tangible guidance for what sustainable actions you can apply in certain situations and the tools to be able to implement those recommendations. Another goal is to highlight the many resources available to learn about the wider domain of sustainable software. Sustainable Engineering Checklist This checklist should be used to quickly identify scenarios for which common sustainable actions exist. Check the box if the scenario applies to your project, then go through the actions and tools you can use to build more sustainable software for those cases. If there are important nuances to consider, they will be linked in the Disclaimers section. For readability some considerations are blank, indicating that the action applies to the first consideration above it. \u2705 Consideration Action Principle Tools Disclaimers For any running software/services Shutdown unused resources. Electricity Consumption Identify Unassociated Resources Resize physical or virtual machines to improve utilization. Energy Proportionality Azure Advisor Cost Recommendations Understanding Advisor Recommendations For development and testing VMs Configure VMs to shutdown during off-hours Electricity Consumption Start/Stop VMs during off-hours For VMs with attached volumes Limit the amount of attached storage capacity to what you expect to use and expand as necessary Electricity Consumption Expanding storage of active VMs Understanding the energy cost of storage For systems using object storage (Azure Blob Storage, AWS S3, GCP Cloud Storage, etc) Compress infrequently accessed data Electricity Consumption , Embodied Carbon Compressing and extracting files in .NET Understanding the energy cost of storage Delete data when it is no longer needed Electricity Consumption Configuring a lifecycle management policy Understanding the energy cost of storage For systems running in on-premise data centers Migrate to hyperscale cloud provider Embodied Carbon , Electricity Consumption Cloud Adoption Approaches Carbon benefits of cloud computing For systems migrating to a hyperscale cloud provider Consider physically shipping data to the provider Networking Azure Data Box Understanding data shipping tradeoffs For time-flexible workloads Utilize \"Spot VMs\" for compute Demand Shaping How to use Spot VMs For services with varied utilization patterns Configure Autoscaling Energy Proportionality Autoscaling Documentation Use serverless functions Energy Proportionality Serverless Architecture Design For services with geographically co-located users (EG internal employee apps) Select a data center region that is physically close to them Networking Azure products available by region Consider running edge devices to reduce excessive data transfer Networking Azure Stack Edge Understanding edge tradeoffs For systems sending data over the network Use caching policies to keep data on the local machine Networking HTTP caching APIs , Cache Management in .NET Understanding caching tradeoffs Consider caching data close to end users with a CDN Networking Benefits of a CDN Understanding CDN tradeoffs Send only the data that will be used Networking Compress data to reduce the size Networking Compressing and extracting files in .NET When designing for the end user Consider giving users visibility and control over their energy usage Electricity Consumption Demand Shaping Designing for eco-mode Design and test your application to be compatible for a wide variety of devices, especially older devices Embodied Carbon Extending device lifespan Compatibility Testing When selecting a programming language Consider the energy efficiency of languages Electricity Consumption Reasoning about the energy consumption of programming languages , Programming Language Energy Efficiency (PDF) Making informed programming language choices Resources Principles of Green Software Engineering Green Software Foundation Microsoft Cloud for Sustainability Learning Module: Sustainable Software Engineering Tools Carbon-Aware SDK \"Awesome List\" of Green Software Emissions Impact Azure GreenAI Carbon-Intensity API Projects Sustainability through SpotVMs","title":"Sustainable Software Engineering"},{"location":"design/sustainability/#sustainable-software-engineering","text":"The choices made throughout the engineering process regarding cloud services, software architecture design and automation can have a big impact on the carbon footprint of a solution. Some choices are always beneficial, like turning off unused resources. Other choices require a more nuanced understanding of the business case at hand and its potential carbon impact.","title":"Sustainable Software Engineering"},{"location":"design/sustainability/#goal","text":"One goal of this section is to provide tangible guidance for what sustainable actions you can apply in certain situations and the tools to be able to implement those recommendations. Another goal is to highlight the many resources available to learn about the wider domain of sustainable software.","title":"Goal"},{"location":"design/sustainability/#sustainable-engineering-checklist","text":"This checklist should be used to quickly identify scenarios for which common sustainable actions exist. Check the box if the scenario applies to your project, then go through the actions and tools you can use to build more sustainable software for those cases. If there are important nuances to consider, they will be linked in the Disclaimers section. For readability some considerations are blank, indicating that the action applies to the first consideration above it. \u2705 Consideration Action Principle Tools Disclaimers For any running software/services Shutdown unused resources. Electricity Consumption Identify Unassociated Resources Resize physical or virtual machines to improve utilization. Energy Proportionality Azure Advisor Cost Recommendations Understanding Advisor Recommendations For development and testing VMs Configure VMs to shutdown during off-hours Electricity Consumption Start/Stop VMs during off-hours For VMs with attached volumes Limit the amount of attached storage capacity to what you expect to use and expand as necessary Electricity Consumption Expanding storage of active VMs Understanding the energy cost of storage For systems using object storage (Azure Blob Storage, AWS S3, GCP Cloud Storage, etc) Compress infrequently accessed data Electricity Consumption , Embodied Carbon Compressing and extracting files in .NET Understanding the energy cost of storage Delete data when it is no longer needed Electricity Consumption Configuring a lifecycle management policy Understanding the energy cost of storage For systems running in on-premise data centers Migrate to hyperscale cloud provider Embodied Carbon , Electricity Consumption Cloud Adoption Approaches Carbon benefits of cloud computing For systems migrating to a hyperscale cloud provider Consider physically shipping data to the provider Networking Azure Data Box Understanding data shipping tradeoffs For time-flexible workloads Utilize \"Spot VMs\" for compute Demand Shaping How to use Spot VMs For services with varied utilization patterns Configure Autoscaling Energy Proportionality Autoscaling Documentation Use serverless functions Energy Proportionality Serverless Architecture Design For services with geographically co-located users (EG internal employee apps) Select a data center region that is physically close to them Networking Azure products available by region Consider running edge devices to reduce excessive data transfer Networking Azure Stack Edge Understanding edge tradeoffs For systems sending data over the network Use caching policies to keep data on the local machine Networking HTTP caching APIs , Cache Management in .NET Understanding caching tradeoffs Consider caching data close to end users with a CDN Networking Benefits of a CDN Understanding CDN tradeoffs Send only the data that will be used Networking Compress data to reduce the size Networking Compressing and extracting files in .NET When designing for the end user Consider giving users visibility and control over their energy usage Electricity Consumption Demand Shaping Designing for eco-mode Design and test your application to be compatible for a wide variety of devices, especially older devices Embodied Carbon Extending device lifespan Compatibility Testing When selecting a programming language Consider the energy efficiency of languages Electricity Consumption Reasoning about the energy consumption of programming languages , Programming Language Energy Efficiency (PDF) Making informed programming language choices","title":"Sustainable Engineering Checklist"},{"location":"design/sustainability/#resources","text":"Principles of Green Software Engineering Green Software Foundation Microsoft Cloud for Sustainability Learning Module: Sustainable Software Engineering","title":"Resources"},{"location":"design/sustainability/#tools","text":"Carbon-Aware SDK \"Awesome List\" of Green Software Emissions Impact Azure GreenAI Carbon-Intensity API","title":"Tools"},{"location":"design/sustainability/#projects","text":"Sustainability through SpotVMs","title":"Projects"},{"location":"design/sustainability/sustainable-action-disclaimers/","text":"Disclaimers The following disclaimers provide more details about how to consider the impact of particular actions recommended by the Sustainable Engineering Checklist . ACTION: Resize Physical or Virtual Machines to Improve Utilization Recommendations from cost-savings tools are usually aligned with carbon-reduction, but as sustainability is not the purpose of such tools, carbon-savings are not guaranteed. How a cloud provider or data center manages unused capacity is also a factor in determining how impactful this action may be. For example: The sustainable impact of using smaller VMs in the same family are typically beneficial or neutral. When cores are no longer reserved they can be used by others instead of bringing new servers online. The sustainable impact of changing VM families can be harder to reason about because the underlying hardware and reserved cores may be changing with them. ACTION: Migrate to a Hyperscale Cloud Provider Carbon savings from hyperscale cloud providers are generally attributable to four key features: IT operational efficiency, IT equipment efficiency, data center infrastructure efficiency, and renewable electricity. Microsoft Cloud, for example, is between 22 and 93 percent more energy efficient than traditional enterprise data centers, depending on the specific comparison being made. When taking into account renewable energy purchases, the Microsoft Cloud is between 72 and 98 percent more carbon efficient. Source (PDF) ACTION: Consider Running an Edge Device Running an edge device negates many of the benefits of hyperscale compute facilities, so considering the local energy grid mix and the typical timing of the workloads is important to determine if this is beneficial overall. The larger volume of data that needs to be transmitted, the more this solution becomes appealing. For example, sending large amounts of audio and video content for processing. ACTION: Consider Physically Shipping Data to the Provider Shipping physical items has its own carbon impact, depending on the mode of transportation, which needs to be understood before making this decision. The larger the volume of data that needs to be transmitted the more this options may be beneficial. ACTION: Consider the Energy Efficiency of Languages When selecting a programming language, the most energy efficient programming language may not always be the best choice for development speed, maintenance, integration with dependent systems, and other project factors. But when deciding between languages that all meet the project needs, energy efficiency can be a helpful consideration. ACTION: Use Caching Policies A cache provides temporary storage of resources that have been requested by an application. Caching can improve application performance by reducing the time required to get a requested resource. Caching can also improve sustainability by decreasing the amount of network traffic. While caching provides these benefits, it also increases the risk that the resource returned to the application is stale, meaning that it is not identical to the resource that would have been sent by the server if caching were not in use. This can create poor user experiences when data accuracy is critical. Additionally, caching may allow unauthorized users or processes to read sensitive data. An authenticated response that is cached may be retrieved from the cache without an additional authorization. Due to security concerns like this, caching is not recommended for middle tier scenarios. ACTION: Consider Caching Data Close to End Users with a CDN Including CDNs in your network architecture adds many additional servers to your software footprint, each with their own local energy grid mix. The details of CDN hardware and the impact of the power that runs it is important to determine if the carbon emissions from running them is lower than the emissions from sending the data over the wire from a more distant source. The larger the volume of data, distance it needs to travel, and frequency of requests, the more this solution becomes appealing.","title":"Disclaimers"},{"location":"design/sustainability/sustainable-action-disclaimers/#disclaimers","text":"The following disclaimers provide more details about how to consider the impact of particular actions recommended by the Sustainable Engineering Checklist .","title":"Disclaimers"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-resize-physical-or-virtual-machines-to-improve-utilization","text":"Recommendations from cost-savings tools are usually aligned with carbon-reduction, but as sustainability is not the purpose of such tools, carbon-savings are not guaranteed. How a cloud provider or data center manages unused capacity is also a factor in determining how impactful this action may be. For example: The sustainable impact of using smaller VMs in the same family are typically beneficial or neutral. When cores are no longer reserved they can be used by others instead of bringing new servers online. The sustainable impact of changing VM families can be harder to reason about because the underlying hardware and reserved cores may be changing with them.","title":"ACTION: Resize Physical or Virtual Machines to Improve Utilization"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-migrate-to-a-hyperscale-cloud-provider","text":"Carbon savings from hyperscale cloud providers are generally attributable to four key features: IT operational efficiency, IT equipment efficiency, data center infrastructure efficiency, and renewable electricity. Microsoft Cloud, for example, is between 22 and 93 percent more energy efficient than traditional enterprise data centers, depending on the specific comparison being made. When taking into account renewable energy purchases, the Microsoft Cloud is between 72 and 98 percent more carbon efficient. Source (PDF)","title":"ACTION: Migrate to a Hyperscale Cloud Provider"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-consider-running-an-edge-device","text":"Running an edge device negates many of the benefits of hyperscale compute facilities, so considering the local energy grid mix and the typical timing of the workloads is important to determine if this is beneficial overall. The larger volume of data that needs to be transmitted, the more this solution becomes appealing. For example, sending large amounts of audio and video content for processing.","title":"ACTION: Consider Running an Edge Device"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-consider-physically-shipping-data-to-the-provider","text":"Shipping physical items has its own carbon impact, depending on the mode of transportation, which needs to be understood before making this decision. The larger the volume of data that needs to be transmitted the more this options may be beneficial.","title":"ACTION: Consider Physically Shipping Data to the Provider"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-consider-the-energy-efficiency-of-languages","text":"When selecting a programming language, the most energy efficient programming language may not always be the best choice for development speed, maintenance, integration with dependent systems, and other project factors. But when deciding between languages that all meet the project needs, energy efficiency can be a helpful consideration.","title":"ACTION: Consider the Energy Efficiency of Languages"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-use-caching-policies","text":"A cache provides temporary storage of resources that have been requested by an application. Caching can improve application performance by reducing the time required to get a requested resource. Caching can also improve sustainability by decreasing the amount of network traffic. While caching provides these benefits, it also increases the risk that the resource returned to the application is stale, meaning that it is not identical to the resource that would have been sent by the server if caching were not in use. This can create poor user experiences when data accuracy is critical. Additionally, caching may allow unauthorized users or processes to read sensitive data. An authenticated response that is cached may be retrieved from the cache without an additional authorization. Due to security concerns like this, caching is not recommended for middle tier scenarios.","title":"ACTION: Use Caching Policies"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-consider-caching-data-close-to-end-users-with-a-cdn","text":"Including CDNs in your network architecture adds many additional servers to your software footprint, each with their own local energy grid mix. The details of CDN hardware and the impact of the power that runs it is important to determine if the carbon emissions from running them is lower than the emissions from sending the data over the wire from a more distant source. The larger the volume of data, distance it needs to travel, and frequency of requests, the more this solution becomes appealing.","title":"ACTION: Consider Caching Data Close to End Users with a CDN"},{"location":"design/sustainability/sustainable-engineering-principles/","text":"Sustainable Principles The following principle overviews provide the foundations supporting specific actions in the Sustainable Engineering Checklist . More details about each principle can be found by following the links in the headings or visiting the Principles of Green Software Engineering website . Electricity Consumption Most electricity is still produced through the burning of fossil fuels and is responsible for 49% of the carbon emitted into the atmosphere. Software consumes electricity in its execution. Running hardware consumes electricity even at zero percent utilization. Some of the best ways we can reduce electricity consumption and the subsequent emissions of carbon pollution is to make our applications more energy efficient when they are running and limit idle hardware. Energy Proportionality The relationship between power and utilization is not proportional. The more you utilize a computer, the more efficient it becomes at converting electricity to useful computing operations. Running your work on as few servers as possible with the highest utilization rate maximizes their energy efficiency. An idle computer, even running at zero percent utilization, still draws electricity. Embodied Carbon Embodied carbon (otherwise referred to as \"Embedded Carbon\") is the amount of carbon pollution emitted during the creation and disposal of a device. When calculating the total carbon pollution for the computers running your software, account for both the carbon pollution to run the computer and the embodied carbon of the computer. Therefore a great way to reduce embodied carbon is to prevent the need for new devices to be manufactured by extending the usefulness of existing ones. Demand Shaping Demand shaping is a strategy of shaping our demand for resources so it matches the existing supply. If supply is high, increase the demand by doing more in your applications. If the supply is low, decrease demand. This means doing less in your applications or delaying work until supply is higher. Networking A network is a series of switches, routers, and servers. All the computers and network equipment in a network consume electricity and have embedded carbon . The internet is a global network of devices typically run off the standard local grid energy mix. When you send data across the internet, you are sending that data through many devices in the network, each one of those devices consuming electricity. As a result, any data you send or receive over the internet emits carbon. The amount of carbon emitted to send data depends on many factors including: Distance the data travels Number of hops between network devices Energy efficiency of the network devices Carbon intensity of energy used by each device at the time the data is transmitted. Network protocol used to coordinate data transmission - e.g. multiplex, header compression, TLS/Quic Recent networking studies - Cloud Carbon Footprint","title":"Sustainable Principles"},{"location":"design/sustainability/sustainable-engineering-principles/#sustainable-principles","text":"The following principle overviews provide the foundations supporting specific actions in the Sustainable Engineering Checklist . More details about each principle can be found by following the links in the headings or visiting the Principles of Green Software Engineering website .","title":"Sustainable Principles"},{"location":"design/sustainability/sustainable-engineering-principles/#electricity-consumption","text":"Most electricity is still produced through the burning of fossil fuels and is responsible for 49% of the carbon emitted into the atmosphere. Software consumes electricity in its execution. Running hardware consumes electricity even at zero percent utilization. Some of the best ways we can reduce electricity consumption and the subsequent emissions of carbon pollution is to make our applications more energy efficient when they are running and limit idle hardware.","title":"Electricity Consumption"},{"location":"design/sustainability/sustainable-engineering-principles/#energy-proportionality","text":"The relationship between power and utilization is not proportional. The more you utilize a computer, the more efficient it becomes at converting electricity to useful computing operations. Running your work on as few servers as possible with the highest utilization rate maximizes their energy efficiency. An idle computer, even running at zero percent utilization, still draws electricity.","title":"Energy Proportionality"},{"location":"design/sustainability/sustainable-engineering-principles/#embodied-carbon","text":"Embodied carbon (otherwise referred to as \"Embedded Carbon\") is the amount of carbon pollution emitted during the creation and disposal of a device. When calculating the total carbon pollution for the computers running your software, account for both the carbon pollution to run the computer and the embodied carbon of the computer. Therefore a great way to reduce embodied carbon is to prevent the need for new devices to be manufactured by extending the usefulness of existing ones.","title":"Embodied Carbon"},{"location":"design/sustainability/sustainable-engineering-principles/#demand-shaping","text":"Demand shaping is a strategy of shaping our demand for resources so it matches the existing supply. If supply is high, increase the demand by doing more in your applications. If the supply is low, decrease demand. This means doing less in your applications or delaying work until supply is higher.","title":"Demand Shaping"},{"location":"design/sustainability/sustainable-engineering-principles/#networking","text":"A network is a series of switches, routers, and servers. All the computers and network equipment in a network consume electricity and have embedded carbon . The internet is a global network of devices typically run off the standard local grid energy mix. When you send data across the internet, you are sending that data through many devices in the network, each one of those devices consuming electricity. As a result, any data you send or receive over the internet emits carbon. The amount of carbon emitted to send data depends on many factors including: Distance the data travels Number of hops between network devices Energy efficiency of the network devices Carbon intensity of energy used by each device at the time the data is transmitted. Network protocol used to coordinate data transmission - e.g. multiplex, header compression, TLS/Quic Recent networking studies - Cloud Carbon Footprint","title":"Networking"},{"location":"developer-experience/","text":"Developer Experience (DevEx) Developer experience refers to how easy or difficult it is for a developer to perform essential tasks needed to implement a change. A positive developer experience would mean these tasks are relatively easy for the team (see measures below). The essential tasks are identified below. Build - Verify that changes are free of syntax error and compile. Test - Verify that all automated tests pass. Start - Launch end-to-end to simulate execution in a deployed environment. Debug - Attach debugger to started solution, set breakpoints, step through code, and inspect variables. If effort is invested to make these activities as easy as possible, the returns on that effort will increase the longer the project runs, and the larger the team is . Defining End-to-End This document makes several references to running a solution end-to-end (aka E2E). End-to-end for the purposes of this document is scoped to the software that is owned, built, and shipped by the team. Systems owned by other teams or third-party vendors is not within the E2E scope for the purposes of this document. Goals Maximize the amount of time engineers spend on writing code that fulfills story acceptance and done-done criteria. Minimize the amount of time spent manual setup and configuration of tooling Minimize regressions and new defects by making end-to-end testing easy Impact Developer experience can have a significant impact on the efficiency of the day-to-day execution of the team. A positive experience can pay dividends throughout the lifetime of the project; especially as new developers join the team. Increased Velocity - Team spends less time on non-value-add activities such as dev/local environment setup, waiting on remote environments to test, and rework (fixing defects). Improved Quality - When it's easy to debug and test, developers will do more of it. This will translate to fewer defects being introduced. Easier Onboarding & Adoption - When dev essential tasks are automated, there is less documentation to write and, subsequently, less to read to get started! Most importantly, the customer will continue to accrue these benefits long after the code-with engagement. Measures Time to First E2E Result (aka F5 Contract) Assuming a laptop/pc that has never run the solution, how long does it take to set up and run the whole system end-to-end and see a result. Time To First Commit How long does it take to make a change that can be verified/tested locally. A locally verified/tested change is one that passes test cases without introducing regression or breaking changes. Participation Providing a positive developer experience is a team effort. However, certain members can take ownership of different areas to help hold the entire team accountable. Dev Lead - Set the Bar The following are examples of how the Dev Lead might set the bar for dev experience Determines development environment (suggested IDE, hosting, etc) Determines source control environment and number of repos required Given development environment and repo structure, sets expectations for team to meet in terms of steps to perform the essential dev tasks Nominates the DevEx Champion IDE choice is NOT intended to mandate that all team members must use the same IDE. However, this choice will direct where tight-integration investment will be prioritized. For example, if Visual Studio Code is the suggested IDE then, the team would focus on integrating VS code tasks and launch configurations over similar integrations for other IDEs. Team members should still feel free to use their preferred IDE as long as it does not negatively impact the team. DevEx Champion - Identify Iterative Improvements The DevEx champion takes ownership in holding the team accountable for providing a positive developer experience. The following outline responsibilities for the DevEx champion. Actively seek opportunities for improving the solution developer experience Work with the Dev Lead to iteratively improve team expectations for developer experience Curate a backlog actionable stories that identify areas for improvement and prioritize with respect to project delivery goals by engaging directly with the Product Owner and Customer. Serve as subject-matter expert for the rest of the team. Help the team determine how to implement DevEx expectations and identify deviations. Team Members - Assert Expectations The team members of the team can also help hold each other accountable for providing a positive developer experience. The following are examples of areas team members can help identify where the team's DevEx expectations are not being met. Pull requests. Try the changes locally to see if they are adhering to the team's DevEx expectations. Design Reviews. Look for proposals that may negatively affect the solution's DevEx. These might include Introduction of new tech whose testability is limited to manual steps in a deployed environment. Addition of new repository New Team Members - Identify Iterative Improvements New team members are uniquely positioned to identify instances of undocumented Collective Wisdom . The following outlines responsibilities of new team members as it relates to DevEx: If you come across missing, incomplete or incorrect documentation while onboarding, you should record the issue as a new defect(s) and assign it to the product owner to triage. If no onboarding documentation exists, note the steps you took in a new user story. Assign the new story to the product owner to triage. Facilitation Guidance The following outline examples of several strategies that can be adopted to promote a positive developer experience. It is expected that each team should define what a positive dev experience means within the context of their project. Additionally, refine that over time via feedback mechanisms such as sprint and project retrospectives. Establish Hotkeys Assign hotkeys to each of the essential tasks. Task Windows Build CTRL+SHIFT+B Test CTRL+R,T Start With Debugging F5 The F5 Contract The F5 contract aims for the ability to run the end-to-end solution with the following steps. Clone - git clone [ my-repo-url-here ] Configure - set any configuration values that need to be unique to the individual (i.e. update a .env file) Press F5 - launch the solution with debugging attached. Most IDEs have some form of a task runner that can be used to automate the build, execute, and attach steps. Try to leverage these such that the steps can all be run with as few manual steps as possible. DevEx Champion Actively Seek Improvements The DevEx champion should actively seek areas where the team has opportunity to improve. For example, do they need to deploy their changes to an environment off their laptop before they can validate if what they did worked. Rather than debugging locally, do they have to do this repetitively to get to a working solution? Does this take several minutes each iteration? Does this block other developers due to the contention on the environment? The following are ceremonies that the DevEx champion can use to find potential opportunities Retrospectives. Is feedback being raised that relates to the essential tasks being difficult or unwieldy? Standup Blockers. Are individuals getting blocked or stumbling on the essential tasks? As opportunities are identified, the DevEx champion can translate these into actionable stories for the product backlog. Make Tasks Cross Platform For essential tasks being standardized during the engagement, ensure that different platforms are accounted for. Team members may have different operating systems and ensuring the tasks are cross-platform will provide an additional opportunity to improve the experience. See the making tasks cross platform recipe for guidance on how tasks can be configured to include different platforms. Create an Onboarding Guide When welcoming new team members to the engagement, there are many areas for them to get adjusted to and bring them up to speed including codebase, coding standards, team agreements, and team culture. By adopting a strong onboarding practice such as an onboarding guide in a centralized location that explains the scope of the project, processes, setup details, and software required, new members can have all the necessary resources for them to be efficient, successful and a valuable team member from the start. See the onboarding guide recipe for guidance on what an onboarding guide may look like. Standardize Essential Tasks Apply a common strategy across solution components for performing the essential tasks Standardize the configuration for solution components Standardize the way tests are run for each component Standardize the way each component is started and stopped locally Standardize how to document the essential tasks for each component This standardization will enable the team to more easily automate these tasks across all components at the solution level. See Solution-level Essential Tasks below. Solution-level Essential Tasks Automate the ability to execute each essential task across all solution components. An example would be mapping the build action in the IDE to run the build task for each component in the solution. More importantly, configure the IDE start action to start all components within the solution. This will provide significant efficiency for the engineering team when dealing with multi-component solutions. When this is not implemented, the engineers must repeat each of the essential tasks manually for each component in the solution. In this situation, the number of steps required to perform each essential task is multiplied by the number of components in the system [Configuration steps + Build steps + Start/Debug steps + Stop steps + Run test steps + Documenting all of the above] * [many solution components] = TOO MANY STEPS VS. [Configuration steps + Build steps + Start/Debug steps + Stop steps + Run test steps + Documenting all of the above] * [1 solution] = MINIMUM NUMBER OF STEPS Observability Observability alleviates unforeseen challenges for the developer in a complex distributed system. It identifies project bottlenecks quicker and with more precision, enhancing performance as the developer seeks to deploy code changes. Adding observability improves the experience when identifying and resolving bugs or broken code. This results in fewer or less severe current and future production failures. There are many observability strategies a developer can use alongside best engineering practices. These resources improve the DevEx by ensuring a shared view of the complex system throughout the entire lifecycle. Observability in code via logging, exception handling and exposing of relevant application metrics for example, promotes the consistent visibility of real time performance. The observability pillars, logging , metrics , and tracing , detail when to enable each of the three specific types of observability. Minimize the Number of Repositories Splitting a solution across multiple repositories can negatively impact the above measures. This can also negatively impact other areas such as Pull Requests, Automated Testing, Continuous Integration, and Continuous Delivery. Similar to the IDE instances, the negative impact is multiplied by the number of repositories. [Clone steps + Branching steps + Commit steps + CI steps + Pull Request reviews & merges ] * [many source code repositories] = TOO MANY STEPS VS. [Clone steps + Branching steps + Commit steps + CI steps + Pull Request reviews & merges ] * [1 source code repository] = MINIMUM NUMBER OF STEPS Atomic Pull Requests When the solution is encapsulated within a single repository, it also allows pull requests to represent a change across multiple layers. This is especially helpful when a change requires changes to a shared contract between multiple components. For example, a story requires that an api endpoint is changed. With this strategy the api and web client could be updated with the same pull request. This avoids the main branch being broken temporarily while waiting on dependent pull requests to merge. Minimize Remote Dependencies for Local Development The fewer dependencies on components that cannot run a developer's machine translate to fewer steps required to get started. Therefore, fewer dependencies will positively impact the measures above. The following strategies can be used to reduce these dependencies Use an Emulator If available, emulators are implementations of technologies that are typically only available in cloud environments. A good example is the CosmosDB emulator . Use DI + Toggle to Mock Remote Dependencies When the solution depends on a technology that cannot be run on a developer's machine, the setup and testing of that solution can be challenging. One strategy that can be employed is to create the ability to swap that dependency for one that can run locally. Abstract the layer that has the remote dependency behind an interface owned by the solution (not the remote dependency). Create an implementation of that interface using a technology that can be run locally. Create a factory that decides which instance to use. This decision could be based on environment configuration (i.e. the toggle). Then, the original class that depends on the remote tech instead should depend on the factory to provide which instance to use. Much of this strategy can be simplified with proper dependency injection technique and/or framework. See example below that swaps Azure Service Bus implementation for RabbitMQ which can be run locally. interface IPublisher { send ( message : string ) : void } class RabbitMQPublisher implements IPublisher { send ( message : string ) { //todo: send the message via RabbitMQ } } class AzureServiceBusPublisher implements IPublisher { send ( message : string ) { //todo: send the message via Azure Service Bus } } interface IPublisherFactory { create () : IPublisher } class PublisherFactory { create () : IPublisher { // use env var value to determine which instance should be used if ( process . env . UseAsb ){ return new AzureServiceBusPublisher (); } else { return new RabbitMqPublisher (); } } } class MyService { //inject the factory constructor ( private readonly publisherFactory : IPublisherFactory ){ } sendAMessage ( message : string ) : void { //use the factory to determine which instance to use const publisher : IPublisher = this . publisherFactory . create (); publisher . send ( message ); } } The recipes section has a more complete discussion on DI as part of a high productivity inner dev loop","title":"Developer Experience (DevEx)"},{"location":"developer-experience/#developer-experience-devex","text":"Developer experience refers to how easy or difficult it is for a developer to perform essential tasks needed to implement a change. A positive developer experience would mean these tasks are relatively easy for the team (see measures below). The essential tasks are identified below. Build - Verify that changes are free of syntax error and compile. Test - Verify that all automated tests pass. Start - Launch end-to-end to simulate execution in a deployed environment. Debug - Attach debugger to started solution, set breakpoints, step through code, and inspect variables. If effort is invested to make these activities as easy as possible, the returns on that effort will increase the longer the project runs, and the larger the team is .","title":"Developer Experience (DevEx)"},{"location":"developer-experience/#defining-end-to-end","text":"This document makes several references to running a solution end-to-end (aka E2E). End-to-end for the purposes of this document is scoped to the software that is owned, built, and shipped by the team. Systems owned by other teams or third-party vendors is not within the E2E scope for the purposes of this document.","title":"Defining End-to-End"},{"location":"developer-experience/#goals","text":"Maximize the amount of time engineers spend on writing code that fulfills story acceptance and done-done criteria. Minimize the amount of time spent manual setup and configuration of tooling Minimize regressions and new defects by making end-to-end testing easy","title":"Goals"},{"location":"developer-experience/#impact","text":"Developer experience can have a significant impact on the efficiency of the day-to-day execution of the team. A positive experience can pay dividends throughout the lifetime of the project; especially as new developers join the team. Increased Velocity - Team spends less time on non-value-add activities such as dev/local environment setup, waiting on remote environments to test, and rework (fixing defects). Improved Quality - When it's easy to debug and test, developers will do more of it. This will translate to fewer defects being introduced. Easier Onboarding & Adoption - When dev essential tasks are automated, there is less documentation to write and, subsequently, less to read to get started! Most importantly, the customer will continue to accrue these benefits long after the code-with engagement.","title":"Impact"},{"location":"developer-experience/#measures","text":"","title":"Measures"},{"location":"developer-experience/#time-to-first-e2e-result-aka-f5-contract","text":"Assuming a laptop/pc that has never run the solution, how long does it take to set up and run the whole system end-to-end and see a result.","title":"Time to First E2E Result (aka F5 Contract)"},{"location":"developer-experience/#time-to-first-commit","text":"How long does it take to make a change that can be verified/tested locally. A locally verified/tested change is one that passes test cases without introducing regression or breaking changes.","title":"Time To First Commit"},{"location":"developer-experience/#participation","text":"Providing a positive developer experience is a team effort. However, certain members can take ownership of different areas to help hold the entire team accountable.","title":"Participation"},{"location":"developer-experience/#dev-lead-set-the-bar","text":"The following are examples of how the Dev Lead might set the bar for dev experience Determines development environment (suggested IDE, hosting, etc) Determines source control environment and number of repos required Given development environment and repo structure, sets expectations for team to meet in terms of steps to perform the essential dev tasks Nominates the DevEx Champion IDE choice is NOT intended to mandate that all team members must use the same IDE. However, this choice will direct where tight-integration investment will be prioritized. For example, if Visual Studio Code is the suggested IDE then, the team would focus on integrating VS code tasks and launch configurations over similar integrations for other IDEs. Team members should still feel free to use their preferred IDE as long as it does not negatively impact the team.","title":"Dev Lead - Set the Bar"},{"location":"developer-experience/#devex-champion-identify-iterative-improvements","text":"The DevEx champion takes ownership in holding the team accountable for providing a positive developer experience. The following outline responsibilities for the DevEx champion. Actively seek opportunities for improving the solution developer experience Work with the Dev Lead to iteratively improve team expectations for developer experience Curate a backlog actionable stories that identify areas for improvement and prioritize with respect to project delivery goals by engaging directly with the Product Owner and Customer. Serve as subject-matter expert for the rest of the team. Help the team determine how to implement DevEx expectations and identify deviations.","title":"DevEx Champion - Identify Iterative Improvements"},{"location":"developer-experience/#team-members-assert-expectations","text":"The team members of the team can also help hold each other accountable for providing a positive developer experience. The following are examples of areas team members can help identify where the team's DevEx expectations are not being met. Pull requests. Try the changes locally to see if they are adhering to the team's DevEx expectations. Design Reviews. Look for proposals that may negatively affect the solution's DevEx. These might include Introduction of new tech whose testability is limited to manual steps in a deployed environment. Addition of new repository","title":"Team Members - Assert Expectations"},{"location":"developer-experience/#new-team-members-identify-iterative-improvements","text":"New team members are uniquely positioned to identify instances of undocumented Collective Wisdom . The following outlines responsibilities of new team members as it relates to DevEx: If you come across missing, incomplete or incorrect documentation while onboarding, you should record the issue as a new defect(s) and assign it to the product owner to triage. If no onboarding documentation exists, note the steps you took in a new user story. Assign the new story to the product owner to triage.","title":"New Team Members - Identify Iterative Improvements"},{"location":"developer-experience/#facilitation-guidance","text":"The following outline examples of several strategies that can be adopted to promote a positive developer experience. It is expected that each team should define what a positive dev experience means within the context of their project. Additionally, refine that over time via feedback mechanisms such as sprint and project retrospectives.","title":"Facilitation Guidance"},{"location":"developer-experience/#establish-hotkeys","text":"Assign hotkeys to each of the essential tasks. Task Windows Build CTRL+SHIFT+B Test CTRL+R,T Start With Debugging F5","title":"Establish Hotkeys"},{"location":"developer-experience/#the-f5-contract","text":"The F5 contract aims for the ability to run the end-to-end solution with the following steps. Clone - git clone [ my-repo-url-here ] Configure - set any configuration values that need to be unique to the individual (i.e. update a .env file) Press F5 - launch the solution with debugging attached. Most IDEs have some form of a task runner that can be used to automate the build, execute, and attach steps. Try to leverage these such that the steps can all be run with as few manual steps as possible.","title":"The F5 Contract"},{"location":"developer-experience/#devex-champion-actively-seek-improvements","text":"The DevEx champion should actively seek areas where the team has opportunity to improve. For example, do they need to deploy their changes to an environment off their laptop before they can validate if what they did worked. Rather than debugging locally, do they have to do this repetitively to get to a working solution? Does this take several minutes each iteration? Does this block other developers due to the contention on the environment? The following are ceremonies that the DevEx champion can use to find potential opportunities Retrospectives. Is feedback being raised that relates to the essential tasks being difficult or unwieldy? Standup Blockers. Are individuals getting blocked or stumbling on the essential tasks? As opportunities are identified, the DevEx champion can translate these into actionable stories for the product backlog.","title":"DevEx Champion Actively Seek Improvements"},{"location":"developer-experience/#make-tasks-cross-platform","text":"For essential tasks being standardized during the engagement, ensure that different platforms are accounted for. Team members may have different operating systems and ensuring the tasks are cross-platform will provide an additional opportunity to improve the experience. See the making tasks cross platform recipe for guidance on how tasks can be configured to include different platforms.","title":"Make Tasks Cross Platform"},{"location":"developer-experience/#create-an-onboarding-guide","text":"When welcoming new team members to the engagement, there are many areas for them to get adjusted to and bring them up to speed including codebase, coding standards, team agreements, and team culture. By adopting a strong onboarding practice such as an onboarding guide in a centralized location that explains the scope of the project, processes, setup details, and software required, new members can have all the necessary resources for them to be efficient, successful and a valuable team member from the start. See the onboarding guide recipe for guidance on what an onboarding guide may look like.","title":"Create an Onboarding Guide"},{"location":"developer-experience/#standardize-essential-tasks","text":"Apply a common strategy across solution components for performing the essential tasks Standardize the configuration for solution components Standardize the way tests are run for each component Standardize the way each component is started and stopped locally Standardize how to document the essential tasks for each component This standardization will enable the team to more easily automate these tasks across all components at the solution level. See Solution-level Essential Tasks below.","title":"Standardize Essential Tasks"},{"location":"developer-experience/#solution-level-essential-tasks","text":"Automate the ability to execute each essential task across all solution components. An example would be mapping the build action in the IDE to run the build task for each component in the solution. More importantly, configure the IDE start action to start all components within the solution. This will provide significant efficiency for the engineering team when dealing with multi-component solutions. When this is not implemented, the engineers must repeat each of the essential tasks manually for each component in the solution. In this situation, the number of steps required to perform each essential task is multiplied by the number of components in the system [Configuration steps + Build steps + Start/Debug steps + Stop steps + Run test steps + Documenting all of the above] * [many solution components] = TOO MANY STEPS VS. [Configuration steps + Build steps + Start/Debug steps + Stop steps + Run test steps + Documenting all of the above] * [1 solution] = MINIMUM NUMBER OF STEPS","title":"Solution-level Essential Tasks"},{"location":"developer-experience/#observability","text":"Observability alleviates unforeseen challenges for the developer in a complex distributed system. It identifies project bottlenecks quicker and with more precision, enhancing performance as the developer seeks to deploy code changes. Adding observability improves the experience when identifying and resolving bugs or broken code. This results in fewer or less severe current and future production failures. There are many observability strategies a developer can use alongside best engineering practices. These resources improve the DevEx by ensuring a shared view of the complex system throughout the entire lifecycle. Observability in code via logging, exception handling and exposing of relevant application metrics for example, promotes the consistent visibility of real time performance. The observability pillars, logging , metrics , and tracing , detail when to enable each of the three specific types of observability.","title":"Observability"},{"location":"developer-experience/#minimize-the-number-of-repositories","text":"Splitting a solution across multiple repositories can negatively impact the above measures. This can also negatively impact other areas such as Pull Requests, Automated Testing, Continuous Integration, and Continuous Delivery. Similar to the IDE instances, the negative impact is multiplied by the number of repositories. [Clone steps + Branching steps + Commit steps + CI steps + Pull Request reviews & merges ] * [many source code repositories] = TOO MANY STEPS VS. [Clone steps + Branching steps + Commit steps + CI steps + Pull Request reviews & merges ] * [1 source code repository] = MINIMUM NUMBER OF STEPS","title":"Minimize the Number of Repositories"},{"location":"developer-experience/#atomic-pull-requests","text":"When the solution is encapsulated within a single repository, it also allows pull requests to represent a change across multiple layers. This is especially helpful when a change requires changes to a shared contract between multiple components. For example, a story requires that an api endpoint is changed. With this strategy the api and web client could be updated with the same pull request. This avoids the main branch being broken temporarily while waiting on dependent pull requests to merge.","title":"Atomic Pull Requests"},{"location":"developer-experience/#minimize-remote-dependencies-for-local-development","text":"The fewer dependencies on components that cannot run a developer's machine translate to fewer steps required to get started. Therefore, fewer dependencies will positively impact the measures above. The following strategies can be used to reduce these dependencies","title":"Minimize Remote Dependencies for Local Development"},{"location":"developer-experience/#use-an-emulator","text":"If available, emulators are implementations of technologies that are typically only available in cloud environments. A good example is the CosmosDB emulator .","title":"Use an Emulator"},{"location":"developer-experience/#use-di-toggle-to-mock-remote-dependencies","text":"When the solution depends on a technology that cannot be run on a developer's machine, the setup and testing of that solution can be challenging. One strategy that can be employed is to create the ability to swap that dependency for one that can run locally. Abstract the layer that has the remote dependency behind an interface owned by the solution (not the remote dependency). Create an implementation of that interface using a technology that can be run locally. Create a factory that decides which instance to use. This decision could be based on environment configuration (i.e. the toggle). Then, the original class that depends on the remote tech instead should depend on the factory to provide which instance to use. Much of this strategy can be simplified with proper dependency injection technique and/or framework. See example below that swaps Azure Service Bus implementation for RabbitMQ which can be run locally. interface IPublisher { send ( message : string ) : void } class RabbitMQPublisher implements IPublisher { send ( message : string ) { //todo: send the message via RabbitMQ } } class AzureServiceBusPublisher implements IPublisher { send ( message : string ) { //todo: send the message via Azure Service Bus } } interface IPublisherFactory { create () : IPublisher } class PublisherFactory { create () : IPublisher { // use env var value to determine which instance should be used if ( process . env . UseAsb ){ return new AzureServiceBusPublisher (); } else { return new RabbitMqPublisher (); } } } class MyService { //inject the factory constructor ( private readonly publisherFactory : IPublisherFactory ){ } sendAMessage ( message : string ) : void { //use the factory to determine which instance to use const publisher : IPublisher = this . publisherFactory . create (); publisher . send ( message ); } } The recipes section has a more complete discussion on DI as part of a high productivity inner dev loop","title":"Use DI + Toggle to Mock Remote Dependencies"},{"location":"developer-experience/client-app-inner-loop/","text":"Separating Client Apps from the Services They Consume During Development Client Apps typically rely on remote services to power their apps. However, development schedules between the client app and the services don't always fully align. For a high velocity inner dev loop, client app development must be decoupled from the backend services while still allowing the app to \"invoke\" the services for local testing. Options Several options exist to decouple client app development from the backend services. The options range from embedding mock implementation of the services into the application, others rely on simplified versions of the services. This document lists several options and discusses trade-offs. Embedded Mocks An embedded mock solution includes classes that implement the service interfaces locally. Interfaces and data classes, also called models or data transfer objects or DTOs, are often generated from the services' API specs using tools like nswag ( RicoSuter/NSwag: The Swagger/OpenAPI toolchain for .NET, ASP.NET Core and TypeScript. (github.com) ) or autorest ( Azure/autorest: OpenAPI (f.k.a Swagger) Specification code generator. Supports C#, PowerShell, Go, Java, Node.js, TypeScript, Python, Ruby (github.com) ). A simple service implementation can return a static response. For RESTful services, the JSON responses for the stubs can be stored as application resources or simply as static strings. public Task < UserProfile > GetUserAsync ( long userId , CancellationToken cancellationToken ) { PetProfile result = Newtonsoft . Json . JsonConvert . DeserializeObject < UserProfile > ( MockUserProfile . UserProfile , new Newtonsoft . Json . JsonSerializerSettings ()); return Task . FromResult ( result ); } More sophisticated can randomly return errors to test the app's resiliency code paths. Mocks can be activated via conditional compilation or dynamically via app configuration. In either case, it is recommended to ensure that mocks, service responses and externalized configurations are not included in the final release to avoid confusing behavior and inclusion of potential vulnerabilities. Sample: Registering Mocks via Dependency Injection Dependency Injection Containers like Unity ( Unity Container Introduction | Unity Container ) make it easy to switch between mock services and real service client implementations. Since both implement the same interface, implementations can be registered with the Unity container. public static void Bootstrap ( IUnityContainer container ) { #if DEBUG container . RegisterSingleton < IUserServiceClient , MockUserService > (); #else container . RegisterSingleton < IUserServiceClient , UserServiceClient > (); #endif } Consuming Mocks via Dependency Injection The code consuming the interfaces will not notice the difference. public class UserPageModel { private readonly IUserServiceClient userServiceClient ; public UserPageModel ( IUserServiceClient userServiceClient ) { this . userServiceClient = userServiceClient ; } // ... } Local Services The approach with Locally Running Services is to replace the call in the client from pointing to the actual endpoint (whether dev, QA, prod, etc.) to a local endpoint. This approach also enables injecting traffic capture and shaping proxies like Postman ( Postman API Platform | Sign Up for Free ) or Fiddler ( Fiddler | Web Debugging Proxy and Troubleshooting Solutions (telerik.com) ). The advantage of this approach is that the APIs are decoupled from the client and can be independently updated/modified (e.g. changing response codes, changing data) without requiring changes to the client. This helps to unlock new development scenarios and provides flexibility during the development phase. The challenge with this approach is that it does require setup, configuration, and running of the services locally. There are tools that help to simplify that process (e.g. JsonServer , Postman Mock Server ). High-Fidelity Local Services A local service stub implements the expected APIs. Just like the embedded mock, it can be generated based on existing API contracts (e.g. OpenAPI). A high-fidelity approach packages the real services together with simplified data in docker containers that can be run locally using docker-compose before the client app is started for local debugging and testing. To enable running services fully local the \"local version\" substitutes dependent cloud services with local alternatives, e.g. file storage instead of blobs, locally running SQL Server instead of SQL AzureDB. This approach also enables full fidelity integration testing without spinning up distributed deployments. Stub / Fake Services Lower fidelity approaches run stub services, that could be generated from API specs, or run fake servers like JsonServer ( JsonServer.io: A fake json server API Service for prototyping and testing. ) or Postman. All these services would respond with predetermined and configured JSON messages. How to Decide Pros Cons Example when developing for: Example When not to Use Embedded Mocks Simplifies the F5 developer experience Tightly coupled with Client More static type data scenarios Testing (e.g. unit tests, integration tests) No external dependencies to manage Hard coded data Initial integration with services Mocking via Dependency Injection can be a non-trivial effort High-Fidelity Local Services Loosely Coupled from Client Extra tooling required i.e. local infrastructure overhead URL Routes When API contract are not available Easier to independently modify response Extra setup and configuration of services Independent updates to services Can utilize HTTP traffic Easier to replace with real services at a later time Stub/Fake Services Loosely coupled from client Extra tooling required i.e. local infrastructure overhead Response Codes When API Contracts available Easier to independently modify response Extra setup and configuration of services Complex/variable data scenarios When API Contracts are note available Independent updates to services Might not provide full fidelity of expected API Can utilize HTTP traffic Easier to replace with real services at a later time","title":"Separating Client Apps from the Services They Consume During Development"},{"location":"developer-experience/client-app-inner-loop/#separating-client-apps-from-the-services-they-consume-during-development","text":"Client Apps typically rely on remote services to power their apps. However, development schedules between the client app and the services don't always fully align. For a high velocity inner dev loop, client app development must be decoupled from the backend services while still allowing the app to \"invoke\" the services for local testing.","title":"Separating Client Apps from the Services They Consume During Development"},{"location":"developer-experience/client-app-inner-loop/#options","text":"Several options exist to decouple client app development from the backend services. The options range from embedding mock implementation of the services into the application, others rely on simplified versions of the services. This document lists several options and discusses trade-offs.","title":"Options"},{"location":"developer-experience/client-app-inner-loop/#embedded-mocks","text":"An embedded mock solution includes classes that implement the service interfaces locally. Interfaces and data classes, also called models or data transfer objects or DTOs, are often generated from the services' API specs using tools like nswag ( RicoSuter/NSwag: The Swagger/OpenAPI toolchain for .NET, ASP.NET Core and TypeScript. (github.com) ) or autorest ( Azure/autorest: OpenAPI (f.k.a Swagger) Specification code generator. Supports C#, PowerShell, Go, Java, Node.js, TypeScript, Python, Ruby (github.com) ). A simple service implementation can return a static response. For RESTful services, the JSON responses for the stubs can be stored as application resources or simply as static strings. public Task < UserProfile > GetUserAsync ( long userId , CancellationToken cancellationToken ) { PetProfile result = Newtonsoft . Json . JsonConvert . DeserializeObject < UserProfile > ( MockUserProfile . UserProfile , new Newtonsoft . Json . JsonSerializerSettings ()); return Task . FromResult ( result ); } More sophisticated can randomly return errors to test the app's resiliency code paths. Mocks can be activated via conditional compilation or dynamically via app configuration. In either case, it is recommended to ensure that mocks, service responses and externalized configurations are not included in the final release to avoid confusing behavior and inclusion of potential vulnerabilities.","title":"Embedded Mocks"},{"location":"developer-experience/client-app-inner-loop/#sample-registering-mocks-via-dependency-injection","text":"Dependency Injection Containers like Unity ( Unity Container Introduction | Unity Container ) make it easy to switch between mock services and real service client implementations. Since both implement the same interface, implementations can be registered with the Unity container. public static void Bootstrap ( IUnityContainer container ) { #if DEBUG container . RegisterSingleton < IUserServiceClient , MockUserService > (); #else container . RegisterSingleton < IUserServiceClient , UserServiceClient > (); #endif }","title":"Sample: Registering Mocks via Dependency Injection"},{"location":"developer-experience/client-app-inner-loop/#consuming-mocks-via-dependency-injection","text":"The code consuming the interfaces will not notice the difference. public class UserPageModel { private readonly IUserServiceClient userServiceClient ; public UserPageModel ( IUserServiceClient userServiceClient ) { this . userServiceClient = userServiceClient ; } // ... }","title":"Consuming Mocks via Dependency Injection"},{"location":"developer-experience/client-app-inner-loop/#local-services","text":"The approach with Locally Running Services is to replace the call in the client from pointing to the actual endpoint (whether dev, QA, prod, etc.) to a local endpoint. This approach also enables injecting traffic capture and shaping proxies like Postman ( Postman API Platform | Sign Up for Free ) or Fiddler ( Fiddler | Web Debugging Proxy and Troubleshooting Solutions (telerik.com) ). The advantage of this approach is that the APIs are decoupled from the client and can be independently updated/modified (e.g. changing response codes, changing data) without requiring changes to the client. This helps to unlock new development scenarios and provides flexibility during the development phase. The challenge with this approach is that it does require setup, configuration, and running of the services locally. There are tools that help to simplify that process (e.g. JsonServer , Postman Mock Server ).","title":"Local Services"},{"location":"developer-experience/client-app-inner-loop/#high-fidelity-local-services","text":"A local service stub implements the expected APIs. Just like the embedded mock, it can be generated based on existing API contracts (e.g. OpenAPI). A high-fidelity approach packages the real services together with simplified data in docker containers that can be run locally using docker-compose before the client app is started for local debugging and testing. To enable running services fully local the \"local version\" substitutes dependent cloud services with local alternatives, e.g. file storage instead of blobs, locally running SQL Server instead of SQL AzureDB. This approach also enables full fidelity integration testing without spinning up distributed deployments.","title":"High-Fidelity Local Services"},{"location":"developer-experience/client-app-inner-loop/#stub-fake-services","text":"Lower fidelity approaches run stub services, that could be generated from API specs, or run fake servers like JsonServer ( JsonServer.io: A fake json server API Service for prototyping and testing. ) or Postman. All these services would respond with predetermined and configured JSON messages.","title":"Stub / Fake Services"},{"location":"developer-experience/client-app-inner-loop/#how-to-decide","text":"Pros Cons Example when developing for: Example When not to Use Embedded Mocks Simplifies the F5 developer experience Tightly coupled with Client More static type data scenarios Testing (e.g. unit tests, integration tests) No external dependencies to manage Hard coded data Initial integration with services Mocking via Dependency Injection can be a non-trivial effort High-Fidelity Local Services Loosely Coupled from Client Extra tooling required i.e. local infrastructure overhead URL Routes When API contract are not available Easier to independently modify response Extra setup and configuration of services Independent updates to services Can utilize HTTP traffic Easier to replace with real services at a later time Stub/Fake Services Loosely coupled from client Extra tooling required i.e. local infrastructure overhead Response Codes When API Contracts available Easier to independently modify response Extra setup and configuration of services Complex/variable data scenarios When API Contracts are note available Independent updates to services Might not provide full fidelity of expected API Can utilize HTTP traffic Easier to replace with real services at a later time","title":"How to Decide"},{"location":"developer-experience/copilots/","text":"Copilots There are a number of AI tools that can improve the developer experience. This article will discuss tooling that is available as well as advice on when it might be appropriate to use such tooling. GitHub Copilot The current version of GitHub Copilot can provide code completion in many popular IDEs. For instance, the VS Code extension that can be installed from the VS Code Marketplace. It requires a GitHub account to use. For more information about what IDEs are supported, what languages are supported, cost, features, etc., please checkout out the information on Copilot and Copilot for Business . Some example use-cases for GitHub Copilot include: Write Documentation . For example, the above paragraph was written using Copilot. Write Unit Tests . Given that setup and assertions are often consistent across unit tests, Copilot tends to be very accurate. Unblock . It is often hard start writing when staring at a blank page, Copilot can fill the space with something that may or may not be what you ultimately want to do, but it can help get you in the right head space. If you want Copilot to write something useful for you, try writing a comment that describes what your code is going to do - it can often take it from there. GitHub Copilot Labs Copilot has a GitHub Copilot labs extension that offers additional features that are not yet ready for prime-time. For VS Code, you can install it from the VS Code Marketplace. These features include: Explain . Copilot can explain what the code is doing in natural language. Translate . Copilot can translate code from one language to another. Brushes . You can select code that Copilot then modifies inline based on a \"brush\" you select, for example, to make the code more readable, fix bugs, improve debugging, document, etc. Generate Tests . Copilot can generate unit tests for your code. Though currently this is limited to JavaScript and TypeScript. GitHub Copilot X The next version of Copilot offers a number of new use-cases beyond code completion. These include: Chat . Rather than just providing code completion, Copilot will be able to have a conversation with you about what you want to do. It has context about the code you are working on and can provide suggestions based on that context. Beyond just writing code, consider using chat to: Build SQL Indexes . Given a query, Copilot can generate a SQL index that will improve the performance of the query. Write Regular Expressions . These are notoriously difficult to write, but Copilot can generate them for you if you give some sample input and describe what you want to extract. Improve and Validate . If you are unsure of the implications of writing code a particular way, you can ask questions about it. For instance, you might ask if there is a way to write the code that is more performant or uses less memory. Once it gives you an opinion, you can ask it to provide documentation validating that assertion. Explain . Copilot can explain what the code is doing in natural language. Write Code . Given prompting by the developer it can write code that you can one-click deploy into existing or new files. Debug . Copilot can analyze your code and propose solutions to fix bugs. It can do most of what Labs can do with \"brushes\" as \"topics\", but whereas Labs changes the code in your file, the chat functionality just shows what it would change in the window. However, there is also an \"inline mode\" for GitHub Copilot Chat that allows you to make changes to your code inline which does not have this same limitation. ChatGPT / Bing Chat For coding, generic AI chat tools such as ChatGPT and Bing Chat are less useful, but they still have their place. GitHub Copilot will only answer \"questions about coding\" and it's interpretation of that rule can be a little restrictive. Some cases for using ChatGPT or Bing Chat include: Write Documentation . Copilot can write documentation, but using ChatGPT or Bing Chat, you can expand your documentation to include business information, use-cases, additional context, etc. Change Perspective . ChatGPT can impersonate a persona or even a system and answer questions from that perspective. For example, you can ask it to explain what a particular piece of code does from the perspective of a user. You might have ChatGPT imagine it is a database administrator and ask it to explain how to improve a particular query. When using Bing Chat, experiment with modes, sometimes changing to Creative Mode can give the results you need. Prompt Engineering Chat AI tools are only as good as the prompts you give them. The quality and appropriateness of the output can vary greatly depending on the prompt. In addition, many of these tools restrict the number of prompts you can send in a given amount of time. To learn more about prompt engineering, you might review some open source documentation here . Considerations It is important when using AI tools to understand how the data (including private or commercial code) might be used by the system. Read more about how GitHub Copilot handles your data and code here .","title":"Copilots"},{"location":"developer-experience/copilots/#copilots","text":"There are a number of AI tools that can improve the developer experience. This article will discuss tooling that is available as well as advice on when it might be appropriate to use such tooling.","title":"Copilots"},{"location":"developer-experience/copilots/#github-copilot","text":"The current version of GitHub Copilot can provide code completion in many popular IDEs. For instance, the VS Code extension that can be installed from the VS Code Marketplace. It requires a GitHub account to use. For more information about what IDEs are supported, what languages are supported, cost, features, etc., please checkout out the information on Copilot and Copilot for Business . Some example use-cases for GitHub Copilot include: Write Documentation . For example, the above paragraph was written using Copilot. Write Unit Tests . Given that setup and assertions are often consistent across unit tests, Copilot tends to be very accurate. Unblock . It is often hard start writing when staring at a blank page, Copilot can fill the space with something that may or may not be what you ultimately want to do, but it can help get you in the right head space. If you want Copilot to write something useful for you, try writing a comment that describes what your code is going to do - it can often take it from there.","title":"GitHub Copilot"},{"location":"developer-experience/copilots/#github-copilot-labs","text":"Copilot has a GitHub Copilot labs extension that offers additional features that are not yet ready for prime-time. For VS Code, you can install it from the VS Code Marketplace. These features include: Explain . Copilot can explain what the code is doing in natural language. Translate . Copilot can translate code from one language to another. Brushes . You can select code that Copilot then modifies inline based on a \"brush\" you select, for example, to make the code more readable, fix bugs, improve debugging, document, etc. Generate Tests . Copilot can generate unit tests for your code. Though currently this is limited to JavaScript and TypeScript.","title":"GitHub Copilot Labs"},{"location":"developer-experience/copilots/#github-copilot-x","text":"The next version of Copilot offers a number of new use-cases beyond code completion. These include: Chat . Rather than just providing code completion, Copilot will be able to have a conversation with you about what you want to do. It has context about the code you are working on and can provide suggestions based on that context. Beyond just writing code, consider using chat to: Build SQL Indexes . Given a query, Copilot can generate a SQL index that will improve the performance of the query. Write Regular Expressions . These are notoriously difficult to write, but Copilot can generate them for you if you give some sample input and describe what you want to extract. Improve and Validate . If you are unsure of the implications of writing code a particular way, you can ask questions about it. For instance, you might ask if there is a way to write the code that is more performant or uses less memory. Once it gives you an opinion, you can ask it to provide documentation validating that assertion. Explain . Copilot can explain what the code is doing in natural language. Write Code . Given prompting by the developer it can write code that you can one-click deploy into existing or new files. Debug . Copilot can analyze your code and propose solutions to fix bugs. It can do most of what Labs can do with \"brushes\" as \"topics\", but whereas Labs changes the code in your file, the chat functionality just shows what it would change in the window. However, there is also an \"inline mode\" for GitHub Copilot Chat that allows you to make changes to your code inline which does not have this same limitation.","title":"GitHub Copilot X"},{"location":"developer-experience/copilots/#chatgpt-bing-chat","text":"For coding, generic AI chat tools such as ChatGPT and Bing Chat are less useful, but they still have their place. GitHub Copilot will only answer \"questions about coding\" and it's interpretation of that rule can be a little restrictive. Some cases for using ChatGPT or Bing Chat include: Write Documentation . Copilot can write documentation, but using ChatGPT or Bing Chat, you can expand your documentation to include business information, use-cases, additional context, etc. Change Perspective . ChatGPT can impersonate a persona or even a system and answer questions from that perspective. For example, you can ask it to explain what a particular piece of code does from the perspective of a user. You might have ChatGPT imagine it is a database administrator and ask it to explain how to improve a particular query. When using Bing Chat, experiment with modes, sometimes changing to Creative Mode can give the results you need.","title":"ChatGPT / Bing Chat"},{"location":"developer-experience/copilots/#prompt-engineering","text":"Chat AI tools are only as good as the prompts you give them. The quality and appropriateness of the output can vary greatly depending on the prompt. In addition, many of these tools restrict the number of prompts you can send in a given amount of time. To learn more about prompt engineering, you might review some open source documentation here .","title":"Prompt Engineering"},{"location":"developer-experience/copilots/#considerations","text":"It is important when using AI tools to understand how the data (including private or commercial code) might be used by the system. Read more about how GitHub Copilot handles your data and code here .","title":"Considerations"},{"location":"developer-experience/cross-platform-tasks/","text":"Cross Platform Tasks There are several options to alleviate cross-platform compatibility issues. Running tasks in a container Using the tasks-system in VS Code which provides options to allow commands to be executed specific to an operating system. Docker or Container Based Using containers as development machines allows developers to get started with minimal setup and abstracts the development environment from the host OS by having it run in a container. DevContainers can also help in standardizing the local developer experience across the team. The following are some good resources to get started with running tasks in DevContainers Developing inside a container . Tutorial on Development in Containers For samples projects and dev container templates see VS Code Dev Containers Recipe Dev Containers Library Tasks in VSCode Running Node.js The example below offers insight into running Node.js executable as a command with tasks.json and how it can be treated differently on Windows and Linux. { \"label\" : \"Run Node\" , \"type\" : \"process\" , \"windows\" : { \"command\" : \"C:\\\\Program Files\\\\nodejs\\\\node.exe\" }, \"linux\" : { \"command\" : \"/usr/bin/node\" } } In this example, to run Node.js, there is a specific windows command, and a specific linux command. This allows for platform specific properties. When these are defined, they will be used instead of the default properties when the command is executed on the Windows operating system or on Linux. Custom Tasks Not all scripts or tasks can be auto-detected in the workspace. It may be necessary at times to defined your own custom tasks. In this example, we have a script to run in order to set up some environment correctly. The script is stored in a folder inside your workspace and named test.sh for Linux & macOS and test.cmd for Windows. With the tasks.json file, the execution of this script can be made possible with a custom task that defines what to do on different operating systems. { \"version\" : \"2.0.0\" , \"tasks\" : [ { \"label\" : \"Run tests\" , \"type\" : \"shell\" , \"command\" : \"./scripts/test.sh\" , \"windows\" : { \"command\" : \".\\\\scripts\\\\test.cmd\" }, \"group\" : \"test\" , \"presentation\" : { \"reveal\" : \"always\" , \"panel\" : \"new\" } } ] } The command here is a shell command and tells the system to run either the test.sh or test.cmd. By default, it will run test.sh with that given path. This example here also defines Windows specific properties and tells it execute test.cmd instead of the default. Resources VS Code Docs - operating system specific properties","title":"Cross Platform Tasks"},{"location":"developer-experience/cross-platform-tasks/#cross-platform-tasks","text":"There are several options to alleviate cross-platform compatibility issues. Running tasks in a container Using the tasks-system in VS Code which provides options to allow commands to be executed specific to an operating system.","title":"Cross Platform Tasks"},{"location":"developer-experience/cross-platform-tasks/#docker-or-container-based","text":"Using containers as development machines allows developers to get started with minimal setup and abstracts the development environment from the host OS by having it run in a container. DevContainers can also help in standardizing the local developer experience across the team. The following are some good resources to get started with running tasks in DevContainers Developing inside a container . Tutorial on Development in Containers For samples projects and dev container templates see VS Code Dev Containers Recipe Dev Containers Library","title":"Docker or Container Based"},{"location":"developer-experience/cross-platform-tasks/#tasks-in-vscode","text":"","title":"Tasks in VSCode"},{"location":"developer-experience/cross-platform-tasks/#running-nodejs","text":"The example below offers insight into running Node.js executable as a command with tasks.json and how it can be treated differently on Windows and Linux. { \"label\" : \"Run Node\" , \"type\" : \"process\" , \"windows\" : { \"command\" : \"C:\\\\Program Files\\\\nodejs\\\\node.exe\" }, \"linux\" : { \"command\" : \"/usr/bin/node\" } } In this example, to run Node.js, there is a specific windows command, and a specific linux command. This allows for platform specific properties. When these are defined, they will be used instead of the default properties when the command is executed on the Windows operating system or on Linux.","title":"Running Node.js"},{"location":"developer-experience/cross-platform-tasks/#custom-tasks","text":"Not all scripts or tasks can be auto-detected in the workspace. It may be necessary at times to defined your own custom tasks. In this example, we have a script to run in order to set up some environment correctly. The script is stored in a folder inside your workspace and named test.sh for Linux & macOS and test.cmd for Windows. With the tasks.json file, the execution of this script can be made possible with a custom task that defines what to do on different operating systems. { \"version\" : \"2.0.0\" , \"tasks\" : [ { \"label\" : \"Run tests\" , \"type\" : \"shell\" , \"command\" : \"./scripts/test.sh\" , \"windows\" : { \"command\" : \".\\\\scripts\\\\test.cmd\" }, \"group\" : \"test\" , \"presentation\" : { \"reveal\" : \"always\" , \"panel\" : \"new\" } } ] } The command here is a shell command and tells the system to run either the test.sh or test.cmd. By default, it will run test.sh with that given path. This example here also defines Windows specific properties and tells it execute test.cmd instead of the default.","title":"Custom Tasks"},{"location":"developer-experience/cross-platform-tasks/#resources","text":"VS Code Docs - operating system specific properties","title":"Resources"},{"location":"developer-experience/devcontainers-getting-started/","text":"Dev Containers: Getting Started If you are a developer and have experience with Visual Studio Code (VS Code) or Docker, then it's probably time you look at development containers (dev containers). This readme is intended to assist developers in the decision-making process needed to build dev containers. The guidance provided should be especially helpful if you are experiencing VS Code dev containers for the first time. Note: This guide is not about setting up a Docker file for deploying a running Python program for CI/CD. Prerequisites Experience with VS Code Experience with Docker What are Dev Containers? Development containers are a VS Code feature that allows developers to package a local development tool stack into the internals of a Docker container while also bringing the VS Code UI experience with them. Have you ever set a breakpoint inside a Docker container? Maybe not. Dev containers make that possible. This is all made possible through a VS Code extension called the Remote Development Extension Pack that works together with Docker to spin-up a VS Code Server within a Docker container. The VS Code UI component remains local, but your working files are volume mounted into the container. The diagram below, taken directly from the official VS Code docs , illustrates this: If the above diagram is not clear, a basic analogy that might help you intuitively understand dev containers is to think of them as a union between Docker's interactive mode ( docker exec -it 987654e0ff32 ), and the VS Code UI experience that you are used to. To set yourself up for the dev container experience described above, use your VS Code's Extension Marketplace to install the Remote Development Extension Pack . How can Dev Containers Improve Project Collaboration? VS Code dev containers have improved project collaboration between developers on recent team projects by addressing two very specific problems: Inconsistent local developer experiences within a team. Slow onboarding of developers joining a project. The problems listed above were addressed by configuring and then sharing a dev container definition. Dev containers are defined by their base image, and the artifacts that support that base image. The base image and the artifacts that come with it live in the .devcontainer directory. This directory is where configuration begins. A central artifact to the dev container definition is a configuration file called devcontainer.json . This file orchestrates the artifacts needed to support the base image and the dev container lifecycle. Installation of the Remote Development Extension Pack is required to enable this orchestration within a project repo. All developers on the team are expected to share and use the dev container definition (.devcontainer directory) in order to spin-up a container. This definition provides consistent tooling for locally developing an application across a team. The code snippets below demonstrate the common location of a .devcontainer directory and devcontainer.json file within a project repository. They also highlight the correct way to reference a Docker file. $ tree vs-code-remote-try-python # main repo directory \u2514\u2500\u2500\u2500.devcontainers \u251c\u2500\u2500\u2500Dockerfile \u251c\u2500\u2500\u2500devcontainer.json # devco nta i ner .jso n { \"name\" : \"Python 3\" , \"build\" : { \"dockerfile\" : \"Dockerfile\" , \"context\" : \"..\" , // Update 'VARIANT' to pick a Python version: 3, 3.6, 3.7, 3.8 \"args\" : { \"VARIANT\" : \"3.8\" } }, } For a list of devcontainer.json configuration properties, visit VS Code documentation on dev container properties . How do I Decide Which Dev Container is Right for my Use Case? Fortunately, VS Code has a repo gallery of platform specific folders that host dev container definitions (.devcontainer directories) to make getting started with dev containers easier. The code snippet below shows a list of gallery folders that come directly from the VS Code dev container gallery repo : $ tree vs-code-dev-containers # main repo directory \u2514\u2500\u2500\u2500containers \u251c\u2500\u2500\u2500dotnetcore | \u2514\u2500\u2500\u2500.devcontainers # dev container \u251c\u2500\u2500\u2500python-3 | \u2514\u2500\u2500\u2500.devcontainers # dev container \u251c\u2500\u2500\u2500ubuntu | \u2514\u2500\u2500\u2500.devcontainers # dev container \u2514\u2500\u2500\u2500.... Here are the final high-level steps it takes to build a dev container: Decide which platform you'd like to build a local development tool stack around. Browse the VS Code provided dev container gallery of project folders that target your platform and choose the most appropriate one. Inspect the dev container definitions (.devcontainer directory) of a project for the base image, and the artifacts that support that base image. Use what you've discovered to begin setting up the dev container as it is, extending it or building your own from scratch. Going further There are use cases where you would want to go further in configuring your Dev Container. More details here","title":"Dev Containers: Getting Started"},{"location":"developer-experience/devcontainers-getting-started/#dev-containers-getting-started","text":"If you are a developer and have experience with Visual Studio Code (VS Code) or Docker, then it's probably time you look at development containers (dev containers). This readme is intended to assist developers in the decision-making process needed to build dev containers. The guidance provided should be especially helpful if you are experiencing VS Code dev containers for the first time. Note: This guide is not about setting up a Docker file for deploying a running Python program for CI/CD.","title":"Dev Containers: Getting Started"},{"location":"developer-experience/devcontainers-getting-started/#prerequisites","text":"Experience with VS Code Experience with Docker","title":"Prerequisites"},{"location":"developer-experience/devcontainers-getting-started/#what-are-dev-containers","text":"Development containers are a VS Code feature that allows developers to package a local development tool stack into the internals of a Docker container while also bringing the VS Code UI experience with them. Have you ever set a breakpoint inside a Docker container? Maybe not. Dev containers make that possible. This is all made possible through a VS Code extension called the Remote Development Extension Pack that works together with Docker to spin-up a VS Code Server within a Docker container. The VS Code UI component remains local, but your working files are volume mounted into the container. The diagram below, taken directly from the official VS Code docs , illustrates this: If the above diagram is not clear, a basic analogy that might help you intuitively understand dev containers is to think of them as a union between Docker's interactive mode ( docker exec -it 987654e0ff32 ), and the VS Code UI experience that you are used to. To set yourself up for the dev container experience described above, use your VS Code's Extension Marketplace to install the Remote Development Extension Pack .","title":"What are Dev Containers?"},{"location":"developer-experience/devcontainers-getting-started/#how-can-dev-containers-improve-project-collaboration","text":"VS Code dev containers have improved project collaboration between developers on recent team projects by addressing two very specific problems: Inconsistent local developer experiences within a team. Slow onboarding of developers joining a project. The problems listed above were addressed by configuring and then sharing a dev container definition. Dev containers are defined by their base image, and the artifacts that support that base image. The base image and the artifacts that come with it live in the .devcontainer directory. This directory is where configuration begins. A central artifact to the dev container definition is a configuration file called devcontainer.json . This file orchestrates the artifacts needed to support the base image and the dev container lifecycle. Installation of the Remote Development Extension Pack is required to enable this orchestration within a project repo. All developers on the team are expected to share and use the dev container definition (.devcontainer directory) in order to spin-up a container. This definition provides consistent tooling for locally developing an application across a team. The code snippets below demonstrate the common location of a .devcontainer directory and devcontainer.json file within a project repository. They also highlight the correct way to reference a Docker file. $ tree vs-code-remote-try-python # main repo directory \u2514\u2500\u2500\u2500.devcontainers \u251c\u2500\u2500\u2500Dockerfile \u251c\u2500\u2500\u2500devcontainer.json # devco nta i ner .jso n { \"name\" : \"Python 3\" , \"build\" : { \"dockerfile\" : \"Dockerfile\" , \"context\" : \"..\" , // Update 'VARIANT' to pick a Python version: 3, 3.6, 3.7, 3.8 \"args\" : { \"VARIANT\" : \"3.8\" } }, } For a list of devcontainer.json configuration properties, visit VS Code documentation on dev container properties .","title":"How can Dev Containers Improve Project Collaboration?"},{"location":"developer-experience/devcontainers-getting-started/#how-do-i-decide-which-dev-container-is-right-for-my-use-case","text":"Fortunately, VS Code has a repo gallery of platform specific folders that host dev container definitions (.devcontainer directories) to make getting started with dev containers easier. The code snippet below shows a list of gallery folders that come directly from the VS Code dev container gallery repo : $ tree vs-code-dev-containers # main repo directory \u2514\u2500\u2500\u2500containers \u251c\u2500\u2500\u2500dotnetcore | \u2514\u2500\u2500\u2500.devcontainers # dev container \u251c\u2500\u2500\u2500python-3 | \u2514\u2500\u2500\u2500.devcontainers # dev container \u251c\u2500\u2500\u2500ubuntu | \u2514\u2500\u2500\u2500.devcontainers # dev container \u2514\u2500\u2500\u2500.... Here are the final high-level steps it takes to build a dev container: Decide which platform you'd like to build a local development tool stack around. Browse the VS Code provided dev container gallery of project folders that target your platform and choose the most appropriate one. Inspect the dev container definitions (.devcontainer directory) of a project for the base image, and the artifacts that support that base image. Use what you've discovered to begin setting up the dev container as it is, extending it or building your own from scratch.","title":"How do I Decide Which Dev Container is Right for my Use Case?"},{"location":"developer-experience/devcontainers-getting-started/#going-further","text":"There are use cases where you would want to go further in configuring your Dev Container. More details here","title":"Going further"},{"location":"developer-experience/devcontainers-going-further/","text":"Dev Containers: Going further Dev Containers allow developers to share a common working environment, ensuring that the runtime and all dependencies versions are consistent for all developers. Dev containers also allow us to: Leverage existing tools to enhance the Dev Containers with more features, Provide custom tools (such as scripts) for other developers. Existing tools In the development phase, you will most probably need to use tools not installed by default in your Dev Container. For instance, if your project's target is to be deployed on Azure, you will need Azure-cli and maybe Terraform for resources and application deployment. You can find such Dev Containers in the VS Code dev container gallery repo . Some other tools may be: Linters for markdown files, Linters for bash scripts, Etc... Linting files that are not the source code can ensure a common format with common rules for each developer. These checks should be also run in a Continuous Integration Pipeline , but it is a good practice to run them prior opening a Pull Request . Limitation of custom tools If you decide to include Azure-cli in your Dev Container, developers will be able to run commands against their tenant. However, to make the developers' lives easier, we could go further by letting them prefill their connection information, such as the tenant ID and the subscription ID in a secure and persistent way (do not forget that your Dev Container, being a Docker container, might get deleted, or the image could be rebuilt, hence, all customization inside will be lost). One way to achieve this is to leverage environment variables, with untracked .env file part of the solution being injected in the Dev Container. Consider the following files structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500config | \u251c\u2500\u2500\u2500.env | \u251c\u2500\u2500\u2500.env-sample The file config/.env-sample is a tracked file where anyone can find environment variables to set (with no values, obviously): TENANT_ID = SUBSCRIPTION_ID = Then, each developer who clones the repository can create the file config/.env and fills it in with the appropriate values. In order now to inject the .env file into the container, you can update the file devcontainer.json with the following: { ... \"runArgs\" : [ \"--env-file\" , \"config/.env\" ], ... } As soon as the Dev Container is started, these environment variables are sent to the container. Another approach would be to use Docker Compose, a little bit more complex, and probably too much for just environment variables. Using Docker Compose can unlock other settings such as custom dns, ports forwarding or multiple containers. To achieve this, you need to add a file .devcontainer/docker-compose.yml with the following: version : '3' services : my-workspace : env_file : ../config/.env build : context : . dockerfile : Dockerfile command : sleep infinity To use the docker-compose.yml file instead of Dockerfile , we need to adjust devcontainer.json with: { \"name\" : \"My Application\" , \"dockerComposeFile\" : [ \"docker-compose.yml\" ], \"service\" : \"my-workspace\" ... } This approach can be applied for many other tools by preparing what would be required. The idea is to simplify developers' lives and new developers joining the project. Custom tools While working on a project, any developer might end up writing a script to automate a task. This script can be in bash , python or whatever scripting language they are comfortable with. Let's say you want to ensure that all markdown files written are validated against specific rules you have set up. As we have seen above, you can include the tool markdownlint in your Dev Container . Having the tool installed does not mean developer will know how to use it! Consider the following solution structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500docker-compose.yml | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500scripts | \u251c\u2500\u2500\u2500check-markdown.sh \u2514\u2500\u2500\u2500.markdownlint.json The file .devcontainer/Dockerfile installs markdownlint ... RUN apt-get update \\ && export DEBIAN_FRONTEND = noninteractive \\ && apt-get install -y nodejs npm # Add NodeJS tools RUN npm install -g markdownlint-cli ... The file .markdownlint.json contains the rules you want to validate in your markdown files (please refer to the markdownlint site for details). And finally, the script scripts/check-markdown.sh contains the following code to execute markdownlint : # Get the repository root repoRoot = \" $( cd \" $( dirname \" ${ BASH_SOURCE [0] } \" ) /..\" >/dev/null 2 > & 1 && pwd ) \" # Execute markdownlint for the entire solution markdownlint -c \" ${ repoRoot } \" /.markdownlint.json When the Dev Container is loaded, any developer can now run this script in their terminal: /> ./scripts/check-markdown.sh This is a small use case, there are unlimited other possibilities to capitalize on work done by developers to save time. Other considerations Platform architecture When installing tooling, you also need to ensure that you know what host computers developers are using. All Intel based computers, whether they are running Windows, Linux or MacOs will have the same behavior. However, the latest Mac architecture (Apple M1/Silicon) being ARM64, means that the behavior is not the same when building Dev Containers. For instance, if you want to install Azure-cli in your Dev Container, you won't be able to do it the same way you do it for Intel based machines. On Intel based computers you can install the deb package. However, this package is not available on ARM architecture. The only way to install Azure-cli on Linux ARM is via the Python installer pip . To achieve this you need to check the architecture of the host building the Dev Container, either in the Dockerfile, or by calling an external bash script to install remaining tools not having a universal version. Here is a snippet to call from the Dockerfile: # If Intel based, then use the deb file if [[ ` dpkg --print-architecture ` == \"amd64\" ]] ; then sudo curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash ; else # arm based, install pip (and gcc) then azure-cli sudo apt-get -y install gcc python3 -m pip install --upgrade pip python3 -m pip install azure-cli fi Reuse of credentials for GitHub If you develop inside a Dev Container, you will also want to share your GitHub credentials between your host and the Dev Container. Doing so, you would avoid copying your ssh keys back and forth (if you are using ssh to access your repositories). One approach would be to mount your local ~/.ssh folder into your Dev Container. You can either use the mounts option of the devcontainer.json , or use Docker Compose Using mounts : { ... \"mounts\" : [ \"source=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,type=bind\" ], ... } As you can see, ${localEnv:HOME} returns the host home folder, and it maps it to the container home folder. Using Docker Compose: version : '3' services : my-worspace : env_file : ../configs/.env build : context : . dockerfile : Dockerfile volumes : - \"~/.ssh:/home/alex/.ssh\" command : sleep infinity Please note that using Docker Compose requires to edit the devcontainer.json file as we have seen above. You can now access GitHub using the same credentials as your host machine, without worrying of persistence. Allow some customization As a final note, it is also interesting to leave developers some flexibility in their environment for customization. For instance, one might want to add aliases to their environment. However, changing the ~/.bashrc file in the Dev Container is not a good approach as the container might be destroyed. There are numerous ways to set persistence, here is one approach. Consider the following solution structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500docker-compose.yml | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500me | \u251c\u2500\u2500\u2500bashrc_extension The folder me is untracked in the repository, leaving developers the flexibility to add personal resources. One of these resources can be a .bashrc extension containing customization. For instance: # Sample alias alias gaa = \"git add --all\" We can now adapt our Dockerfile to load these changes when the Docker image is built (and of course, do nothing if there is no file): ... RUN echo \"[ -f PATH_TO_WORKSPACE/me/bashrc_extension ] && . PATH_TO_WORKSPACE/me/bashrc_extension\" >> ~/.bashrc ; ...","title":"Dev Containers: Going further"},{"location":"developer-experience/devcontainers-going-further/#dev-containers-going-further","text":"Dev Containers allow developers to share a common working environment, ensuring that the runtime and all dependencies versions are consistent for all developers. Dev containers also allow us to: Leverage existing tools to enhance the Dev Containers with more features, Provide custom tools (such as scripts) for other developers.","title":"Dev Containers: Going further"},{"location":"developer-experience/devcontainers-going-further/#existing-tools","text":"In the development phase, you will most probably need to use tools not installed by default in your Dev Container. For instance, if your project's target is to be deployed on Azure, you will need Azure-cli and maybe Terraform for resources and application deployment. You can find such Dev Containers in the VS Code dev container gallery repo . Some other tools may be: Linters for markdown files, Linters for bash scripts, Etc... Linting files that are not the source code can ensure a common format with common rules for each developer. These checks should be also run in a Continuous Integration Pipeline , but it is a good practice to run them prior opening a Pull Request .","title":"Existing tools"},{"location":"developer-experience/devcontainers-going-further/#limitation-of-custom-tools","text":"If you decide to include Azure-cli in your Dev Container, developers will be able to run commands against their tenant. However, to make the developers' lives easier, we could go further by letting them prefill their connection information, such as the tenant ID and the subscription ID in a secure and persistent way (do not forget that your Dev Container, being a Docker container, might get deleted, or the image could be rebuilt, hence, all customization inside will be lost). One way to achieve this is to leverage environment variables, with untracked .env file part of the solution being injected in the Dev Container. Consider the following files structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500config | \u251c\u2500\u2500\u2500.env | \u251c\u2500\u2500\u2500.env-sample The file config/.env-sample is a tracked file where anyone can find environment variables to set (with no values, obviously): TENANT_ID = SUBSCRIPTION_ID = Then, each developer who clones the repository can create the file config/.env and fills it in with the appropriate values. In order now to inject the .env file into the container, you can update the file devcontainer.json with the following: { ... \"runArgs\" : [ \"--env-file\" , \"config/.env\" ], ... } As soon as the Dev Container is started, these environment variables are sent to the container. Another approach would be to use Docker Compose, a little bit more complex, and probably too much for just environment variables. Using Docker Compose can unlock other settings such as custom dns, ports forwarding or multiple containers. To achieve this, you need to add a file .devcontainer/docker-compose.yml with the following: version : '3' services : my-workspace : env_file : ../config/.env build : context : . dockerfile : Dockerfile command : sleep infinity To use the docker-compose.yml file instead of Dockerfile , we need to adjust devcontainer.json with: { \"name\" : \"My Application\" , \"dockerComposeFile\" : [ \"docker-compose.yml\" ], \"service\" : \"my-workspace\" ... } This approach can be applied for many other tools by preparing what would be required. The idea is to simplify developers' lives and new developers joining the project.","title":"Limitation of custom tools"},{"location":"developer-experience/devcontainers-going-further/#custom-tools","text":"While working on a project, any developer might end up writing a script to automate a task. This script can be in bash , python or whatever scripting language they are comfortable with. Let's say you want to ensure that all markdown files written are validated against specific rules you have set up. As we have seen above, you can include the tool markdownlint in your Dev Container . Having the tool installed does not mean developer will know how to use it! Consider the following solution structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500docker-compose.yml | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500scripts | \u251c\u2500\u2500\u2500check-markdown.sh \u2514\u2500\u2500\u2500.markdownlint.json The file .devcontainer/Dockerfile installs markdownlint ... RUN apt-get update \\ && export DEBIAN_FRONTEND = noninteractive \\ && apt-get install -y nodejs npm # Add NodeJS tools RUN npm install -g markdownlint-cli ... The file .markdownlint.json contains the rules you want to validate in your markdown files (please refer to the markdownlint site for details). And finally, the script scripts/check-markdown.sh contains the following code to execute markdownlint : # Get the repository root repoRoot = \" $( cd \" $( dirname \" ${ BASH_SOURCE [0] } \" ) /..\" >/dev/null 2 > & 1 && pwd ) \" # Execute markdownlint for the entire solution markdownlint -c \" ${ repoRoot } \" /.markdownlint.json When the Dev Container is loaded, any developer can now run this script in their terminal: /> ./scripts/check-markdown.sh This is a small use case, there are unlimited other possibilities to capitalize on work done by developers to save time.","title":"Custom tools"},{"location":"developer-experience/devcontainers-going-further/#other-considerations","text":"","title":"Other considerations"},{"location":"developer-experience/devcontainers-going-further/#platform-architecture","text":"When installing tooling, you also need to ensure that you know what host computers developers are using. All Intel based computers, whether they are running Windows, Linux or MacOs will have the same behavior. However, the latest Mac architecture (Apple M1/Silicon) being ARM64, means that the behavior is not the same when building Dev Containers. For instance, if you want to install Azure-cli in your Dev Container, you won't be able to do it the same way you do it for Intel based machines. On Intel based computers you can install the deb package. However, this package is not available on ARM architecture. The only way to install Azure-cli on Linux ARM is via the Python installer pip . To achieve this you need to check the architecture of the host building the Dev Container, either in the Dockerfile, or by calling an external bash script to install remaining tools not having a universal version. Here is a snippet to call from the Dockerfile: # If Intel based, then use the deb file if [[ ` dpkg --print-architecture ` == \"amd64\" ]] ; then sudo curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash ; else # arm based, install pip (and gcc) then azure-cli sudo apt-get -y install gcc python3 -m pip install --upgrade pip python3 -m pip install azure-cli fi","title":"Platform architecture"},{"location":"developer-experience/devcontainers-going-further/#reuse-of-credentials-for-github","text":"If you develop inside a Dev Container, you will also want to share your GitHub credentials between your host and the Dev Container. Doing so, you would avoid copying your ssh keys back and forth (if you are using ssh to access your repositories). One approach would be to mount your local ~/.ssh folder into your Dev Container. You can either use the mounts option of the devcontainer.json , or use Docker Compose Using mounts : { ... \"mounts\" : [ \"source=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,type=bind\" ], ... } As you can see, ${localEnv:HOME} returns the host home folder, and it maps it to the container home folder. Using Docker Compose: version : '3' services : my-worspace : env_file : ../configs/.env build : context : . dockerfile : Dockerfile volumes : - \"~/.ssh:/home/alex/.ssh\" command : sleep infinity Please note that using Docker Compose requires to edit the devcontainer.json file as we have seen above. You can now access GitHub using the same credentials as your host machine, without worrying of persistence.","title":"Reuse of credentials for GitHub"},{"location":"developer-experience/devcontainers-going-further/#allow-some-customization","text":"As a final note, it is also interesting to leave developers some flexibility in their environment for customization. For instance, one might want to add aliases to their environment. However, changing the ~/.bashrc file in the Dev Container is not a good approach as the container might be destroyed. There are numerous ways to set persistence, here is one approach. Consider the following solution structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500docker-compose.yml | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500me | \u251c\u2500\u2500\u2500bashrc_extension The folder me is untracked in the repository, leaving developers the flexibility to add personal resources. One of these resources can be a .bashrc extension containing customization. For instance: # Sample alias alias gaa = \"git add --all\" We can now adapt our Dockerfile to load these changes when the Docker image is built (and of course, do nothing if there is no file): ... RUN echo \"[ -f PATH_TO_WORKSPACE/me/bashrc_extension ] && . PATH_TO_WORKSPACE/me/bashrc_extension\" >> ~/.bashrc ; ...","title":"Allow some customization"},{"location":"developer-experience/execute-local-pipeline-with-docker/","text":"Executing Pipelines Locally Abstract Having the ability to execute pipeline activities locally has been identified as an opportunity to promote positive developer experience. In this document we will explore a solution which will allow us to have the local CI experience to be as similar as possible to the remote process in the CI server. Using the suggested method will allow us to: Build Lint Unit test E2E test Run Solution Be OS and environment agnostic. Enter Docker Compose Docker Compose allows you to build push or run multi-container Docker applications. Method of Work Dockerize your application(s), including a build step if possible. Add a step in your docker file to execute unit tests. Add a step in the docker file for linting. Create a new dockerfile, possibly in a different folder, which executes end-to-end tests against the cluster. Make sure the default endpoints are configurable (This will become handy in your remote CI server, where you will be able to test against a live environment, if you choose to). Create a docker-compose file which allows you to choose which of the services to run. The default will run all applications and tests, and an optional parameter can run specific services, for example only the application without the tests. Prerequisites Docker Optional: if you clone the sample app, you need to have dotnet core installed. Step by Step with Examples For this tutorial we are going to use a sample dotnet core api application . Here is the docker file for the sample app: # https://hub.docker.com/_/microsoft-dotnet FROM mcr.microsoft.com/dotnet/sdk:5.0 AS build WORKDIR /app # copy csproj and restore as distinct layers COPY ./ ./ RUN dotnet restore RUN dotnet test # copy everything else and build app COPY SampleApp/. ./ RUN dotnet publish -c release -o out --no-restore # final stage/image FROM mcr.microsoft.com/dotnet/aspnet:5.0 WORKDIR /app COPY --from = build /app/out . ENTRYPOINT [ \"dotnet\" , \"SampleNetApi.dll\" ] This script restores all dependencies, builds and runs tests. The dotnet app includes stylecop which fails the build in case of linting issues. Next we will also create a dockerfile to perform an end-to-end test. Usually this will look like a set of scripts, or a dedicated app which performs actual HTTP calls to a running application. For the sake of simplicity the dockerfile itself will run a simple curl command: FROM alpine:3.7 RUN apk --no-cache add curl ENTRYPOINT [ \"curl\" , \"0.0.0.0:8080/weatherforecast\" ] Now we are ready to combine both of the dockerfiles in a docker-compose script: version: '3' services: app: image: app:0.01 build: context: . ports: - \"8080:80\" e2e: image: e2e:0.01 build: context: ./E2E The docker-compose script will launch the 2 dockerfiles, and it will build them if they were not built before. The following command will run docker compose: docker-compose up --build -d Once the images are up, you can make calls to the service. The e2e image will perform the set of e2e tests. If you want to skip the tests, you can simply tell compose to run a specific service by appending the name of the service, as follows: docker-compose up --build -d app Now you have a local script which builds and tests you application. The next step would be make your CI run the docker-compose script. Here is an example of a yaml file used by Azure DevOps pipelines: trigger: - master pool: vmImage: 'ubuntu-latest' variables: solution: '**/*.sln' buildPlatform: 'Any CPU' buildConfiguration: 'Release' steps: - task: DockerCompose@0 displayName: Build, Test, E2E inputs: action: Run services dockerComposeFile: docker-compose.yml - script: dotnet restore SampleApp - script: dotnet build --configuration $( buildConfiguration ) SampleApp displayName: 'dotnet build $(buildConfiguration)' In this script the first step is docker-compose, which uses the same file we created the previous steps. The next steps, do the same using scripts, and are here for comparison. By the end of this step, your CI effectively runs the same build and test commands you run locally.","title":"Executing Pipelines Locally"},{"location":"developer-experience/execute-local-pipeline-with-docker/#executing-pipelines-locally","text":"","title":"Executing Pipelines Locally"},{"location":"developer-experience/execute-local-pipeline-with-docker/#abstract","text":"Having the ability to execute pipeline activities locally has been identified as an opportunity to promote positive developer experience. In this document we will explore a solution which will allow us to have the local CI experience to be as similar as possible to the remote process in the CI server. Using the suggested method will allow us to: Build Lint Unit test E2E test Run Solution Be OS and environment agnostic.","title":"Abstract"},{"location":"developer-experience/execute-local-pipeline-with-docker/#enter-docker-compose","text":"Docker Compose allows you to build push or run multi-container Docker applications.","title":"Enter Docker Compose"},{"location":"developer-experience/execute-local-pipeline-with-docker/#method-of-work","text":"Dockerize your application(s), including a build step if possible. Add a step in your docker file to execute unit tests. Add a step in the docker file for linting. Create a new dockerfile, possibly in a different folder, which executes end-to-end tests against the cluster. Make sure the default endpoints are configurable (This will become handy in your remote CI server, where you will be able to test against a live environment, if you choose to). Create a docker-compose file which allows you to choose which of the services to run. The default will run all applications and tests, and an optional parameter can run specific services, for example only the application without the tests.","title":"Method of Work"},{"location":"developer-experience/execute-local-pipeline-with-docker/#prerequisites","text":"Docker Optional: if you clone the sample app, you need to have dotnet core installed.","title":"Prerequisites"},{"location":"developer-experience/execute-local-pipeline-with-docker/#step-by-step-with-examples","text":"For this tutorial we are going to use a sample dotnet core api application . Here is the docker file for the sample app: # https://hub.docker.com/_/microsoft-dotnet FROM mcr.microsoft.com/dotnet/sdk:5.0 AS build WORKDIR /app # copy csproj and restore as distinct layers COPY ./ ./ RUN dotnet restore RUN dotnet test # copy everything else and build app COPY SampleApp/. ./ RUN dotnet publish -c release -o out --no-restore # final stage/image FROM mcr.microsoft.com/dotnet/aspnet:5.0 WORKDIR /app COPY --from = build /app/out . ENTRYPOINT [ \"dotnet\" , \"SampleNetApi.dll\" ] This script restores all dependencies, builds and runs tests. The dotnet app includes stylecop which fails the build in case of linting issues. Next we will also create a dockerfile to perform an end-to-end test. Usually this will look like a set of scripts, or a dedicated app which performs actual HTTP calls to a running application. For the sake of simplicity the dockerfile itself will run a simple curl command: FROM alpine:3.7 RUN apk --no-cache add curl ENTRYPOINT [ \"curl\" , \"0.0.0.0:8080/weatherforecast\" ] Now we are ready to combine both of the dockerfiles in a docker-compose script: version: '3' services: app: image: app:0.01 build: context: . ports: - \"8080:80\" e2e: image: e2e:0.01 build: context: ./E2E The docker-compose script will launch the 2 dockerfiles, and it will build them if they were not built before. The following command will run docker compose: docker-compose up --build -d Once the images are up, you can make calls to the service. The e2e image will perform the set of e2e tests. If you want to skip the tests, you can simply tell compose to run a specific service by appending the name of the service, as follows: docker-compose up --build -d app Now you have a local script which builds and tests you application. The next step would be make your CI run the docker-compose script. Here is an example of a yaml file used by Azure DevOps pipelines: trigger: - master pool: vmImage: 'ubuntu-latest' variables: solution: '**/*.sln' buildPlatform: 'Any CPU' buildConfiguration: 'Release' steps: - task: DockerCompose@0 displayName: Build, Test, E2E inputs: action: Run services dockerComposeFile: docker-compose.yml - script: dotnet restore SampleApp - script: dotnet build --configuration $( buildConfiguration ) SampleApp displayName: 'dotnet build $(buildConfiguration)' In this script the first step is docker-compose, which uses the same file we created the previous steps. The next steps, do the same using scripts, and are here for comparison. By the end of this step, your CI effectively runs the same build and test commands you run locally.","title":"Step by Step with Examples"},{"location":"developer-experience/fake-services-inner-loop/","text":"Fake Services Inner Dev Loop Introduction Consumers of remote services often find that their development cycle is not in sync with development of remote services, leaving developers of these consumers waiting for the remote services to \"catch up\". One approach to mitigate this issue and improve the inner dev loop is by decoupling and using Mock Services. Various Mock Service options are detailed here . This document will focus on providing an example using the Fake Services approach. API For our example API, we will work against a /User endpoint and the properties for User will be: id - int username - string firstName - string lastName - string email - string password - string phone - string userStatus - int Tooling For the Fake Service approach, we will be using Json-Server . Json-Server is a tool that provides the ability to fully fake REST APIs and run the server locally. It is designed to spin up REST APIs with CRUD functionality with minimal setup. Json-Server requires NodeJS and is installed via NPM. npm install -g json-server Setup In order to run Json-Server, it simply requires a source for data and will infer routes, etc. based on the data file. Note that additional customization can be performed for more advanced scenarios (e.g. custom routes). Details can be found here . For our example, we will use the following data file, db.json : { \"user\" : [ { \"id\" : 0 , \"username\" : \"user1\" , \"firstName\" : \"Kobe\" , \"lastName\" : \"Bryant\" , \"email\" : \"kobe@example.com\" , \"password\" : \"superSecure1\" , \"phone\" : \"(123) 123-1234\" , \"userStatus\" : 0 }, { \"id\" : 1 , \"username\" : \"user2\" , \"firstName\" : \"Shaquille\" , \"lastName\" : \"O'Neal\" , \"email\" : \"shaq@example.com\" , \"password\" : \"superSecure2\" , \"phone\" : \"(123) 123-1235\" , \"userStatus\" : 0 } ] } Run Running Json-Server can be performed by simply running: json-server --watch src/db.json Once running, the User endpoint can be hit on the default localhost port: http:/localhost:3000/user Note that Json-Server can be configured to use other ports using the following syntax: json-server --watch db.json --port 3004 Endpoint The endpoint can be tested by running curl against it and we can narrow down which user object to get back with the following command: curl http://localhost:3000/user/1 which, as expected, returns: { \"id\": 1, \"username\": \"user2\", \"firstName\": \"Shaquille\", \"lastName\": \"O'Neal\", \"email\": \"shaq@example.com\", \"password\": \"superSecure2\", \"phone\": \"(123) 123-1235\", \"userStatus\": 0 }","title":"Fake Services Inner Dev Loop"},{"location":"developer-experience/fake-services-inner-loop/#fake-services-inner-dev-loop","text":"","title":"Fake Services Inner Dev Loop"},{"location":"developer-experience/fake-services-inner-loop/#introduction","text":"Consumers of remote services often find that their development cycle is not in sync with development of remote services, leaving developers of these consumers waiting for the remote services to \"catch up\". One approach to mitigate this issue and improve the inner dev loop is by decoupling and using Mock Services. Various Mock Service options are detailed here . This document will focus on providing an example using the Fake Services approach.","title":"Introduction"},{"location":"developer-experience/fake-services-inner-loop/#api","text":"For our example API, we will work against a /User endpoint and the properties for User will be: id - int username - string firstName - string lastName - string email - string password - string phone - string userStatus - int","title":"API"},{"location":"developer-experience/fake-services-inner-loop/#tooling","text":"For the Fake Service approach, we will be using Json-Server . Json-Server is a tool that provides the ability to fully fake REST APIs and run the server locally. It is designed to spin up REST APIs with CRUD functionality with minimal setup. Json-Server requires NodeJS and is installed via NPM. npm install -g json-server","title":"Tooling"},{"location":"developer-experience/fake-services-inner-loop/#setup","text":"In order to run Json-Server, it simply requires a source for data and will infer routes, etc. based on the data file. Note that additional customization can be performed for more advanced scenarios (e.g. custom routes). Details can be found here . For our example, we will use the following data file, db.json : { \"user\" : [ { \"id\" : 0 , \"username\" : \"user1\" , \"firstName\" : \"Kobe\" , \"lastName\" : \"Bryant\" , \"email\" : \"kobe@example.com\" , \"password\" : \"superSecure1\" , \"phone\" : \"(123) 123-1234\" , \"userStatus\" : 0 }, { \"id\" : 1 , \"username\" : \"user2\" , \"firstName\" : \"Shaquille\" , \"lastName\" : \"O'Neal\" , \"email\" : \"shaq@example.com\" , \"password\" : \"superSecure2\" , \"phone\" : \"(123) 123-1235\" , \"userStatus\" : 0 } ] }","title":"Setup"},{"location":"developer-experience/fake-services-inner-loop/#run","text":"Running Json-Server can be performed by simply running: json-server --watch src/db.json Once running, the User endpoint can be hit on the default localhost port: http:/localhost:3000/user Note that Json-Server can be configured to use other ports using the following syntax: json-server --watch db.json --port 3004","title":"Run"},{"location":"developer-experience/fake-services-inner-loop/#endpoint","text":"The endpoint can be tested by running curl against it and we can narrow down which user object to get back with the following command: curl http://localhost:3000/user/1 which, as expected, returns: { \"id\": 1, \"username\": \"user2\", \"firstName\": \"Shaquille\", \"lastName\": \"O'Neal\", \"email\": \"shaq@example.com\", \"password\": \"superSecure2\", \"phone\": \"(123) 123-1235\", \"userStatus\": 0 }","title":"Endpoint"},{"location":"developer-experience/onboarding-guide-template/","text":"Onboarding Guide Template When developing an onboarding document for a team, it should contain details of engagement scope, team processes, codebase, coding standards, team agreements, software requirements and setup details. The onboarding guide can be used as an index to project specific content if it already exists elsewhere. Allowing this guide to be utilized as a foundation with the links will help keep the guide concise and effective. Overview and Goals List a few sentences explaining the high-level summary and the scope of the engagement. Consider adding any additional background and context as needed. Include the value proposition of the project, goals, what success looks like, and what the team is trying to achieve and why. Contacts List a few of the main contacts for the team and project overall such as the Dev Lead and Product Owner. Consider including the roles of these main contacts so that the team knows who to reach out to depending on the situation. Team Agreement and Code of Conduct Include the team's code of conduct or agreement that defines a set of expectation from each team member and how the team has agreed to operate. Working Agreement Template - working agreement Dev Environment Setup Consider adding steps to run the project end-to-end. This could be in form of a separate wiki page or document that can be linked here. Include any software that needs to be downloaded and specify if a specific version of the software is needed. Project Building Blocks This can include a more in depth description with different areas of the project to help increase the project understanding. It can include different sections on the various components of the project including deployment, e2e testing, repositories. Resources This can include any additional links to documents related to the project It may include links to backlog items, work items, wiki pages or project history.","title":"Onboarding Guide Template"},{"location":"developer-experience/onboarding-guide-template/#onboarding-guide-template","text":"When developing an onboarding document for a team, it should contain details of engagement scope, team processes, codebase, coding standards, team agreements, software requirements and setup details. The onboarding guide can be used as an index to project specific content if it already exists elsewhere. Allowing this guide to be utilized as a foundation with the links will help keep the guide concise and effective.","title":"Onboarding Guide Template"},{"location":"developer-experience/onboarding-guide-template/#overview-and-goals","text":"List a few sentences explaining the high-level summary and the scope of the engagement. Consider adding any additional background and context as needed. Include the value proposition of the project, goals, what success looks like, and what the team is trying to achieve and why.","title":"Overview and Goals"},{"location":"developer-experience/onboarding-guide-template/#contacts","text":"List a few of the main contacts for the team and project overall such as the Dev Lead and Product Owner. Consider including the roles of these main contacts so that the team knows who to reach out to depending on the situation.","title":"Contacts"},{"location":"developer-experience/onboarding-guide-template/#team-agreement-and-code-of-conduct","text":"Include the team's code of conduct or agreement that defines a set of expectation from each team member and how the team has agreed to operate. Working Agreement Template - working agreement","title":"Team Agreement and Code of Conduct"},{"location":"developer-experience/onboarding-guide-template/#dev-environment-setup","text":"Consider adding steps to run the project end-to-end. This could be in form of a separate wiki page or document that can be linked here. Include any software that needs to be downloaded and specify if a specific version of the software is needed.","title":"Dev Environment Setup"},{"location":"developer-experience/onboarding-guide-template/#project-building-blocks","text":"This can include a more in depth description with different areas of the project to help increase the project understanding. It can include different sections on the various components of the project including deployment, e2e testing, repositories.","title":"Project Building Blocks"},{"location":"developer-experience/onboarding-guide-template/#resources","text":"This can include any additional links to documents related to the project It may include links to backlog items, work items, wiki pages or project history.","title":"Resources"},{"location":"developer-experience/toggle-vnet-dev-environment/","text":"Toggle VNet On and Off for Production and Development Environment Problem Statement When deploying resources on Azure in a secure environment, resources are usually created behind a Private Network (VNet), without public access and with private endpoints to consume resources. This is the recommended approach for pre-production or production environments. Accessing protected resources from a local machine implies one of the following options: Use a VPN Use a jump box With SSH activated (less secure) With Bastion (recommended approach) However, a developer may want to deploy a test environment (in a non-production subscription) for their tests during development phase, without the complexity of networking. In addition, infrastructure code should not be duplicated: it has to be the same whether resources are deployed in a production like environment or in development environment. Option The idea is to offer, via a single boolean variable , the option to deploy resources behind a VNet or not using one infrastructure code base. Securing resources behind a VNet usually implies that public accesses are disabled and private endpoints are created. This is something to have in mind because, as a developer, public access must be activated in order to connect to this environment. The deployment pipeline will set these resources behind a VNet and will secure them by removing public accesses. Developers will be able to run the same deployment script, specifying that resources will not be behind a VNet nor have public accesses disabled. Let's consider the following use case: we want to deploy a VNet, a subnet, a storage account with no public access and a private endpoint for the table. The magic variable that will help toggling security will be called behind_vnet , of type boolean. Let's implement this use case using Terraform . The code below does not contain everything, the purpose is to show the pattern and not how to deploy these resources. For more information on Terraform, please refer to the official documentation . There is no if per se in Terraform to define whether a specific resource should be deployed or not based on a variable value. However, we can use the count meta-argument. The strength of this meta-argument is if its value is 0 , the block is skipped. Here is below the code snippets for this deployment: variables.tf variable \"behind_vnet\" { type = bool } main.tf resource \"azurerm_virtual_network\" \"vnet\" { count = var.behind_vnet ? 1 : 0 name = \"MyVnet\" address_space = [ x.x.x.x / 16 ] resource_group_name = \"MyResourceGroup\" location = \"WestEurope\" ... subnet { name = \"subnet_1\" address_prefix = \"x.x.x.x/24\" } } resource \"azurerm_storage_account\" \"storage_account\" { name = \"storage\" resource_group_name = \"MyResourceGroup\" location = \"WestEurope\" tags = var.tags ... public_network_access_enabled = var.behind_vnet ? false : true } resource \"azurerm_private_endpoint\" \"storage_account_table_private_endpoint\" { count = var.behind_vnet ? 1 : 0 name = \"pe-storage\" subnet_id = azurerm_virtual_network.vnet[0].subnet[0].id ... private_service_connection { name = \"psc-storage\" private_connection_resource_id = azurerm_storage_account.storage_account.id subresource_names = [ \"table\" ] ... } private_dns_zone_group { name = \"privateDnsZoneGroup\" ... } } If we run terraform apply -var behind_vnet = true then all the resources above will be deployed, and it is what we want on a pre-production or production environment. The instruction count = var.behind_vnet ? 1 : 0 will set count with the value 1 , therefore blocks will be executed. However, if we run terraform apply -var behind_vnet = false the azurerm_virtual_network and azurerm_private_endpoint resources will be skipped (because count will be 0 ). The resource azurerm_storage_account will be created, with minor differences in some properties: for instance, here, public_network_access_enabled will be set to true (and this is the goal for a developer to be able to access resources created). The same pattern can be applied over and over for the entire infrastructure code. Conclusion With this approach, the same infrastructure code base can be used to target a production like environment with secured resources behind a VNet with no public accesses and also a more permissive development environment. However, there are a couple of trade-offs with this approach: if a resource has the count argument, it needs to be treated as a list, and not a single item. In the example above, if there is a need to reference the resource azurerm_virtual_network later in the code, azurerm_virtual_network.vnet.id will not work. The following must be used azurerm_virtual_network.vnet[0].id # First (and only) item of the collection The meta-argument count cannot be used with for_each for a whole block. That means that the use of loops to deploy multiple endpoints for instance will not work. Each private endpoints will need to be deployed individually.","title":"Toggle VNet On and Off for Production and Development Environment"},{"location":"developer-experience/toggle-vnet-dev-environment/#toggle-vnet-on-and-off-for-production-and-development-environment","text":"","title":"Toggle VNet On and Off for Production and Development Environment"},{"location":"developer-experience/toggle-vnet-dev-environment/#problem-statement","text":"When deploying resources on Azure in a secure environment, resources are usually created behind a Private Network (VNet), without public access and with private endpoints to consume resources. This is the recommended approach for pre-production or production environments. Accessing protected resources from a local machine implies one of the following options: Use a VPN Use a jump box With SSH activated (less secure) With Bastion (recommended approach) However, a developer may want to deploy a test environment (in a non-production subscription) for their tests during development phase, without the complexity of networking. In addition, infrastructure code should not be duplicated: it has to be the same whether resources are deployed in a production like environment or in development environment.","title":"Problem Statement"},{"location":"developer-experience/toggle-vnet-dev-environment/#option","text":"The idea is to offer, via a single boolean variable , the option to deploy resources behind a VNet or not using one infrastructure code base. Securing resources behind a VNet usually implies that public accesses are disabled and private endpoints are created. This is something to have in mind because, as a developer, public access must be activated in order to connect to this environment. The deployment pipeline will set these resources behind a VNet and will secure them by removing public accesses. Developers will be able to run the same deployment script, specifying that resources will not be behind a VNet nor have public accesses disabled. Let's consider the following use case: we want to deploy a VNet, a subnet, a storage account with no public access and a private endpoint for the table. The magic variable that will help toggling security will be called behind_vnet , of type boolean. Let's implement this use case using Terraform . The code below does not contain everything, the purpose is to show the pattern and not how to deploy these resources. For more information on Terraform, please refer to the official documentation . There is no if per se in Terraform to define whether a specific resource should be deployed or not based on a variable value. However, we can use the count meta-argument. The strength of this meta-argument is if its value is 0 , the block is skipped. Here is below the code snippets for this deployment: variables.tf variable \"behind_vnet\" { type = bool } main.tf resource \"azurerm_virtual_network\" \"vnet\" { count = var.behind_vnet ? 1 : 0 name = \"MyVnet\" address_space = [ x.x.x.x / 16 ] resource_group_name = \"MyResourceGroup\" location = \"WestEurope\" ... subnet { name = \"subnet_1\" address_prefix = \"x.x.x.x/24\" } } resource \"azurerm_storage_account\" \"storage_account\" { name = \"storage\" resource_group_name = \"MyResourceGroup\" location = \"WestEurope\" tags = var.tags ... public_network_access_enabled = var.behind_vnet ? false : true } resource \"azurerm_private_endpoint\" \"storage_account_table_private_endpoint\" { count = var.behind_vnet ? 1 : 0 name = \"pe-storage\" subnet_id = azurerm_virtual_network.vnet[0].subnet[0].id ... private_service_connection { name = \"psc-storage\" private_connection_resource_id = azurerm_storage_account.storage_account.id subresource_names = [ \"table\" ] ... } private_dns_zone_group { name = \"privateDnsZoneGroup\" ... } } If we run terraform apply -var behind_vnet = true then all the resources above will be deployed, and it is what we want on a pre-production or production environment. The instruction count = var.behind_vnet ? 1 : 0 will set count with the value 1 , therefore blocks will be executed. However, if we run terraform apply -var behind_vnet = false the azurerm_virtual_network and azurerm_private_endpoint resources will be skipped (because count will be 0 ). The resource azurerm_storage_account will be created, with minor differences in some properties: for instance, here, public_network_access_enabled will be set to true (and this is the goal for a developer to be able to access resources created). The same pattern can be applied over and over for the entire infrastructure code.","title":"Option"},{"location":"developer-experience/toggle-vnet-dev-environment/#conclusion","text":"With this approach, the same infrastructure code base can be used to target a production like environment with secured resources behind a VNet with no public accesses and also a more permissive development environment. However, there are a couple of trade-offs with this approach: if a resource has the count argument, it needs to be treated as a list, and not a single item. In the example above, if there is a need to reference the resource azurerm_virtual_network later in the code, azurerm_virtual_network.vnet.id will not work. The following must be used azurerm_virtual_network.vnet[0].id # First (and only) item of the collection The meta-argument count cannot be used with for_each for a whole block. That means that the use of loops to deploy multiple endpoints for instance will not work. Each private endpoints will need to be deployed individually.","title":"Conclusion"},{"location":"documentation/","text":"Documentation Every software development project requires documentation. Agile Software Development values working software over comprehensive documentation . Still, projects should include the key information needed to understand the development and the use of the generated software. Documentation shouldn't be an afterthought. Different written documents and materials should be created during the whole life cycle of the project, as per the project needs. Goals Facilitate onboarding of new team members. Improve communication and collaboration between teams (especially when distributed across time zones). Improve the transition of the project to another team. Challenges When working in an engineering project, we typically encounter one or more of these challenges related to documentation (including some examples): Non-existent . No onboarding documentation, so it takes a long time to set up the environment when you join the project. No document in the wiki explaining existing repositories, so you cannot tell which of the 10 available repositories you should clone. No main README, so you don't know where to start when you clone a repository. No \"how to contribute\" section, so you don't know which is the branch policy, where to add new documents, etc. No code guidelines, so everyone follows different naming conventions, etc. Hidden . Impossible to find useful documentation as it\u2019s scattered all over the place. E.g., no idea how to compile, run and test the code as the README is hidden in a folder within a folder within a folder. Useful processes (e.g., grooming process) explained outside the backlog management tool and not linked anywhere. Decisions taken in different channels other than the backlog management tool and not recorded anywhere else. Incomplete . No clear branch policy, so everyone names their branches differently. Missing settings in the \"how to run this\" document that are required to run the application. Inaccurate . Documents not updated along with the code, so they don't mention the right folders, settings, etc. Obsolete . Design documents that don't apply anymore, sitting next to valid documents. Which one shows the latest decisions? Out of order (subject / date) . Documents not organized per subject/workstream so not easy to find relevant information when you change to a new workstream. Design decision logs out of order and without a date that helps to determine which is the final decision on something. Duplicate . No settings file available in a centralized place as a single source of truth, so developers must keep sharing their own versions, and we end up with many files that might or might not work. Afterthought . Key documents created several weeks into the project: onboarding, how to run the app, etc. Documents created last minute just before the end of a project, forgetting that they also help the team while working on the project. What Documentation Should Exist Project and Repositories Commit Messages Pull Requests Code Work Items REST APIs Engineering Feedback Best Practices Establishing and managing documentation Creating good documentation Replacing documentation with automation Tools Wikis Languages markdown mermaid How to automate simple checks Integration with Teams/Slack Recipes How to sync a wiki between repositories Using DocFx and Companion Tools to generate a Documentation website Deploy the DocFx Documentation website to an Azure Website automatically How to create a static website for your documentation based on MkDocs and Material for MkDocs Resources Software Documentation Types and Best Practices","title":"Documentation"},{"location":"documentation/#documentation","text":"Every software development project requires documentation. Agile Software Development values working software over comprehensive documentation . Still, projects should include the key information needed to understand the development and the use of the generated software. Documentation shouldn't be an afterthought. Different written documents and materials should be created during the whole life cycle of the project, as per the project needs.","title":"Documentation"},{"location":"documentation/#goals","text":"Facilitate onboarding of new team members. Improve communication and collaboration between teams (especially when distributed across time zones). Improve the transition of the project to another team.","title":"Goals"},{"location":"documentation/#challenges","text":"When working in an engineering project, we typically encounter one or more of these challenges related to documentation (including some examples): Non-existent . No onboarding documentation, so it takes a long time to set up the environment when you join the project. No document in the wiki explaining existing repositories, so you cannot tell which of the 10 available repositories you should clone. No main README, so you don't know where to start when you clone a repository. No \"how to contribute\" section, so you don't know which is the branch policy, where to add new documents, etc. No code guidelines, so everyone follows different naming conventions, etc. Hidden . Impossible to find useful documentation as it\u2019s scattered all over the place. E.g., no idea how to compile, run and test the code as the README is hidden in a folder within a folder within a folder. Useful processes (e.g., grooming process) explained outside the backlog management tool and not linked anywhere. Decisions taken in different channels other than the backlog management tool and not recorded anywhere else. Incomplete . No clear branch policy, so everyone names their branches differently. Missing settings in the \"how to run this\" document that are required to run the application. Inaccurate . Documents not updated along with the code, so they don't mention the right folders, settings, etc. Obsolete . Design documents that don't apply anymore, sitting next to valid documents. Which one shows the latest decisions? Out of order (subject / date) . Documents not organized per subject/workstream so not easy to find relevant information when you change to a new workstream. Design decision logs out of order and without a date that helps to determine which is the final decision on something. Duplicate . No settings file available in a centralized place as a single source of truth, so developers must keep sharing their own versions, and we end up with many files that might or might not work. Afterthought . Key documents created several weeks into the project: onboarding, how to run the app, etc. Documents created last minute just before the end of a project, forgetting that they also help the team while working on the project.","title":"Challenges"},{"location":"documentation/#what-documentation-should-exist","text":"Project and Repositories Commit Messages Pull Requests Code Work Items REST APIs Engineering Feedback","title":"What Documentation Should Exist"},{"location":"documentation/#best-practices","text":"Establishing and managing documentation Creating good documentation Replacing documentation with automation","title":"Best Practices"},{"location":"documentation/#tools","text":"Wikis Languages markdown mermaid How to automate simple checks Integration with Teams/Slack","title":"Tools"},{"location":"documentation/#recipes","text":"How to sync a wiki between repositories Using DocFx and Companion Tools to generate a Documentation website Deploy the DocFx Documentation website to an Azure Website automatically How to create a static website for your documentation based on MkDocs and Material for MkDocs","title":"Recipes"},{"location":"documentation/#resources","text":"Software Documentation Types and Best Practices","title":"Resources"},{"location":"documentation/best-practices/automation/","text":"Replacing Documentation with Automation You can document how to set up your dev machine with the right version of the framework required to run the code, which extensions are useful to develop the application with your editor, or how to configure your editor to launch and debug the application. If it is possible, a better solution is to provide the means to automate tool installs, application startup, etc., instead. Some examples are provided below: Dev Containers in Visual Studio Code The Visual Studio Code Remote - Containers extension lets you use a Docker container as a full-featured development environment. It allows you to open any folder inside (or mounted into) a container and take advantage of Visual Studio Code's full feature set. Additional information: Developing inside a Container . Launch Configurations and Tasks in Visual Studio Code Launch configurations allows you to configure and save debugging setup details. Tasks can be configured to run scripts and start processes so that many of these existing tools can be used from within VS Code without having to enter a command line or write new code.","title":"Replacing Documentation with Automation"},{"location":"documentation/best-practices/automation/#replacing-documentation-with-automation","text":"You can document how to set up your dev machine with the right version of the framework required to run the code, which extensions are useful to develop the application with your editor, or how to configure your editor to launch and debug the application. If it is possible, a better solution is to provide the means to automate tool installs, application startup, etc., instead. Some examples are provided below:","title":"Replacing Documentation with Automation"},{"location":"documentation/best-practices/automation/#dev-containers-in-visual-studio-code","text":"The Visual Studio Code Remote - Containers extension lets you use a Docker container as a full-featured development environment. It allows you to open any folder inside (or mounted into) a container and take advantage of Visual Studio Code's full feature set. Additional information: Developing inside a Container .","title":"Dev Containers in Visual Studio Code"},{"location":"documentation/best-practices/automation/#launch-configurations-and-tasks-in-visual-studio-code","text":"Launch configurations allows you to configure and save debugging setup details. Tasks can be configured to run scripts and start processes so that many of these existing tools can be used from within VS Code without having to enter a command line or write new code.","title":"Launch Configurations and Tasks in Visual Studio Code"},{"location":"documentation/best-practices/establish-and-manage/","text":"Establishing and Managing Documentation Documentation should be source-controlled. Pull Requests can be used to tell others about the changes, so they can be reviewed and discussed. E.g., Async Design Reviews . Tools: Wikis .","title":"Establishing and Managing Documentation"},{"location":"documentation/best-practices/establish-and-manage/#establishing-and-managing-documentation","text":"Documentation should be source-controlled. Pull Requests can be used to tell others about the changes, so they can be reviewed and discussed. E.g., Async Design Reviews . Tools: Wikis .","title":"Establishing and Managing Documentation"},{"location":"documentation/best-practices/good-documentation/","text":"Creating Good Documentation Review the Documentation Review Checklist for advice on how to write good documentation. Good documentation should follow good writing guidelines: Writing Style Guidelines .","title":"Creating Good Documentation"},{"location":"documentation/best-practices/good-documentation/#creating-good-documentation","text":"Review the Documentation Review Checklist for advice on how to write good documentation. Good documentation should follow good writing guidelines: Writing Style Guidelines .","title":"Creating Good Documentation"},{"location":"documentation/guidance/code/","text":"Code You might have heard more than once that you should write self-documenting code . This doesn't mean that you should never comment your code. There are two types of code comments, implementation comments and documentation comments. Implementation Comments They are used for internal documentation, and are intended for anyone who may need to maintain the code in the future, including your future self. There can be single line and multi-line comments (e.g., C# Comments ). Comments are human-readable and not executed, thus ignored by the compiler. So you could potentially add as many as you want. Now, the use of these comments is often considered a code smell. If you need to clarify your code, that may mean the code is too complex. So you should work towards the removal of the clarification by making the code simpler, easier to read, and understand. Still, these comments can be useful to give overviews of the code, or provide additional context information that is not available in the code itself. Examples of useful comments: Single line comment in C# that explains why that piece of code is there (from a private method in System.Text.Json.JsonSerializer ): // For performance, avoid obtaining actual byte count unless memory usage is higher than the threshold. Span < byte > utf8 = json . Length <= ( ArrayPoolMaxSizeBeforeUsingNormalAlloc / JsonConstants . MaxExpansionFactorWhileTranscoding ) ? ... Multi-line comment in C# that provides additional context (from a private method in System.Text.Json.Utf8JsonReader ): // Transcoding from UTF-16 to UTF-8 will change the length by somewhere between 1x and 3x. // Un-escaping the token value will at most shrink its length by 6x. // There is no point incurring the transcoding/un-escaping/comparing cost if: // - The token value is smaller than charTextLength // - The token value needs to be transcoded AND unescaped and it is more than 6x larger than charTextLength // - For an ASCII UTF-16 characters, transcoding = 1x, escaping = 6x => 6x factor // - For non-ASCII UTF-16 characters within the BMP, transcoding = 2-3x, but they are represented as a single escaped hex value, \\uXXXX => 6x factor // - For non-ASCII UTF-16 characters outside of the BMP, transcoding = 4x, but the surrogate pair (2 characters) are represented by 16 bytes \\uXXXX\\uXXXX => 6x factor // - The token value needs to be transcoded, but NOT escaped and it is more than 3x larger than charTextLength // - For an ASCII UTF-16 characters, transcoding = 1x, // - For non-ASCII UTF-16 characters within the BMP, transcoding = 2-3x, // - For non-ASCII UTF-16 characters outside of the BMP, transcoding = 2x, (surrogate pairs - 2 characters transcode to 4 UTF-8 bytes) if ( sourceLength < charTextLength || sourceLength / ( _stringHasEscaping ? JsonConstants . MaxExpansionFactorWhileEscaping : JsonConstants . MaxExpansionFactorWhileTranscoding ) > charTextLength ) { Documentation Comments Doc comments are a special kind of comment, added above the definition of any user-defined type or member, and are intended for anyone who may need to use those types or members in their own code. If, for example, you are building a library or framework, doc comments can be used to generate their documentation. This documentation should serve as API specification, and/or programming guide. Doc comments won't be included by the compiler in the final executable, as with single and multi-line comments. Example of a doc comment in C# (from Deserialize method in System.Text.Json.JsonSerializer ): /// <summary> /// Parse the text representing a single JSON value into a <typeparamref name=\"TValue\"/>. /// </summary> /// <returns>A <typeparamref name=\"TValue\"/> representation of the JSON value.</returns> /// <param name=\"json\">JSON text to parse.</param> /// <param name=\"options\">Options to control the behavior during parsing.</param> /// <exception cref=\"System.ArgumentNullException\"> /// <paramref name=\"json\"/> is <see langword=\"null\"/>. /// </exception> /// <exception cref=\"JsonException\"> /// The JSON is invalid. /// /// -or- /// /// <typeparamref name=\"TValue\" /> is not compatible with the JSON. /// /// -or- /// /// There is remaining data in the string beyond a single JSON value.</exception> /// <exception cref=\"NotSupportedException\"> /// There is no compatible <see cref=\"System.Text.Json.Serialization.JsonConverter\"/> /// for <typeparamref name=\"TValue\"/> or its serializable members. /// </exception> /// <remarks>Using a <see cref=\"string\"/> is not as efficient as using the /// UTF-8 methods since the implementation natively uses UTF-8. /// </remarks> [RequiresUnreferencedCode(SerializationUnreferencedCodeMessage)] public static TValue ? Deserialize < TValue > ( string json , JsonSerializerOptions ? options = null ) { In C# , doc comments can be processed by the compiler to generate XML documentation files. These files can be distributed alongside your libraries so that Visual Studio and other IDEs can use IntelliSense to show quick information about types or members. Additionally, these files can be run through tools like DocFx to generate API reference websites. More information: Recommended XML tags for C# documentation comments . In other languages, you may require external tools. For example, Java doc comments can be processed by Javadoc tool to generate HTML documentation files. More information: How to Write Doc Comments for the Javadoc Tool Javadoc Tool","title":"Code"},{"location":"documentation/guidance/code/#code","text":"You might have heard more than once that you should write self-documenting code . This doesn't mean that you should never comment your code. There are two types of code comments, implementation comments and documentation comments.","title":"Code"},{"location":"documentation/guidance/code/#implementation-comments","text":"They are used for internal documentation, and are intended for anyone who may need to maintain the code in the future, including your future self. There can be single line and multi-line comments (e.g., C# Comments ). Comments are human-readable and not executed, thus ignored by the compiler. So you could potentially add as many as you want. Now, the use of these comments is often considered a code smell. If you need to clarify your code, that may mean the code is too complex. So you should work towards the removal of the clarification by making the code simpler, easier to read, and understand. Still, these comments can be useful to give overviews of the code, or provide additional context information that is not available in the code itself. Examples of useful comments: Single line comment in C# that explains why that piece of code is there (from a private method in System.Text.Json.JsonSerializer ): // For performance, avoid obtaining actual byte count unless memory usage is higher than the threshold. Span < byte > utf8 = json . Length <= ( ArrayPoolMaxSizeBeforeUsingNormalAlloc / JsonConstants . MaxExpansionFactorWhileTranscoding ) ? ... Multi-line comment in C# that provides additional context (from a private method in System.Text.Json.Utf8JsonReader ): // Transcoding from UTF-16 to UTF-8 will change the length by somewhere between 1x and 3x. // Un-escaping the token value will at most shrink its length by 6x. // There is no point incurring the transcoding/un-escaping/comparing cost if: // - The token value is smaller than charTextLength // - The token value needs to be transcoded AND unescaped and it is more than 6x larger than charTextLength // - For an ASCII UTF-16 characters, transcoding = 1x, escaping = 6x => 6x factor // - For non-ASCII UTF-16 characters within the BMP, transcoding = 2-3x, but they are represented as a single escaped hex value, \\uXXXX => 6x factor // - For non-ASCII UTF-16 characters outside of the BMP, transcoding = 4x, but the surrogate pair (2 characters) are represented by 16 bytes \\uXXXX\\uXXXX => 6x factor // - The token value needs to be transcoded, but NOT escaped and it is more than 3x larger than charTextLength // - For an ASCII UTF-16 characters, transcoding = 1x, // - For non-ASCII UTF-16 characters within the BMP, transcoding = 2-3x, // - For non-ASCII UTF-16 characters outside of the BMP, transcoding = 2x, (surrogate pairs - 2 characters transcode to 4 UTF-8 bytes) if ( sourceLength < charTextLength || sourceLength / ( _stringHasEscaping ? JsonConstants . MaxExpansionFactorWhileEscaping : JsonConstants . MaxExpansionFactorWhileTranscoding ) > charTextLength ) {","title":"Implementation Comments"},{"location":"documentation/guidance/code/#documentation-comments","text":"Doc comments are a special kind of comment, added above the definition of any user-defined type or member, and are intended for anyone who may need to use those types or members in their own code. If, for example, you are building a library or framework, doc comments can be used to generate their documentation. This documentation should serve as API specification, and/or programming guide. Doc comments won't be included by the compiler in the final executable, as with single and multi-line comments. Example of a doc comment in C# (from Deserialize method in System.Text.Json.JsonSerializer ): /// <summary> /// Parse the text representing a single JSON value into a <typeparamref name=\"TValue\"/>. /// </summary> /// <returns>A <typeparamref name=\"TValue\"/> representation of the JSON value.</returns> /// <param name=\"json\">JSON text to parse.</param> /// <param name=\"options\">Options to control the behavior during parsing.</param> /// <exception cref=\"System.ArgumentNullException\"> /// <paramref name=\"json\"/> is <see langword=\"null\"/>. /// </exception> /// <exception cref=\"JsonException\"> /// The JSON is invalid. /// /// -or- /// /// <typeparamref name=\"TValue\" /> is not compatible with the JSON. /// /// -or- /// /// There is remaining data in the string beyond a single JSON value.</exception> /// <exception cref=\"NotSupportedException\"> /// There is no compatible <see cref=\"System.Text.Json.Serialization.JsonConverter\"/> /// for <typeparamref name=\"TValue\"/> or its serializable members. /// </exception> /// <remarks>Using a <see cref=\"string\"/> is not as efficient as using the /// UTF-8 methods since the implementation natively uses UTF-8. /// </remarks> [RequiresUnreferencedCode(SerializationUnreferencedCodeMessage)] public static TValue ? Deserialize < TValue > ( string json , JsonSerializerOptions ? options = null ) { In C# , doc comments can be processed by the compiler to generate XML documentation files. These files can be distributed alongside your libraries so that Visual Studio and other IDEs can use IntelliSense to show quick information about types or members. Additionally, these files can be run through tools like DocFx to generate API reference websites. More information: Recommended XML tags for C# documentation comments . In other languages, you may require external tools. For example, Java doc comments can be processed by Javadoc tool to generate HTML documentation files. More information: How to Write Doc Comments for the Javadoc Tool Javadoc Tool","title":"Documentation Comments"},{"location":"documentation/guidance/engineering-feedback/","text":"Engineering Feedback Good engineering feedback is: Actionable Specific Detailed Includes assets (script, data, code, etc.) to reproduce scenario and validate solution Includes details about the customer scenario / what the customer was trying to achieve Refer to Microsoft Engineering Feedback for more details, including guidance , FAQ and examples .","title":"Engineering Feedback"},{"location":"documentation/guidance/engineering-feedback/#engineering-feedback","text":"Good engineering feedback is: Actionable Specific Detailed Includes assets (script, data, code, etc.) to reproduce scenario and validate solution Includes details about the customer scenario / what the customer was trying to achieve Refer to Microsoft Engineering Feedback for more details, including guidance , FAQ and examples .","title":"Engineering Feedback"},{"location":"documentation/guidance/project-and-repositories/","text":"Projects and Repositories Every source code repository should include documentation that is specific to it (e.g., in a Wiki within the repository), while the project itself should include general documentation that is common to all its associated repositories (e.g., in a Wiki within the backlog management tool). Documentation Specific to a Repository Introduction Getting started Onboarding Setup: programming language, frameworks, platforms, tools, etc. Sandbox environment Working agreement Contributing guide Structure: folders, projects, etc. How to compile, test, build, deploy the solution/each project Different OS versions Command line + editors/IDEs Design Decision Logs Architecture Decision Record (ADRs) Trade Studies Some sections in the documentation of the repository might point to the project\u2019s documentation (e.g., Onboarding, Working Agreement, Contributing Guide). Common Documentation to all Repositories Introduction Project Stakeholders Definitions Requirements Onboarding Repository guide Production, Spikes Team agreements Team Manifesto Short summary of expectations around the technical way of working and supported mindset in the team. E.g., ownership, respect, collaboration, transparency. Working Agreement How we work together as a team and what our expectations and principles are. E.g., communication, work-life balance, scrum rhythm, backlog management, code management. Definition of Done List of tasks that must be completed to close a user story, a sprint, or a milestone. Definition of Ready How complete a user story should be in order to be selected as candidate for estimation in the sprint planning. Contributing Guide Repo structure Design documents Branching and branch name strategy Merge and commit history strategy Pull Requests Code Review Process Code Review Checklist Language Specific Checklists Project Design High Level / Game Plan Milestone / Epic Design Review Design Review Recipes Milestone / Epic Design Review Template Feature / Story Design Review Template Task Design Review Template Decision Log Template Architecture Decision Record (ADR) Template ( Example 1 , Example 2 ) Trade Study Template","title":"Projects and Repositories"},{"location":"documentation/guidance/project-and-repositories/#projects-and-repositories","text":"Every source code repository should include documentation that is specific to it (e.g., in a Wiki within the repository), while the project itself should include general documentation that is common to all its associated repositories (e.g., in a Wiki within the backlog management tool).","title":"Projects and Repositories"},{"location":"documentation/guidance/project-and-repositories/#documentation-specific-to-a-repository","text":"Introduction Getting started Onboarding Setup: programming language, frameworks, platforms, tools, etc. Sandbox environment Working agreement Contributing guide Structure: folders, projects, etc. How to compile, test, build, deploy the solution/each project Different OS versions Command line + editors/IDEs Design Decision Logs Architecture Decision Record (ADRs) Trade Studies Some sections in the documentation of the repository might point to the project\u2019s documentation (e.g., Onboarding, Working Agreement, Contributing Guide).","title":"Documentation Specific to a Repository"},{"location":"documentation/guidance/project-and-repositories/#common-documentation-to-all-repositories","text":"Introduction Project Stakeholders Definitions Requirements Onboarding Repository guide Production, Spikes Team agreements Team Manifesto Short summary of expectations around the technical way of working and supported mindset in the team. E.g., ownership, respect, collaboration, transparency. Working Agreement How we work together as a team and what our expectations and principles are. E.g., communication, work-life balance, scrum rhythm, backlog management, code management. Definition of Done List of tasks that must be completed to close a user story, a sprint, or a milestone. Definition of Ready How complete a user story should be in order to be selected as candidate for estimation in the sprint planning. Contributing Guide Repo structure Design documents Branching and branch name strategy Merge and commit history strategy Pull Requests Code Review Process Code Review Checklist Language Specific Checklists Project Design High Level / Game Plan Milestone / Epic Design Review Design Review Recipes Milestone / Epic Design Review Template Feature / Story Design Review Template Task Design Review Template Decision Log Template Architecture Decision Record (ADR) Template ( Example 1 , Example 2 ) Trade Study Template","title":"Common Documentation to all Repositories"},{"location":"documentation/guidance/pull-requests/","text":"Pull Requests When we create Pull Requests , we must ensure they are properly documented: Title and Description Pull Request Description Pull Request Template Linked worked items Comments As an author, address all comments As a reviewer, make comments clear","title":"Pull Requests"},{"location":"documentation/guidance/pull-requests/#pull-requests","text":"When we create Pull Requests , we must ensure they are properly documented: Title and Description Pull Request Description Pull Request Template Linked worked items Comments As an author, address all comments As a reviewer, make comments clear","title":"Pull Requests"},{"location":"documentation/guidance/rest-apis/","text":"REST APIs When creating REST APIs , you can leverage the OpenAPI-Specification (OAI) (originally known as the Swagger Specification) to describe them: The OpenAPI Specification (OAS) defines a standard, programming language-agnostic interface description for HTTP APIs, which allows both humans and computers to discover and understand the capabilities of a service without requiring access to source code, additional documentation, or inspection of network traffic. When properly defined via OpenAPI, a consumer can understand and interact with the remote service with a minimal amount of implementation logic. Use cases for machine-readable API definition documents include, but are not limited to: interactive documentation; code generation for documentation, clients, and servers; and automation of test cases. OpenAPI documents describe an APIs services and are represented in either YAML or JSON formats. These documents may either be produced and served statically or be generated dynamically from an application. There are implementations available for many languages like C#, including low-level tooling, editors, user interfaces, code generators, etc. Here you can find a list of known tooling for the different languages: OpenAPI-Specification/IMPLEMENTATIONS.md . Using Microsoft TypeSpec While the OpenAPI-Specification (OAI) is a popular method for defining and documenting RESTful APIs, there are other languages available that can simplify and expedite the documentation process. Microsoft TypeSpec is one such language that allows for the description of cloud service APIs and the generation of API description languages, client and service code, documentation, and other assets. Microsoft TypeSpec is a highly extensible language that offers a set of core primitives that can describe API shapes common among REST, OpenAPI, GraphQL, gRPC, and other protocols. This makes it a versatile option for developers who need to work with a range of different API styles and technologies. Microsoft TypeSpec is a widely adopted tool within Azure teams, particularly for generating OpenAPI Specifications in complex and interconnected APIs that span multiple teams. To ensure consistency across different parts of the API, teams commonly leverage shared libraries which contain reusable patterns. This makes easier to follow best practices rather than deviating from them. By promoting highly regular API designs that adhere to best practices by construction, TypeSpec can help improve the quality and consistency of APIs developed within an organization. Resources ASP.NET Core web API documentation with Swagger / OpenAPI . Microsoft TypeSpec . Design Patterns - REST API Guidance","title":"REST APIs"},{"location":"documentation/guidance/rest-apis/#rest-apis","text":"When creating REST APIs , you can leverage the OpenAPI-Specification (OAI) (originally known as the Swagger Specification) to describe them: The OpenAPI Specification (OAS) defines a standard, programming language-agnostic interface description for HTTP APIs, which allows both humans and computers to discover and understand the capabilities of a service without requiring access to source code, additional documentation, or inspection of network traffic. When properly defined via OpenAPI, a consumer can understand and interact with the remote service with a minimal amount of implementation logic. Use cases for machine-readable API definition documents include, but are not limited to: interactive documentation; code generation for documentation, clients, and servers; and automation of test cases. OpenAPI documents describe an APIs services and are represented in either YAML or JSON formats. These documents may either be produced and served statically or be generated dynamically from an application. There are implementations available for many languages like C#, including low-level tooling, editors, user interfaces, code generators, etc. Here you can find a list of known tooling for the different languages: OpenAPI-Specification/IMPLEMENTATIONS.md .","title":"REST APIs"},{"location":"documentation/guidance/rest-apis/#using-microsoft-typespec","text":"While the OpenAPI-Specification (OAI) is a popular method for defining and documenting RESTful APIs, there are other languages available that can simplify and expedite the documentation process. Microsoft TypeSpec is one such language that allows for the description of cloud service APIs and the generation of API description languages, client and service code, documentation, and other assets. Microsoft TypeSpec is a highly extensible language that offers a set of core primitives that can describe API shapes common among REST, OpenAPI, GraphQL, gRPC, and other protocols. This makes it a versatile option for developers who need to work with a range of different API styles and technologies. Microsoft TypeSpec is a widely adopted tool within Azure teams, particularly for generating OpenAPI Specifications in complex and interconnected APIs that span multiple teams. To ensure consistency across different parts of the API, teams commonly leverage shared libraries which contain reusable patterns. This makes easier to follow best practices rather than deviating from them. By promoting highly regular API designs that adhere to best practices by construction, TypeSpec can help improve the quality and consistency of APIs developed within an organization.","title":"Using Microsoft TypeSpec"},{"location":"documentation/guidance/rest-apis/#resources","text":"ASP.NET Core web API documentation with Swagger / OpenAPI . Microsoft TypeSpec . Design Patterns - REST API Guidance","title":"Resources"},{"location":"documentation/guidance/work-items/","text":"Work Items While many teams can work with a flat list of items, sometimes it helps to group related items into a hierarchical structure. You can use portfolio backlogs to bring more order to your backlog. Agile process backlog work item hierarchy: Scrum process backlog work item hierarchy: Bugs can be set at the same level as User Stories / Product Backlog Items or Tasks. Epics and Features User stories / Product Backlog Items roll up into Features , which typically represent a shippable deliverable that addresses a customer need e.g., \"Add shopping cart\". And Features roll up into Epics , which represent a business initiative to be accomplished e.g., \"Increase customer engagement\". Take that into account when naming them. Each Feature or Epic should include as much detail as the team needs to: Understand the scope. Estimate the work required. Develop tests. Ensure the end product meets acceptance criteria. Details that should be added: Value Area : Business (directly deliver customer value) vs. Architectural (technical services to implement business features). Effort / Story Points / Size : Relative estimate of the amount of work required to complete the item. Business Value : Priority of an item compared to other items of the same type. Time Criticality : Higher values indicate an item is more time critical than items with lower values. Target Date by which the feature should be implemented. You may use work item tags to support queries and filtering. User Stories / Product Backlog Items Each User Story / Product Backlog Item should be sized so that they can be completed within a sprint. You should add the following details to the items: Title : Usually expressed as \"As a [persona], I want [to perform an action], so that [I can achieve an end result].\". Description : Provide enough detail to create shared understanding of scope and support estimation efforts. Focus on the user, what they want to accomplish, and why. Don't describe how to develop the product. Provide enough details so the team can write tasks and test cases to implement the item. Include Design Reviews. Acceptance Criteria : Define what \"Done\" means. Activity : Deployment, Design, Development, Documentation, Requirements, Testing. Effort / Story Points / Size : Relative estimate of the amount of work required to complete the item. Business Value : Priority of an item compared to other items of the same type. Original Estimate : The amount of estimated work required to complete a task. Remember to use the Discussion section of the items to keep track of related comments, and mention individuals, groups, work items or pull requests when required. Tasks Each Task should be sized so that they can be completed within a day. You should at least add the following details to the items: Title . Description : Provide enough detail to create shared understanding of scope. Any developer should be able to take the item and know what needs to be implemented. Include Design Reviews. Reference to the working branch in related code repository. Remember to use the Discussion section of the tasks to keep track of related comments. Bugs You should use bugs to capture both the initial issue and ongoing discoveries. You should at least add the following details to the bug items: Title . Description . Steps to Reproduce . System Info / Found in Build : Software and system configuration that is relevant to the bug and tests to apply. Acceptance Criteria : Criteria to meet so the bug can be closed. Integrated in Build : Name of the build that incorporates the code that fixes the bug. Priority : 1: Product should not ship without the successful resolution of the work item. The bug should be addressed as soon as possible. 2: Product should not ship without the successful resolution of the work item, but it does not need to be addressed immediately. 3: Resolution of the work item is optional based on resources, time, and risk. Severity : 1 - Critical: Must fix. No acceptable alternative methods. 2 - High: Consider fix. An acceptable alternative method exists. 3 - Medium: (Default). 4 - Low. Issues / Impediments Don't confuse with bugs. They represent unplanned activities that may block work from getting done. For example: feature ambiguity, personnel or resource issues, problems with environments, or other risks that impact scope, quality, or schedule. In general, you link these items to user stories or other work items. Actions from Retrospectives After a retrospective, every action that requires work should be tracked with its own Task or Issue / Impediment. These items might be unparented (without link to parent backlog item or user story). Related information Best practices for Agile project management - Azure Boards | Microsoft Docs . Define features and epics, organize backlog items - Azure Boards | Microsoft Docs . Create your product backlog - Azure Boards | Microsoft Docs . Add tasks to support sprint planning - Azure Boards | Microsoft Docs . Define, capture, triage, and manage bugs or code defects - Azure Boards | Microsoft Docs . Add and manage issues or impediments - Azure Boards | Microsoft Docs .","title":"Work Items"},{"location":"documentation/guidance/work-items/#work-items","text":"While many teams can work with a flat list of items, sometimes it helps to group related items into a hierarchical structure. You can use portfolio backlogs to bring more order to your backlog. Agile process backlog work item hierarchy: Scrum process backlog work item hierarchy: Bugs can be set at the same level as User Stories / Product Backlog Items or Tasks.","title":"Work Items"},{"location":"documentation/guidance/work-items/#epics-and-features","text":"User stories / Product Backlog Items roll up into Features , which typically represent a shippable deliverable that addresses a customer need e.g., \"Add shopping cart\". And Features roll up into Epics , which represent a business initiative to be accomplished e.g., \"Increase customer engagement\". Take that into account when naming them. Each Feature or Epic should include as much detail as the team needs to: Understand the scope. Estimate the work required. Develop tests. Ensure the end product meets acceptance criteria. Details that should be added: Value Area : Business (directly deliver customer value) vs. Architectural (technical services to implement business features). Effort / Story Points / Size : Relative estimate of the amount of work required to complete the item. Business Value : Priority of an item compared to other items of the same type. Time Criticality : Higher values indicate an item is more time critical than items with lower values. Target Date by which the feature should be implemented. You may use work item tags to support queries and filtering.","title":"Epics and Features"},{"location":"documentation/guidance/work-items/#user-stories-product-backlog-items","text":"Each User Story / Product Backlog Item should be sized so that they can be completed within a sprint. You should add the following details to the items: Title : Usually expressed as \"As a [persona], I want [to perform an action], so that [I can achieve an end result].\". Description : Provide enough detail to create shared understanding of scope and support estimation efforts. Focus on the user, what they want to accomplish, and why. Don't describe how to develop the product. Provide enough details so the team can write tasks and test cases to implement the item. Include Design Reviews. Acceptance Criteria : Define what \"Done\" means. Activity : Deployment, Design, Development, Documentation, Requirements, Testing. Effort / Story Points / Size : Relative estimate of the amount of work required to complete the item. Business Value : Priority of an item compared to other items of the same type. Original Estimate : The amount of estimated work required to complete a task. Remember to use the Discussion section of the items to keep track of related comments, and mention individuals, groups, work items or pull requests when required.","title":"User Stories / Product Backlog Items"},{"location":"documentation/guidance/work-items/#tasks","text":"Each Task should be sized so that they can be completed within a day. You should at least add the following details to the items: Title . Description : Provide enough detail to create shared understanding of scope. Any developer should be able to take the item and know what needs to be implemented. Include Design Reviews. Reference to the working branch in related code repository. Remember to use the Discussion section of the tasks to keep track of related comments.","title":"Tasks"},{"location":"documentation/guidance/work-items/#bugs","text":"You should use bugs to capture both the initial issue and ongoing discoveries. You should at least add the following details to the bug items: Title . Description . Steps to Reproduce . System Info / Found in Build : Software and system configuration that is relevant to the bug and tests to apply. Acceptance Criteria : Criteria to meet so the bug can be closed. Integrated in Build : Name of the build that incorporates the code that fixes the bug. Priority : 1: Product should not ship without the successful resolution of the work item. The bug should be addressed as soon as possible. 2: Product should not ship without the successful resolution of the work item, but it does not need to be addressed immediately. 3: Resolution of the work item is optional based on resources, time, and risk. Severity : 1 - Critical: Must fix. No acceptable alternative methods. 2 - High: Consider fix. An acceptable alternative method exists. 3 - Medium: (Default). 4 - Low.","title":"Bugs"},{"location":"documentation/guidance/work-items/#issues-impediments","text":"Don't confuse with bugs. They represent unplanned activities that may block work from getting done. For example: feature ambiguity, personnel or resource issues, problems with environments, or other risks that impact scope, quality, or schedule. In general, you link these items to user stories or other work items.","title":"Issues / Impediments"},{"location":"documentation/guidance/work-items/#actions-from-retrospectives","text":"After a retrospective, every action that requires work should be tracked with its own Task or Issue / Impediment. These items might be unparented (without link to parent backlog item or user story).","title":"Actions from Retrospectives"},{"location":"documentation/guidance/work-items/#related-information","text":"Best practices for Agile project management - Azure Boards | Microsoft Docs . Define features and epics, organize backlog items - Azure Boards | Microsoft Docs . Create your product backlog - Azure Boards | Microsoft Docs . Add tasks to support sprint planning - Azure Boards | Microsoft Docs . Define, capture, triage, and manage bugs or code defects - Azure Boards | Microsoft Docs . Add and manage issues or impediments - Azure Boards | Microsoft Docs .","title":"Related information"},{"location":"documentation/recipes/deploy-docfx-azure-website/","text":"Deploy the DocFx Documentation Website to an Azure Website Automatically In the article Using DocFx and Companion Tools to generate a Documentation website the process is described to generate content of a documentation website using DocFx. This document describes how to setup an Azure Website to host the content and automate the deployment to it using a pipeline in Azure DevOps. The QuickStart sample that is provided for a quick setup of DocFx generation also contains the files explained in this document. Especially the .pipelines and infrastructure folders. The following steps can be followed when using the Quick Start folder. In the infrastructure folder you can find the Terraform files to create the website in an Azure environment. Out of the box, the script will create a website where the documentation content can be deployed to. 1. Install Terraform You can use tools like Chocolatey to install Terraform: choco install terraform 2. Set the Proper Variables Note: Make sure you modify the value of the app_name , rg_name and rg_location variables. The app_name value is appended by azurewebsites.net and must be unique. Otherwise the script will fail that it cannot create the website. In the Quick Start, authentication is disabled. If you want that enabled, make sure you have create an Application in the Azure AD and have the client ID . This client id must be set as the value of the client_id variable in variables.tf . In the main.tf make sure you uncomment the authentication settings in the app-service . For more information see Configure Azure AD authentication - Azure App Service . If you want to set a custom domain for your documentation website with an SSL certificate you have to do some extra steps. You have to create a Key Vault and store the certificate there. Next step is to uncomment and set the values in variables.tf . You also have to uncomment the necessary steps in main.tf . All is indicated by comment-boxes. For more information see Add a TLS/SSL certificate in Azure App Service . Some extra information on SSL certificate, custom domain and Azure App Service can be found in the following paragraphs. If you are familiar with that or don't need it, go ahead and continue with Step 3 . SSL Certificate To secure a website with a custom domain name and a certificate, you can find the steps to take in the article Add a TLS/SSL certificate in Azure App Service . That article also contains a description of ways to obtain a certificate and the requirements for a certificate. Usually you'll get a certificate from the customers IT department. If you want to start with a development certificate to test the process, you can create one yourself. You can do that in PowerShell with the script below. Replace: [YOUR DOMAIN] with the domain you would like to register, e.g. docs.somewhere.com [PASSWORD] with a password of the certificate. It's required for uploading a certificate in the Key Vault to have a password. You'll need this password in that step. [FILENAME] for the output file name of the certificate. You can even insert the path here where it should be store on your machine. You can store this script in a PowerShell script file (ps1 extension). $cert = New-SelfSignedCertificate -CertStoreLocation cert :\\ currentuser \\ my -Subject \"cn=[YOUR DOMAIN]\" -DnsName \"[YOUR DOMAIN]\" $pwd = ConvertTo-SecureString -String '[PASSWORD]' -Force -AsPlainText $path = 'cert:\\currentuser\\my\\' + $cert . thumbprint Export-PfxCertificate -cert $path -FilePath [FILENAME] . pfx -Password $pwd The certificate needs to be stored in the common Key Vault. Go to Settings > Certificates in the left menu of the Key Vault and click Generate/Import . Provide these details: Method of Certificate Creation: Import Certificate name: e.g. ssl-certificate Upload Certificate File: select the file on disc for this. Password: this is the [PASSWORD] we reference earlier. Custom Domain Registration To use a custom domain a few things need to be done. The process in the Azure portal is described in the article Tutorial: Map an existing custom DNS name to Azure App Service . An important part is described under the header Get a domain verification ID . This ID needs to be registered with the DNS description as a TXT record. Important to know is that this Custom Domain Verification ID is the same for all web resources in the same Azure subscription. See this StackOverflow issue . This means that this ID needs to be registered only once for one Azure Subscription. And this enables (re)creation of an App Service with the custom domain though script. Add Get-permissions for Microsoft Azure App Service The Azure App Service needs to access the Key Vault to get the certificate. This is needed for the first run, but also when the certificate is renewed in the Key Vault. For this purpose the Azure App Service accesses the Key Vault with the App Service resource provided identity. This identity can be found with the service principal name abfa0a7c-a6b6-4736-8310-5855508787cd or Microsoft Azure App Service and is of type Application . This ID is the same for all Azure subscriptions. It needs to have Get-permissions on secrets and certificates. For more information see this article Import a certificate from Key Vault . Add the Custom Domain and SSL Certificate to the App Service Once we have the SSL certificate and there is a complete DNS registration as described, we can uncomment the code in the Terraform script from the Quick Start folder to attach this to the App Service. In this script you need to reference the certificate in the common Key Vault and use it in the custom hostname binding. The custom hostname is assigned in the script as well. The settings ssl_state needs to be SniEnabled if you're using an SSL certificate. Now the creation of the authenticated website with a custom domain is automated. 3. Deploy Azure Resources from Your Local Machine Open up a command prompt. For the commands to be executed, you need to have a connection to your Azure subscription. This can be done using Azure Cli . Type this command: az login This will use the web browser to login to your account. You can check the connected subscription with this command: az account show If you have to change to another subscription, use this command where you replace [id] with the id of the subscription to select: az account set --subscription [ id ] Once this is done run this command to initialize: terraform init Now you can run the command to plan what the script will do. You run this command every time changes are made to the terraform scripts: terraform plan Inspect the result shown. If that is what you expect, apply these changes with this command: terraform apply When asked for approval, type \"yes\" and ENTER. You can also add the -auto-approve flag to the apply command. The deployment using Terraform is not included in the pipeline from the Quick Start folder as described in the next step, as that asks for more configuration. But of course that can always be added. 4. Deploy the Website from a Pipeline The best way to create the resources and deploy to it, is to do this automatically in a pipeline. For this purpose the .pipelines/documentation.yml pipeline is provided. This pipeline is built for an Azure DevOps environment. Create a pipeline and reference this YAML file. Note: the Quick Start folder contains a web.config that is needed for deployment to IIS or Azure App Service. This enables the use of the json file for search requests. If you don't have this in place, the search of text will never return anything and result in 404's under the hood. You have to create a Service Connection in your DevOps environment to connect to the Azure Subscription you want to deploy to. Note: set the variables AzureConnectionName to the name of the Service Connection and the AzureAppServiceName to the name you determined in the infrastructure/variables.tf . In the Quick Start folder the pipeline uses master as trigger, which means that any push being done to master triggers the pipeline. You will probably change this to another branch.","title":"Deploy the DocFx Documentation Website to an Azure Website Automatically"},{"location":"documentation/recipes/deploy-docfx-azure-website/#deploy-the-docfx-documentation-website-to-an-azure-website-automatically","text":"In the article Using DocFx and Companion Tools to generate a Documentation website the process is described to generate content of a documentation website using DocFx. This document describes how to setup an Azure Website to host the content and automate the deployment to it using a pipeline in Azure DevOps. The QuickStart sample that is provided for a quick setup of DocFx generation also contains the files explained in this document. Especially the .pipelines and infrastructure folders. The following steps can be followed when using the Quick Start folder. In the infrastructure folder you can find the Terraform files to create the website in an Azure environment. Out of the box, the script will create a website where the documentation content can be deployed to.","title":"Deploy the DocFx Documentation Website to an Azure Website Automatically"},{"location":"documentation/recipes/deploy-docfx-azure-website/#1-install-terraform","text":"You can use tools like Chocolatey to install Terraform: choco install terraform","title":"1. Install Terraform"},{"location":"documentation/recipes/deploy-docfx-azure-website/#2-set-the-proper-variables","text":"Note: Make sure you modify the value of the app_name , rg_name and rg_location variables. The app_name value is appended by azurewebsites.net and must be unique. Otherwise the script will fail that it cannot create the website. In the Quick Start, authentication is disabled. If you want that enabled, make sure you have create an Application in the Azure AD and have the client ID . This client id must be set as the value of the client_id variable in variables.tf . In the main.tf make sure you uncomment the authentication settings in the app-service . For more information see Configure Azure AD authentication - Azure App Service . If you want to set a custom domain for your documentation website with an SSL certificate you have to do some extra steps. You have to create a Key Vault and store the certificate there. Next step is to uncomment and set the values in variables.tf . You also have to uncomment the necessary steps in main.tf . All is indicated by comment-boxes. For more information see Add a TLS/SSL certificate in Azure App Service . Some extra information on SSL certificate, custom domain and Azure App Service can be found in the following paragraphs. If you are familiar with that or don't need it, go ahead and continue with Step 3 .","title":"2. Set the Proper Variables"},{"location":"documentation/recipes/deploy-docfx-azure-website/#ssl-certificate","text":"To secure a website with a custom domain name and a certificate, you can find the steps to take in the article Add a TLS/SSL certificate in Azure App Service . That article also contains a description of ways to obtain a certificate and the requirements for a certificate. Usually you'll get a certificate from the customers IT department. If you want to start with a development certificate to test the process, you can create one yourself. You can do that in PowerShell with the script below. Replace: [YOUR DOMAIN] with the domain you would like to register, e.g. docs.somewhere.com [PASSWORD] with a password of the certificate. It's required for uploading a certificate in the Key Vault to have a password. You'll need this password in that step. [FILENAME] for the output file name of the certificate. You can even insert the path here where it should be store on your machine. You can store this script in a PowerShell script file (ps1 extension). $cert = New-SelfSignedCertificate -CertStoreLocation cert :\\ currentuser \\ my -Subject \"cn=[YOUR DOMAIN]\" -DnsName \"[YOUR DOMAIN]\" $pwd = ConvertTo-SecureString -String '[PASSWORD]' -Force -AsPlainText $path = 'cert:\\currentuser\\my\\' + $cert . thumbprint Export-PfxCertificate -cert $path -FilePath [FILENAME] . pfx -Password $pwd The certificate needs to be stored in the common Key Vault. Go to Settings > Certificates in the left menu of the Key Vault and click Generate/Import . Provide these details: Method of Certificate Creation: Import Certificate name: e.g. ssl-certificate Upload Certificate File: select the file on disc for this. Password: this is the [PASSWORD] we reference earlier.","title":"SSL Certificate"},{"location":"documentation/recipes/deploy-docfx-azure-website/#custom-domain-registration","text":"To use a custom domain a few things need to be done. The process in the Azure portal is described in the article Tutorial: Map an existing custom DNS name to Azure App Service . An important part is described under the header Get a domain verification ID . This ID needs to be registered with the DNS description as a TXT record. Important to know is that this Custom Domain Verification ID is the same for all web resources in the same Azure subscription. See this StackOverflow issue . This means that this ID needs to be registered only once for one Azure Subscription. And this enables (re)creation of an App Service with the custom domain though script.","title":"Custom Domain Registration"},{"location":"documentation/recipes/deploy-docfx-azure-website/#add-get-permissions-for-microsoft-azure-app-service","text":"The Azure App Service needs to access the Key Vault to get the certificate. This is needed for the first run, but also when the certificate is renewed in the Key Vault. For this purpose the Azure App Service accesses the Key Vault with the App Service resource provided identity. This identity can be found with the service principal name abfa0a7c-a6b6-4736-8310-5855508787cd or Microsoft Azure App Service and is of type Application . This ID is the same for all Azure subscriptions. It needs to have Get-permissions on secrets and certificates. For more information see this article Import a certificate from Key Vault .","title":"Add Get-permissions for Microsoft Azure App Service"},{"location":"documentation/recipes/deploy-docfx-azure-website/#add-the-custom-domain-and-ssl-certificate-to-the-app-service","text":"Once we have the SSL certificate and there is a complete DNS registration as described, we can uncomment the code in the Terraform script from the Quick Start folder to attach this to the App Service. In this script you need to reference the certificate in the common Key Vault and use it in the custom hostname binding. The custom hostname is assigned in the script as well. The settings ssl_state needs to be SniEnabled if you're using an SSL certificate. Now the creation of the authenticated website with a custom domain is automated.","title":"Add the Custom Domain and SSL Certificate to the App Service"},{"location":"documentation/recipes/deploy-docfx-azure-website/#3-deploy-azure-resources-from-your-local-machine","text":"Open up a command prompt. For the commands to be executed, you need to have a connection to your Azure subscription. This can be done using Azure Cli . Type this command: az login This will use the web browser to login to your account. You can check the connected subscription with this command: az account show If you have to change to another subscription, use this command where you replace [id] with the id of the subscription to select: az account set --subscription [ id ] Once this is done run this command to initialize: terraform init Now you can run the command to plan what the script will do. You run this command every time changes are made to the terraform scripts: terraform plan Inspect the result shown. If that is what you expect, apply these changes with this command: terraform apply When asked for approval, type \"yes\" and ENTER. You can also add the -auto-approve flag to the apply command. The deployment using Terraform is not included in the pipeline from the Quick Start folder as described in the next step, as that asks for more configuration. But of course that can always be added.","title":"3. Deploy Azure Resources from Your Local Machine"},{"location":"documentation/recipes/deploy-docfx-azure-website/#4-deploy-the-website-from-a-pipeline","text":"The best way to create the resources and deploy to it, is to do this automatically in a pipeline. For this purpose the .pipelines/documentation.yml pipeline is provided. This pipeline is built for an Azure DevOps environment. Create a pipeline and reference this YAML file. Note: the Quick Start folder contains a web.config that is needed for deployment to IIS or Azure App Service. This enables the use of the json file for search requests. If you don't have this in place, the search of text will never return anything and result in 404's under the hood. You have to create a Service Connection in your DevOps environment to connect to the Azure Subscription you want to deploy to. Note: set the variables AzureConnectionName to the name of the Service Connection and the AzureAppServiceName to the name you determined in the infrastructure/variables.tf . In the Quick Start folder the pipeline uses master as trigger, which means that any push being done to master triggers the pipeline. You will probably change this to another branch.","title":"4. Deploy the Website from a Pipeline"},{"location":"documentation/recipes/static-website-with-mkdocs/","text":"How to Create a Static Website for Your Documentation Based on mkdocs and mkdocs-material MkDocs is a tool built to create static websites from raw markdown files. Other alternatives include Sphinx , and Jekyll . We used MkDocs to create ISE Engineering Fundamentals Playbook static website from the contents in the GitHub repository . Then we deployed it to GitHub Pages . We found MkDocs to be a good choice since: It's easy to set up and looks great even with the vanilla version. It works well with markdown, which is what we already have in the Playbook. It uses a Python stack which is friendly to many contributors of this Playbook. For comparison, Sphinx mainly generates docs from restructured-text (rst) format, and Jekyll is written in Ruby. To setup an MkDocs website, the main assets needed are: An mkdocs.yaml file, similar to the one we have in the Playbook . This is the configuration file that defines the appearance of the website, the navigation, the plugins used and more. A folder named docs (the default value for the directory) that contains the documentation source files. A GitHub Action for automatically generating the website (e.g. on every commit to main), similar to this one from the Playbook . A list of plugins used during the build phase of the website. We specified ours here . And these are the plugins we've used: - Material for MkDocs : Material design appearance and user experience. - pymdown-extensions : Improves the appearance of markdown based content. - mdx_truly_sane_lists : For defining the indent level for lists without having to refactor the entire documentation we already had in the Playbook. Setting up locally is very easy. See Getting Started with MkDocs for details. For publishing the website, there's a good integration with GitHub for storing the website as a GitHub Page . Resources MkDocs Plugins The best MkDocs plugins and customizations","title":"How to Create a Static Website for Your Documentation Based on mkdocs and mkdocs-material"},{"location":"documentation/recipes/static-website-with-mkdocs/#how-to-create-a-static-website-for-your-documentation-based-on-mkdocs-and-mkdocs-material","text":"MkDocs is a tool built to create static websites from raw markdown files. Other alternatives include Sphinx , and Jekyll . We used MkDocs to create ISE Engineering Fundamentals Playbook static website from the contents in the GitHub repository . Then we deployed it to GitHub Pages . We found MkDocs to be a good choice since: It's easy to set up and looks great even with the vanilla version. It works well with markdown, which is what we already have in the Playbook. It uses a Python stack which is friendly to many contributors of this Playbook. For comparison, Sphinx mainly generates docs from restructured-text (rst) format, and Jekyll is written in Ruby. To setup an MkDocs website, the main assets needed are: An mkdocs.yaml file, similar to the one we have in the Playbook . This is the configuration file that defines the appearance of the website, the navigation, the plugins used and more. A folder named docs (the default value for the directory) that contains the documentation source files. A GitHub Action for automatically generating the website (e.g. on every commit to main), similar to this one from the Playbook . A list of plugins used during the build phase of the website. We specified ours here . And these are the plugins we've used: - Material for MkDocs : Material design appearance and user experience. - pymdown-extensions : Improves the appearance of markdown based content. - mdx_truly_sane_lists : For defining the indent level for lists without having to refactor the entire documentation we already had in the Playbook. Setting up locally is very easy. See Getting Started with MkDocs for details. For publishing the website, there's a good integration with GitHub for storing the website as a GitHub Page .","title":"How to Create a Static Website for Your Documentation Based on mkdocs and mkdocs-material"},{"location":"documentation/recipes/static-website-with-mkdocs/#resources","text":"MkDocs Plugins The best MkDocs plugins and customizations","title":"Resources"},{"location":"documentation/recipes/sync-wiki-between-repos/","text":"How to Sync a Wiki Between Repositories This is a quick guide to mirroring a Project Wiki to another repository. # Clone the wiki git clone < source wiki repo url> # Add mirror repository as a remote cd < source wiki repo working folder> git remote add mirror <mirror repo that must already exist> Now each time you wish to sync run the following to get latest from the source wiki repo: # Get everything git pull -v Warning : Check that the output of the pull shows \"From source repo URL\". If this shows the mirror repo url then you've forgotten to reset the tracking. Run git branch -u origin/wikiMaster then continue. Then run this to push it to the mirror repo and reset the branch to track the source repo again: # Push all branches up to mirror remote git push -u mirror # Reset local to track source remote git branch -u origin/wikiMaster Your output should look like this when run: PS C:\\Git\\MyProject.wiki> git pull -v POST git-upload-pack (909 bytes) remote: Azure Repos remote: Found 5 objects to send. (0 ms) Unpacking objects: 100% (5/5), done. From https://..... wikiMaster -> origin/wikiMaster Updating 7412b94..a0f543b Fast-forward .../dffffds.md | 4 ++++ 1 file changed, 4 insertions(+) PS C:\\Git\\MyProject.wiki> git push -u mirror Enumerating objects: 9, done. Counting objects: 100% (9/9), done. Delta compression using up to 8 threads Compressing objects: 100% (5/5), done. Writing objects: 100% (5/5), 2.08 KiB | 2.08 MiB/s, done. Total 5 (delta 4), reused 0 (delta 0) remote: Analyzing objects... (5/5) (6 ms) remote: Storing packfile... done (48 ms) remote: Storing index... done (59 ms) To https://...... 7412b94..a0f543b wikiMaster -> wikiMaster Branch 'wikiMaster' set up to track remote branch 'wikiMaster' from 'mirror'. PS C:\\Git\\MyProject.wiki> git branch -u origin/wikiMaster Branch 'wikiMaster' set up to track remote branch 'wikiMaster' from 'origin'.","title":"How to Sync a Wiki Between Repositories"},{"location":"documentation/recipes/sync-wiki-between-repos/#how-to-sync-a-wiki-between-repositories","text":"This is a quick guide to mirroring a Project Wiki to another repository. # Clone the wiki git clone < source wiki repo url> # Add mirror repository as a remote cd < source wiki repo working folder> git remote add mirror <mirror repo that must already exist> Now each time you wish to sync run the following to get latest from the source wiki repo: # Get everything git pull -v Warning : Check that the output of the pull shows \"From source repo URL\". If this shows the mirror repo url then you've forgotten to reset the tracking. Run git branch -u origin/wikiMaster then continue. Then run this to push it to the mirror repo and reset the branch to track the source repo again: # Push all branches up to mirror remote git push -u mirror # Reset local to track source remote git branch -u origin/wikiMaster Your output should look like this when run: PS C:\\Git\\MyProject.wiki> git pull -v POST git-upload-pack (909 bytes) remote: Azure Repos remote: Found 5 objects to send. (0 ms) Unpacking objects: 100% (5/5), done. From https://..... wikiMaster -> origin/wikiMaster Updating 7412b94..a0f543b Fast-forward .../dffffds.md | 4 ++++ 1 file changed, 4 insertions(+) PS C:\\Git\\MyProject.wiki> git push -u mirror Enumerating objects: 9, done. Counting objects: 100% (9/9), done. Delta compression using up to 8 threads Compressing objects: 100% (5/5), done. Writing objects: 100% (5/5), 2.08 KiB | 2.08 MiB/s, done. Total 5 (delta 4), reused 0 (delta 0) remote: Analyzing objects... (5/5) (6 ms) remote: Storing packfile... done (48 ms) remote: Storing index... done (59 ms) To https://...... 7412b94..a0f543b wikiMaster -> wikiMaster Branch 'wikiMaster' set up to track remote branch 'wikiMaster' from 'mirror'. PS C:\\Git\\MyProject.wiki> git branch -u origin/wikiMaster Branch 'wikiMaster' set up to track remote branch 'wikiMaster' from 'origin'.","title":"How to Sync a Wiki Between Repositories"},{"location":"documentation/recipes/using-docfx-and-tools/","text":"Using DocFx and Companion Tools to Generate a Documentation Website If you want an easy way to have a website with all your documentation coming from Markdown files and comments coming from code, you can use DocFx . The website generated by DocFx also includes fast search capabilities. There are some gaps in the DocFx solution, but we've provided companion tools that help you fill those gaps. Also see the blog post Providing quality documentation in your project with DocFx and Companion Tools for more explanation about the solution. Prerequisites This document is followed best by cloning the sample from https://github.com/mtirionMSFT/DocFxQuickStart first. Copy the contents of the QuickStart folder to the root of your own repository to get started in your own environment. Quick Start TLDR; If you want a really quick start using Azure DevOps and Azure App Service without reading the what and how, follow these steps: Azure DevOps: If you don't have it yet, create a project in Azure DevOps and create a Service Connection to your Azure environment . Clone the repository. QuickStart folder: Copy the contents of the QuickStart folder in there repository that can be found on https://github.com/mtirionMSFT/DocFxQuickStart to the root of the repository. Azure: Create a resource group in your Azure environment where the documentation website resources should be created. Create Azure resources: Fill in the default values in infrastructure/variables.tf and run the commands from Step 3 - Deploy Azure resources from your local machine to create the Azure Resources. Pipeline: Fill in the variables in .pipelines/documentation.yml , commit the changes and push the contents of the repository to your branch (possibly through a PR). Now you can create a pipeline in your Azure DevOps project that uses the .pipelines/documentation.yml and run it. Documents and Projects Folder Structure The easiest is to work with a mono repository where documentation and code live together. If that's not the case in your situation but you still want to combine multiple repositories into one documentation website, you'll have to clone all repositories first to be able to combine the information. In this recipe we'll assume a monorepo is used. In the steps below we'll consider the generation of the documentation website from this content structure: \u251c\u2500\u2500 .pipelines // Azure DevOps pipeline for automatic generation and deployment \u2502 \u251c\u2500\u2500 docs // all documents \u2502 \u251c\u2500\u2500 .attachments // all images and other attachments used by documents \u2502 \u251c\u2500\u2500 infrastructure // Terraform scripts for creation of the Azure website \u2502 \u251c\u2500\u2500 src // all projects \u2502 \u251c\u2500\u2500 build // build settings \u2502 \u251c\u2500\u2500 dotnet // .NET build settings \u2502 \u251c\u2500\u2500 Directory.Build.props // project settings for all .NET projects in sub folders \u2502 \u251c\u2500\u2500 [ Project folders ] \u2502 \u251c\u2500\u2500 x-cross \u2502 \u251c\u2500\u2500 toc.yml // Cross reference definition ( optional ) \u2502 \u251c\u2500\u2500 .markdownlint.json // Markdownlinter settings \u251c\u2500\u2500 docfx.json // DocFx configuration \u251c\u2500\u2500 index.md // Website landing page \u251c\u2500\u2500 toc.yml // Definition of the website header content links \u251c\u2500\u2500 web.config // web.config to enable search in deployed website We'll be using the DocLinkChecker tool to validate all links in documentation and for orphaned attachments. That's the reason we have all attachments in the .attachments folder. In the generated website from the QuickStart folder you'll see that the hierarchies of documentation and references is combined in the left table of contents. This is achieved by the definition and use of x-cross\\toc.yml . If you don't want the hierarchies combined, just remove the folder and file from your environment and (re)generate the website. A .markdownlint.json is included with the contents below. The MD013 setting is set to false to prevent checking for maximum line length. You can modify this file to your likings to include or exclude certain tests. { \"MD013\" : false } The contents of the .pipelines and infrastructure folders are explained in the recipe Deploy the DocFx Documentation website to an Azure Website automatically . Reference Documentation from Source Code DocFx can generate reference documentation from code, where C# and Typescript are supported best at the moment. In the QuickStart folder we only used C# projects. For DocFx to generate quality reference documentation, quality triple slash-comments are required. See Triple-slash (///) Code Comments Support . To enforce this, it's a good idea to enforce the use of StyleCop . There are a few steps that will give you an easy start with this. First, you can use the Directory.Build.props file in the /src folder in combination with the files in the build/dotnet folder. By having this, you enforce StyleCop in all Visual Studio project files in it's sub folders with a configuration of which rules should be used or ignored. You can tailor this to your needs of course. For more information, see Customize your build and Use rule sets to group code analysis rules . To make sure developers are forced to add the triple-slash comments by throwing compiler errors and to have the proper settings for the generation of documentation XML-files, add the TreatWarningsAsErrors and GenerateDocumentationFile settings to every .csproj file. You can add that in the first PropertyGroup settings like this: <Project Sdk= \"Microsoft.NET.Sdk\" > <PropertyGroup> ... <GenerateDocumentationFile> true </GenerateDocumentationFile> <TreatWarningsAsErrors> true </TreatWarningsAsErrors> </PropertyGroup> ... </Project> Now you are all set to generate documentation from your C# code. For more information about languages supported by DocFx and how to configure it, see Introduction to Multiple Languages Support . Note: You can also add a PropertyGroup definition with the two settings in Directory.Build.props to have that settings in all projects. But in that case it will also be inherited in your Test projects. 1. Install DocFx and markdownlint-cli Go to the DocFx website to the Download section and download the latest version of DocFx. Go to the github page of markdownlint-cli to find download and install options. You can also use tools like Chocolatey to install: choco install docfx choco install markdownlint-cli 2. Configure DocFx Configuration for DocFx is done in a docfx.json file. Store this file in the root of your repository. Note: You can store the docfx.json somewhere else in the hierarchy, but then you need to provide the path of the file as an argument to the docfx command so it can be located. Below is a good configuration to start with, where documentation is in the /docs folder and the sources are in the /src folder: { \"metadata\" : [ { \"src\" : [ { \"files\" : [ \"src/**.csproj\" ], \"exclude\" : [ \"_site/**\" , \"**/bin/**\" , \"**/obj/**\" , \"**/[Tt]ests/**\" ] } ], \"dest\" : \"reference\" , \"disableGitFeatures\" : false } ], \"build\" : { \"content\" : [ { \"files\" : [ \"reference/**\" ] }, { \"files\" : [ \"**.md\" , \"**/toc.yml\" ], \"exclude\" : [ \"_site/**\" , \"**/bin/**\" , \"**/obj/**\" , \"**/[Tt]ests/**\" ] } ], \"resource\" : [ { \"files\" : [ \"docs/.attachments/**\" ] }, { \"files\" : [ \"web.config\" ] } ], \"template\" : [ \"templates/cse\" ], \"globalMetadata\" : { \"_appTitle\" : \"CSE Documentation\" , \"_enableSearch\" : true }, \"markdownEngineName\" : \"markdig\" , \"dest\" : \"_site\" , \"xrefService\" : [ \"https://xref.learn.microsoft.com/query?uid={uid}\" ] } } 3. Setup Some Basic Documents We suggest starting with a basic documentation structure in the /docs folder. In the provided QuickStart folder we have a basic setup: \u251c\u2500\u2500 docs \u2502 \u251c\u2500\u2500 .attachments // All images and other attachments used by documents \u2502 \u2502 \u251c\u2500\u2500 architecture-decisions \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 decision-log.md // Sample index into all ADRs \u2502 \u2514\u2500\u2500 README.md // Landing page architecture decisions \u2502 \u2502 \u251c\u2500\u2500 getting-started \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // This recipe document. Replace the content with something meaningful to the project \u2502 \u2502 \u251c\u2500\u2500 guidelines \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 docs-guidelines.md // General documentation guidelines \u2502 \u2514\u2500\u2500 README.md // Landing page guidelines \u2502 \u2502 \u251c\u2500\u2500 templates // all templates like ADR template and such \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // Landing page templates \u2502 \u2502 \u251c\u2500\u2500 working-agreements \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // Landing page working agreements \u2502 \u2502 \u251c\u2500\u2500 .order // Providing a fixed order of files and directories \u2502 \u251c\u2500\u2500 index.md // Landing page documentation You can use templates like working agreements and such from the ISE Playbook . To have a proper landing page of your documentation website, you can use a markdown file called INDEX.MD in the root of your repository. Contents can be something like this: # ISE Documentation This is the landing page of the ISE Documentation website. This is the page to introduce everything on the website. You can add specific links that are important to provide direct access. > Try not to duplicate the links on the top of the page, unless it really makes sense. To get started with the setup of this website, read the getting started document with the title [ Using DocFx and Companion Tools ]( using-docfx-and-tools.md ). 4. Compile the Companion Tools and Run Them Note: To explain each step, we'll be going through the various steps in the next few paragraphs. In the provided sample, a batch-file called GenerateDocWebsite.cmd is included. This script will take all the necessary steps to compile the tools, execute the checks, generate the table of contents and execute docfx to generate the website. To check for proper markdown formatting the markdownlint-cli tool is used. The command takes it's configuration from the .markdownlint.json file in the root of the project. To check all markdown files, simply execute this command: markdownlint **/*.md In the QuickStart folder you should have copied in the two companion tools TocDocFxCreation and DocLinkChecker as described in the introduction of this article. You can compile the tools from Visual Studio, but you can also run dotnet build in both tool folders. The DocLinkChecker companion tool is used to validate what's in the docs folder. It validates links between documents and attachments in the docs folder and checks if there aren't orphaned attachments. An example of executing this tool, where the check of attachments is included: DocLinkChecker.exe -d ./docs -a The TocDocFxCreation tool is needed to generate a table of contents for your documentation, so users can navigate between folders and documents. If you have compiled the tool, use this command to generate a table of content file toc.yml . To generate a table of contents with the use of the .order files for determining the sequence of articles and to automatically generate index.md documents if no default document is available in a folder, this command can be used: TocDocFxCreation.exe -d ./docs -sri 5. Run DocFx to Generate the Website Run the docfx command to generate the website, by default in the _site folder. TIP: If you want to check the website in your local environment, provide the --serve option to either the docfx command or the GenerateDocWebsite script. A small webserver is launched that hosts your website, which is accessible on localhost. Style of the Website If you started with the QuickStart folder, the website is generated using a custom theme using material design and the Microsoft logo. You can change this to your likings. For more information see How-to: Create A Custom Template | DocFX website (dotnet.github.io) . Deploy to an Azure Website After you completed the steps, you should have a default website generated in the _site folder. But of course, you want this to be accessible for everyone. So, the next step is to create for instance an Azure Website and have a process to automatically generate and deploy the contents to that website. That process is described in the recipe Deploy the DocFx Documentation website to an Azure Website automatically . Resources DocFX - static documentation generator Deploy the DocFx Documentation website to an Azure Website automatically Providing quality documentation in your project with DocFx and Companion Tools Monorepo For Beginners","title":"Using DocFx and Companion Tools to Generate a Documentation Website"},{"location":"documentation/recipes/using-docfx-and-tools/#using-docfx-and-companion-tools-to-generate-a-documentation-website","text":"If you want an easy way to have a website with all your documentation coming from Markdown files and comments coming from code, you can use DocFx . The website generated by DocFx also includes fast search capabilities. There are some gaps in the DocFx solution, but we've provided companion tools that help you fill those gaps. Also see the blog post Providing quality documentation in your project with DocFx and Companion Tools for more explanation about the solution.","title":"Using DocFx and Companion Tools to Generate a Documentation Website"},{"location":"documentation/recipes/using-docfx-and-tools/#prerequisites","text":"This document is followed best by cloning the sample from https://github.com/mtirionMSFT/DocFxQuickStart first. Copy the contents of the QuickStart folder to the root of your own repository to get started in your own environment.","title":"Prerequisites"},{"location":"documentation/recipes/using-docfx-and-tools/#quick-start","text":"TLDR; If you want a really quick start using Azure DevOps and Azure App Service without reading the what and how, follow these steps: Azure DevOps: If you don't have it yet, create a project in Azure DevOps and create a Service Connection to your Azure environment . Clone the repository. QuickStart folder: Copy the contents of the QuickStart folder in there repository that can be found on https://github.com/mtirionMSFT/DocFxQuickStart to the root of the repository. Azure: Create a resource group in your Azure environment where the documentation website resources should be created. Create Azure resources: Fill in the default values in infrastructure/variables.tf and run the commands from Step 3 - Deploy Azure resources from your local machine to create the Azure Resources. Pipeline: Fill in the variables in .pipelines/documentation.yml , commit the changes and push the contents of the repository to your branch (possibly through a PR). Now you can create a pipeline in your Azure DevOps project that uses the .pipelines/documentation.yml and run it.","title":"Quick Start"},{"location":"documentation/recipes/using-docfx-and-tools/#documents-and-projects-folder-structure","text":"The easiest is to work with a mono repository where documentation and code live together. If that's not the case in your situation but you still want to combine multiple repositories into one documentation website, you'll have to clone all repositories first to be able to combine the information. In this recipe we'll assume a monorepo is used. In the steps below we'll consider the generation of the documentation website from this content structure: \u251c\u2500\u2500 .pipelines // Azure DevOps pipeline for automatic generation and deployment \u2502 \u251c\u2500\u2500 docs // all documents \u2502 \u251c\u2500\u2500 .attachments // all images and other attachments used by documents \u2502 \u251c\u2500\u2500 infrastructure // Terraform scripts for creation of the Azure website \u2502 \u251c\u2500\u2500 src // all projects \u2502 \u251c\u2500\u2500 build // build settings \u2502 \u251c\u2500\u2500 dotnet // .NET build settings \u2502 \u251c\u2500\u2500 Directory.Build.props // project settings for all .NET projects in sub folders \u2502 \u251c\u2500\u2500 [ Project folders ] \u2502 \u251c\u2500\u2500 x-cross \u2502 \u251c\u2500\u2500 toc.yml // Cross reference definition ( optional ) \u2502 \u251c\u2500\u2500 .markdownlint.json // Markdownlinter settings \u251c\u2500\u2500 docfx.json // DocFx configuration \u251c\u2500\u2500 index.md // Website landing page \u251c\u2500\u2500 toc.yml // Definition of the website header content links \u251c\u2500\u2500 web.config // web.config to enable search in deployed website We'll be using the DocLinkChecker tool to validate all links in documentation and for orphaned attachments. That's the reason we have all attachments in the .attachments folder. In the generated website from the QuickStart folder you'll see that the hierarchies of documentation and references is combined in the left table of contents. This is achieved by the definition and use of x-cross\\toc.yml . If you don't want the hierarchies combined, just remove the folder and file from your environment and (re)generate the website. A .markdownlint.json is included with the contents below. The MD013 setting is set to false to prevent checking for maximum line length. You can modify this file to your likings to include or exclude certain tests. { \"MD013\" : false } The contents of the .pipelines and infrastructure folders are explained in the recipe Deploy the DocFx Documentation website to an Azure Website automatically .","title":"Documents and Projects Folder Structure"},{"location":"documentation/recipes/using-docfx-and-tools/#reference-documentation-from-source-code","text":"DocFx can generate reference documentation from code, where C# and Typescript are supported best at the moment. In the QuickStart folder we only used C# projects. For DocFx to generate quality reference documentation, quality triple slash-comments are required. See Triple-slash (///) Code Comments Support . To enforce this, it's a good idea to enforce the use of StyleCop . There are a few steps that will give you an easy start with this. First, you can use the Directory.Build.props file in the /src folder in combination with the files in the build/dotnet folder. By having this, you enforce StyleCop in all Visual Studio project files in it's sub folders with a configuration of which rules should be used or ignored. You can tailor this to your needs of course. For more information, see Customize your build and Use rule sets to group code analysis rules . To make sure developers are forced to add the triple-slash comments by throwing compiler errors and to have the proper settings for the generation of documentation XML-files, add the TreatWarningsAsErrors and GenerateDocumentationFile settings to every .csproj file. You can add that in the first PropertyGroup settings like this: <Project Sdk= \"Microsoft.NET.Sdk\" > <PropertyGroup> ... <GenerateDocumentationFile> true </GenerateDocumentationFile> <TreatWarningsAsErrors> true </TreatWarningsAsErrors> </PropertyGroup> ... </Project> Now you are all set to generate documentation from your C# code. For more information about languages supported by DocFx and how to configure it, see Introduction to Multiple Languages Support . Note: You can also add a PropertyGroup definition with the two settings in Directory.Build.props to have that settings in all projects. But in that case it will also be inherited in your Test projects.","title":"Reference Documentation from Source Code"},{"location":"documentation/recipes/using-docfx-and-tools/#1-install-docfx-and-markdownlint-cli","text":"Go to the DocFx website to the Download section and download the latest version of DocFx. Go to the github page of markdownlint-cli to find download and install options. You can also use tools like Chocolatey to install: choco install docfx choco install markdownlint-cli","title":"1. Install DocFx and markdownlint-cli"},{"location":"documentation/recipes/using-docfx-and-tools/#2-configure-docfx","text":"Configuration for DocFx is done in a docfx.json file. Store this file in the root of your repository. Note: You can store the docfx.json somewhere else in the hierarchy, but then you need to provide the path of the file as an argument to the docfx command so it can be located. Below is a good configuration to start with, where documentation is in the /docs folder and the sources are in the /src folder: { \"metadata\" : [ { \"src\" : [ { \"files\" : [ \"src/**.csproj\" ], \"exclude\" : [ \"_site/**\" , \"**/bin/**\" , \"**/obj/**\" , \"**/[Tt]ests/**\" ] } ], \"dest\" : \"reference\" , \"disableGitFeatures\" : false } ], \"build\" : { \"content\" : [ { \"files\" : [ \"reference/**\" ] }, { \"files\" : [ \"**.md\" , \"**/toc.yml\" ], \"exclude\" : [ \"_site/**\" , \"**/bin/**\" , \"**/obj/**\" , \"**/[Tt]ests/**\" ] } ], \"resource\" : [ { \"files\" : [ \"docs/.attachments/**\" ] }, { \"files\" : [ \"web.config\" ] } ], \"template\" : [ \"templates/cse\" ], \"globalMetadata\" : { \"_appTitle\" : \"CSE Documentation\" , \"_enableSearch\" : true }, \"markdownEngineName\" : \"markdig\" , \"dest\" : \"_site\" , \"xrefService\" : [ \"https://xref.learn.microsoft.com/query?uid={uid}\" ] } }","title":"2. Configure DocFx"},{"location":"documentation/recipes/using-docfx-and-tools/#3-setup-some-basic-documents","text":"We suggest starting with a basic documentation structure in the /docs folder. In the provided QuickStart folder we have a basic setup: \u251c\u2500\u2500 docs \u2502 \u251c\u2500\u2500 .attachments // All images and other attachments used by documents \u2502 \u2502 \u251c\u2500\u2500 architecture-decisions \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 decision-log.md // Sample index into all ADRs \u2502 \u2514\u2500\u2500 README.md // Landing page architecture decisions \u2502 \u2502 \u251c\u2500\u2500 getting-started \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // This recipe document. Replace the content with something meaningful to the project \u2502 \u2502 \u251c\u2500\u2500 guidelines \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 docs-guidelines.md // General documentation guidelines \u2502 \u2514\u2500\u2500 README.md // Landing page guidelines \u2502 \u2502 \u251c\u2500\u2500 templates // all templates like ADR template and such \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // Landing page templates \u2502 \u2502 \u251c\u2500\u2500 working-agreements \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // Landing page working agreements \u2502 \u2502 \u251c\u2500\u2500 .order // Providing a fixed order of files and directories \u2502 \u251c\u2500\u2500 index.md // Landing page documentation You can use templates like working agreements and such from the ISE Playbook . To have a proper landing page of your documentation website, you can use a markdown file called INDEX.MD in the root of your repository. Contents can be something like this: # ISE Documentation This is the landing page of the ISE Documentation website. This is the page to introduce everything on the website. You can add specific links that are important to provide direct access. > Try not to duplicate the links on the top of the page, unless it really makes sense. To get started with the setup of this website, read the getting started document with the title [ Using DocFx and Companion Tools ]( using-docfx-and-tools.md ).","title":"3. Setup Some Basic Documents"},{"location":"documentation/recipes/using-docfx-and-tools/#4-compile-the-companion-tools-and-run-them","text":"Note: To explain each step, we'll be going through the various steps in the next few paragraphs. In the provided sample, a batch-file called GenerateDocWebsite.cmd is included. This script will take all the necessary steps to compile the tools, execute the checks, generate the table of contents and execute docfx to generate the website. To check for proper markdown formatting the markdownlint-cli tool is used. The command takes it's configuration from the .markdownlint.json file in the root of the project. To check all markdown files, simply execute this command: markdownlint **/*.md In the QuickStart folder you should have copied in the two companion tools TocDocFxCreation and DocLinkChecker as described in the introduction of this article. You can compile the tools from Visual Studio, but you can also run dotnet build in both tool folders. The DocLinkChecker companion tool is used to validate what's in the docs folder. It validates links between documents and attachments in the docs folder and checks if there aren't orphaned attachments. An example of executing this tool, where the check of attachments is included: DocLinkChecker.exe -d ./docs -a The TocDocFxCreation tool is needed to generate a table of contents for your documentation, so users can navigate between folders and documents. If you have compiled the tool, use this command to generate a table of content file toc.yml . To generate a table of contents with the use of the .order files for determining the sequence of articles and to automatically generate index.md documents if no default document is available in a folder, this command can be used: TocDocFxCreation.exe -d ./docs -sri","title":"4. Compile the Companion Tools and Run Them"},{"location":"documentation/recipes/using-docfx-and-tools/#5-run-docfx-to-generate-the-website","text":"Run the docfx command to generate the website, by default in the _site folder. TIP: If you want to check the website in your local environment, provide the --serve option to either the docfx command or the GenerateDocWebsite script. A small webserver is launched that hosts your website, which is accessible on localhost.","title":"5. Run DocFx to Generate the Website"},{"location":"documentation/recipes/using-docfx-and-tools/#style-of-the-website","text":"If you started with the QuickStart folder, the website is generated using a custom theme using material design and the Microsoft logo. You can change this to your likings. For more information see How-to: Create A Custom Template | DocFX website (dotnet.github.io) .","title":"Style of the Website"},{"location":"documentation/recipes/using-docfx-and-tools/#deploy-to-an-azure-website","text":"After you completed the steps, you should have a default website generated in the _site folder. But of course, you want this to be accessible for everyone. So, the next step is to create for instance an Azure Website and have a process to automatically generate and deploy the contents to that website. That process is described in the recipe Deploy the DocFx Documentation website to an Azure Website automatically .","title":"Deploy to an Azure Website"},{"location":"documentation/recipes/using-docfx-and-tools/#resources","text":"DocFX - static documentation generator Deploy the DocFx Documentation website to an Azure Website automatically Providing quality documentation in your project with DocFx and Companion Tools Monorepo For Beginners","title":"Resources"},{"location":"documentation/tools/automation/","text":"How to Automate Simple Checks If you want to automate some checks on your Markdown documents, there are several tools that you could leverage. For example: Code Analysis / Linting markdownlint to verify Markdown syntax and enforce rules that make the text more readable. markdown-link-check to extract links from markdown texts and check whether each link is alive (200 OK) or dead. write-good to check English prose. Docker image for node-markdown-spellcheck , a lightweight docker image to spellcheck markdown files. static code analysis VS Code Extensions Write Good Linter to get grammar and language advice while editing a document. markdownlint to examine Markdown documents and get warnings for rule violations while editing. Automation pre-commit to use Git hook scripts to identify simple issues before submitting our code or documentation for review. Check Build validation to automate linting for PRs. Check CI Pipeline for better documentation for a sample pipeline with markdownlint , markdown-link-check and write-good . Sample output: On Linting Rules The team needs to be clear what linting rules are required and shouldn't be overridden with tooling or comments. The team should have consensus on when to override tooling rules.","title":"How to Automate Simple Checks"},{"location":"documentation/tools/automation/#how-to-automate-simple-checks","text":"If you want to automate some checks on your Markdown documents, there are several tools that you could leverage. For example: Code Analysis / Linting markdownlint to verify Markdown syntax and enforce rules that make the text more readable. markdown-link-check to extract links from markdown texts and check whether each link is alive (200 OK) or dead. write-good to check English prose. Docker image for node-markdown-spellcheck , a lightweight docker image to spellcheck markdown files. static code analysis VS Code Extensions Write Good Linter to get grammar and language advice while editing a document. markdownlint to examine Markdown documents and get warnings for rule violations while editing. Automation pre-commit to use Git hook scripts to identify simple issues before submitting our code or documentation for review. Check Build validation to automate linting for PRs. Check CI Pipeline for better documentation for a sample pipeline with markdownlint , markdown-link-check and write-good . Sample output:","title":"How to Automate Simple Checks"},{"location":"documentation/tools/automation/#on-linting-rules","text":"The team needs to be clear what linting rules are required and shouldn't be overridden with tooling or comments. The team should have consensus on when to override tooling rules.","title":"On Linting Rules"},{"location":"documentation/tools/integrations/","text":"Integration with Teams/Slack Monitor your Azure repositories and receive notifications in your channel whenever code is pushed/checked in and whenever a pull request (PR) is created, updated, or a merge is attempted. Azure Repos with Microsoft Teams Azure Repos with Slack","title":"Integration with Teams/Slack"},{"location":"documentation/tools/integrations/#integration-with-teamsslack","text":"Monitor your Azure repositories and receive notifications in your channel whenever code is pushed/checked in and whenever a pull request (PR) is created, updated, or a merge is attempted. Azure Repos with Microsoft Teams Azure Repos with Slack","title":"Integration with Teams/Slack"},{"location":"documentation/tools/languages/","text":"Languages Markdown Markdown is one of the most popular markup languages to add rich formatting, tables and images to your documentation using plain text documents. Markdown files (.md) can be source-controlled along with your code. More information: Getting Started Cheat Sheet Basic Syntax Extended Syntax Wiki Markdown Syntax Tools: Markdown and Visual Studio Code How to automate simple checks Mermaid Mermaid lets you create diagrams using text definitions that can later be rendered with a diagramming and charting tool. Mermaid files (.mmd) can be source-controlled along with your code. It's also recommended to include image files (.png) with the rendered diagrams under source control. Your markdown files should link the image files, so they can be read without the need of a Mermaid rendering tool (e.g., during Pull Request review). Example Mermaid Diagram This is an example of a Mermaid flowchart diagram written as code. graph LR A[Diagram Idea] -->|Write mermaid code| B(mermaid.mmd file) B -->|Add to source control| C{Code repo} B -->|Export as .png| G(.png file of diagram) G -->|Add to source control| C This is an example of how it can be rendered as an image. More information: About Mermaid Diagram syntax Tools: Mermaid Live Editor Markdown Preview Mermaid Support for Visual Studio Code","title":"Languages"},{"location":"documentation/tools/languages/#languages","text":"","title":"Languages"},{"location":"documentation/tools/languages/#markdown","text":"Markdown is one of the most popular markup languages to add rich formatting, tables and images to your documentation using plain text documents. Markdown files (.md) can be source-controlled along with your code. More information: Getting Started Cheat Sheet Basic Syntax Extended Syntax Wiki Markdown Syntax Tools: Markdown and Visual Studio Code How to automate simple checks","title":"Markdown"},{"location":"documentation/tools/languages/#mermaid","text":"Mermaid lets you create diagrams using text definitions that can later be rendered with a diagramming and charting tool. Mermaid files (.mmd) can be source-controlled along with your code. It's also recommended to include image files (.png) with the rendered diagrams under source control. Your markdown files should link the image files, so they can be read without the need of a Mermaid rendering tool (e.g., during Pull Request review).","title":"Mermaid"},{"location":"documentation/tools/languages/#example-mermaid-diagram","text":"This is an example of a Mermaid flowchart diagram written as code. graph LR A[Diagram Idea] -->|Write mermaid code| B(mermaid.mmd file) B -->|Add to source control| C{Code repo} B -->|Export as .png| G(.png file of diagram) G -->|Add to source control| C This is an example of how it can be rendered as an image. More information: About Mermaid Diagram syntax Tools: Mermaid Live Editor Markdown Preview Mermaid Support for Visual Studio Code","title":"Example Mermaid Diagram"},{"location":"documentation/tools/wikis/","text":"Wikis Use a team project wiki to share information with other team members. When you provision a wiki from scratch, a new Git repository stores your Markdown files, images, attachments, and sequence of pages. This wiki supports collaborative editing of its content and structure. In Azure DevOps, you have the following options for maintaining wiki content : Provision a wiki for your team project. This option supports only one wiki for the team project. Publish Markdown files defined in a Git repository to a wiki. With this option, you can maintain several versioned wikis to support your content needs. More information: About Wikis, READMEs, and Markdown . Provisioned wikis vs. published code as a wiki . Create a Wiki for your project . Manage wikis . Wikis vs. Digital Notebooks (e.g., OneNote) When you work on a project, you may decide to document relevant details or record important decisions about the project in a digital notebook. Tools like OneNote allows you to easily organize, navigate and search your notes. You can provide type, highlighting, or ink annotations to your notes. These notes can easily be shared and created together with others. Still, Wikis greatly facilitate the process of establishing and managing documentation by allowing us to source control the documentation.","title":"Wikis"},{"location":"documentation/tools/wikis/#wikis","text":"Use a team project wiki to share information with other team members. When you provision a wiki from scratch, a new Git repository stores your Markdown files, images, attachments, and sequence of pages. This wiki supports collaborative editing of its content and structure. In Azure DevOps, you have the following options for maintaining wiki content : Provision a wiki for your team project. This option supports only one wiki for the team project. Publish Markdown files defined in a Git repository to a wiki. With this option, you can maintain several versioned wikis to support your content needs. More information: About Wikis, READMEs, and Markdown . Provisioned wikis vs. published code as a wiki . Create a Wiki for your project . Manage wikis .","title":"Wikis"},{"location":"documentation/tools/wikis/#wikis-vs-digital-notebooks-eg-onenote","text":"When you work on a project, you may decide to document relevant details or record important decisions about the project in a digital notebook. Tools like OneNote allows you to easily organize, navigate and search your notes. You can provide type, highlighting, or ink annotations to your notes. These notes can easily be shared and created together with others. Still, Wikis greatly facilitate the process of establishing and managing documentation by allowing us to source control the documentation.","title":"Wikis vs. Digital Notebooks (e.g., OneNote)"},{"location":"engineering-feedback/","text":"Microsoft Engineering Feedback Why is it Important to Submit Microsoft Engineering Feedback Engineering Feedback captures the \"voice of the customer\" and is an important mechanism to provide actionable insights and help Microsoft product groups continuously improve the platform and cloud services to enable all customers to be as productive as possible. Please note that Engineering Feedback is an asynchronous (i.e. not real-time) method to capture and aggregate friction points across multiple customers and code-with engagements. Therefore, if you need to report a service outage, or an immediately-blocking bug, you should file an official Azure support ticket and, if possible, reference the ticket id in the feedback that you submit later. Even if the feedback has already been raised directly with a product group or on through online channels like GitHub or Stack Overflow, it is still important to raise it via Microsoft Engineering feedback, so it can be consolidated with other customer projects that have the same feedback to help with prioritization. When to Submit Engineering Feedback Capturing and providing high-quality actionable Engineering Feedback is an integral ongoing part of all code-with engagements. It is recommended to submit feedback on an ongoing basis instead of batching it up for submission at the end of the engagement. You should jot down the details of the feedback close to the time when you encounter the specific blockers, challenges, and friction since that is when it is freshest in your mind. The project team can then decide how to prioritize and when to submit the feedback into the official CSE Feedback system (accessible to ISE team members) during each sprint. What is Good and High-quality Engineering Feedback Good engineering feedback provides enough information for those who are not part of the code-with engagement to understand the customer pain, the associated product issues, the impact and priority of these issues, and any potential workarounds that exist to minimize that impact. High-Quality Engineering Feedback is Goal Oriented - states what the customer is trying to accomplish Specific - details the scenario, observation, or challenge faced by the customer Actionable - includes the necessary clarifying information to enable a decision Examples of Good Engineering Feedback For example, here is an evolution of transforming a fictitious feedback with the above high-quality engineering feedback guidance in mind: Stage Feedback Evolution Initial feedback Azure Functions Service Bus Trigger is slow for in-order scenarios Making it Goal Oriented Customer requests batch receiving for Azure Functions Service Bus trigger with sessions enabled to better support higher throughput messaging. They want to use Azure Functions to process as many messages per second as possible with minimum latency and in a given order. Adding Specifics Customer scenario was to receive a total of 250 messages/second from 50 producers with requirement for ordering per producer & minimum latency, using a Service Bus topic with sessions enabled for ordering. Batch receiving is not supported in Azure Functions Service Bus Trigger. Making it Actionable Customer scenario was to receive a total of 250 messages/second from 50 producers with requirement for ordering per producer & minimum latency, using a Service Bus topic with sessions enabled for ordering. According to Microsoft documentation , batch receiving is recommended for better performance but this is not currently supported in the Azure Functions Service Bus Trigger. The impact and workaround was choosing containers over Functions. The desired outcome is for Azure Functions to support Service Bus sessions with batch and non-batch processing for all Azure Functions GA languages. For real-world examples please follow Feedback Examples . How to Submit Engineering Feedback Please follow the Engineering Feedback Guidance to ensure that you provide feedback that can be triaged and processed most efficiently. Please review the Frequently Asked Questions page for additional information on the engineering feedback process.","title":"Microsoft Engineering Feedback"},{"location":"engineering-feedback/#microsoft-engineering-feedback","text":"","title":"Microsoft Engineering Feedback"},{"location":"engineering-feedback/#why-is-it-important-to-submit-microsoft-engineering-feedback","text":"Engineering Feedback captures the \"voice of the customer\" and is an important mechanism to provide actionable insights and help Microsoft product groups continuously improve the platform and cloud services to enable all customers to be as productive as possible. Please note that Engineering Feedback is an asynchronous (i.e. not real-time) method to capture and aggregate friction points across multiple customers and code-with engagements. Therefore, if you need to report a service outage, or an immediately-blocking bug, you should file an official Azure support ticket and, if possible, reference the ticket id in the feedback that you submit later. Even if the feedback has already been raised directly with a product group or on through online channels like GitHub or Stack Overflow, it is still important to raise it via Microsoft Engineering feedback, so it can be consolidated with other customer projects that have the same feedback to help with prioritization.","title":"Why is it Important to Submit Microsoft Engineering Feedback"},{"location":"engineering-feedback/#when-to-submit-engineering-feedback","text":"Capturing and providing high-quality actionable Engineering Feedback is an integral ongoing part of all code-with engagements. It is recommended to submit feedback on an ongoing basis instead of batching it up for submission at the end of the engagement. You should jot down the details of the feedback close to the time when you encounter the specific blockers, challenges, and friction since that is when it is freshest in your mind. The project team can then decide how to prioritize and when to submit the feedback into the official CSE Feedback system (accessible to ISE team members) during each sprint.","title":"When to Submit Engineering Feedback"},{"location":"engineering-feedback/#what-is-good-and-high-quality-engineering-feedback","text":"Good engineering feedback provides enough information for those who are not part of the code-with engagement to understand the customer pain, the associated product issues, the impact and priority of these issues, and any potential workarounds that exist to minimize that impact.","title":"What is Good and High-quality Engineering Feedback"},{"location":"engineering-feedback/#high-quality-engineering-feedback-is","text":"Goal Oriented - states what the customer is trying to accomplish Specific - details the scenario, observation, or challenge faced by the customer Actionable - includes the necessary clarifying information to enable a decision","title":"High-Quality Engineering Feedback is"},{"location":"engineering-feedback/#examples-of-good-engineering-feedback","text":"For example, here is an evolution of transforming a fictitious feedback with the above high-quality engineering feedback guidance in mind: Stage Feedback Evolution Initial feedback Azure Functions Service Bus Trigger is slow for in-order scenarios Making it Goal Oriented Customer requests batch receiving for Azure Functions Service Bus trigger with sessions enabled to better support higher throughput messaging. They want to use Azure Functions to process as many messages per second as possible with minimum latency and in a given order. Adding Specifics Customer scenario was to receive a total of 250 messages/second from 50 producers with requirement for ordering per producer & minimum latency, using a Service Bus topic with sessions enabled for ordering. Batch receiving is not supported in Azure Functions Service Bus Trigger. Making it Actionable Customer scenario was to receive a total of 250 messages/second from 50 producers with requirement for ordering per producer & minimum latency, using a Service Bus topic with sessions enabled for ordering. According to Microsoft documentation , batch receiving is recommended for better performance but this is not currently supported in the Azure Functions Service Bus Trigger. The impact and workaround was choosing containers over Functions. The desired outcome is for Azure Functions to support Service Bus sessions with batch and non-batch processing for all Azure Functions GA languages. For real-world examples please follow Feedback Examples .","title":"Examples of Good Engineering Feedback"},{"location":"engineering-feedback/#how-to-submit-engineering-feedback","text":"Please follow the Engineering Feedback Guidance to ensure that you provide feedback that can be triaged and processed most efficiently. Please review the Frequently Asked Questions page for additional information on the engineering feedback process.","title":"How to Submit Engineering Feedback"},{"location":"engineering-feedback/feedback-examples/","text":"Engineering Feedback Examples The following are real-world examples of Engineering Feedback that have led to product improvements and unblocked customers. Windows Server Container Support for Azure Kubernetes Service The Azure Kubernetes Service should have first class Windows container support so solutions that require Windows workloads can be deployed on a wildly popular container orchestration platform. The need was to be able to deploy Windows Server containers on AKS the managed Azure Kubernetes Service. According to this FAQ (and in parallel confirmation) it is not available yet. We tried to deploy anyway as a test, and it did not work \u2013 the deployment would be pending without success. More than a dozen large partners/customers are blocked in deploying Windows workloads to AKS due to a lack of support for Windows Server containers. They need this feature so solutions requiring Windows workloads can be deployed to this popular container orchestration platform. We are seeing an emergence of companies beginning to try Windows containers as an option to move their Windows workloads to the cloud.\u202f Gartner is claiming that 80% of enterprise apps run on Windows. Containers have become the de facto deployment mechanism in the industry, and deployment consistency and speed are a few of the important factors companies are looking for. Enabling Windows applications and ensuring that developers have a good experience when moving their workloads to Azure via Windows containers is key to keeping existing Windows customers within the Azure ecosystem and driving Azure adoption for new workloads. We are also seeing increased interest, particularly among enterprise customers, in using a single orchestrator control plane for managing both Linux and Windows workloads. This feedback was created as a high priority feedback and followed up internally until addressed. Here is the announcement . Support Batch Receiving with Sessions in Azure Functions Service Bus Trigger Customer scenario was to receive a total of 250 messages per second from 50 producers with requirement for ordering & minimum latency, using a Service Bus topic with sessions enabled for ordering. According to Microsoft documentation , batch receiving is recommended for better performance but this is not currently supported in Azure Functions Service Bus Trigger. The impact (and work around) was choosing containers over Functions. The Acceptance Criteria is for Azure Functions to support Service Bus sessions with batch and non-batch processing for all Azure Functions GA languages. This feedback was created as a feedback with the Azure Functions product group and also followed up internally until addressed. Stream Analytics - No Support for Zero-Downtime Scale-Down In order to update the Streaming Unit number in Stream Analytics you need to stop the service and wait for minutes for it to restart. This unacceptable by customers who need near real-time analysis\u200b. In order to have a job re-started, up to 2 minutes are needed and this is not acceptable for a real-time streaming solution. It would also be optimal if scale-up and scale-down could be done automatically, by setting threshold values that when reached increase or decrease automatically the amount of RU available. This feedback is for customers' request for zero down-time scale-down capability in stream analytics. Problem Statement: In order to update the \"Streaming Unit\" number, partners must stop the service and wait until it restarts. The partner needs to be able to update the number without stopping the service. Desired Experience: Partners should be able to update the Streaming Unit number without stopping the associated service. This feedback was created as a high priority feedback and followed up until addressed in December 2019. Python Support for Azure Functions Several customers already use Python as part of their workflow, and would like to be able to use Python for Azure Functions. This is specially true since many of them are already have scripts running on other clouds and services. In addition, Python support has been in Preview for a very long time, and it's missing a lot of functionality. This feature request is one of the most asked, and a huge upside potential to pull through Machine Learning (ML) based workloads. This feedback was created as a feedback with the Azure Functions product group and also followed up internally until addressed. Here is the announcement .","title":"Engineering Feedback Examples"},{"location":"engineering-feedback/feedback-examples/#engineering-feedback-examples","text":"The following are real-world examples of Engineering Feedback that have led to product improvements and unblocked customers.","title":"Engineering Feedback Examples"},{"location":"engineering-feedback/feedback-examples/#windows-server-container-support-for-azure-kubernetes-service","text":"The Azure Kubernetes Service should have first class Windows container support so solutions that require Windows workloads can be deployed on a wildly popular container orchestration platform. The need was to be able to deploy Windows Server containers on AKS the managed Azure Kubernetes Service. According to this FAQ (and in parallel confirmation) it is not available yet. We tried to deploy anyway as a test, and it did not work \u2013 the deployment would be pending without success. More than a dozen large partners/customers are blocked in deploying Windows workloads to AKS due to a lack of support for Windows Server containers. They need this feature so solutions requiring Windows workloads can be deployed to this popular container orchestration platform. We are seeing an emergence of companies beginning to try Windows containers as an option to move their Windows workloads to the cloud.\u202f Gartner is claiming that 80% of enterprise apps run on Windows. Containers have become the de facto deployment mechanism in the industry, and deployment consistency and speed are a few of the important factors companies are looking for. Enabling Windows applications and ensuring that developers have a good experience when moving their workloads to Azure via Windows containers is key to keeping existing Windows customers within the Azure ecosystem and driving Azure adoption for new workloads. We are also seeing increased interest, particularly among enterprise customers, in using a single orchestrator control plane for managing both Linux and Windows workloads. This feedback was created as a high priority feedback and followed up internally until addressed. Here is the announcement .","title":"Windows Server Container Support for Azure Kubernetes Service"},{"location":"engineering-feedback/feedback-examples/#support-batch-receiving-with-sessions-in-azure-functions-service-bus-trigger","text":"Customer scenario was to receive a total of 250 messages per second from 50 producers with requirement for ordering & minimum latency, using a Service Bus topic with sessions enabled for ordering. According to Microsoft documentation , batch receiving is recommended for better performance but this is not currently supported in Azure Functions Service Bus Trigger. The impact (and work around) was choosing containers over Functions. The Acceptance Criteria is for Azure Functions to support Service Bus sessions with batch and non-batch processing for all Azure Functions GA languages. This feedback was created as a feedback with the Azure Functions product group and also followed up internally until addressed.","title":"Support Batch Receiving with Sessions in Azure Functions Service Bus Trigger"},{"location":"engineering-feedback/feedback-examples/#stream-analytics-no-support-for-zero-downtime-scale-down","text":"In order to update the Streaming Unit number in Stream Analytics you need to stop the service and wait for minutes for it to restart. This unacceptable by customers who need near real-time analysis\u200b. In order to have a job re-started, up to 2 minutes are needed and this is not acceptable for a real-time streaming solution. It would also be optimal if scale-up and scale-down could be done automatically, by setting threshold values that when reached increase or decrease automatically the amount of RU available. This feedback is for customers' request for zero down-time scale-down capability in stream analytics. Problem Statement: In order to update the \"Streaming Unit\" number, partners must stop the service and wait until it restarts. The partner needs to be able to update the number without stopping the service. Desired Experience: Partners should be able to update the Streaming Unit number without stopping the associated service. This feedback was created as a high priority feedback and followed up until addressed in December 2019.","title":"Stream Analytics - No Support for Zero-Downtime Scale-Down"},{"location":"engineering-feedback/feedback-examples/#python-support-for-azure-functions","text":"Several customers already use Python as part of their workflow, and would like to be able to use Python for Azure Functions. This is specially true since many of them are already have scripts running on other clouds and services. In addition, Python support has been in Preview for a very long time, and it's missing a lot of functionality. This feature request is one of the most asked, and a huge upside potential to pull through Machine Learning (ML) based workloads. This feedback was created as a feedback with the Azure Functions product group and also followed up internally until addressed. Here is the announcement .","title":"Python Support for Azure Functions"},{"location":"engineering-feedback/feedback-faq/","text":"Engineering Feedback Frequently Asked Questions (F.A.Q.) The questions below are common questions related to the feedback process. The answers are intended to help both Microsoft employees and customers. When Should I Submit Feedback vs. Creating an Issue on GitHub, UserVoice, or Sending an Email Directly to a Microsoft Employee? It is appropriate to do both. As a customer or Microsoft employee, you are empowered to create an issue or submit feedback via the medium appropriate for service. In addition to an issue on GitHub, feedback on UserVoice, or a personal email, Microsoft employees in CSE should submit feedback via CSE Feedback. In doing so, please reference the GitHub issue, UserVoice feedback, or email by including a link to the item or attaching the email. Submitting to ISE Feedback allows the ISE Feedback team to coalesce feedback across a wide range of sources, and thus create a unified case to submit to the appropriate Azure engineering team(s). How can a Customer Track the Status of a Specific Feedback Item? At this time, customers are not able to directly track the status of feedback submitted via ISE Feedback. The ISE Feedback process is internal to Microsoft, and as such, available only to Microsoft employees. Customers may request an update from their ISE engineering partner or Microsoft account representative(s). Customers can also submit their feedback directly via GitHub or UserVoice (as appropriate for the specific service), and inform their ISE engineering partner. The ISE engineer should submit the feedback via the ISE Feedback process, and in doing so reference the previously created issue. Customers can follow the GitHub or UserVoice item to be alerted on updates. How can a Microsoft Employee Track the Status of a Specific Feedback Item? The easiest way for a Microsoft employee within ISE to track a specific feedback item is to follow the feedback (a work item) in Azure DevOps. As a Microsoft Employee Within ISE, if I Submit a Feedback and Move to Another Dev Crew Engagement, how Would my Customer get an Update on that Feedback? If the feedback is also submitted via GitHub or UserVoice, the customer may elect to follow that item for publicly available updates. The customer may also contact their Microsoft account representative to request an update. As a Microsoft Employee Within ISE, what Should I Expect/Do After Submitting Feedback via ISE Feedback? After submitting the feedback, it is recommended to follow the feedback (a work item) in Azure DevOps. If you have configured Azure DevOps notifications to send an email on work item updates, you will receive an email when the feedback is updated. If more information about the feedback is needed, a member of the ISE Feedback team will contact you to gather more information. How/When are Feedback Aggregated? Members of the ISE Feedback team will make a best effort to triage and review new ISE Feedback items within two weeks of the original submission date. If there is similarity across multiple feedback items, a member of the ISE Feedback team may decide to create a new feedback item which is an aggregate of similar items. This is done to aid in the creation of a unified feedback item to present to the appropriate Microsoft engineering team. On a monthly basis, the ISE Feedback team will review all feedback and generate a report consisting of the highest priority feedback. The report is presented to appropriate ISE and Microsoft leadership teams.","title":"Engineering Feedback Frequently Asked Questions (F.A.Q.)"},{"location":"engineering-feedback/feedback-faq/#engineering-feedback-frequently-asked-questions-faq","text":"The questions below are common questions related to the feedback process. The answers are intended to help both Microsoft employees and customers.","title":"Engineering Feedback Frequently Asked Questions (F.A.Q.)"},{"location":"engineering-feedback/feedback-faq/#when-should-i-submit-feedback-vs-creating-an-issue-on-github-uservoice-or-sending-an-email-directly-to-a-microsoft-employee","text":"It is appropriate to do both. As a customer or Microsoft employee, you are empowered to create an issue or submit feedback via the medium appropriate for service. In addition to an issue on GitHub, feedback on UserVoice, or a personal email, Microsoft employees in CSE should submit feedback via CSE Feedback. In doing so, please reference the GitHub issue, UserVoice feedback, or email by including a link to the item or attaching the email. Submitting to ISE Feedback allows the ISE Feedback team to coalesce feedback across a wide range of sources, and thus create a unified case to submit to the appropriate Azure engineering team(s).","title":"When Should I Submit Feedback vs. Creating an Issue on GitHub, UserVoice, or Sending an Email Directly to a Microsoft Employee?"},{"location":"engineering-feedback/feedback-faq/#how-can-a-customer-track-the-status-of-a-specific-feedback-item","text":"At this time, customers are not able to directly track the status of feedback submitted via ISE Feedback. The ISE Feedback process is internal to Microsoft, and as such, available only to Microsoft employees. Customers may request an update from their ISE engineering partner or Microsoft account representative(s). Customers can also submit their feedback directly via GitHub or UserVoice (as appropriate for the specific service), and inform their ISE engineering partner. The ISE engineer should submit the feedback via the ISE Feedback process, and in doing so reference the previously created issue. Customers can follow the GitHub or UserVoice item to be alerted on updates.","title":"How can a Customer Track the Status of a Specific Feedback Item?"},{"location":"engineering-feedback/feedback-faq/#how-can-a-microsoft-employee-track-the-status-of-a-specific-feedback-item","text":"The easiest way for a Microsoft employee within ISE to track a specific feedback item is to follow the feedback (a work item) in Azure DevOps.","title":"How can a Microsoft Employee Track the Status of a Specific Feedback Item?"},{"location":"engineering-feedback/feedback-faq/#as-a-microsoft-employee-within-ise-if-i-submit-a-feedback-and-move-to-another-dev-crew-engagement-how-would-my-customer-get-an-update-on-that-feedback","text":"If the feedback is also submitted via GitHub or UserVoice, the customer may elect to follow that item for publicly available updates. The customer may also contact their Microsoft account representative to request an update.","title":"As a Microsoft Employee Within ISE, if I Submit a Feedback and Move to Another Dev Crew Engagement, how Would my Customer get an Update on that Feedback?"},{"location":"engineering-feedback/feedback-faq/#as-a-microsoft-employee-within-ise-what-should-i-expectdo-after-submitting-feedback-via-ise-feedback","text":"After submitting the feedback, it is recommended to follow the feedback (a work item) in Azure DevOps. If you have configured Azure DevOps notifications to send an email on work item updates, you will receive an email when the feedback is updated. If more information about the feedback is needed, a member of the ISE Feedback team will contact you to gather more information.","title":"As a Microsoft Employee Within ISE, what Should I Expect/Do After Submitting Feedback via ISE Feedback?"},{"location":"engineering-feedback/feedback-faq/#howwhen-are-feedback-aggregated","text":"Members of the ISE Feedback team will make a best effort to triage and review new ISE Feedback items within two weeks of the original submission date. If there is similarity across multiple feedback items, a member of the ISE Feedback team may decide to create a new feedback item which is an aggregate of similar items. This is done to aid in the creation of a unified feedback item to present to the appropriate Microsoft engineering team. On a monthly basis, the ISE Feedback team will review all feedback and generate a report consisting of the highest priority feedback. The report is presented to appropriate ISE and Microsoft leadership teams.","title":"How/When are Feedback Aggregated?"},{"location":"engineering-feedback/feedback-guidance/","text":"Engineering Feedback Guidance The following guidance provides a minimum set of details that will result in actionable engineering feedback. Ensure that you provide as much detail for each of the following sections as relevant and possible. Title Provide a meaningful and descriptive title. There is no need to include the Azure service in the title as this will be included as part of the Categorization section. Good examples: Supported X versions not documented Require all-in-one Y story Summary Summarize the feedback in a short paragraph. Categorization Azure Service Which Azure service does this feedback item refer to? If there are multiple Azure services involved, pick the primary service and include the details of the others in the Notes section. Type Select one of the following to describe what type of feedback is being provided: Business Blocker (e.g. No SLA on X, Service Y not GA, Service A not in Region B) Technical Blocker (e.g. Accelerated networking not available on Service X) Documentation (e.g. Instructions for configuring scenario X missing) Feature Request (e.g. Enable simple integration to X on Service Y) Stage Select one of the following to describe the lifecycle stage of the engagement that has generated this feedback: Production Staging Testing Development Impact Describe the impact to the customer and engagement that this feedback implies. Time Frame Provide a time frame that this feedback item needs to be resolved within (if relevant). Priority Please provide the customer perspective priority of the feedback. Feedback is prioritized at one of the following four levels: P0 - Impact is critical and large : Needs to be addressed immediately; impact is critical and large in scope (i.e. major service outage; makes service or functions unusable/unavailable to a high portion of addressable space; no known workaround). P1 - Impact is high and significant : Needs to be addressed quickly; impacts a large percentage of addressable space and impedes progress. A partial workaround exists or is overly painful. P2 - Impact is moderate and varies in scope : Needs to be addressed in a reasonable time frame (i.e. issues that are impeding adoption and usage with no reasonable workarounds). For example, feedback may be related to feature-level issue to solve for friction. P3 - Impact is low : Issue can be address when able or eventually (i.e. relevant to core addressable space but issue does not impede progress or has reasonable workaround). For example, feedback may be related to feature ideas or opportunities. Reproduction Steps The reproduction steps are important since they help confirm and replay the issue, and are essential in demonstrating success once there is a resolution. Pre-requisites Provide a clear set of all conditions and pre-requisites required before following the set of reproduction steps. These could include: Platform (e.g. AKS 1.16.4 cluster with Azure CNI, Ubuntu 19.04 VM) Services (e.g. Azure Key Vault, Azure Monitor) Networking (e.g. VNET with subnet) Steps Provide a clear set of repeatable steps that will allow for this feedback to be reproduced. This can take the form of: Scripts (e.g. bash, PowerShell, terraform, arm template) Command line instructions (e.g. az, helm, terraform) Screen shots (e.g. azure portal screens) Notes Include items like architecture diagrams, screenshots, logs, traces etc which can help with understanding your notes and the feedback item. Also include details about the scenario customer/partner verbatim as much as possible in the main content. What Didn't Work Describe what didn't work or what feature gap you identified. What was Your Expectation or the Desired Outcome Describe what you expected to happen. What was the outcome that was expected? Describe the Steps you Took Provide a clear description of the steps taken and the outcome/description at each point.","title":"Engineering Feedback Guidance"},{"location":"engineering-feedback/feedback-guidance/#engineering-feedback-guidance","text":"The following guidance provides a minimum set of details that will result in actionable engineering feedback. Ensure that you provide as much detail for each of the following sections as relevant and possible.","title":"Engineering Feedback Guidance"},{"location":"engineering-feedback/feedback-guidance/#title","text":"Provide a meaningful and descriptive title. There is no need to include the Azure service in the title as this will be included as part of the Categorization section. Good examples: Supported X versions not documented Require all-in-one Y story","title":"Title"},{"location":"engineering-feedback/feedback-guidance/#summary","text":"Summarize the feedback in a short paragraph.","title":"Summary"},{"location":"engineering-feedback/feedback-guidance/#categorization","text":"","title":"Categorization"},{"location":"engineering-feedback/feedback-guidance/#azure-service","text":"Which Azure service does this feedback item refer to? If there are multiple Azure services involved, pick the primary service and include the details of the others in the Notes section.","title":"Azure Service"},{"location":"engineering-feedback/feedback-guidance/#type","text":"Select one of the following to describe what type of feedback is being provided: Business Blocker (e.g. No SLA on X, Service Y not GA, Service A not in Region B) Technical Blocker (e.g. Accelerated networking not available on Service X) Documentation (e.g. Instructions for configuring scenario X missing) Feature Request (e.g. Enable simple integration to X on Service Y)","title":"Type"},{"location":"engineering-feedback/feedback-guidance/#stage","text":"Select one of the following to describe the lifecycle stage of the engagement that has generated this feedback: Production Staging Testing Development","title":"Stage"},{"location":"engineering-feedback/feedback-guidance/#impact","text":"Describe the impact to the customer and engagement that this feedback implies.","title":"Impact"},{"location":"engineering-feedback/feedback-guidance/#time-frame","text":"Provide a time frame that this feedback item needs to be resolved within (if relevant).","title":"Time Frame"},{"location":"engineering-feedback/feedback-guidance/#priority","text":"Please provide the customer perspective priority of the feedback. Feedback is prioritized at one of the following four levels: P0 - Impact is critical and large : Needs to be addressed immediately; impact is critical and large in scope (i.e. major service outage; makes service or functions unusable/unavailable to a high portion of addressable space; no known workaround). P1 - Impact is high and significant : Needs to be addressed quickly; impacts a large percentage of addressable space and impedes progress. A partial workaround exists or is overly painful. P2 - Impact is moderate and varies in scope : Needs to be addressed in a reasonable time frame (i.e. issues that are impeding adoption and usage with no reasonable workarounds). For example, feedback may be related to feature-level issue to solve for friction. P3 - Impact is low : Issue can be address when able or eventually (i.e. relevant to core addressable space but issue does not impede progress or has reasonable workaround). For example, feedback may be related to feature ideas or opportunities.","title":"Priority"},{"location":"engineering-feedback/feedback-guidance/#reproduction-steps","text":"The reproduction steps are important since they help confirm and replay the issue, and are essential in demonstrating success once there is a resolution.","title":"Reproduction Steps"},{"location":"engineering-feedback/feedback-guidance/#pre-requisites","text":"Provide a clear set of all conditions and pre-requisites required before following the set of reproduction steps. These could include: Platform (e.g. AKS 1.16.4 cluster with Azure CNI, Ubuntu 19.04 VM) Services (e.g. Azure Key Vault, Azure Monitor) Networking (e.g. VNET with subnet)","title":"Pre-requisites"},{"location":"engineering-feedback/feedback-guidance/#steps","text":"Provide a clear set of repeatable steps that will allow for this feedback to be reproduced. This can take the form of: Scripts (e.g. bash, PowerShell, terraform, arm template) Command line instructions (e.g. az, helm, terraform) Screen shots (e.g. azure portal screens)","title":"Steps"},{"location":"engineering-feedback/feedback-guidance/#notes","text":"Include items like architecture diagrams, screenshots, logs, traces etc which can help with understanding your notes and the feedback item. Also include details about the scenario customer/partner verbatim as much as possible in the main content.","title":"Notes"},{"location":"engineering-feedback/feedback-guidance/#what-didnt-work","text":"Describe what didn't work or what feature gap you identified.","title":"What Didn't Work"},{"location":"engineering-feedback/feedback-guidance/#what-was-your-expectation-or-the-desired-outcome","text":"Describe what you expected to happen. What was the outcome that was expected?","title":"What was Your Expectation or the Desired Outcome"},{"location":"engineering-feedback/feedback-guidance/#describe-the-steps-you-took","text":"Provide a clear description of the steps taken and the outcome/description at each point.","title":"Describe the Steps you Took"},{"location":"machine-learning/","text":"Machine Learning Fundamentals at ISE This guideline documents the Machine Learning (ML) practices in ISE. ISE works with customers on developing ML models and putting them in production, with an emphasis on engineering and research best practices throughout the project's life cycle. Goals Provide a set of ML practices to follow in an ML project. Provide clarity on ML process and how it fits within a software engineering project. Provide best practices for the different stages of an ML project. How to use these Fundamentals If you are starting a new ML project, consider reading through the general guidance documents . For specific aspects of an ML project, refer to the guidelines for different project phases . ML Project Phases The diagram below shows different phases in an ideal ML project. Due to practical constraints and requirements, it might not always be possible to have a project structured in such a manner, however best practices should be followed for each individual phase. Envisioning : Initial problem understanding, customer goals and objectives. Feasibility Study : Assess whether the problem in question is feasible to solve satisfactorily using ML with the available data. Model Milestone : There is a basic model that is achieving the minimum required performance, both in terms of ML performance and system performance. Using the knowledge gathered to this milestone, define the scope, objectives, high-level architecture, definition of done and plan for the entire project. Model(s) experimentation : Tools and best practices for conducting successful model experimentation. Model(s) Operationalization : Model readiness for production checklist. General Guidance ML Process Guidance ML Fundamentals checklist Data Exploration Agile ML development Testing Data Science and ML Ops code Profiling Machine Learning and ML Ops code Responsible AI Program Management for ML projects Resources Model Operationalization","title":"Machine Learning Fundamentals at ISE"},{"location":"machine-learning/#machine-learning-fundamentals-at-ise","text":"This guideline documents the Machine Learning (ML) practices in ISE. ISE works with customers on developing ML models and putting them in production, with an emphasis on engineering and research best practices throughout the project's life cycle.","title":"Machine Learning Fundamentals at ISE"},{"location":"machine-learning/#goals","text":"Provide a set of ML practices to follow in an ML project. Provide clarity on ML process and how it fits within a software engineering project. Provide best practices for the different stages of an ML project.","title":"Goals"},{"location":"machine-learning/#how-to-use-these-fundamentals","text":"If you are starting a new ML project, consider reading through the general guidance documents . For specific aspects of an ML project, refer to the guidelines for different project phases .","title":"How to use these Fundamentals"},{"location":"machine-learning/#ml-project-phases","text":"The diagram below shows different phases in an ideal ML project. Due to practical constraints and requirements, it might not always be possible to have a project structured in such a manner, however best practices should be followed for each individual phase. Envisioning : Initial problem understanding, customer goals and objectives. Feasibility Study : Assess whether the problem in question is feasible to solve satisfactorily using ML with the available data. Model Milestone : There is a basic model that is achieving the minimum required performance, both in terms of ML performance and system performance. Using the knowledge gathered to this milestone, define the scope, objectives, high-level architecture, definition of done and plan for the entire project. Model(s) experimentation : Tools and best practices for conducting successful model experimentation. Model(s) Operationalization : Model readiness for production checklist.","title":"ML Project Phases"},{"location":"machine-learning/#general-guidance","text":"ML Process Guidance ML Fundamentals checklist Data Exploration Agile ML development Testing Data Science and ML Ops code Profiling Machine Learning and ML Ops code Responsible AI Program Management for ML projects","title":"General Guidance"},{"location":"machine-learning/#resources","text":"Model Operationalization","title":"Resources"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/","text":"Agile Development Considerations for ML Projects Overview When running ML projects, we follow the Agile methodology for software development with some adaptations, as we acknowledge that research and experimentation are sometimes difficult to plan and estimate. Goals Run and manage ML projects effectively Create effective collaboration between the ML team and the other teams working on the project To learn more about how ISE runs the Agile process for software development teams, refer to this doc . Within this framework, the team follows these Agile ceremonies: Backlog management Retrospectives Scrum of Scrums (where applicable) Sprint planning Stand-ups Working agreement Agile Process During Exploration and Experimentation While acknowledging the fact that ML user stories and research spikes are less predictable than software development ones, we strive to have a deliverable for every user story in every sprint. User stories and spikes are usually estimated using T-shirt sizes or similar, and not in actual days/hours. ML design sessions should be included in each sprint. Examples of ML Deliverables for each Sprint Working code (e.g. models, pipelines, exploratory code) Documentation of new hypotheses, and the acceptance or rejection of previous hypotheses as part of a Hypothesis Driven Analysis (HDA). For more information see Hypothesis Driven Development on Barry Oreilly's website Exploratory Data Analysis (EDA) results and learnings documented Collaboration Between Data Scientists and Software Developers Data scientists and software developers work together on the project. The team uses one backlog and attend the same Agile ceremonies. In cases where the project has many participants, we will divide into working groups, but still have the entire team join the Agile ceremonies. If possible, feasibility study and initial model experimentation takes place before the operationalization work kicks off. Everyone shares the accountability for the MLOps solution. The ML model interface (API) is determined as early as possible, to allow the developers to consider its integration into the production pipeline. MLOps artifacts are developed with a continuous collaboration and review of the data scientists, to ensure the appropriate approaches for experimentation and productization are used. Retrospectives and sprint planning are performed on the entire team level, and not the specific work groups level.","title":"Agile Development Considerations for ML Projects"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#agile-development-considerations-for-ml-projects","text":"","title":"Agile Development Considerations for ML Projects"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#overview","text":"When running ML projects, we follow the Agile methodology for software development with some adaptations, as we acknowledge that research and experimentation are sometimes difficult to plan and estimate.","title":"Overview"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#goals","text":"Run and manage ML projects effectively Create effective collaboration between the ML team and the other teams working on the project To learn more about how ISE runs the Agile process for software development teams, refer to this doc . Within this framework, the team follows these Agile ceremonies: Backlog management Retrospectives Scrum of Scrums (where applicable) Sprint planning Stand-ups Working agreement","title":"Goals"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#agile-process-during-exploration-and-experimentation","text":"While acknowledging the fact that ML user stories and research spikes are less predictable than software development ones, we strive to have a deliverable for every user story in every sprint. User stories and spikes are usually estimated using T-shirt sizes or similar, and not in actual days/hours. ML design sessions should be included in each sprint.","title":"Agile Process During Exploration and Experimentation"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#examples-of-ml-deliverables-for-each-sprint","text":"Working code (e.g. models, pipelines, exploratory code) Documentation of new hypotheses, and the acceptance or rejection of previous hypotheses as part of a Hypothesis Driven Analysis (HDA). For more information see Hypothesis Driven Development on Barry Oreilly's website Exploratory Data Analysis (EDA) results and learnings documented","title":"Examples of ML Deliverables for each Sprint"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#collaboration-between-data-scientists-and-software-developers","text":"Data scientists and software developers work together on the project. The team uses one backlog and attend the same Agile ceremonies. In cases where the project has many participants, we will divide into working groups, but still have the entire team join the Agile ceremonies. If possible, feasibility study and initial model experimentation takes place before the operationalization work kicks off. Everyone shares the accountability for the MLOps solution. The ML model interface (API) is determined as early as possible, to allow the developers to consider its integration into the production pipeline. MLOps artifacts are developed with a continuous collaboration and review of the data scientists, to ensure the appropriate approaches for experimentation and productization are used. Retrospectives and sprint planning are performed on the entire team level, and not the specific work groups level.","title":"Collaboration Between Data Scientists and Software Developers"},{"location":"machine-learning/data-exploration/","text":"Data Exploration After envisioning , and typically as part of the ML feasibility study , the next step is to confirm resource access and then dive deep into the available data through data exploration workshops. Purpose of the Data Exploration Workshop The purpose of the data exploration workshop is as follows: Ensure the team can access the data and compute resources that are necessary for the ML feasibility study Ensure that the data provided is of quality and is relevant to the ML solution Make sure that the project team has a good understanding of the data Make sure that the SMEs (Subject Matter Experts) needed are present for Data Exploration Workshop List people needed for the data exploration workshop Accessing Resources Prior to diving into data exploration workshops, it is important to confirm that you have access to the necessary resources (including data). Below is an example list of questions to consider before starting a data exploration workshop. What are the requirements for an account to be set up in order for the team to access data and compute resources? Are there security requirements around accessing resources (Subscriptions, Azure Resources, project management, etc.) such as VPN, 2FA, jump boxes, etc.? Data access: * Is it on-prem or on Azure already? * If it is on-prem, can we move the needed data to Azure under the appropriate subscription? Who has permission to move the data? * Is the data access approved from a legal/compliance perspective? Computation: * Is a VPN needed for the project team to access these computation nodes (Virtual Machines, Databricks clusters, etc) from their work PCs/Macs? * Any restrictions on accessing the source data system from these computation nodes? * If we want to create some compute resources, who has permissions to do so? Source code repository: * Do you have any preference on source code repository location? Backlog management and work planning: * Do you have any preference on backlog management and work planning, such as Azure DevOps, Jira or anything else? * If an existing system, are special accounts / system setups required to access? Programming Language: * Is Python/PySpark a preferred language? * Is there any internal approval processes for the Python/PySpark libraries we want to use for this engagement? Data Exploration Workshop Key objectives of the exploration workshops include the following: Understand and document the features, location, and availability of the data. What order of magnitude is the current data (e.g., GB, TB)? Is this all relevant? How does the organization decide when to collect additional data or purchase external data? Are there any examples of this? Understand the quality of the data. Is there already a data validation strategy in place? What data has been used so far to analyze recent data-driven projects? What has been found to be most useful? What was not useful? How was this judged? What additional internal data may provide insights useful for data-driven decision-making for proposed projects? What external data could be useful? What are the possible constraints or challenges in accessing or incorporating this data? How was the data collected? Are there any obvious biases due to how the data was collected? What changes to data collection, coding, integration, etc has occurred in the last 2 years that may impact the interpretation or availability of the collected data","title":"Data Exploration"},{"location":"machine-learning/data-exploration/#data-exploration","text":"After envisioning , and typically as part of the ML feasibility study , the next step is to confirm resource access and then dive deep into the available data through data exploration workshops.","title":"Data Exploration"},{"location":"machine-learning/data-exploration/#purpose-of-the-data-exploration-workshop","text":"The purpose of the data exploration workshop is as follows: Ensure the team can access the data and compute resources that are necessary for the ML feasibility study Ensure that the data provided is of quality and is relevant to the ML solution Make sure that the project team has a good understanding of the data Make sure that the SMEs (Subject Matter Experts) needed are present for Data Exploration Workshop List people needed for the data exploration workshop","title":"Purpose of the Data Exploration Workshop"},{"location":"machine-learning/data-exploration/#accessing-resources","text":"Prior to diving into data exploration workshops, it is important to confirm that you have access to the necessary resources (including data). Below is an example list of questions to consider before starting a data exploration workshop. What are the requirements for an account to be set up in order for the team to access data and compute resources? Are there security requirements around accessing resources (Subscriptions, Azure Resources, project management, etc.) such as VPN, 2FA, jump boxes, etc.? Data access: * Is it on-prem or on Azure already? * If it is on-prem, can we move the needed data to Azure under the appropriate subscription? Who has permission to move the data? * Is the data access approved from a legal/compliance perspective? Computation: * Is a VPN needed for the project team to access these computation nodes (Virtual Machines, Databricks clusters, etc) from their work PCs/Macs? * Any restrictions on accessing the source data system from these computation nodes? * If we want to create some compute resources, who has permissions to do so? Source code repository: * Do you have any preference on source code repository location? Backlog management and work planning: * Do you have any preference on backlog management and work planning, such as Azure DevOps, Jira or anything else? * If an existing system, are special accounts / system setups required to access? Programming Language: * Is Python/PySpark a preferred language? * Is there any internal approval processes for the Python/PySpark libraries we want to use for this engagement?","title":"Accessing Resources"},{"location":"machine-learning/data-exploration/#data-exploration-workshop","text":"Key objectives of the exploration workshops include the following: Understand and document the features, location, and availability of the data. What order of magnitude is the current data (e.g., GB, TB)? Is this all relevant? How does the organization decide when to collect additional data or purchase external data? Are there any examples of this? Understand the quality of the data. Is there already a data validation strategy in place? What data has been used so far to analyze recent data-driven projects? What has been found to be most useful? What was not useful? How was this judged? What additional internal data may provide insights useful for data-driven decision-making for proposed projects? What external data could be useful? What are the possible constraints or challenges in accessing or incorporating this data? How was the data collected? Are there any obvious biases due to how the data was collected? What changes to data collection, coding, integration, etc has occurred in the last 2 years that may impact the interpretation or availability of the collected data","title":"Data Exploration Workshop"},{"location":"machine-learning/envisioning-and-problem-formulation/","text":"Envisioning and Problem Formulation Before beginning a data science investigation, we need to define a problem statement which the data science team can explore; this problem statement can have a significant influence on whether the project is likely to be successful. Envisioning Goals The main goals of the envisioning process are: Establish a clear understanding of the problem domain and the underlying business objective Define how a potential solution would be used and how its performance should be measured Determine what data is available to solve the problem Understand the capabilities and working practices of the data science team Ensure all parties have the same understanding of the scope and next steps (e.g., onboarding, data exploration workshop) The envisioning process usually entails a series of 'envisioning' sessions where the data science team work alongside subject-matter experts to formulate the problem in such a way that there is a shared understanding a shared understanding of the problem domain, a clear goal, and a predefined approach to evaluating a potential solution. Understanding the Problem Domain Generally, before defining a project scope for a data science investigation, we must first understand the problem domain: What is the problem? Why does the problem need to be solved? Does this problem require a machine learning solution? How would a potential solution be used? However, establishing this understanding can prove difficult, especially for those unfamiliar with the problem domain. To ease this process, we can approach problems in a structured way by taking the following steps: Identify a measurable problem and define this in business terms. The objective should be clear, and we should have a good understanding of the factors that we can control - that can be used as inputs - and how they affect the objective. Be as specific as possible. Decide how the performance of a solution should be measured and identify whether this is possible within the restrictions of this problem. Make sure it aligns with the business objective and that you have identified the data required to evaluate the solution. Note that the data required to evaluate a solution may differ from the data needed to create a solution. Thinking about the solution as a black box, detail the function that a solution to this problem should perform to fulfil the objective and verify that the relevant data is available to solve the problem. One way of approaching this is by thinking about how a subject-matter expert could solve the problem manually, and the data that would be required; if a human subject-matter expert is unable to solve the problem given the available data, this is indicative that additional information is required and/or more data needs to be collected. Based on the available data, define specific hypothesis statements - which can be proved or disproved - to guide the exploration of the data science team. Where possible, each hypothesis statement should have a clearly defined success criteria (e.g., with an accuracy of over 60% ), however, this is not always possible - especially for projects where no solution to the problem currently exists. In these cases, the measure of success could be based on a subject-matter expert verifying that the results meet their expectations. Document all the above information, to ensure alignment between stakeholders and establish a clear understanding of the problem to be solved. Try to ensure that as much relevant domain knowledge is captured as possible, and that the features present in available data - and the way that the data was collected - are clearly explained, such that they can be understood by a non-subject matter expert. Once an understanding of the problem domain has been established, it may be necessary to break down the overall problem into smaller, meaningful chunks of work to maintain team focus and ensure a realistic project scope within the given time frame. Listening to the End User These problems are complex and require understanding from a variety of perspectives. It is not uncommon for the stakeholders to not be the end user of the solution framework. In these cases, listening to the actual end users is critical to the success of the project. The following questions can help guide discussion in understanding the stakeholders' perspectives: Who is the end user? What is the current practice related to the business problem? What's the performance of the current solution? What are their pain points? What is their toughest problem? What is the state of the data used to build the solution? How does the end user or SME envision the solution? Envisioning Guidance During envisioning sessions, the following may prove useful for guiding the discussion. Many of these points are taken directly, or adapted from, [1] and [2] . Problem Framing Define the objective in business terms. How will the solution be used? What are the current solutions/workarounds (if any)? What work has been done in this area so far? Does this solution need to fit into an existing system? How should performance be measured? Is the performance measure aligned with the business objective? What would be the minimum performance needed to reach the business objective? Are there any known constraints around non-functional requirements that would have to be taken into account? (e.g., computation times) Frame this problem (supervised/unsupervised, online/offline, etc.) Is human expertise available? How would you solve the problem manually? Are there any restrictions on the type of approaches which can be used? (e.g., does the solution need to be completely explainable?) List the assumptions you or others have made so far. Verify these assumptions if possible. Define some initial hypothesis statements to be explored. Highlight and discuss any responsible AI concerns if appropriate. Workflow What data science skills exist in the organization? How many data scientists/engineers would be available to work on this project? In what capacity would these resources be available (full-time, part-time, etc.)? What does the team's current workflow practices look like? Do they work on the cloud/on-prem? In notebooks/IDE? Is version control used? How are data, experiments and models currently tracked? Does the team employ an Agile methodology? How is work tracked? Are there any ML solutions currently running in production? Who is responsible for maintaining these solutions? Who would be responsible for maintaining a solution produced during this project? Are there any restrictions on tooling that must/cannot be used? Example: A Recommendation Engine Problem To illustrate how the above process can be applied to a tangible problem domain, as an example, consider that we are looking at implementing a recommendation engine for a clothing retailer. This example was, in part, inspired by [3] . Often, the objective may be simply presented, in a form such as \"to improve sales\". However, whilst this is ultimately the main goal, we would benefit from being more specific here. Suppose that we were to deploy a solution in November and then observed a December sales surge; how would we be able to distinguish how much of this was as a result of the new recommendation engine, as opposed to the fact that December is a peak buying season? A better objective, in this case, would be \"to drive additional sales by presenting the customer with items that they would not otherwise have purchased without the recommendation \". Here, the inputs that we can control are the choice of items that are presented to each customer, and the order in which they are displayed; considering factors such as how frequently these should change, seasonality, etc. The data required to evaluate a potential solution in this case would be which recommendations resulted in new sales, and an estimation of a customer's likeliness to purchase a specific item without a recommendation. Note that, whilst this data could also be used to build a recommendation engine, it is unlikely that this data will be available before a recommendation system has been implemented, so it is likely that we will have to use an alternate data source to build the model. We can get an initial idea of how to approach a solution to this problem by considering how it would be solved by a subject-matter expert. Thinking of how a personal stylist may provide a recommendation, they are likely to recommend items based on one or more of the following: generally popular items items similar to those liked/purchased by the customer items that were liked/purchased by similar customers items which are complementary to those owned by the customer Whilst this list is by no means exhaustive, it provides a good indication of the data that is likely to be useful to us: item sales data customer purchase histories customer demographics item descriptions and tags previous outfits, or sets, which have been curated by the stylist We would then be able to use this data to explore: a method of measuring similarity between items a method of measuring similarity between customers a method of measuring how complementary items are relative to one another which can be used to create and rank recommendations. Depending on the project scope, and available data, one or more of these areas could be selected to create hypotheses to be explored by the data science team. Some examples of such hypothesis statements could be: From the descriptions of each item, we can determine a measure of similarity between different items to a degree of accuracy which is specified by a stylist. Based on the behavior of customers with similar purchasing histories, we are able to predict certain items that a customer is likely to purchase; with a certainty which is greater than random choice. Using sets of items which have previously been sold together, we can formulate rules around the features which determine whether items are complementary or not which can be verified by a stylist. Next Steps To ensure clarity and alignment, it is useful to summarize the envisioning stage findings focusing on proposed detailed scenarios, assumptions and agreed decisions as well next steps. We suggest confirming that you have access to all necessary resources (including data) as a next step before proceeding with data exploration workshops. Below are the links to the exit document template and to some questions which may be helpful in confirming resource access. Summary of Scope Exit Document Template List of Resource Access Questions List of Data Exploration Workshop Questions Resources Many of the ideas presented here - and much more - were inspired by, and can be found in the following resources; all of which are highly recommended. Aur\u00e9lien G\u00e9ron's Machine learning project checklist Fast.ai's Data project checklist Designing great data products. Jeremy Howard, Margit Zwemer and Mike Loukides","title":"Envisioning and Problem Formulation"},{"location":"machine-learning/envisioning-and-problem-formulation/#envisioning-and-problem-formulation","text":"Before beginning a data science investigation, we need to define a problem statement which the data science team can explore; this problem statement can have a significant influence on whether the project is likely to be successful.","title":"Envisioning and Problem Formulation"},{"location":"machine-learning/envisioning-and-problem-formulation/#envisioning-goals","text":"The main goals of the envisioning process are: Establish a clear understanding of the problem domain and the underlying business objective Define how a potential solution would be used and how its performance should be measured Determine what data is available to solve the problem Understand the capabilities and working practices of the data science team Ensure all parties have the same understanding of the scope and next steps (e.g., onboarding, data exploration workshop) The envisioning process usually entails a series of 'envisioning' sessions where the data science team work alongside subject-matter experts to formulate the problem in such a way that there is a shared understanding a shared understanding of the problem domain, a clear goal, and a predefined approach to evaluating a potential solution.","title":"Envisioning Goals"},{"location":"machine-learning/envisioning-and-problem-formulation/#understanding-the-problem-domain","text":"Generally, before defining a project scope for a data science investigation, we must first understand the problem domain: What is the problem? Why does the problem need to be solved? Does this problem require a machine learning solution? How would a potential solution be used? However, establishing this understanding can prove difficult, especially for those unfamiliar with the problem domain. To ease this process, we can approach problems in a structured way by taking the following steps: Identify a measurable problem and define this in business terms. The objective should be clear, and we should have a good understanding of the factors that we can control - that can be used as inputs - and how they affect the objective. Be as specific as possible. Decide how the performance of a solution should be measured and identify whether this is possible within the restrictions of this problem. Make sure it aligns with the business objective and that you have identified the data required to evaluate the solution. Note that the data required to evaluate a solution may differ from the data needed to create a solution. Thinking about the solution as a black box, detail the function that a solution to this problem should perform to fulfil the objective and verify that the relevant data is available to solve the problem. One way of approaching this is by thinking about how a subject-matter expert could solve the problem manually, and the data that would be required; if a human subject-matter expert is unable to solve the problem given the available data, this is indicative that additional information is required and/or more data needs to be collected. Based on the available data, define specific hypothesis statements - which can be proved or disproved - to guide the exploration of the data science team. Where possible, each hypothesis statement should have a clearly defined success criteria (e.g., with an accuracy of over 60% ), however, this is not always possible - especially for projects where no solution to the problem currently exists. In these cases, the measure of success could be based on a subject-matter expert verifying that the results meet their expectations. Document all the above information, to ensure alignment between stakeholders and establish a clear understanding of the problem to be solved. Try to ensure that as much relevant domain knowledge is captured as possible, and that the features present in available data - and the way that the data was collected - are clearly explained, such that they can be understood by a non-subject matter expert. Once an understanding of the problem domain has been established, it may be necessary to break down the overall problem into smaller, meaningful chunks of work to maintain team focus and ensure a realistic project scope within the given time frame.","title":"Understanding the Problem Domain"},{"location":"machine-learning/envisioning-and-problem-formulation/#listening-to-the-end-user","text":"These problems are complex and require understanding from a variety of perspectives. It is not uncommon for the stakeholders to not be the end user of the solution framework. In these cases, listening to the actual end users is critical to the success of the project. The following questions can help guide discussion in understanding the stakeholders' perspectives: Who is the end user? What is the current practice related to the business problem? What's the performance of the current solution? What are their pain points? What is their toughest problem? What is the state of the data used to build the solution? How does the end user or SME envision the solution?","title":"Listening to the End User"},{"location":"machine-learning/envisioning-and-problem-formulation/#envisioning-guidance","text":"During envisioning sessions, the following may prove useful for guiding the discussion. Many of these points are taken directly, or adapted from, [1] and [2] .","title":"Envisioning Guidance"},{"location":"machine-learning/envisioning-and-problem-formulation/#problem-framing","text":"Define the objective in business terms. How will the solution be used? What are the current solutions/workarounds (if any)? What work has been done in this area so far? Does this solution need to fit into an existing system? How should performance be measured? Is the performance measure aligned with the business objective? What would be the minimum performance needed to reach the business objective? Are there any known constraints around non-functional requirements that would have to be taken into account? (e.g., computation times) Frame this problem (supervised/unsupervised, online/offline, etc.) Is human expertise available? How would you solve the problem manually? Are there any restrictions on the type of approaches which can be used? (e.g., does the solution need to be completely explainable?) List the assumptions you or others have made so far. Verify these assumptions if possible. Define some initial hypothesis statements to be explored. Highlight and discuss any responsible AI concerns if appropriate.","title":"Problem Framing"},{"location":"machine-learning/envisioning-and-problem-formulation/#workflow","text":"What data science skills exist in the organization? How many data scientists/engineers would be available to work on this project? In what capacity would these resources be available (full-time, part-time, etc.)? What does the team's current workflow practices look like? Do they work on the cloud/on-prem? In notebooks/IDE? Is version control used? How are data, experiments and models currently tracked? Does the team employ an Agile methodology? How is work tracked? Are there any ML solutions currently running in production? Who is responsible for maintaining these solutions? Who would be responsible for maintaining a solution produced during this project? Are there any restrictions on tooling that must/cannot be used?","title":"Workflow"},{"location":"machine-learning/envisioning-and-problem-formulation/#example-a-recommendation-engine-problem","text":"To illustrate how the above process can be applied to a tangible problem domain, as an example, consider that we are looking at implementing a recommendation engine for a clothing retailer. This example was, in part, inspired by [3] . Often, the objective may be simply presented, in a form such as \"to improve sales\". However, whilst this is ultimately the main goal, we would benefit from being more specific here. Suppose that we were to deploy a solution in November and then observed a December sales surge; how would we be able to distinguish how much of this was as a result of the new recommendation engine, as opposed to the fact that December is a peak buying season? A better objective, in this case, would be \"to drive additional sales by presenting the customer with items that they would not otherwise have purchased without the recommendation \". Here, the inputs that we can control are the choice of items that are presented to each customer, and the order in which they are displayed; considering factors such as how frequently these should change, seasonality, etc. The data required to evaluate a potential solution in this case would be which recommendations resulted in new sales, and an estimation of a customer's likeliness to purchase a specific item without a recommendation. Note that, whilst this data could also be used to build a recommendation engine, it is unlikely that this data will be available before a recommendation system has been implemented, so it is likely that we will have to use an alternate data source to build the model. We can get an initial idea of how to approach a solution to this problem by considering how it would be solved by a subject-matter expert. Thinking of how a personal stylist may provide a recommendation, they are likely to recommend items based on one or more of the following: generally popular items items similar to those liked/purchased by the customer items that were liked/purchased by similar customers items which are complementary to those owned by the customer Whilst this list is by no means exhaustive, it provides a good indication of the data that is likely to be useful to us: item sales data customer purchase histories customer demographics item descriptions and tags previous outfits, or sets, which have been curated by the stylist We would then be able to use this data to explore: a method of measuring similarity between items a method of measuring similarity between customers a method of measuring how complementary items are relative to one another which can be used to create and rank recommendations. Depending on the project scope, and available data, one or more of these areas could be selected to create hypotheses to be explored by the data science team. Some examples of such hypothesis statements could be: From the descriptions of each item, we can determine a measure of similarity between different items to a degree of accuracy which is specified by a stylist. Based on the behavior of customers with similar purchasing histories, we are able to predict certain items that a customer is likely to purchase; with a certainty which is greater than random choice. Using sets of items which have previously been sold together, we can formulate rules around the features which determine whether items are complementary or not which can be verified by a stylist.","title":"Example: A Recommendation Engine Problem"},{"location":"machine-learning/envisioning-and-problem-formulation/#next-steps","text":"To ensure clarity and alignment, it is useful to summarize the envisioning stage findings focusing on proposed detailed scenarios, assumptions and agreed decisions as well next steps. We suggest confirming that you have access to all necessary resources (including data) as a next step before proceeding with data exploration workshops. Below are the links to the exit document template and to some questions which may be helpful in confirming resource access. Summary of Scope Exit Document Template List of Resource Access Questions List of Data Exploration Workshop Questions","title":"Next Steps"},{"location":"machine-learning/envisioning-and-problem-formulation/#resources","text":"Many of the ideas presented here - and much more - were inspired by, and can be found in the following resources; all of which are highly recommended. Aur\u00e9lien G\u00e9ron's Machine learning project checklist Fast.ai's Data project checklist Designing great data products. Jeremy Howard, Margit Zwemer and Mike Loukides","title":"Resources"},{"location":"machine-learning/envisioning-summary-template/","text":"Generic Envisioning Summary Purpose of this Template This is an example of an envisioning summary completed after envisioning sessions have concluded. It summarizes the materials reviewed, application scenarios discussed and decided, and the next steps in the process. Summary of Envisioning Introduction This document is to summarize what we have discussed in these envisioning sessions, and what we have decided to work on in this machine learning (ML) engagement. With this document, we hope that everyone can be on the same page regarding the scope of this ML engagement, and will ensure a successful start for the project. Materials Shared with the Team List materials shared with you here. The list below contains some examples. You will want to be more specific. Business vision statement Sample Data Current problem statement Also discuss: How the current solution is built and implemented Details about the current state of the systems and processes. Applications Scenarios that Can Help [People] Achieve [Task] The following application scenarios were discussed: Scenario 1: Scenario 2: Add more scenarios as needed For each scenario, provide an appropriately descriptive name and then follow up with more details. For each scenario, discuss: What problem statement was discussed How we propose to solve the problem (there may be several proposals) Who would use the solution What would it look like to use our solution? An example of how it would bring value to the end user. Selected Scenario for this ML Engagement Which scenario was selected? Why was this scenario prioritised over the others? Will other scenarios be considered in the future? When will we revisit them / what conditions need to be met to pursue them? More Details of the Scope for Selected Scenario What is in scope? What data is available? Which performance metric to use? Bar of performance metrics What are deliverables? What\u2019s Next? Legal Documents to be Signed State documents and timeline Responsible AI Review Plan when to conduct a responsible AI process. What are the prerequisites to start this process? Data Exploration Workshop A data exploration workshop is planned for DATE RANGE . This data exploration workshops will be X - Y days, not including the time to gain access resources. The purpose of the data exploration workshop is as follows: Ensure the team can access the data and compute resources that are necessary for the ML feasibility study Ensure that the data provided is of quality and is relevant to the ML solution Make sure that the project team has a good understanding of the data Make sure that the SMEs (Subject Matter Experts) needed are present for Data Exploration Workshop List people needed for the data exploration workshop ML Feasibility Study til [date] Objectives State what we expect to be the objective in the feasibility study Timeline Give a possible timeline for the feasibility study Personnel Needed What sorts of people/roles are needed for the feasibility study? What\u2019s After ML Feasibility Study Detail here Summary of Timeline Below is a high-level summary of the upcoming timeline: Discuss dates for the data exploration workshop, and feasibility study along with any to-do items such as starting responsible AI process, identifying engineering resources. We suggest using a concise bulleted list or a table to easily convey the information.","title":"Generic Envisioning Summary"},{"location":"machine-learning/envisioning-summary-template/#generic-envisioning-summary","text":"","title":"Generic Envisioning Summary"},{"location":"machine-learning/envisioning-summary-template/#purpose-of-this-template","text":"This is an example of an envisioning summary completed after envisioning sessions have concluded. It summarizes the materials reviewed, application scenarios discussed and decided, and the next steps in the process.","title":"Purpose of this Template"},{"location":"machine-learning/envisioning-summary-template/#summary-of-envisioning","text":"","title":"Summary of Envisioning"},{"location":"machine-learning/envisioning-summary-template/#introduction","text":"This document is to summarize what we have discussed in these envisioning sessions, and what we have decided to work on in this machine learning (ML) engagement. With this document, we hope that everyone can be on the same page regarding the scope of this ML engagement, and will ensure a successful start for the project.","title":"Introduction"},{"location":"machine-learning/envisioning-summary-template/#materials-shared-with-the-team","text":"List materials shared with you here. The list below contains some examples. You will want to be more specific. Business vision statement Sample Data Current problem statement Also discuss: How the current solution is built and implemented Details about the current state of the systems and processes.","title":"Materials Shared with the Team"},{"location":"machine-learning/envisioning-summary-template/#applications-scenarios-that-can-help-people-achieve-task","text":"The following application scenarios were discussed: Scenario 1: Scenario 2: Add more scenarios as needed For each scenario, provide an appropriately descriptive name and then follow up with more details. For each scenario, discuss: What problem statement was discussed How we propose to solve the problem (there may be several proposals) Who would use the solution What would it look like to use our solution? An example of how it would bring value to the end user.","title":"Applications Scenarios that Can Help [People] Achieve [Task]"},{"location":"machine-learning/envisioning-summary-template/#selected-scenario-for-this-ml-engagement","text":"Which scenario was selected? Why was this scenario prioritised over the others? Will other scenarios be considered in the future? When will we revisit them / what conditions need to be met to pursue them?","title":"Selected Scenario for this ML Engagement"},{"location":"machine-learning/envisioning-summary-template/#more-details-of-the-scope-for-selected-scenario","text":"What is in scope? What data is available? Which performance metric to use? Bar of performance metrics What are deliverables?","title":"More Details of the Scope for Selected Scenario"},{"location":"machine-learning/envisioning-summary-template/#whats-next","text":"","title":"What\u2019s Next?"},{"location":"machine-learning/envisioning-summary-template/#legal-documents-to-be-signed","text":"State documents and timeline","title":"Legal Documents to be Signed"},{"location":"machine-learning/envisioning-summary-template/#responsible-ai-review","text":"Plan when to conduct a responsible AI process. What are the prerequisites to start this process?","title":"Responsible AI Review"},{"location":"machine-learning/envisioning-summary-template/#data-exploration-workshop","text":"A data exploration workshop is planned for DATE RANGE . This data exploration workshops will be X - Y days, not including the time to gain access resources. The purpose of the data exploration workshop is as follows: Ensure the team can access the data and compute resources that are necessary for the ML feasibility study Ensure that the data provided is of quality and is relevant to the ML solution Make sure that the project team has a good understanding of the data Make sure that the SMEs (Subject Matter Experts) needed are present for Data Exploration Workshop List people needed for the data exploration workshop","title":"Data Exploration Workshop"},{"location":"machine-learning/envisioning-summary-template/#ml-feasibility-study-til-date","text":"","title":"ML Feasibility Study til [date]"},{"location":"machine-learning/envisioning-summary-template/#objectives","text":"State what we expect to be the objective in the feasibility study","title":"Objectives"},{"location":"machine-learning/envisioning-summary-template/#timeline","text":"Give a possible timeline for the feasibility study","title":"Timeline"},{"location":"machine-learning/envisioning-summary-template/#personnel-needed","text":"What sorts of people/roles are needed for the feasibility study?","title":"Personnel Needed"},{"location":"machine-learning/envisioning-summary-template/#whats-after-ml-feasibility-study","text":"Detail here","title":"What\u2019s After ML Feasibility Study"},{"location":"machine-learning/envisioning-summary-template/#summary-of-timeline","text":"Below is a high-level summary of the upcoming timeline: Discuss dates for the data exploration workshop, and feasibility study along with any to-do items such as starting responsible AI process, identifying engineering resources. We suggest using a concise bulleted list or a table to easily convey the information.","title":"Summary of Timeline"},{"location":"machine-learning/feasibility-studies/","text":"Feasibility Studies The main goal of feasibility studies is to assess whether it is feasible to solve the problem satisfactorily using ML with the available data. We want to avoid investing too much in the solution before we have: Sufficient evidence that a solution would be the best technical solution given the business case Sufficient evidence that a solution is compatible with the problem context Sufficient evidence that a solution is possible Some vetted direction on what a solution should look like This effort ensures quality solutions backed by the appropriate, thorough amount of consideration and evidence. When are Feasibility Studies Useful? Every engagement can benefit from a feasibility study early in the project. Architectural discussions can still occur in parallel as the team works towards gaining a solid understanding and definition of what will be built. Feasibility studies can last between 4-16 weeks, depending on specific problem details, volume of data, state of the data etc. Starting with a 4-week milestone might be useful, during which it can be determined how much more time, if any, is required for completion. Who Collaborates on Feasibility Studies? Collaboration from individuals with diverse skill sets is desired at this stage, including data scientists, data engineers, software engineers, PMs, human experience researchers, and domain experts. It embraces the use of engineering fundamentals, with some flexibility. For example, not all experimentation requires full test coverage and code review. Experimentation is typically not part of a CI/CD pipeline. Artifacts may live in the main branch as a folder excluded from the CI/CD pipeline, or as a separate experimental branch, depending on customer/team preferences. What do Feasibility Studies Entail? Problem Definition and Desired Outcome Ensure that the problem is complex enough that coding rules or manual scaling is unrealistic Clear definition of the problem from business and technical perspectives Deep Contextual Understanding Confirm that the following questions can be answered based on what was learned during the Discovery Phase of the project. For items that can not be satisfactorily answered, undertake additional investigation to answer. Understanding the people who are using and/or affected by the solution Understanding the contextual forces at play around the problem, including goals, culture, and historical context To accomplish this a researcher will: Collaborate with customers and colleagues to explore the landscape of people who relate to and may be affected by the problem space being explored (Users, stakeholders, subject matter experts, etc) Formulate the research question(s) to be addressed Select and design research to best serve the research question(s) Identify and select representative research participants across the problem space with whom to conduct the research Construct a research plan and necessary preparation documents for the selected research method(s) Conduct research activity with the participants via the selected method(s) Synthesize, analyze, and interpret research findings Where relevant, build frameworks, artifacts and processes that help explore the findings and implications of the research across the team Share what was uncovered and understood, and the implications thereof across the engagement team and relevant stakeholders. If the above research was conducted during the Discovery phase, it should be reviewed, and any substantial knowledge gaps should be identified and filled by following the above process. Data Access Verify that the full team has access to the data Set up a dedicated and/or restricted environment if required Perform any required de-identification or redaction of sensitive information Understand data access requirements (retention, role-based access, etc.) Data Discovery Hold a data exploration workshop and deep dive with domain experts Understand data availability and confirm the team's access Understand the data dictionary, if available Understand the quality of the data. Is there already a data validation strategy in place? Ensure required data is present in reasonable volumes For supervised problems (most common), assess the availability of labels or data that can be used to effectively approximate labels If applicable, ensure all data can be joined as required and understand how Ideally obtain or create an entity relationship diagram (ERD) Potentially uncover new useful data sources Architecture Discovery Clear picture of existing architecture Infrastructure spikes Concept Ideation and Iteration Develop value proposition(s) for users and stakeholders based on the contextual understanding developed through the discovery process (e.g. key elements of value, benefits) As relevant, make use of Co-creation with team Co-creation with users and stakeholders As relevant, create vignettes, narratives or other materials to communicate the concept Identify the next set of hypotheses or unknowns to be tested (see concept testing) Revisit and iterate on the concept throughout discovery as understanding of the problem space evolves Exploratory Data Analysis (EDA) Data deep dive Understand feature and label value distributions Understand correlations among features and between features and labels Understand data specific problem constraints like missing values, categorical cardinality, potential for data leakage etc. Identify any gaps in data that couldn't be identified in the data discovery phase Pave the way of further understanding of what techniques are applicable Establish a mutual understanding of what data is in or out of scope for feasibility, ensuring that the data in scope is significant for the business Data Pre-Processing Happens during EDA and hypothesis testing Feature engineering Sampling Scaling and/or discretization Noise handling Hypothesis Testing Design several potential solutions using theoretically applicable algorithms and techniques, starting with the simplest reasonable baseline Train model(s) Evaluate performance and determine if satisfactory Tweak experimental solution designs based on outcomes Iterate Thoroughly document each step and outcome, plus any resulting hypotheses for easy following of the decision-making process Concept Testing Where relevant, to test the value proposition, concepts or aspects of the experience Plan user, stakeholder and expert research Develop and design necessary research materials Synthesize and evaluate feedback to incorporate into concept development Continue to iterate and test different elements of the concept as necessary, including testing to best serve RAI goals and guidelines Ensure that the proposed solution and framing are compatible with and acceptable to affected people Ensure that the proposed solution and framing is compatible with existing business goals and context Risk Assessment Identification and assessment of risks and constraints Responsible AI Consideration of responsible AI principles Understanding of users and stakeholders\u2019 contexts, needs and concerns to inform development of RAI Testing AI concept and experience elements with users and stakeholders Discussion and feedback from diverse perspectives around any responsible AI concerns Output of a Feasibility Study The main outcome is a feasibility study report, with a recommendation on next steps: If there is not enough evidence to support the hypothesis that this problem can be solved using ML, as aligned with the pre-determined performance measures and business impact: We detail the gaps and challenges that prevented us from reaching a positive outcome We may scope down the project, if applicable We may look at re-scoping the problem taking into account the findings of the feasibility study We assess the possibility to collect more data or improve data quality If there is enough evidence to support the hypothesis that this problem can be solved using ML Provide recommendations and technical assets for moving to the operationalization phase","title":"Feasibility Studies"},{"location":"machine-learning/feasibility-studies/#feasibility-studies","text":"The main goal of feasibility studies is to assess whether it is feasible to solve the problem satisfactorily using ML with the available data. We want to avoid investing too much in the solution before we have: Sufficient evidence that a solution would be the best technical solution given the business case Sufficient evidence that a solution is compatible with the problem context Sufficient evidence that a solution is possible Some vetted direction on what a solution should look like This effort ensures quality solutions backed by the appropriate, thorough amount of consideration and evidence.","title":"Feasibility Studies"},{"location":"machine-learning/feasibility-studies/#when-are-feasibility-studies-useful","text":"Every engagement can benefit from a feasibility study early in the project. Architectural discussions can still occur in parallel as the team works towards gaining a solid understanding and definition of what will be built. Feasibility studies can last between 4-16 weeks, depending on specific problem details, volume of data, state of the data etc. Starting with a 4-week milestone might be useful, during which it can be determined how much more time, if any, is required for completion.","title":"When are Feasibility Studies Useful?"},{"location":"machine-learning/feasibility-studies/#who-collaborates-on-feasibility-studies","text":"Collaboration from individuals with diverse skill sets is desired at this stage, including data scientists, data engineers, software engineers, PMs, human experience researchers, and domain experts. It embraces the use of engineering fundamentals, with some flexibility. For example, not all experimentation requires full test coverage and code review. Experimentation is typically not part of a CI/CD pipeline. Artifacts may live in the main branch as a folder excluded from the CI/CD pipeline, or as a separate experimental branch, depending on customer/team preferences.","title":"Who Collaborates on Feasibility Studies?"},{"location":"machine-learning/feasibility-studies/#what-do-feasibility-studies-entail","text":"","title":"What do Feasibility Studies Entail?"},{"location":"machine-learning/feasibility-studies/#problem-definition-and-desired-outcome","text":"Ensure that the problem is complex enough that coding rules or manual scaling is unrealistic Clear definition of the problem from business and technical perspectives","title":"Problem Definition and Desired Outcome"},{"location":"machine-learning/feasibility-studies/#deep-contextual-understanding","text":"Confirm that the following questions can be answered based on what was learned during the Discovery Phase of the project. For items that can not be satisfactorily answered, undertake additional investigation to answer. Understanding the people who are using and/or affected by the solution Understanding the contextual forces at play around the problem, including goals, culture, and historical context To accomplish this a researcher will: Collaborate with customers and colleagues to explore the landscape of people who relate to and may be affected by the problem space being explored (Users, stakeholders, subject matter experts, etc) Formulate the research question(s) to be addressed Select and design research to best serve the research question(s) Identify and select representative research participants across the problem space with whom to conduct the research Construct a research plan and necessary preparation documents for the selected research method(s) Conduct research activity with the participants via the selected method(s) Synthesize, analyze, and interpret research findings Where relevant, build frameworks, artifacts and processes that help explore the findings and implications of the research across the team Share what was uncovered and understood, and the implications thereof across the engagement team and relevant stakeholders. If the above research was conducted during the Discovery phase, it should be reviewed, and any substantial knowledge gaps should be identified and filled by following the above process.","title":"Deep Contextual Understanding"},{"location":"machine-learning/feasibility-studies/#data-access","text":"Verify that the full team has access to the data Set up a dedicated and/or restricted environment if required Perform any required de-identification or redaction of sensitive information Understand data access requirements (retention, role-based access, etc.)","title":"Data Access"},{"location":"machine-learning/feasibility-studies/#data-discovery","text":"Hold a data exploration workshop and deep dive with domain experts Understand data availability and confirm the team's access Understand the data dictionary, if available Understand the quality of the data. Is there already a data validation strategy in place? Ensure required data is present in reasonable volumes For supervised problems (most common), assess the availability of labels or data that can be used to effectively approximate labels If applicable, ensure all data can be joined as required and understand how Ideally obtain or create an entity relationship diagram (ERD) Potentially uncover new useful data sources","title":"Data Discovery"},{"location":"machine-learning/feasibility-studies/#architecture-discovery","text":"Clear picture of existing architecture Infrastructure spikes","title":"Architecture Discovery"},{"location":"machine-learning/feasibility-studies/#concept-ideation-and-iteration","text":"Develop value proposition(s) for users and stakeholders based on the contextual understanding developed through the discovery process (e.g. key elements of value, benefits) As relevant, make use of Co-creation with team Co-creation with users and stakeholders As relevant, create vignettes, narratives or other materials to communicate the concept Identify the next set of hypotheses or unknowns to be tested (see concept testing) Revisit and iterate on the concept throughout discovery as understanding of the problem space evolves","title":"Concept Ideation and Iteration"},{"location":"machine-learning/feasibility-studies/#exploratory-data-analysis-eda","text":"Data deep dive Understand feature and label value distributions Understand correlations among features and between features and labels Understand data specific problem constraints like missing values, categorical cardinality, potential for data leakage etc. Identify any gaps in data that couldn't be identified in the data discovery phase Pave the way of further understanding of what techniques are applicable Establish a mutual understanding of what data is in or out of scope for feasibility, ensuring that the data in scope is significant for the business","title":"Exploratory Data Analysis (EDA)"},{"location":"machine-learning/feasibility-studies/#data-pre-processing","text":"Happens during EDA and hypothesis testing Feature engineering Sampling Scaling and/or discretization Noise handling","title":"Data Pre-Processing"},{"location":"machine-learning/feasibility-studies/#hypothesis-testing","text":"Design several potential solutions using theoretically applicable algorithms and techniques, starting with the simplest reasonable baseline Train model(s) Evaluate performance and determine if satisfactory Tweak experimental solution designs based on outcomes Iterate Thoroughly document each step and outcome, plus any resulting hypotheses for easy following of the decision-making process","title":"Hypothesis Testing"},{"location":"machine-learning/feasibility-studies/#concept-testing","text":"Where relevant, to test the value proposition, concepts or aspects of the experience Plan user, stakeholder and expert research Develop and design necessary research materials Synthesize and evaluate feedback to incorporate into concept development Continue to iterate and test different elements of the concept as necessary, including testing to best serve RAI goals and guidelines Ensure that the proposed solution and framing are compatible with and acceptable to affected people Ensure that the proposed solution and framing is compatible with existing business goals and context","title":"Concept Testing"},{"location":"machine-learning/feasibility-studies/#risk-assessment","text":"Identification and assessment of risks and constraints","title":"Risk Assessment"},{"location":"machine-learning/feasibility-studies/#responsible-ai","text":"Consideration of responsible AI principles Understanding of users and stakeholders\u2019 contexts, needs and concerns to inform development of RAI Testing AI concept and experience elements with users and stakeholders Discussion and feedback from diverse perspectives around any responsible AI concerns","title":"Responsible AI"},{"location":"machine-learning/feasibility-studies/#output-of-a-feasibility-study","text":"The main outcome is a feasibility study report, with a recommendation on next steps: If there is not enough evidence to support the hypothesis that this problem can be solved using ML, as aligned with the pre-determined performance measures and business impact: We detail the gaps and challenges that prevented us from reaching a positive outcome We may scope down the project, if applicable We may look at re-scoping the problem taking into account the findings of the feasibility study We assess the possibility to collect more data or improve data quality If there is enough evidence to support the hypothesis that this problem can be solved using ML Provide recommendations and technical assets for moving to the operationalization phase","title":"Output of a Feasibility Study"},{"location":"machine-learning/ml-fundamentals-checklist/","text":"ML Fundamentals Checklist This checklist helps ensure that our ML projects meet our ML Fundamentals. The items below are not sequential, but rather organized by different parts of an ML project. Data Quality and Governance There is access to data. Labels exist for dataset of interest. Data quality evaluation. Able to track data lineage. Understanding of where the data is coming from and any policies related to data access. Gather Security and Compliance requirements. Feasibility Study A feasibility study was performed to assess if the data supports the proposed tasks. Rigorous Exploratory data analysis was performed (including analysis of data distribution). Hypotheses were tested producing sufficient evidence to either support or reject that an ML approach is feasible to solve the problem. ROI estimation and risk analysis was performed for the project. ML outputs/assets can be integrated within the production system. Recommendations on how to proceed have been documented. Evaluation and Metrics Clear definition of how performance will be measured. The evaluation metrics are somewhat connected to the success criteria. The metrics can be calculated with the datasets available. Evaluation flow can be applied to all versions of the model. Evaluation code is unit-tested and reviewed by all team members. Evaluation flow facilitates further results and error analysis. Model Baseline Well-defined baseline model exists and its performance is calculated. ( More details on well defined baselines ) The performance of other ML models can be compared with the model baseline. Experimentation setup Well-defined train/test dataset with labels. Reproducible and logged experiments in an environment accessible by all data scientists to quickly iterate. Defined experiments/hypothesis to test. Results of experiments are documented. Model hyper parameters are tuned systematically. Same performance evaluation metrics and consistent datasets are used when comparing candidate models. Production Model readiness checklist reviewed. Model reviews were performed (covering model debugging, reviews of training and evaluation approaches, model performance). Data pipeline for inferencing, including an end-to-end tests. SLAs requirements for models are gathered and documented. Monitoring of data feeds and model output. Ensure consistent schema is used across the system with expected input/output defined for each component of the pipelines (data processing as well as models). Responsible AI reviewed.","title":"ML Fundamentals Checklist"},{"location":"machine-learning/ml-fundamentals-checklist/#ml-fundamentals-checklist","text":"This checklist helps ensure that our ML projects meet our ML Fundamentals. The items below are not sequential, but rather organized by different parts of an ML project.","title":"ML Fundamentals Checklist"},{"location":"machine-learning/ml-fundamentals-checklist/#data-quality-and-governance","text":"There is access to data. Labels exist for dataset of interest. Data quality evaluation. Able to track data lineage. Understanding of where the data is coming from and any policies related to data access. Gather Security and Compliance requirements.","title":"Data Quality and Governance"},{"location":"machine-learning/ml-fundamentals-checklist/#feasibility-study","text":"A feasibility study was performed to assess if the data supports the proposed tasks. Rigorous Exploratory data analysis was performed (including analysis of data distribution). Hypotheses were tested producing sufficient evidence to either support or reject that an ML approach is feasible to solve the problem. ROI estimation and risk analysis was performed for the project. ML outputs/assets can be integrated within the production system. Recommendations on how to proceed have been documented.","title":"Feasibility Study"},{"location":"machine-learning/ml-fundamentals-checklist/#evaluation-and-metrics","text":"Clear definition of how performance will be measured. The evaluation metrics are somewhat connected to the success criteria. The metrics can be calculated with the datasets available. Evaluation flow can be applied to all versions of the model. Evaluation code is unit-tested and reviewed by all team members. Evaluation flow facilitates further results and error analysis.","title":"Evaluation and Metrics"},{"location":"machine-learning/ml-fundamentals-checklist/#model-baseline","text":"Well-defined baseline model exists and its performance is calculated. ( More details on well defined baselines ) The performance of other ML models can be compared with the model baseline.","title":"Model Baseline"},{"location":"machine-learning/ml-fundamentals-checklist/#experimentation-setup","text":"Well-defined train/test dataset with labels. Reproducible and logged experiments in an environment accessible by all data scientists to quickly iterate. Defined experiments/hypothesis to test. Results of experiments are documented. Model hyper parameters are tuned systematically. Same performance evaluation metrics and consistent datasets are used when comparing candidate models.","title":"Experimentation setup"},{"location":"machine-learning/ml-fundamentals-checklist/#production","text":"Model readiness checklist reviewed. Model reviews were performed (covering model debugging, reviews of training and evaluation approaches, model performance). Data pipeline for inferencing, including an end-to-end tests. SLAs requirements for models are gathered and documented. Monitoring of data feeds and model output. Ensure consistent schema is used across the system with expected input/output defined for each component of the pipelines (data processing as well as models). Responsible AI reviewed.","title":"Production"},{"location":"machine-learning/ml-model-checklist/","text":"ML Model Production Checklist The purpose of this checklist is to make sure that: The team assessed if the model is ready for production before moving to the scoring process The team has prepared a production plan for the model The checklist provides guidelines for creating this production plan. It should be used by teams/organizations that already built/trained an ML model and are now considering putting it into production. Checklist Before putting an individual ML model into production, the following aspects should be considered: Is there a well defined baseline? Is the model performing better than the baseline? Are machine learning performance metrics defined for both training and scoring? Is the model benchmarked? Can ground truth be obtained or inferred in production? Has the data distribution of training, testing and validation sets been analyzed? Have goals and hard limits for performance, speed of prediction and costs been established so they can be considered if trade-offs need to be made? How will the model be integrated into other systems, and what impact will it have? How will incoming data quality be monitored? How will drift in data characteristics be monitored? How will performance be monitored? Have any ethical concerns been taken into account? Please note that there might be scenarios where it is not possible to check all the items on this checklist. However, it is advised to go through all items and make informed decisions based on your specific use case. Will Your Model Performance be Different in Production than During the Training Phase Once deployed into production, the model might be performing much worse than expected. This poor performance could be a result of: The data to be scored in production is significantly different from the train and test datasets The feature engineering steps are different or inconsistent in production compared to the training process The performance measure is not consistent (for example your test set covers several months of data where the performance metric for production has been calculated for one month of data) Is there a Well-Defined Baseline? Is the Model Performing Better than the Baseline? A good way to think of a model baseline is the simplest model one can come up with: either a simple threshold, a random guess or a very basic linear model. This baseline is the reference point your model needs to outperform. A well-defined baseline is different for each problem type and there is no one size fits all approach. As an example, let's consider some common types of machine learning problems: Classification : Predicting between a positive and a negative class. Either the class with the most observations or a simple logistic regression model can be the baseline. Regression : Predicting the house prices in a city. The average house price for the last year or last month, a simple linear regression model, or the previous median house price in a neighborhood could be the baseline. Image classification : Building an image classifier to distinguish between cats and no cats in an image. If your classes are unbalanced: 70% cats and 30% no cats and if you always predict cats, your naive classifier has 70% accuracy and this can be your baseline. If your classes are balanced: 52% cats and 48% no cats, then a simple convolutional architecture can be the baseline (1 conv layer + 1 max pooling + 1 dense). Additionally, human accuracy at labelling can also be the baseline in an image classification scenario. Some questions to ask when comparing to a baseline: How does your model compare to a random guess? How does your model performance compare to applying a simple threshold? How does your model compare with always predicting the most common value? Note : In some cases, human parity might be too ambitious as a baseline, but this should be decided on a case by case basis. Human accuracy is one of the available options, but not the only one. Resources: \"How To Get Baseline Results And Why They Matter\" article \"Always start with a stupid model, no exceptions.\" article Are Machine Learning Performance Metrics Defined for Both Training and Scoring? The methodology of translating the training metrics to scoring metrics should be well-defined and understood. Depending on the data type and model, the model metrics calculation might differ in production and in training. For example, the training procedure calculated metrics for a long period of time (a year, a decade) with different seasonal characteristics while the scoring procedure will calculate the metrics per a restricted time interval (for example a week, a month, a quarter). Well-defined ML performance metrics are essential in production so that a decrease or increase in model performance can be accurately detected. Things to consider: In forecasting, if you change the period of assessing the performance, from one month to a year for example, then you might get a different result. For example, if your model is predicting sales of a product per day and the RMSE (Root Mean Squared Error) is very low for the first month the model is in production. As the model is live for longer, the RMSE is increasing, becoming 10x the RMSE for the first year compared to the first month. In a classification scenario, the overall accuracy is good, but the model is performing poorly for some subgroups. For example, a classifier has an accuracy of 80% overall, but only 55% for the 20-30 age group. If this is a significant age group for the production data, then your accuracy might suffer greatly when in production. In scene classification scenario, the model is trying to identify a specific scene in a video, and the model has been trained and tested (80-20 split) on 50000 segments where half are segments containing the scene and half of the segments do not contain the scene. The accuracy on the training set is 85% and 84% on the test set. However, when an entire video is scored, scores are obtained on all segments, and we expect few segments to contain the scene. The accuracy for an entire video is not comparable with the training/test set procedure in this case, hence different metrics should be considered. If sampling techniques (over-sampling, under-sampling) are used to train model when classes are imbalanced, ensure the metrics used during training are comparable with the ones used in scoring. If the number of samples used for training and testing is small, the performance metrics might change significantly as new data is scored. Is the Model Benchmarked? The trained model to be put into production is well benchmarked if machine learning performance metrics (such as accuracy, recall, RMSE or whatever is appropriate) are measured on the train and test set. Furthermore, the train and test set split should be well documented and reproducible. Can Ground Truth be Obtained or Inferred in Production? Without a reliable ground truth, the machine learning metrics cannot be calculated. It is important to identify if the ground truth can be obtained as the model is scoring new data by either manual or automatic means. If the ground truth cannot be obtained systematically, other proxies and methodology should be investigated in order to obtain some measure of model performance. One option is to use humans to manually label samples. One important aspect of human labelling is to take into account the human accuracy. If there are two different individuals labelling an image, the labels will likely be different for some samples. It is important to understand how the labels were obtained to assess the reliability of the ground truth (that is why we talk about human accuracy). For clarity, let's consider the following examples (by no means an exhaustive list): Forecasting : Forecasting scenarios are an example of machine learning problems where the ground truth could be obtained in most cases even though a delay might occur. For example, for a model predicting the sales of ice cream in a local shop, the ground truth will be obtained as the sales are happening, but it might appear in the system at a later time than as the model prediction. Recommender systems : For recommender system, obtaining the ground truth is a complex problem in most cases as there is no way of identifying the ideal recommendation. For a retail website for example, click/not click, buy/not buy or other user interaction with recommendation can be used as ground truth proxies. Object detection in images : For an object detection model, as new images are scored, there are no new labels being generated automatically. One option to obtain the ground truth for the new images is to use people to manually label the images. Human labelling is costly, time-consuming and not 100% accurate, so in most cases, only a subset of images can be labelled. These samples can be chosen at random or by using active learning techniques of selecting the most informative unlabeled samples. Has the Data Distribution of Training, Testing and Validation Sets Been Analyzed? The data distribution of your training, test and validation (if applicable) dataset (including labels) should be analyzed to ensure they all come from the same distribution. If this is not the case, some options to consider are: re-shuffling, re-sampling, modifying the data, more samples need to be gathered or features removed from the dataset. Significant differences in the data distributions of the different datasets can greatly impact the performance of the model. Some potential questions to ask: How much does the training and test data represent the end result? Is the distribution of each individual feature consistent across all your datasets? (i.e. same representation of age groups, gender, race etc.) Is there any data lineage information? Where did the data come from? How was the data collected? Can collection and labelling be automated? Resources: \"Splitting into train, dev and test\" tutorial Have Goals and Hard Limits for Performance, Speed of Prediction and Costs been Established, so they can be Considered if Trade-Offs Need to be Made? Some machine learning models achieve high ML performance, but they are costly and time-consuming to run. In those cases, a less performant and cheaper model could be preferred. Hence, it is important to calculate the model performance metrics (accuracy, precision, recall, RMSE etc), but also to gather data on how expensive it will be to run the model and how long it will take to run. Once this data is gathered, an informed decision should be made on what model to productionize. System metrics to consider: CPU/GPU/memory usage Cost per prediction Time taken to make a prediction How Will the Model be Integrated into Other Systems, and what Impact will it Have? Machine Learning models do not exist in isolation, but rather they are part of a much larger system. These systems could be old, proprietary systems or new systems being developed as a results of the creation a new machine learning model. In both of those cases, it is important to understand where the actual model is going to fit in, what output is expected from the model and how that output is going to be used by the larger system. Additionally, it is essential to decide if the model will be used for batch and/or real-time inference as production paths might differ. Possible questions to assess model impact: Is there a human in the loop? How is feedback collected through the system? (for example how do we know if a prediction is wrong) Is there a fallback mechanism when things go wrong? Is the system transparent that there is a model making a prediction and what data is used to make this prediction? What is the cost of a wrong prediction? How Will Incoming Data Quality be Monitored? As data systems become increasingly complex in the mainstream, it is especially vital to employ data quality monitoring, alerting and rectification protocols. Following data validation best practices can prevent insidious issues from creeping into machine learning models that, at best, reduce the usefulness of the model, and at worst, introduce harm. Data validation, reduces the risk of data downtime (increasing headroom) and technical debt and supports long-term success of machine learning models and other applications that rely on the data. Data validation best practices include: Employing automated data quality testing processes at each stage of the data pipeline Re-routing data that fails quality tests to a separate data store for diagnosis and resolution Employing end-to-end data observability on data freshness, distribution, volume, schema and lineage Note that data validation is distinct from data drift detection. Data validation detects errors in the data (ex. a datum is outside of the expected range), while data drift detection uncovers legitimate changes in the data that are truly representative of the phenomenon being modeled (ex. user preferences change). Data validation issues should trigger re-routing and rectification, while data drift should trigger adaptation or retraining of a model. Resources: \"Data Quality Fundamentals\" by Moses et al. How Will Drift in Data Characteristics be Monitored? Data drift detection uncovers legitimate changes in incoming data that are truly representative of the phenomenon being modeled,and are not erroneous (ex. user preferences change). It is imperative to understand if the new data in production will be significantly different from the data in the training phase. It is also important to check that the data distribution information can be obtained for any of the new data coming in. Drift monitoring can inform when changes are occurring and what their characteristics are (ex. abrupt vs gradual) and guide effective adaptation or retraining strategies to maintain performance. Possible questions to ask: What are some examples of drift, or deviation from the norm, that have been experience in the past or that might be expected? Is there a drift detection strategy in place? Does it align with expected types of changes? Are there warnings when anomalies in input data are occurring? Is there an adaptation strategy in place? Does it align with expected types of changes? Resources: \"Learning Under Concept Drift: A Review\" by Lu at al. Understanding dataset shift How Will Performance be Monitored? It is important to define how the model will be monitored when it is in production and how that data is going to be used to make decisions. For example, when will a model need retraining as the performance has degraded and how to identify what are the underlying causes of this degradation could be part of this monitoring methodology. Ideally, model monitoring should be done automatically. However, if this is not possible, then there should be a manual periodical check of the model performance. Model monitoring should lead to: Ability to identify changes in model performance Warnings when anomalies in model output are occurring Retraining decisions and adaptation strategy Have any Ethical Concerns Been Taken into Account? Every ML project goes through the Responsible AI process to ensure that it upholds Microsoft's 6 Responsible AI principles .","title":"ML Model Production Checklist"},{"location":"machine-learning/ml-model-checklist/#ml-model-production-checklist","text":"The purpose of this checklist is to make sure that: The team assessed if the model is ready for production before moving to the scoring process The team has prepared a production plan for the model The checklist provides guidelines for creating this production plan. It should be used by teams/organizations that already built/trained an ML model and are now considering putting it into production.","title":"ML Model Production Checklist"},{"location":"machine-learning/ml-model-checklist/#checklist","text":"Before putting an individual ML model into production, the following aspects should be considered: Is there a well defined baseline? Is the model performing better than the baseline? Are machine learning performance metrics defined for both training and scoring? Is the model benchmarked? Can ground truth be obtained or inferred in production? Has the data distribution of training, testing and validation sets been analyzed? Have goals and hard limits for performance, speed of prediction and costs been established so they can be considered if trade-offs need to be made? How will the model be integrated into other systems, and what impact will it have? How will incoming data quality be monitored? How will drift in data characteristics be monitored? How will performance be monitored? Have any ethical concerns been taken into account? Please note that there might be scenarios where it is not possible to check all the items on this checklist. However, it is advised to go through all items and make informed decisions based on your specific use case.","title":"Checklist"},{"location":"machine-learning/ml-model-checklist/#will-your-model-performance-be-different-in-production-than-during-the-training-phase","text":"Once deployed into production, the model might be performing much worse than expected. This poor performance could be a result of: The data to be scored in production is significantly different from the train and test datasets The feature engineering steps are different or inconsistent in production compared to the training process The performance measure is not consistent (for example your test set covers several months of data where the performance metric for production has been calculated for one month of data)","title":"Will Your Model Performance be Different in Production than During the Training Phase"},{"location":"machine-learning/ml-model-checklist/#is-there-a-well-defined-baseline-is-the-model-performing-better-than-the-baseline","text":"A good way to think of a model baseline is the simplest model one can come up with: either a simple threshold, a random guess or a very basic linear model. This baseline is the reference point your model needs to outperform. A well-defined baseline is different for each problem type and there is no one size fits all approach. As an example, let's consider some common types of machine learning problems: Classification : Predicting between a positive and a negative class. Either the class with the most observations or a simple logistic regression model can be the baseline. Regression : Predicting the house prices in a city. The average house price for the last year or last month, a simple linear regression model, or the previous median house price in a neighborhood could be the baseline. Image classification : Building an image classifier to distinguish between cats and no cats in an image. If your classes are unbalanced: 70% cats and 30% no cats and if you always predict cats, your naive classifier has 70% accuracy and this can be your baseline. If your classes are balanced: 52% cats and 48% no cats, then a simple convolutional architecture can be the baseline (1 conv layer + 1 max pooling + 1 dense). Additionally, human accuracy at labelling can also be the baseline in an image classification scenario. Some questions to ask when comparing to a baseline: How does your model compare to a random guess? How does your model performance compare to applying a simple threshold? How does your model compare with always predicting the most common value? Note : In some cases, human parity might be too ambitious as a baseline, but this should be decided on a case by case basis. Human accuracy is one of the available options, but not the only one. Resources: \"How To Get Baseline Results And Why They Matter\" article \"Always start with a stupid model, no exceptions.\" article","title":"Is there a Well-Defined Baseline? Is the Model Performing Better than the Baseline?"},{"location":"machine-learning/ml-model-checklist/#are-machine-learning-performance-metrics-defined-for-both-training-and-scoring","text":"The methodology of translating the training metrics to scoring metrics should be well-defined and understood. Depending on the data type and model, the model metrics calculation might differ in production and in training. For example, the training procedure calculated metrics for a long period of time (a year, a decade) with different seasonal characteristics while the scoring procedure will calculate the metrics per a restricted time interval (for example a week, a month, a quarter). Well-defined ML performance metrics are essential in production so that a decrease or increase in model performance can be accurately detected. Things to consider: In forecasting, if you change the period of assessing the performance, from one month to a year for example, then you might get a different result. For example, if your model is predicting sales of a product per day and the RMSE (Root Mean Squared Error) is very low for the first month the model is in production. As the model is live for longer, the RMSE is increasing, becoming 10x the RMSE for the first year compared to the first month. In a classification scenario, the overall accuracy is good, but the model is performing poorly for some subgroups. For example, a classifier has an accuracy of 80% overall, but only 55% for the 20-30 age group. If this is a significant age group for the production data, then your accuracy might suffer greatly when in production. In scene classification scenario, the model is trying to identify a specific scene in a video, and the model has been trained and tested (80-20 split) on 50000 segments where half are segments containing the scene and half of the segments do not contain the scene. The accuracy on the training set is 85% and 84% on the test set. However, when an entire video is scored, scores are obtained on all segments, and we expect few segments to contain the scene. The accuracy for an entire video is not comparable with the training/test set procedure in this case, hence different metrics should be considered. If sampling techniques (over-sampling, under-sampling) are used to train model when classes are imbalanced, ensure the metrics used during training are comparable with the ones used in scoring. If the number of samples used for training and testing is small, the performance metrics might change significantly as new data is scored.","title":"Are Machine Learning Performance Metrics Defined for Both Training and Scoring?"},{"location":"machine-learning/ml-model-checklist/#is-the-model-benchmarked","text":"The trained model to be put into production is well benchmarked if machine learning performance metrics (such as accuracy, recall, RMSE or whatever is appropriate) are measured on the train and test set. Furthermore, the train and test set split should be well documented and reproducible.","title":"Is the Model Benchmarked?"},{"location":"machine-learning/ml-model-checklist/#can-ground-truth-be-obtained-or-inferred-in-production","text":"Without a reliable ground truth, the machine learning metrics cannot be calculated. It is important to identify if the ground truth can be obtained as the model is scoring new data by either manual or automatic means. If the ground truth cannot be obtained systematically, other proxies and methodology should be investigated in order to obtain some measure of model performance. One option is to use humans to manually label samples. One important aspect of human labelling is to take into account the human accuracy. If there are two different individuals labelling an image, the labels will likely be different for some samples. It is important to understand how the labels were obtained to assess the reliability of the ground truth (that is why we talk about human accuracy). For clarity, let's consider the following examples (by no means an exhaustive list): Forecasting : Forecasting scenarios are an example of machine learning problems where the ground truth could be obtained in most cases even though a delay might occur. For example, for a model predicting the sales of ice cream in a local shop, the ground truth will be obtained as the sales are happening, but it might appear in the system at a later time than as the model prediction. Recommender systems : For recommender system, obtaining the ground truth is a complex problem in most cases as there is no way of identifying the ideal recommendation. For a retail website for example, click/not click, buy/not buy or other user interaction with recommendation can be used as ground truth proxies. Object detection in images : For an object detection model, as new images are scored, there are no new labels being generated automatically. One option to obtain the ground truth for the new images is to use people to manually label the images. Human labelling is costly, time-consuming and not 100% accurate, so in most cases, only a subset of images can be labelled. These samples can be chosen at random or by using active learning techniques of selecting the most informative unlabeled samples.","title":"Can Ground Truth be Obtained or Inferred in Production?"},{"location":"machine-learning/ml-model-checklist/#has-the-data-distribution-of-training-testing-and-validation-sets-been-analyzed","text":"The data distribution of your training, test and validation (if applicable) dataset (including labels) should be analyzed to ensure they all come from the same distribution. If this is not the case, some options to consider are: re-shuffling, re-sampling, modifying the data, more samples need to be gathered or features removed from the dataset. Significant differences in the data distributions of the different datasets can greatly impact the performance of the model. Some potential questions to ask: How much does the training and test data represent the end result? Is the distribution of each individual feature consistent across all your datasets? (i.e. same representation of age groups, gender, race etc.) Is there any data lineage information? Where did the data come from? How was the data collected? Can collection and labelling be automated? Resources: \"Splitting into train, dev and test\" tutorial","title":"Has the Data Distribution of Training, Testing and Validation Sets Been Analyzed?"},{"location":"machine-learning/ml-model-checklist/#have-goals-and-hard-limits-for-performance-speed-of-prediction-and-costs-been-established-so-they-can-be-considered-if-trade-offs-need-to-be-made","text":"Some machine learning models achieve high ML performance, but they are costly and time-consuming to run. In those cases, a less performant and cheaper model could be preferred. Hence, it is important to calculate the model performance metrics (accuracy, precision, recall, RMSE etc), but also to gather data on how expensive it will be to run the model and how long it will take to run. Once this data is gathered, an informed decision should be made on what model to productionize. System metrics to consider: CPU/GPU/memory usage Cost per prediction Time taken to make a prediction","title":"Have Goals and Hard Limits for Performance, Speed of Prediction and Costs been Established, so they can be Considered if Trade-Offs Need to be Made?"},{"location":"machine-learning/ml-model-checklist/#how-will-the-model-be-integrated-into-other-systems-and-what-impact-will-it-have","text":"Machine Learning models do not exist in isolation, but rather they are part of a much larger system. These systems could be old, proprietary systems or new systems being developed as a results of the creation a new machine learning model. In both of those cases, it is important to understand where the actual model is going to fit in, what output is expected from the model and how that output is going to be used by the larger system. Additionally, it is essential to decide if the model will be used for batch and/or real-time inference as production paths might differ. Possible questions to assess model impact: Is there a human in the loop? How is feedback collected through the system? (for example how do we know if a prediction is wrong) Is there a fallback mechanism when things go wrong? Is the system transparent that there is a model making a prediction and what data is used to make this prediction? What is the cost of a wrong prediction?","title":"How Will the Model be Integrated into Other Systems, and what Impact will it Have?"},{"location":"machine-learning/ml-model-checklist/#how-will-incoming-data-quality-be-monitored","text":"As data systems become increasingly complex in the mainstream, it is especially vital to employ data quality monitoring, alerting and rectification protocols. Following data validation best practices can prevent insidious issues from creeping into machine learning models that, at best, reduce the usefulness of the model, and at worst, introduce harm. Data validation, reduces the risk of data downtime (increasing headroom) and technical debt and supports long-term success of machine learning models and other applications that rely on the data. Data validation best practices include: Employing automated data quality testing processes at each stage of the data pipeline Re-routing data that fails quality tests to a separate data store for diagnosis and resolution Employing end-to-end data observability on data freshness, distribution, volume, schema and lineage Note that data validation is distinct from data drift detection. Data validation detects errors in the data (ex. a datum is outside of the expected range), while data drift detection uncovers legitimate changes in the data that are truly representative of the phenomenon being modeled (ex. user preferences change). Data validation issues should trigger re-routing and rectification, while data drift should trigger adaptation or retraining of a model. Resources: \"Data Quality Fundamentals\" by Moses et al.","title":"How Will Incoming Data Quality be Monitored?"},{"location":"machine-learning/ml-model-checklist/#how-will-drift-in-data-characteristics-be-monitored","text":"Data drift detection uncovers legitimate changes in incoming data that are truly representative of the phenomenon being modeled,and are not erroneous (ex. user preferences change). It is imperative to understand if the new data in production will be significantly different from the data in the training phase. It is also important to check that the data distribution information can be obtained for any of the new data coming in. Drift monitoring can inform when changes are occurring and what their characteristics are (ex. abrupt vs gradual) and guide effective adaptation or retraining strategies to maintain performance. Possible questions to ask: What are some examples of drift, or deviation from the norm, that have been experience in the past or that might be expected? Is there a drift detection strategy in place? Does it align with expected types of changes? Are there warnings when anomalies in input data are occurring? Is there an adaptation strategy in place? Does it align with expected types of changes? Resources: \"Learning Under Concept Drift: A Review\" by Lu at al. Understanding dataset shift","title":"How Will Drift in Data Characteristics be Monitored?"},{"location":"machine-learning/ml-model-checklist/#how-will-performance-be-monitored","text":"It is important to define how the model will be monitored when it is in production and how that data is going to be used to make decisions. For example, when will a model need retraining as the performance has degraded and how to identify what are the underlying causes of this degradation could be part of this monitoring methodology. Ideally, model monitoring should be done automatically. However, if this is not possible, then there should be a manual periodical check of the model performance. Model monitoring should lead to: Ability to identify changes in model performance Warnings when anomalies in model output are occurring Retraining decisions and adaptation strategy","title":"How Will Performance be Monitored?"},{"location":"machine-learning/ml-model-checklist/#have-any-ethical-concerns-been-taken-into-account","text":"Every ML project goes through the Responsible AI process to ensure that it upholds Microsoft's 6 Responsible AI principles .","title":"Have any Ethical Concerns Been Taken into Account?"},{"location":"machine-learning/model-experimentation/","text":"Model Experimentation Overview Machine learning model experimentation involves uncertainty around the expected model results and future operationalization. To handle this uncertainty as much as possible, we propose a semi-structured process, balancing between engineering/research best practices and rapid model/data exploration. Model Experimentation Goals Performance : Find the best performing solution Operationalization : Keep an eye towards production, making sure that operationalization is feasible Code quality Maintain code and artifacts quality Reproducibility : Keep research active by allowing experiment tracking and reproducibility Collaboration : Foster the collaboration and joint work of multiple people on the team Model Experimentation Challenges Trial and error process : Difficult to plan and estimate durations and capacity. Quick and dirty : We want to fail fast and get a sense of what\u2019s working efficiently. Collaboration : How do we form a team-wide trial and error process and effective brainstorming. Code quality : How do we maintain the quality of non-production code during research. Operationalization : Switching between approaches might have a significant impact on operationalization (e.g. GPU/CPU, batch/online, parallel/sequential, runtime environments). Creating an experimentation framework which facilitates rapid experimentation , collaboration , experiment and model reproducibility , evaluation and defined APIs , and lets each team member focus on the model development and improvement, while trusting the framework to do the rest. The following tools and guidelines are aimed at achieving experimentation goals as well as addressing the aforementioned challenges. Tools and Guidelines for Successful Model Experimentation Virtual environments Source control and folder/package structure Experiment tracking Datasets and models abstractions Model evaluation Virtual Environments In languages like Python and R, it is always advised to employ virtual environments. Virtual environments facilitate reproducibility, collaboration and productization. Virtual environments allow us to be consistent across our local dev envs as well as with compute resources. These environments' configuration files can be used to build the code from source in an consistent way. For more details on why we need virtual environments visit this blog post . Which Virtual Environment Framework should I Choose All virtual environments frameworks create isolation, some also propose dependency management and additional features. Decision on which framework to use depends on the complexity of the development environment (dependencies and other required resources) and on the ease of use of the framework. Types of Virtual Environments In ISE, we often choose from either venv , Conda or Poetry , depending on the project requirements and complexity. venv is included in Python, is the easiest to use, but lacks more advanced features like dependency management. Conda is a popular package, dependency and environment management framework. It supports multiple stacks (Python, R) and multiple versions of the same environment (e.g. multiple Python versions). Conda maintains its own package repository, therefore some packages might not be downloaded and managed directly through Conda . Poetry is a Python dependency management system which manages dependencies in a standard way using pyproject.toml files and lock files. Similar to Conda , Poetry 's dependency resolution process is sometimes slow (see FAQ ), but in cases where dependency issues are common or tricky, it provides a robust way to create reproducible and stable environments. Expected Outcomes for Virtual Environments Setup Documentation describing how to create the selected virtual environment and how to install dependencies. Environment configuration files if applicable (e.g. requirements.txt for venv , environment.yml for Conda or pyrpoject.toml for Poetry ). Virtual Environments Benefits Productization Collaboration Reproducibility Source Control and Folder or Package Structure Applied ML projects often contain source code, notebooks, devops scripts, documentation, scientific resources, datasets and more. We recommend coming up with an agreed folder structure to keep resources tidy. Consider deciding upon a generic folder structure for projects (e.g. which contains the folders data , src , docs and notebooks ), or adopt popular structures like the CookieCutter Data Science folder structure. Source control should be applied to allow collaboration, versioning, code reviews, traceability and backup. In data science projects, source control should be used for code, and the storing and versioning of other artifacts (e.g. data, scientific literature) should be decided upon depending on the scenario. Folder Structure and Source Control Expected Outcomes Defined folder structure for all users to use, pushed to the repo. .gitignore file determining which folders should be synced with git and which should be kept locally. For example, this one . Determine how notebooks are stored and versioned (e.g. strip output from Jupyter notebooks ) Source Control and Folder Structure Benefits Collaboration Reproducibility Code quality Experiment Tracking Experiment tracking tools allow data scientists and researchers to keep track of previous experiments for better understanding of the experimentation process and for the reproducibility of experiments or models. Types of Experiment Tracking Frameworks Experiment tracking frameworks differ by the set of features they provide for collecting experiment metadata, and comparing and analyzing experiments. In ISE, we mainly use MLFlow on Databricks or Azure ML Experimentation . Note that some experiment tracking frameworks require a deployment, while others are SaaS. Experiment Tracking Outcomes Decide on an experiment tracking framework Ensure it is accessible to all users Document set-up on local environments Define datasets and evaluation in a way which will allow the comparison of all experiments. Consistency across datasets and evaluation is paramount for experiment comparison . Ensure full reproducibility by assuring that all required details are tracked (i.e. dataset names and versions, parameters, code, environment) Experiment Tracking Benefits Model performance Reproducibility Collaboration Code quality Datasets and Models Abstractions By creating abstractions to building blocks (e.g., datasets, models, evaluators), we allow the easy introduction of new logic into the experimentation pipeline while keeping the agreed upon experimentation flow intact. These abstractions can be created using different mechanisms. For example, we can use Object-Oriented Programming (OOP) solutions like abstract classes: An example from scikit-learn describing the creation of new estimators compatible with the API . An example from PyTorch on extending the abstract Dataset class . Abstraction Outcomes Different building blocks have defined APIs allowing them to be replaced or extended. Replacing building blocks does not break the original experimentation flow. Mock building blocks are used for unit tests APIs/mocks are shared with the engineering teams for integration with other modules. Abstraction Benefits Collaboration Code quality Reproducibility Operationalization Model performance Model Evaluation When deciding on the evaluation of the ML model/process, consider the following checklist: Evaluation logic is approved by all stakeholders. Relationship between evaluation logic and business KPIs is analyzed and decided. Evaluation flow is applicable for all present and future models (i.e. does not assume some prediction structure or method-specific process). Evaluation code is unit-tested and reviewed by all team members. Evaluation flow facilitates further results and error analysis. Evaluation Development Process Outcomes Evaluation strategy is agreed upon all stakeholders Research and discussion on various evaluation methods and metrics is documented. The code holding the logic and data structures for evaluation is reviewed and tested. Documentation on how to apply evaluation is reviewed. Performance metrics are automatically tracked into the experiment tracker. Evaluation Development Process Benefits Model performance Code quality Collaboration Reproducibility","title":"Model Experimentation"},{"location":"machine-learning/model-experimentation/#model-experimentation","text":"","title":"Model Experimentation"},{"location":"machine-learning/model-experimentation/#overview","text":"Machine learning model experimentation involves uncertainty around the expected model results and future operationalization. To handle this uncertainty as much as possible, we propose a semi-structured process, balancing between engineering/research best practices and rapid model/data exploration.","title":"Overview"},{"location":"machine-learning/model-experimentation/#model-experimentation-goals","text":"Performance : Find the best performing solution Operationalization : Keep an eye towards production, making sure that operationalization is feasible Code quality Maintain code and artifacts quality Reproducibility : Keep research active by allowing experiment tracking and reproducibility Collaboration : Foster the collaboration and joint work of multiple people on the team","title":"Model Experimentation Goals"},{"location":"machine-learning/model-experimentation/#model-experimentation-challenges","text":"Trial and error process : Difficult to plan and estimate durations and capacity. Quick and dirty : We want to fail fast and get a sense of what\u2019s working efficiently. Collaboration : How do we form a team-wide trial and error process and effective brainstorming. Code quality : How do we maintain the quality of non-production code during research. Operationalization : Switching between approaches might have a significant impact on operationalization (e.g. GPU/CPU, batch/online, parallel/sequential, runtime environments). Creating an experimentation framework which facilitates rapid experimentation , collaboration , experiment and model reproducibility , evaluation and defined APIs , and lets each team member focus on the model development and improvement, while trusting the framework to do the rest. The following tools and guidelines are aimed at achieving experimentation goals as well as addressing the aforementioned challenges.","title":"Model Experimentation Challenges"},{"location":"machine-learning/model-experimentation/#tools-and-guidelines-for-successful-model-experimentation","text":"Virtual environments Source control and folder/package structure Experiment tracking Datasets and models abstractions Model evaluation","title":"Tools and Guidelines for Successful Model Experimentation"},{"location":"machine-learning/model-experimentation/#virtual-environments","text":"In languages like Python and R, it is always advised to employ virtual environments. Virtual environments facilitate reproducibility, collaboration and productization. Virtual environments allow us to be consistent across our local dev envs as well as with compute resources. These environments' configuration files can be used to build the code from source in an consistent way. For more details on why we need virtual environments visit this blog post .","title":"Virtual Environments"},{"location":"machine-learning/model-experimentation/#which-virtual-environment-framework-should-i-choose","text":"All virtual environments frameworks create isolation, some also propose dependency management and additional features. Decision on which framework to use depends on the complexity of the development environment (dependencies and other required resources) and on the ease of use of the framework.","title":"Which Virtual Environment Framework should I Choose"},{"location":"machine-learning/model-experimentation/#types-of-virtual-environments","text":"In ISE, we often choose from either venv , Conda or Poetry , depending on the project requirements and complexity. venv is included in Python, is the easiest to use, but lacks more advanced features like dependency management. Conda is a popular package, dependency and environment management framework. It supports multiple stacks (Python, R) and multiple versions of the same environment (e.g. multiple Python versions). Conda maintains its own package repository, therefore some packages might not be downloaded and managed directly through Conda . Poetry is a Python dependency management system which manages dependencies in a standard way using pyproject.toml files and lock files. Similar to Conda , Poetry 's dependency resolution process is sometimes slow (see FAQ ), but in cases where dependency issues are common or tricky, it provides a robust way to create reproducible and stable environments.","title":"Types of Virtual Environments"},{"location":"machine-learning/model-experimentation/#expected-outcomes-for-virtual-environments-setup","text":"Documentation describing how to create the selected virtual environment and how to install dependencies. Environment configuration files if applicable (e.g. requirements.txt for venv , environment.yml for Conda or pyrpoject.toml for Poetry ).","title":"Expected Outcomes for Virtual Environments Setup"},{"location":"machine-learning/model-experimentation/#virtual-environments-benefits","text":"Productization Collaboration Reproducibility","title":"Virtual Environments Benefits"},{"location":"machine-learning/model-experimentation/#source-control-and-folder-or-package-structure","text":"Applied ML projects often contain source code, notebooks, devops scripts, documentation, scientific resources, datasets and more. We recommend coming up with an agreed folder structure to keep resources tidy. Consider deciding upon a generic folder structure for projects (e.g. which contains the folders data , src , docs and notebooks ), or adopt popular structures like the CookieCutter Data Science folder structure. Source control should be applied to allow collaboration, versioning, code reviews, traceability and backup. In data science projects, source control should be used for code, and the storing and versioning of other artifacts (e.g. data, scientific literature) should be decided upon depending on the scenario.","title":"Source Control and Folder or Package Structure"},{"location":"machine-learning/model-experimentation/#folder-structure-and-source-control-expected-outcomes","text":"Defined folder structure for all users to use, pushed to the repo. .gitignore file determining which folders should be synced with git and which should be kept locally. For example, this one . Determine how notebooks are stored and versioned (e.g. strip output from Jupyter notebooks )","title":"Folder Structure and Source Control Expected Outcomes"},{"location":"machine-learning/model-experimentation/#source-control-and-folder-structure-benefits","text":"Collaboration Reproducibility Code quality","title":"Source Control and Folder Structure Benefits"},{"location":"machine-learning/model-experimentation/#experiment-tracking","text":"Experiment tracking tools allow data scientists and researchers to keep track of previous experiments for better understanding of the experimentation process and for the reproducibility of experiments or models.","title":"Experiment Tracking"},{"location":"machine-learning/model-experimentation/#types-of-experiment-tracking-frameworks","text":"Experiment tracking frameworks differ by the set of features they provide for collecting experiment metadata, and comparing and analyzing experiments. In ISE, we mainly use MLFlow on Databricks or Azure ML Experimentation . Note that some experiment tracking frameworks require a deployment, while others are SaaS.","title":"Types of Experiment Tracking Frameworks"},{"location":"machine-learning/model-experimentation/#experiment-tracking-outcomes","text":"Decide on an experiment tracking framework Ensure it is accessible to all users Document set-up on local environments Define datasets and evaluation in a way which will allow the comparison of all experiments. Consistency across datasets and evaluation is paramount for experiment comparison . Ensure full reproducibility by assuring that all required details are tracked (i.e. dataset names and versions, parameters, code, environment)","title":"Experiment Tracking Outcomes"},{"location":"machine-learning/model-experimentation/#experiment-tracking-benefits","text":"Model performance Reproducibility Collaboration Code quality","title":"Experiment Tracking Benefits"},{"location":"machine-learning/model-experimentation/#datasets-and-models-abstractions","text":"By creating abstractions to building blocks (e.g., datasets, models, evaluators), we allow the easy introduction of new logic into the experimentation pipeline while keeping the agreed upon experimentation flow intact. These abstractions can be created using different mechanisms. For example, we can use Object-Oriented Programming (OOP) solutions like abstract classes: An example from scikit-learn describing the creation of new estimators compatible with the API . An example from PyTorch on extending the abstract Dataset class .","title":"Datasets and Models Abstractions"},{"location":"machine-learning/model-experimentation/#abstraction-outcomes","text":"Different building blocks have defined APIs allowing them to be replaced or extended. Replacing building blocks does not break the original experimentation flow. Mock building blocks are used for unit tests APIs/mocks are shared with the engineering teams for integration with other modules.","title":"Abstraction Outcomes"},{"location":"machine-learning/model-experimentation/#abstraction-benefits","text":"Collaboration Code quality Reproducibility Operationalization Model performance","title":"Abstraction Benefits"},{"location":"machine-learning/model-experimentation/#model-evaluation","text":"When deciding on the evaluation of the ML model/process, consider the following checklist: Evaluation logic is approved by all stakeholders. Relationship between evaluation logic and business KPIs is analyzed and decided. Evaluation flow is applicable for all present and future models (i.e. does not assume some prediction structure or method-specific process). Evaluation code is unit-tested and reviewed by all team members. Evaluation flow facilitates further results and error analysis.","title":"Model Evaluation"},{"location":"machine-learning/model-experimentation/#evaluation-development-process-outcomes","text":"Evaluation strategy is agreed upon all stakeholders Research and discussion on various evaluation methods and metrics is documented. The code holding the logic and data structures for evaluation is reviewed and tested. Documentation on how to apply evaluation is reviewed. Performance metrics are automatically tracked into the experiment tracker.","title":"Evaluation Development Process Outcomes"},{"location":"machine-learning/model-experimentation/#evaluation-development-process-benefits","text":"Model performance Code quality Collaboration Reproducibility","title":"Evaluation Development Process Benefits"},{"location":"machine-learning/profiling-ml-and-mlops-code/","text":"Profiling Machine Learning and MLOps Code Data Science projects, especially the ones that involve Deep Learning techniques, usually are resource intensive. One model training iteration might be multiple hours long. Although large data volumes processing genuinely takes time, minor bugs and suboptimal implementation of some functional pieces might cause extra resources consumption. Profiling can be used to identify performance bottlenecks and see which functions are the costliest in the application code. Based on the outputs of the profiler, one can focus on largest and easiest-to-resolve inefficiencies and therefore achieve better code performance. Although profiling follows the same principles of any other software project, the purpose of this document is to provide profiling samples for the most common scenarios in MLOps/Data Science projects. Below are some common scenarios in MLOps/Data Science projects, along with suggestions on how to profile them. Generic Python profiling PyTorch model training profiling Azure Machine Learning pipeline profiling Generic Python Profiling Usually an MLOps/Data Science solution contains plain Python code serving different purposes (e.g. data processing) along with specialized model training code. Although many Machine Learning frameworks provide their own profiler, sometimes it is also useful to profile the whole solution. There are two types of profilers: deterministic (all events are tracked, e.g. cProfile ) and statistical (sampling with regular intervals, e.g., py-spy ). The sample below shows an example of a deterministic profiler. There are many options of generic deterministic Python code profiling. One of the default options for profiling used to be a built-in cProfile profiler. Using cProfile one can easily profile either a Python script or just a chunk of code. This profiling tool produces a file that can be either visualized using open source tools or analyzed using stats.Stats class. The latter option requires setting up filtering and sorting parameters for better analysis experience. Below you can find an example of using cProfile to profile a chunk of code. import cProfile # Start profiling profiler = cProfile . Profile () profiler . enable () # -- YOUR CODE GOES HERE --- # Stop profiling profiler . disable () # Write profiler results to an html file profiler . dump_stats ( \"profiler_results.prof\" ) You can also run cProfile outside of the Python script using the following command: python -m cProfile [ -o output_file ] [ -s sort_order ] ( -m module | myscript.py ) Note: one epoch of model training is usually enough for profiling. There's no need to run more epochs and produce additional cost. Refer to The Python Profilers for further details. PyTorch Model Training Profiling PyTorch 1.8 includes an updated PyTorch profiler that is supplied together with the PyTorch distribution and doesn't require any additional installation. Using PyTorch profiler one can record CPU side operations as well as CUDA kernel launches on GPU side. The profiler can visualize analysis results using TensorBoard plugin as well as provide suggestions on bottlenecks and potential code improvements. with torch . profiler . profile ( # Limit number of training steps included in profiling schedule = torch . profiler . schedule ( wait = 1 , warmup = 1 , active = 3 , repeat = 2 ), # Automatically saves profiling results to disk on_trace_ready = torch . profiler . tensorboard_trace_handler , with_stack = True ) as profiler : for step , data in enumerate ( trainloader , 0 ): # -- TRAINING STEP CODE GOES HERE --- profiler . step () The tensorboard_trace_handler can be used to generate result files for TensorBoard. Those can be visualized by installing TensorBoard. plugin and running TensorBoard on your log directory. pip install torch_tb_profiler tensorboard --logdir = <LOG_DIR_PATH> # Navigate to `http://localhost:6006/#pytorch_profiler` Note: make sure to provide the right parameters to the torch.profiler.schedule . Usually you would need several steps of training to be profiled rather than the whole epoch. More information on PyTorch profiler : PyTorch Profiler Recipe Introducing PyTorch Profiler - the new and improved performance tool Azure Machine Learning Pipeline Profiling In our projects we often use Azure Machine Learning pipelines to train Machine Learning models. Most of the profilers can also be used in conjunction with Azure Machine Learning. For a profiler to be used with Azure Machine Learning, it should meet the following criteria: Turning the profiler on/off can be achieved by passing a parameter to the script ran by Azure Machine Learning The profiler produces a file as an output In general, a recipe for using profilers with Azure Machine Learning is the following: (Optional) If you're using profiling with an Azure Machine Learning pipeline, you might want to add --profile Boolean flag as a pipeline parameter Use one of the profilers described above or any other profiler that can produce a file as an output Inside of your Python script, create step output folder, e.g.: output_dir = \"./outputs/profiler_results\" os . makedirs ( output_dir , exist_ok = True ) Run your training pipeline Once the pipeline is completed, navigate to Azure ML portal and open details of the step that contains training code. The results can be found in the Outputs+logs tab, under outputs/profiler_results folder. You might want to download the results and visualize it locally. Note: it's not recommended to run profilers simultaneously. Profiles also consume resources, therefore a simultaneous run might significantly affect the results.","title":"Profiling Machine Learning and MLOps Code"},{"location":"machine-learning/profiling-ml-and-mlops-code/#profiling-machine-learning-and-mlops-code","text":"Data Science projects, especially the ones that involve Deep Learning techniques, usually are resource intensive. One model training iteration might be multiple hours long. Although large data volumes processing genuinely takes time, minor bugs and suboptimal implementation of some functional pieces might cause extra resources consumption. Profiling can be used to identify performance bottlenecks and see which functions are the costliest in the application code. Based on the outputs of the profiler, one can focus on largest and easiest-to-resolve inefficiencies and therefore achieve better code performance. Although profiling follows the same principles of any other software project, the purpose of this document is to provide profiling samples for the most common scenarios in MLOps/Data Science projects. Below are some common scenarios in MLOps/Data Science projects, along with suggestions on how to profile them. Generic Python profiling PyTorch model training profiling Azure Machine Learning pipeline profiling","title":"Profiling Machine Learning and MLOps Code"},{"location":"machine-learning/profiling-ml-and-mlops-code/#generic-python-profiling","text":"Usually an MLOps/Data Science solution contains plain Python code serving different purposes (e.g. data processing) along with specialized model training code. Although many Machine Learning frameworks provide their own profiler, sometimes it is also useful to profile the whole solution. There are two types of profilers: deterministic (all events are tracked, e.g. cProfile ) and statistical (sampling with regular intervals, e.g., py-spy ). The sample below shows an example of a deterministic profiler. There are many options of generic deterministic Python code profiling. One of the default options for profiling used to be a built-in cProfile profiler. Using cProfile one can easily profile either a Python script or just a chunk of code. This profiling tool produces a file that can be either visualized using open source tools or analyzed using stats.Stats class. The latter option requires setting up filtering and sorting parameters for better analysis experience. Below you can find an example of using cProfile to profile a chunk of code. import cProfile # Start profiling profiler = cProfile . Profile () profiler . enable () # -- YOUR CODE GOES HERE --- # Stop profiling profiler . disable () # Write profiler results to an html file profiler . dump_stats ( \"profiler_results.prof\" ) You can also run cProfile outside of the Python script using the following command: python -m cProfile [ -o output_file ] [ -s sort_order ] ( -m module | myscript.py ) Note: one epoch of model training is usually enough for profiling. There's no need to run more epochs and produce additional cost. Refer to The Python Profilers for further details.","title":"Generic Python Profiling"},{"location":"machine-learning/profiling-ml-and-mlops-code/#pytorch-model-training-profiling","text":"PyTorch 1.8 includes an updated PyTorch profiler that is supplied together with the PyTorch distribution and doesn't require any additional installation. Using PyTorch profiler one can record CPU side operations as well as CUDA kernel launches on GPU side. The profiler can visualize analysis results using TensorBoard plugin as well as provide suggestions on bottlenecks and potential code improvements. with torch . profiler . profile ( # Limit number of training steps included in profiling schedule = torch . profiler . schedule ( wait = 1 , warmup = 1 , active = 3 , repeat = 2 ), # Automatically saves profiling results to disk on_trace_ready = torch . profiler . tensorboard_trace_handler , with_stack = True ) as profiler : for step , data in enumerate ( trainloader , 0 ): # -- TRAINING STEP CODE GOES HERE --- profiler . step () The tensorboard_trace_handler can be used to generate result files for TensorBoard. Those can be visualized by installing TensorBoard. plugin and running TensorBoard on your log directory. pip install torch_tb_profiler tensorboard --logdir = <LOG_DIR_PATH> # Navigate to `http://localhost:6006/#pytorch_profiler` Note: make sure to provide the right parameters to the torch.profiler.schedule . Usually you would need several steps of training to be profiled rather than the whole epoch. More information on PyTorch profiler : PyTorch Profiler Recipe Introducing PyTorch Profiler - the new and improved performance tool","title":"PyTorch Model Training Profiling"},{"location":"machine-learning/profiling-ml-and-mlops-code/#azure-machine-learning-pipeline-profiling","text":"In our projects we often use Azure Machine Learning pipelines to train Machine Learning models. Most of the profilers can also be used in conjunction with Azure Machine Learning. For a profiler to be used with Azure Machine Learning, it should meet the following criteria: Turning the profiler on/off can be achieved by passing a parameter to the script ran by Azure Machine Learning The profiler produces a file as an output In general, a recipe for using profilers with Azure Machine Learning is the following: (Optional) If you're using profiling with an Azure Machine Learning pipeline, you might want to add --profile Boolean flag as a pipeline parameter Use one of the profilers described above or any other profiler that can produce a file as an output Inside of your Python script, create step output folder, e.g.: output_dir = \"./outputs/profiler_results\" os . makedirs ( output_dir , exist_ok = True ) Run your training pipeline Once the pipeline is completed, navigate to Azure ML portal and open details of the step that contains training code. The results can be found in the Outputs+logs tab, under outputs/profiler_results folder. You might want to download the results and visualize it locally. Note: it's not recommended to run profilers simultaneously. Profiles also consume resources, therefore a simultaneous run might significantly affect the results.","title":"Azure Machine Learning Pipeline Profiling"},{"location":"machine-learning/proposed-ml-process/","text":"Proposed ML Process Introduction The objective of this document is to provide guidance to produce machine learning (ML) applications that are based on code, data and models that can be reproduced and reliably released to production environments. When developing ML applications, we consider the following approaches: Best practices in ML engineering: The ML application development should use engineering fundamentals to ensure high quality software deliverables. The ML application should be reliability released into production, leveraging automation as much as possible. The ML application can be deployed into production at any time. This makes the decision about when to release it a business decision rather than a technical one. Best practices in ML research: All artifacts, specifically data, code and ML models, should be versioned and managed using standard tools and workflows, in order to facilitate continuous research and development. While the model outputs can be non-deterministic and hard to reproduce, the process of releasing ML software into production should be reproducible. Responsible AI aspects are carefully analyzed and addressed. Cross-functional team: A cross-functional team consisting of different skill sets in data science, data engineering, development, operations, and industry domain specialists is required. ML process The proposed ML development process consists of: Data and problem understanding Responsible AI assessment Feasibility study Baseline model experimentation Model evaluation and experimentation Model operationalization * Unit and Integration testing * Deployment * Monitoring and Observability Version Control During all stages of the process, it is suggested that artifacts should be version-controlled . Typically, the process is iterative and versioned artifacts can assist in traceability and reviewing. Understanding the Problem Define the business problem for the ML project: Agree on the success criteria with the customer. Identify potential data sources and determine the availability of these sources. Define performance evaluation metrics on ground truth data Conduct a Responsible AI assessment to ensure development and deployment of the ML solution in a responsible manner. Conduct a feasibility study to assess whether the business problem is feasible to solve satisfactorily using ML with the available data. The objective of the feasibility study is to mitigate potential over-investment by ensuring sufficient evidence that ML is possible and would be the best solution. The study also provides initial indications of what the ML solution should look like. This ensures quality solutions supported by thorough consideration and evidence. Refer to feasibility study . Exploratory data analysis is performed and discussed with the team Typical output : Data exploration source code (Jupyter notebooks/scripts) and slides/docs Initial ML model code (Jupyter notebook or scripts) Initial solution architecture with initial data engineering requirements Data dictionary (if not yet available) List of assumptions Baseline Model Experimentation Data preparation: creating data source connectors, determining storage services to be used and potential versioning of raw datasets. Feature engineering: create new features from raw source data to increase the predictive power of the learning algorithm. The features should capture additional information that is not apparent in the original feature set. Split data into training, validation and test sets: creating training, validation, and test datasets with ground truth to develop ML models. This would entail joining or merging various feature engineered datasets. The training dataset is used to train the model to find the patterns between its features and labels (ground truth). The validation dataset is used to assess the model architecture, and the test data is used to confirm the prediction quality of the model. Initial code to create access data sources, transform raw data into features and model training as well as scoring. During this phase, experiment code (Jupyter notebooks or scripts) and accompanying utility code should be version-controlled using tools such as ADO (Azure DevOps). Typical output : Rough Jupyter notebooks or scripts in Python or R, initial results from baseline model. For more information on experimentation, refer to the experimentation section. Model Evaluation Compare the effectiveness of different algorithms on the given problem. Typical output : Evaluation flow is fully set up . Reproducible experiments for the different approaches experimented with. Model Operationalization Taking \"experimental\" code and preparing it, so it can be deployed. This includes data pre-processing, featurization code, training model code (if required to be trained using CI/CD) and model inference code. Typical output : Production-grade code (Preferably in the form of a package) for: Data preprocessing / post processing Serving a model Training a model CI/CD scripts. Reproducibility steps for the model in production. See more in the ML model checklist . Unit and Integration Testing Ensuring that production code behaves in the way we expect it to, and that its results match those we saw during the Model Evaluation and Experimentation phases. Refer to ML testing post for further details. Typical output : Test suite with unit and end-to-end tests is created and completes successfully. Deployment Responsible AI considerations such as bias and fairness analysis. Additionally, explainability/interpretability of the model should also be considered. It is recommended for a human-in-the-loop to verify the model and manually approve deployment to production. Getting the model into production where it can start adding value by serving predictions. Typical artifacts are APIs for accessing the model and integrating the model to the solution architecture. Additionally, certain scenarios may require training the model periodically in production. Reproducibility steps of the production model are available. Typical output : model readiness checklist is completed. Monitoring and Observability This is the final phase, where we ensure our model is doing what we expect it to in production. Read more about ML observability . Read more about Azure ML's offerings around ML models production monitoring . It is recommended to consider incorporating data drift monitoring process in the production solution. This will assist in detecting potential changes in new datasets presented for inference that may significantly impact model performance. For more info on detecting data drift with Azure ML see the Microsoft docs article on how to monitor datasets . Typical output : Logging and monitoring scripts and tools set up, permissions for users to access monitoring tools.","title":"Proposed ML Process"},{"location":"machine-learning/proposed-ml-process/#proposed-ml-process","text":"","title":"Proposed ML Process"},{"location":"machine-learning/proposed-ml-process/#introduction","text":"The objective of this document is to provide guidance to produce machine learning (ML) applications that are based on code, data and models that can be reproduced and reliably released to production environments. When developing ML applications, we consider the following approaches: Best practices in ML engineering: The ML application development should use engineering fundamentals to ensure high quality software deliverables. The ML application should be reliability released into production, leveraging automation as much as possible. The ML application can be deployed into production at any time. This makes the decision about when to release it a business decision rather than a technical one. Best practices in ML research: All artifacts, specifically data, code and ML models, should be versioned and managed using standard tools and workflows, in order to facilitate continuous research and development. While the model outputs can be non-deterministic and hard to reproduce, the process of releasing ML software into production should be reproducible. Responsible AI aspects are carefully analyzed and addressed. Cross-functional team: A cross-functional team consisting of different skill sets in data science, data engineering, development, operations, and industry domain specialists is required.","title":"Introduction"},{"location":"machine-learning/proposed-ml-process/#ml-process","text":"The proposed ML development process consists of: Data and problem understanding Responsible AI assessment Feasibility study Baseline model experimentation Model evaluation and experimentation Model operationalization * Unit and Integration testing * Deployment * Monitoring and Observability","title":"ML process"},{"location":"machine-learning/proposed-ml-process/#version-control","text":"During all stages of the process, it is suggested that artifacts should be version-controlled . Typically, the process is iterative and versioned artifacts can assist in traceability and reviewing.","title":"Version Control"},{"location":"machine-learning/proposed-ml-process/#understanding-the-problem","text":"Define the business problem for the ML project: Agree on the success criteria with the customer. Identify potential data sources and determine the availability of these sources. Define performance evaluation metrics on ground truth data Conduct a Responsible AI assessment to ensure development and deployment of the ML solution in a responsible manner. Conduct a feasibility study to assess whether the business problem is feasible to solve satisfactorily using ML with the available data. The objective of the feasibility study is to mitigate potential over-investment by ensuring sufficient evidence that ML is possible and would be the best solution. The study also provides initial indications of what the ML solution should look like. This ensures quality solutions supported by thorough consideration and evidence. Refer to feasibility study . Exploratory data analysis is performed and discussed with the team Typical output : Data exploration source code (Jupyter notebooks/scripts) and slides/docs Initial ML model code (Jupyter notebook or scripts) Initial solution architecture with initial data engineering requirements Data dictionary (if not yet available) List of assumptions","title":"Understanding the Problem"},{"location":"machine-learning/proposed-ml-process/#baseline-model-experimentation","text":"Data preparation: creating data source connectors, determining storage services to be used and potential versioning of raw datasets. Feature engineering: create new features from raw source data to increase the predictive power of the learning algorithm. The features should capture additional information that is not apparent in the original feature set. Split data into training, validation and test sets: creating training, validation, and test datasets with ground truth to develop ML models. This would entail joining or merging various feature engineered datasets. The training dataset is used to train the model to find the patterns between its features and labels (ground truth). The validation dataset is used to assess the model architecture, and the test data is used to confirm the prediction quality of the model. Initial code to create access data sources, transform raw data into features and model training as well as scoring. During this phase, experiment code (Jupyter notebooks or scripts) and accompanying utility code should be version-controlled using tools such as ADO (Azure DevOps). Typical output : Rough Jupyter notebooks or scripts in Python or R, initial results from baseline model. For more information on experimentation, refer to the experimentation section.","title":"Baseline Model Experimentation"},{"location":"machine-learning/proposed-ml-process/#model-evaluation","text":"Compare the effectiveness of different algorithms on the given problem. Typical output : Evaluation flow is fully set up . Reproducible experiments for the different approaches experimented with.","title":"Model Evaluation"},{"location":"machine-learning/proposed-ml-process/#model-operationalization","text":"Taking \"experimental\" code and preparing it, so it can be deployed. This includes data pre-processing, featurization code, training model code (if required to be trained using CI/CD) and model inference code. Typical output : Production-grade code (Preferably in the form of a package) for: Data preprocessing / post processing Serving a model Training a model CI/CD scripts. Reproducibility steps for the model in production. See more in the ML model checklist .","title":"Model Operationalization"},{"location":"machine-learning/proposed-ml-process/#unit-and-integration-testing","text":"Ensuring that production code behaves in the way we expect it to, and that its results match those we saw during the Model Evaluation and Experimentation phases. Refer to ML testing post for further details. Typical output : Test suite with unit and end-to-end tests is created and completes successfully.","title":"Unit and Integration Testing"},{"location":"machine-learning/proposed-ml-process/#deployment","text":"Responsible AI considerations such as bias and fairness analysis. Additionally, explainability/interpretability of the model should also be considered. It is recommended for a human-in-the-loop to verify the model and manually approve deployment to production. Getting the model into production where it can start adding value by serving predictions. Typical artifacts are APIs for accessing the model and integrating the model to the solution architecture. Additionally, certain scenarios may require training the model periodically in production. Reproducibility steps of the production model are available. Typical output : model readiness checklist is completed.","title":"Deployment"},{"location":"machine-learning/proposed-ml-process/#monitoring-and-observability","text":"This is the final phase, where we ensure our model is doing what we expect it to in production. Read more about ML observability . Read more about Azure ML's offerings around ML models production monitoring . It is recommended to consider incorporating data drift monitoring process in the production solution. This will assist in detecting potential changes in new datasets presented for inference that may significantly impact model performance. For more info on detecting data drift with Azure ML see the Microsoft docs article on how to monitor datasets . Typical output : Logging and monitoring scripts and tools set up, permissions for users to access monitoring tools.","title":"Monitoring and Observability"},{"location":"machine-learning/responsible-ai/","text":"Responsible AI in ISE Microsoft's Responsible AI principles Every ML project in ISE goes through a Responsible AI (RAI) assessment to ensure that it upholds Microsoft's 6 Responsible AI principles : Fairness Reliability & Safety Privacy & Security Inclusiveness Transparency Accountability Every project goes through the RAI process, whether we are building a new ML model from scratch, or putting an existing model in production. ISE's Responsible AI process The process begins as soon as we start a prospective project. We start to complete a Responsible AI review document, and an impact assessment, which provides a structured way to explore topics such as: Can the problem be addressed with a non-technical (e.g. social) solution? Can the problem be solved without AI? Would simpler technology suffice? Will the team have access to domain experts (e.g. doctors, refugees) in the field where the AI is applicable? Who are the stakeholders in this project? Who does the AI impact? Are there any vulnerable groups affected? What are the possible benefits and harms to each stakeholder? How can the technology be misused, and what can go wrong? Has the team analyzed the input data properly to make sure that the training data is suitable for machine learning? Is the training data an accurate representation of data that will be used as input in production? Is there a good representation of all users? Is there a fall-back mechanism (a human in the loop, or a way to revert decisions based on the model)? Does data used by the model for training or scoring contain PII? What measures have been taken to remove sensitive data? Does the model impact consequential decisions, like blocking people from getting jobs, loans, health care etc. or in the cases where it may, have appropriate ethical considerations been discussed? Have measures for re-training been considered? How can we address any concerns that arise, and how can we mitigate risk? At this point we research available tools and resources , such as InterpretML or Fairlearn , that we may use on the project. We may change the project scope or re-define the ML problem definition if necessary. The Responsible AI review documents remain living documents that we re-visit and update throughout project development, through the feasibility study , as the model is developed and prepared for production, and new information unfolds. The documents can be used and expanded once the model is deployed, and monitored in production.","title":"Responsible AI in ISE"},{"location":"machine-learning/responsible-ai/#responsible-ai-in-ise","text":"","title":"Responsible AI in ISE"},{"location":"machine-learning/responsible-ai/#microsofts-responsible-ai-principles","text":"Every ML project in ISE goes through a Responsible AI (RAI) assessment to ensure that it upholds Microsoft's 6 Responsible AI principles : Fairness Reliability & Safety Privacy & Security Inclusiveness Transparency Accountability Every project goes through the RAI process, whether we are building a new ML model from scratch, or putting an existing model in production.","title":"Microsoft's Responsible AI principles"},{"location":"machine-learning/responsible-ai/#ises-responsible-ai-process","text":"The process begins as soon as we start a prospective project. We start to complete a Responsible AI review document, and an impact assessment, which provides a structured way to explore topics such as: Can the problem be addressed with a non-technical (e.g. social) solution? Can the problem be solved without AI? Would simpler technology suffice? Will the team have access to domain experts (e.g. doctors, refugees) in the field where the AI is applicable? Who are the stakeholders in this project? Who does the AI impact? Are there any vulnerable groups affected? What are the possible benefits and harms to each stakeholder? How can the technology be misused, and what can go wrong? Has the team analyzed the input data properly to make sure that the training data is suitable for machine learning? Is the training data an accurate representation of data that will be used as input in production? Is there a good representation of all users? Is there a fall-back mechanism (a human in the loop, or a way to revert decisions based on the model)? Does data used by the model for training or scoring contain PII? What measures have been taken to remove sensitive data? Does the model impact consequential decisions, like blocking people from getting jobs, loans, health care etc. or in the cases where it may, have appropriate ethical considerations been discussed? Have measures for re-training been considered? How can we address any concerns that arise, and how can we mitigate risk? At this point we research available tools and resources , such as InterpretML or Fairlearn , that we may use on the project. We may change the project scope or re-define the ML problem definition if necessary. The Responsible AI review documents remain living documents that we re-visit and update throughout project development, through the feasibility study , as the model is developed and prepared for production, and new information unfolds. The documents can be used and expanded once the model is deployed, and monitored in production.","title":"ISE's Responsible AI process"},{"location":"machine-learning/testing-data-science-and-mlops-code/","text":"Testing Data Science and MLOps Code The purpose of this document is to provide samples of tests for the most common operations in MLOps/Data Science projects. Testing the code used for MLOps or data science projects follows the same principles of any other software project. Some scenarios might seem different or more difficult to test. The best way to approach this is to always have a test design session, where the focus is on the input/outputs, exceptions and testing the behavior of data transformations. Designing the tests first makes it easier to test as it forces a more modular style, where each function has one purpose, and extracting common functionality functions and modules. Below are some common operations in MLOps or Data Science projects, along with suggestions on how to test them. Saving and loading data Transforming data Model load or predict Data validation Model testing Saving and Loading Data Reading and writing to csv, reading images or loading audio files are common scenarios encountered in MLOps projects. Example: Verify that a Load Function Calls read_csv if the File Exists utils.py def load_data ( filename : str ) -> pd . DataFrame : if os . path . isfile ( filename ): df = pd . read_csv ( filename , index_col = 'ID' ) return df return None There's no need to test the read_csv function, or the isfile functions, we can leave testing them to the pandas and os developers. The only thing we need to test here is the logic in this function, i.e. that load_data loads the file if the file exists with the right index column, and doesn't load the file if it doesn't exist, and that it returns the expected results. One way to do this would be to provide a sample file and call the function, and verify that the output is None or a DataFrame . This requires separate files to be present, or not present, for the tests to run. This can cause the same test to run on one machine and then fail on a build server which is not a desired behavior. A much better way is to mock calls to isfile , and read_csv . Instead of calling the real function, we will return a predefined return value, or call a stub that doesn't have any side effects. This way no files are needed in the repository to execute the test, and the test will always work the same, independent of what machine it runs on. Note: Below we mock the specific os and pd functions referenced in the utils file, any others are left unaffected and would run as normal. test_utils.py import utils from mock import patch @patch ( 'utils.os.path.isfile' ) @patch ( 'utils.pd.read_csv' ) def test_load_data_calls_read_csv_if_exists ( mock_isfile , mock_read_csv ): # arrange # always return true for isfile utils . os . path . isfile . return_value = True filename = 'file.csv' # act _ = utils . load_data ( filename ) # assert # check that read_csv is called with the correct parameters utils . pd . read_csv . assert_called_once_with ( filename , index_col = 'ID' ) Similarly, we can verify that it's called 0 or multiple times. In the example below where we verify that it's not called if the file doesn't exist @patch ( 'utils.os.path.isfile' ) @patch ( 'utils.pd.read_csv' ) def test_load_data_does_not_call_read_csv_if_not_exists ( mock_isfile , mock_read_csv ): # arrange # file doesn't exist utils . os . path . isfile . return_value = False filename = 'file.csv' # act _ = utils . load_data ( filename ) # assert # check that read_csv is not called assert utils . pd . read_csv . call_count == 0 Example: Using the Same Sample Data for Multiple Tests If more than one test will use the same sample data, fixtures are a good way to reuse this sample data. The sample data can be the contents of a json file, or a csv, or a DataFrame, or even an image. Note: The sample data is still hard coded if possible, and does not need to be large. Only add as much sample data as required for the tests to make the tests readable. Use the fixture to return the sample data, and add this as a parameter to the tests where you want to use the sample data. import pytest @pytest . fixture def house_features_json (): return { 'area' : 25 , 'price' : 2500 , 'rooms' : np . nan } def test_clean_features_cleans_nan_values ( house_features_json ): cleaned_features = clean_features ( house_features_json ) assert cleaned_features [ 'rooms' ] == 0 def test_extract_features_extracts_price_per_area ( house_features_json ): extracted_features = extract_features ( house_features_json ) assert extracted_features [ 'price_per_area' ] == 100 Transforming Data For cleaning and transforming data, test fixed input and output, but try to limit each test to one verification. For example, create one test to verify the output shape of the data. def test_resize_image_generates_the_correct_size (): # Arrange original_image = np . ones (( 10 , 5 , 2 , 3 )) # act resized_image = utils . resize_image ( original_image , 100 , 100 ) # assert resized_image . shape [: 2 ] = ( 100 , 100 ) and one to verify that any padding is made appropriately def test_resize_image_pads_correctly (): # Arrange original_image = np . ones (( 10 , 5 , 2 , 3 )) # Act resized_image = utils . resize_image ( original_image , 100 , 100 ) # Assert assert resized_image [ 0 ][ 0 ][ 0 ][ 0 ] == 0 assert resized_image [ 0 ][ 0 ][ 2 ][ 0 ] == 1 To test different inputs and expected outputs automatically, use parametrize @pytest . mark . parametrize ( 'orig_height, orig_width, expected_height, expected_width' , [ # smaller than target ( 10 , 10 , 20 , 20 ), # larger than target ( 20 , 20 , 10 , 10 ), # wider than target ( 10 , 20 , 10 , 10 ) ]) def test_resize_image_generates_the_correct_size ( orig_height , orig_width , expected_height , expected_width ): # Arrange original_image = np . ones (( orig_height , orig_width , 2 , 3 )) # act resized_image = utils . resize_image ( original_image , expected_height , expected_width ) # assert resized_image . shape [: 2 ] = ( expected_height , expected_width ) Model Load or Predict When unit testing we should mock model load and model predictions similarly to mocking file access. There may be cases when you want to load your model to do smoke tests, or integration tests. Since these will often take a bit longer to run it's important to be able to separate them from unit tests so that the developers on the team can still run unit tests as part of their test driven development. One way to do this is using marks @pytest . mark . longrunning def test_integration_between_two_systems (): # this might take a while Run all tests that are not marked longrunning pytest -v -m \"not longrunning\" Basic Unit Tests for ML Models ML unit tests are not intended to check the accuracy or performance of a model. Unit tests for an ML model is for code quality checks - for example: Does the model accept the correct inputs and produce the correctly shaped outputs? Do the weights of the model update when running fit ? To do this, the ML model tests do not strictly follow best practices of standard Unit tests - not all outside calls are mocked. These tests are much closer to a narrow integration test . However, the benefits of having simple tests for the ML model help to stop a poorly configured model from spending hours in training, while still producing poor results. Examples of how to implement these tests (for Deep Learning models) include: Build a model and compare the shape of input layers to that of an example source of data. Then, compare the output layer shape to the expected output. Initialize the model and record the weights of each layer. Then, run a single epoch of training on a dummy data set, and compare the weights of the \"trained model\" - only check if the values have changed. Train the model on a dummy dataset for a single epoch, and then validate with dummy data - only validate that the prediction is formatted correctly, this model will not be accurate. Data Validation An important part of the unit testing is to include test cases for data validation. For example, no data supplied, images that are not in the expected format, data containing null values or outliers to make sure that the data processing pipeline is robust. Model Testing Apart from unit testing code, we can also test, debug and validate our models in different ways during the training process Some options to consider at this stage: Adversarial and Boundary tests to increase robustness Verifying accuracy for under-represented classes","title":"Testing Data Science and MLOps Code"},{"location":"machine-learning/testing-data-science-and-mlops-code/#testing-data-science-and-mlops-code","text":"The purpose of this document is to provide samples of tests for the most common operations in MLOps/Data Science projects. Testing the code used for MLOps or data science projects follows the same principles of any other software project. Some scenarios might seem different or more difficult to test. The best way to approach this is to always have a test design session, where the focus is on the input/outputs, exceptions and testing the behavior of data transformations. Designing the tests first makes it easier to test as it forces a more modular style, where each function has one purpose, and extracting common functionality functions and modules. Below are some common operations in MLOps or Data Science projects, along with suggestions on how to test them. Saving and loading data Transforming data Model load or predict Data validation Model testing","title":"Testing Data Science and MLOps Code"},{"location":"machine-learning/testing-data-science-and-mlops-code/#saving-and-loading-data","text":"Reading and writing to csv, reading images or loading audio files are common scenarios encountered in MLOps projects.","title":"Saving and Loading Data"},{"location":"machine-learning/testing-data-science-and-mlops-code/#example-verify-that-a-load-function-calls-read_csv-if-the-file-exists","text":"utils.py def load_data ( filename : str ) -> pd . DataFrame : if os . path . isfile ( filename ): df = pd . read_csv ( filename , index_col = 'ID' ) return df return None There's no need to test the read_csv function, or the isfile functions, we can leave testing them to the pandas and os developers. The only thing we need to test here is the logic in this function, i.e. that load_data loads the file if the file exists with the right index column, and doesn't load the file if it doesn't exist, and that it returns the expected results. One way to do this would be to provide a sample file and call the function, and verify that the output is None or a DataFrame . This requires separate files to be present, or not present, for the tests to run. This can cause the same test to run on one machine and then fail on a build server which is not a desired behavior. A much better way is to mock calls to isfile , and read_csv . Instead of calling the real function, we will return a predefined return value, or call a stub that doesn't have any side effects. This way no files are needed in the repository to execute the test, and the test will always work the same, independent of what machine it runs on. Note: Below we mock the specific os and pd functions referenced in the utils file, any others are left unaffected and would run as normal. test_utils.py import utils from mock import patch @patch ( 'utils.os.path.isfile' ) @patch ( 'utils.pd.read_csv' ) def test_load_data_calls_read_csv_if_exists ( mock_isfile , mock_read_csv ): # arrange # always return true for isfile utils . os . path . isfile . return_value = True filename = 'file.csv' # act _ = utils . load_data ( filename ) # assert # check that read_csv is called with the correct parameters utils . pd . read_csv . assert_called_once_with ( filename , index_col = 'ID' ) Similarly, we can verify that it's called 0 or multiple times. In the example below where we verify that it's not called if the file doesn't exist @patch ( 'utils.os.path.isfile' ) @patch ( 'utils.pd.read_csv' ) def test_load_data_does_not_call_read_csv_if_not_exists ( mock_isfile , mock_read_csv ): # arrange # file doesn't exist utils . os . path . isfile . return_value = False filename = 'file.csv' # act _ = utils . load_data ( filename ) # assert # check that read_csv is not called assert utils . pd . read_csv . call_count == 0","title":"Example: Verify that a Load Function Calls read_csv if the File Exists"},{"location":"machine-learning/testing-data-science-and-mlops-code/#example-using-the-same-sample-data-for-multiple-tests","text":"If more than one test will use the same sample data, fixtures are a good way to reuse this sample data. The sample data can be the contents of a json file, or a csv, or a DataFrame, or even an image. Note: The sample data is still hard coded if possible, and does not need to be large. Only add as much sample data as required for the tests to make the tests readable. Use the fixture to return the sample data, and add this as a parameter to the tests where you want to use the sample data. import pytest @pytest . fixture def house_features_json (): return { 'area' : 25 , 'price' : 2500 , 'rooms' : np . nan } def test_clean_features_cleans_nan_values ( house_features_json ): cleaned_features = clean_features ( house_features_json ) assert cleaned_features [ 'rooms' ] == 0 def test_extract_features_extracts_price_per_area ( house_features_json ): extracted_features = extract_features ( house_features_json ) assert extracted_features [ 'price_per_area' ] == 100","title":"Example: Using the Same Sample Data for Multiple Tests"},{"location":"machine-learning/testing-data-science-and-mlops-code/#transforming-data","text":"For cleaning and transforming data, test fixed input and output, but try to limit each test to one verification. For example, create one test to verify the output shape of the data. def test_resize_image_generates_the_correct_size (): # Arrange original_image = np . ones (( 10 , 5 , 2 , 3 )) # act resized_image = utils . resize_image ( original_image , 100 , 100 ) # assert resized_image . shape [: 2 ] = ( 100 , 100 ) and one to verify that any padding is made appropriately def test_resize_image_pads_correctly (): # Arrange original_image = np . ones (( 10 , 5 , 2 , 3 )) # Act resized_image = utils . resize_image ( original_image , 100 , 100 ) # Assert assert resized_image [ 0 ][ 0 ][ 0 ][ 0 ] == 0 assert resized_image [ 0 ][ 0 ][ 2 ][ 0 ] == 1 To test different inputs and expected outputs automatically, use parametrize @pytest . mark . parametrize ( 'orig_height, orig_width, expected_height, expected_width' , [ # smaller than target ( 10 , 10 , 20 , 20 ), # larger than target ( 20 , 20 , 10 , 10 ), # wider than target ( 10 , 20 , 10 , 10 ) ]) def test_resize_image_generates_the_correct_size ( orig_height , orig_width , expected_height , expected_width ): # Arrange original_image = np . ones (( orig_height , orig_width , 2 , 3 )) # act resized_image = utils . resize_image ( original_image , expected_height , expected_width ) # assert resized_image . shape [: 2 ] = ( expected_height , expected_width )","title":"Transforming Data"},{"location":"machine-learning/testing-data-science-and-mlops-code/#model-load-or-predict","text":"When unit testing we should mock model load and model predictions similarly to mocking file access. There may be cases when you want to load your model to do smoke tests, or integration tests. Since these will often take a bit longer to run it's important to be able to separate them from unit tests so that the developers on the team can still run unit tests as part of their test driven development. One way to do this is using marks @pytest . mark . longrunning def test_integration_between_two_systems (): # this might take a while Run all tests that are not marked longrunning pytest -v -m \"not longrunning\"","title":"Model Load or Predict"},{"location":"machine-learning/testing-data-science-and-mlops-code/#basic-unit-tests-for-ml-models","text":"ML unit tests are not intended to check the accuracy or performance of a model. Unit tests for an ML model is for code quality checks - for example: Does the model accept the correct inputs and produce the correctly shaped outputs? Do the weights of the model update when running fit ? To do this, the ML model tests do not strictly follow best practices of standard Unit tests - not all outside calls are mocked. These tests are much closer to a narrow integration test . However, the benefits of having simple tests for the ML model help to stop a poorly configured model from spending hours in training, while still producing poor results. Examples of how to implement these tests (for Deep Learning models) include: Build a model and compare the shape of input layers to that of an example source of data. Then, compare the output layer shape to the expected output. Initialize the model and record the weights of each layer. Then, run a single epoch of training on a dummy data set, and compare the weights of the \"trained model\" - only check if the values have changed. Train the model on a dummy dataset for a single epoch, and then validate with dummy data - only validate that the prediction is formatted correctly, this model will not be accurate.","title":"Basic Unit Tests for ML Models"},{"location":"machine-learning/testing-data-science-and-mlops-code/#data-validation","text":"An important part of the unit testing is to include test cases for data validation. For example, no data supplied, images that are not in the expected format, data containing null values or outliers to make sure that the data processing pipeline is robust.","title":"Data Validation"},{"location":"machine-learning/testing-data-science-and-mlops-code/#model-testing","text":"Apart from unit testing code, we can also test, debug and validate our models in different ways during the training process Some options to consider at this stage: Adversarial and Boundary tests to increase robustness Verifying accuracy for under-represented classes","title":"Model Testing"},{"location":"machine-learning/tpm-considerations-for-ml-projects/","text":"TPM considerations for Machine Learning projects In this document, we explore some of the Program Management considerations for Machine Learning (ML) projects and suggest recommendations for Technical Program Managers (TPM) to effectively work with Data and Applied Machine Learning engineering teams. Determine the Need for Machine Learning in the Project In Artificial Intelligence (AI) projects, the ML component is generally a part of an overall business problem and NOT the problem itself. Determine the overall business problem first and then evaluate if ML can help address a part of the problem space. Few considerations for identifying the right fit for the project: Engage experts in human experience and employ techniques such as Design Thinking and Problem Formulation to understand the customer needs and human behavior first. Identify the right stakeholders from both business and technical leadership and invite them to these workshops. The outcome should be end-user scenarios and personas to determine the real needs of the users. Focus on System Design principles to identify the architectural components, entities, interfaces, constraints. Ask the right questions early and explore design alternatives with the engineering team. Think hard about the costs of ML and whether we are solving a repetitive problem at scale. Many a times, customer problems can be solved with data analytics, dashboards, or rule-based algorithms as the first phase of the project. Set Expectations for High Ambiguity in ML components ML projects can be plagued with a phenomenon we can call as the \" Death by Unknowns \". Unlike software engineering projects, ML focused projects can result in quick success early (aka sudden decrease in error rate), but this may flatten eventually. Few things to consider: Set clear expectations : Identify the performance metrics and discuss on a \"good enough\" prediction rate that will bring value to the business. An 80% \"good enough\" rate may save business costs and increase productivity but if going from 80 to 95% would require unimaginable cost and effort. Is it worth it? Can it be a progressive road map? Create a smaller team and undertake a feasibility analysis through techniques like EDA (Exploratory Data Analysis). A feasibility study is much cheaper to evaluate data quality, customer constraints and model feasibility. It allows a TPM to better understand customer use cases and current environment and can act as a fail-fast mechanism. Note that feasibility should be shorter (in weeks) else it misses the point of saving costs. As in any project, there will be new needs (additional data sources, technical constraints, hiring data labelers, business users time etc.). Incorporate Agile techniques to fail fast and minimize cost and schedule surprises. Notebooks != ML Production Notebooks are a great way to kick start Data Analytics and Applied Machine Learning efforts, however for a production releases, additional constraints should be considered: Understand the end-end flow of data management , how data will be made available (ingestion flows), what's the frequency, storage, retention of data. Plan user stories and design spikes around these flows to ensure a robust ML pipeline is developed. Engineering team should follow the same rigor in building ML projects as in any software engineering project. We at ISE (Industry Solutions Engineering) have built a good set of resources from our learnings in our ISE Engineering Playbook . Think about the how the model will be deployed, for example, are there technical constraints due to an edge device, or network constraints that will prevent updating the model. Understanding of the environment is critical, refer to the Model Production Checklist as a reference to determine model deployment choices. ML Focussed projects are not a \"one-shot\" release solution, they need to be nurtured, evolved, and improved over time. Plan for a continuous improvement lifecycle, the initial phases can be model feasibility and validation to get the good enough prediction rate, the later phases can be then be scaling and improving the models through feedback loops and fresh data sets. Garbage Data In -> Garbage Model Out Data quality is a major factor in affecting model performance and production roll-out, consider the following: Conduct a data exploration workshop and generate a report on data quality that includes missing values, duplicates, unlabeled data, expired or not valid data, incomplete data (e.g., only having male representation in a people dataset). Identify data source reliability to ensure data is coming from a production source. (e.g., are the images from a production or industrial camera or taken from an iPhone/Android phone.) Identify data acquisition constraints : Determine how the data is being obtained and the constraints around it. Some example may include legal, contractual, Privacy, regulation, ethics constraints. These can significantly slow down production roll out if not captured in the early phases of the project. Determine data volumes : Identify if we have enough data for sampling the required business use case and how will the data be improved over time. The thumb rule here is that data should be enough for generalization to avoid overfitting. Plan for Unique Roles in AI projects An ML Project has multiple stages, and each stage may require additional roles. For example, Design Research & Designers for Human Experience, Data Engineer for Data Collection, Feature Engineering, a Data Labeler for labeling structured data, engineers for MLOps and model deployment and the list can go on. As a TPM, factor in having these resources available at the right time to avoid any schedule risks. Feature Engineering and Hyperparameter Tuning Feature Engineering enables the transformation of data so that it becomes usable for an algorithm. Creating the right features is an art and may require experimentation as well as domain expertise. Allocate time for domain experts to help with improving and identifying the best features. For example, for a natural language processing engine for text extraction of financial documents, we may involve financial researchers and run a relevance judgment exercise and provide a feedback loop to evaluate model performance. Responsible AI Considerations Bias in machine learning could be the number one issue of a model not performing to its intended needs. Plan to incorporate Responsible AI principles from Day 1 to ensure fairness, security, privacy and transparency of the models. For example, for a person recognition algorithm, if the data source is only feeding a specific skin type, then production scenarios may not provide good results. PM Fundamentals Core to a TPM role are the fundamentals that include bringing clarity to the team, design thinking, driving the team to the right technical decisions, managing risk, managing stakeholders, backlog management, project management. These are a TPM superpowers . A TPM can complement the machine learning team by ensuring the problem and customer needs are understood, a wholistic system design is evaluated, the stakeholder expectations and driving customer objectives. Here are some references that may help: The T in a TPM The TPM Don't M*ck up framework The mind of a TPM ML Learning Journey for a TPM","title":"TPM considerations for Machine Learning projects"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#tpm-considerations-for-machine-learning-projects","text":"In this document, we explore some of the Program Management considerations for Machine Learning (ML) projects and suggest recommendations for Technical Program Managers (TPM) to effectively work with Data and Applied Machine Learning engineering teams.","title":"TPM considerations for Machine Learning projects"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#determine-the-need-for-machine-learning-in-the-project","text":"In Artificial Intelligence (AI) projects, the ML component is generally a part of an overall business problem and NOT the problem itself. Determine the overall business problem first and then evaluate if ML can help address a part of the problem space. Few considerations for identifying the right fit for the project: Engage experts in human experience and employ techniques such as Design Thinking and Problem Formulation to understand the customer needs and human behavior first. Identify the right stakeholders from both business and technical leadership and invite them to these workshops. The outcome should be end-user scenarios and personas to determine the real needs of the users. Focus on System Design principles to identify the architectural components, entities, interfaces, constraints. Ask the right questions early and explore design alternatives with the engineering team. Think hard about the costs of ML and whether we are solving a repetitive problem at scale. Many a times, customer problems can be solved with data analytics, dashboards, or rule-based algorithms as the first phase of the project.","title":"Determine the Need for Machine Learning in the Project"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#set-expectations-for-high-ambiguity-in-ml-components","text":"ML projects can be plagued with a phenomenon we can call as the \" Death by Unknowns \". Unlike software engineering projects, ML focused projects can result in quick success early (aka sudden decrease in error rate), but this may flatten eventually. Few things to consider: Set clear expectations : Identify the performance metrics and discuss on a \"good enough\" prediction rate that will bring value to the business. An 80% \"good enough\" rate may save business costs and increase productivity but if going from 80 to 95% would require unimaginable cost and effort. Is it worth it? Can it be a progressive road map? Create a smaller team and undertake a feasibility analysis through techniques like EDA (Exploratory Data Analysis). A feasibility study is much cheaper to evaluate data quality, customer constraints and model feasibility. It allows a TPM to better understand customer use cases and current environment and can act as a fail-fast mechanism. Note that feasibility should be shorter (in weeks) else it misses the point of saving costs. As in any project, there will be new needs (additional data sources, technical constraints, hiring data labelers, business users time etc.). Incorporate Agile techniques to fail fast and minimize cost and schedule surprises.","title":"Set Expectations for High Ambiguity in ML components"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#notebooks-ml-production","text":"Notebooks are a great way to kick start Data Analytics and Applied Machine Learning efforts, however for a production releases, additional constraints should be considered: Understand the end-end flow of data management , how data will be made available (ingestion flows), what's the frequency, storage, retention of data. Plan user stories and design spikes around these flows to ensure a robust ML pipeline is developed. Engineering team should follow the same rigor in building ML projects as in any software engineering project. We at ISE (Industry Solutions Engineering) have built a good set of resources from our learnings in our ISE Engineering Playbook . Think about the how the model will be deployed, for example, are there technical constraints due to an edge device, or network constraints that will prevent updating the model. Understanding of the environment is critical, refer to the Model Production Checklist as a reference to determine model deployment choices. ML Focussed projects are not a \"one-shot\" release solution, they need to be nurtured, evolved, and improved over time. Plan for a continuous improvement lifecycle, the initial phases can be model feasibility and validation to get the good enough prediction rate, the later phases can be then be scaling and improving the models through feedback loops and fresh data sets.","title":"Notebooks != ML Production"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#garbage-data-in-garbage-model-out","text":"Data quality is a major factor in affecting model performance and production roll-out, consider the following: Conduct a data exploration workshop and generate a report on data quality that includes missing values, duplicates, unlabeled data, expired or not valid data, incomplete data (e.g., only having male representation in a people dataset). Identify data source reliability to ensure data is coming from a production source. (e.g., are the images from a production or industrial camera or taken from an iPhone/Android phone.) Identify data acquisition constraints : Determine how the data is being obtained and the constraints around it. Some example may include legal, contractual, Privacy, regulation, ethics constraints. These can significantly slow down production roll out if not captured in the early phases of the project. Determine data volumes : Identify if we have enough data for sampling the required business use case and how will the data be improved over time. The thumb rule here is that data should be enough for generalization to avoid overfitting.","title":"Garbage Data In -> Garbage Model Out"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#plan-for-unique-roles-in-ai-projects","text":"An ML Project has multiple stages, and each stage may require additional roles. For example, Design Research & Designers for Human Experience, Data Engineer for Data Collection, Feature Engineering, a Data Labeler for labeling structured data, engineers for MLOps and model deployment and the list can go on. As a TPM, factor in having these resources available at the right time to avoid any schedule risks.","title":"Plan for Unique Roles in AI projects"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#feature-engineering-and-hyperparameter-tuning","text":"Feature Engineering enables the transformation of data so that it becomes usable for an algorithm. Creating the right features is an art and may require experimentation as well as domain expertise. Allocate time for domain experts to help with improving and identifying the best features. For example, for a natural language processing engine for text extraction of financial documents, we may involve financial researchers and run a relevance judgment exercise and provide a feedback loop to evaluate model performance.","title":"Feature Engineering and Hyperparameter Tuning"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#responsible-ai-considerations","text":"Bias in machine learning could be the number one issue of a model not performing to its intended needs. Plan to incorporate Responsible AI principles from Day 1 to ensure fairness, security, privacy and transparency of the models. For example, for a person recognition algorithm, if the data source is only feeding a specific skin type, then production scenarios may not provide good results.","title":"Responsible AI Considerations"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#pm-fundamentals","text":"Core to a TPM role are the fundamentals that include bringing clarity to the team, design thinking, driving the team to the right technical decisions, managing risk, managing stakeholders, backlog management, project management. These are a TPM superpowers . A TPM can complement the machine learning team by ensuring the problem and customer needs are understood, a wholistic system design is evaluated, the stakeholder expectations and driving customer objectives. Here are some references that may help: The T in a TPM The TPM Don't M*ck up framework The mind of a TPM ML Learning Journey for a TPM","title":"PM Fundamentals"},{"location":"non-functional-requirements/accessibility/","text":"Accessibility Accessibility is a critical component of any successful project and ensures the solutions we build are usable and enjoyed by as many people as possible. While meeting accessibility compliance standards is required, accessibility is much broader than compliance alone. Accessibility is about using techniques like inclusive design to infuse different perspectives and the full range of human diversity into the products we build. By incorporating accessibility into your project from the initial envisioning through MVP and beyond, you are promoting a more inclusive environment for your team and helping close the \"Disability Divide\" that exists for many people living with disabilities. Getting Started If you are new to accessibility or are looking for an overview of accessibility fundamentals, Microsoft Learn offers a great training course that covers a broad range of topics from creating accessible content in Office to designing accessibility features in your own apps. You can learn more about the course or get started at Microsoft Learn: Accessibility Fundamentals . Inclusive Design Inclusive design is a methodology that embraces the full range of human diversity as a resource to help build better products and services. Inclusive design compliments accessibility going beyond accessibility compliance standards to ensure products are usable and enjoyed by all people. By leveraging the inclusive design methodology early in a project, you can expect a more inclusive and better solution for everyone. The Microsoft Inclusive Design website offers a variety of resources for incorporating inclusive design in your projects including inclusive design activities that can be used in envisioning and architecture design sessions. The Microsoft Inclusive Design methodology includes the following principles: Recognize Exclusion Designing for inclusivity not only opens up our products and services to more people, it also reflects how people really are. All humans grow and adapt to the world around them and we want our designs to reflect that. Solve for One, Extend to Many Everyone has abilities, and limits to those abilities. Designing for people with permanent disabilities actually results in designs that benefit people universally. Constraints are a beautiful thing. Learn from Diversity Human beings are the real experts in adapting to diversity. Inclusive design puts people in the center from the very start of the process, and those fresh, diverse perspectives are the key to true insight. Tools Accessibility Insights Accessibility Insights is a free, open-source solution for identifying accessibility issues in Windows, Android, and web applications. Accessibility Insights can identify a broad range of accessibility issues including problems with missing image alt tags, heading organization, tab order, color contrast, and many more. In addition, you can use Accessibility Insights to simulate color blindness to ensure your user interface is accessible to those that experience some form of color blindness. You can download Accessibility Insights here: https://accessibilityinsights.io/downloads/ Accessibility Linter Deque Systems are web accessibility experts that provide accessibility training and tools to many organizations including Microsoft. One of the many tools offered by Deque is the axe Accessibility Linter for VS Code . This VS Code extension use the axe-core rules engine to identify accessibility issues in HTML, Angular, React, Markdown, and Vue. Using an accessibility linter can help ensure accessibility issues get addressed early in the development lifecycle. Practices Accessibility Testing Accessibility testing is a specialized subset of software testing and includes automated tools and manual testing processes that vary from project to project. In addition to tools like Accessibility Insights discussed earlier, there are many other solutions for accessibility testing. The W3C provides a comprehensive list of evaluation and testing tools on their website at https://www.w3.org/WAI/ER/tools/ . If you are looking to add automated testing to your Azure Pipelines, you may want to consider the Accessibility Testing extension built by Drew Lewis, a former Microsoft employee. It's important to keep in mind that automated tooling alone is not enough - make sure to augment your automated tests with manual ones. Accessibility Insights (linked above) can guide users through some manual testing steps. Code and Documentation Basics Before you get to testing, you can make some small changes in how you write code and documentation. Document! Beyond text documentation, this also means code comments, clear variable and file naming, and pipeline or script outputs that clearly report success or failure and give details. Avoid small case for variable and file names, hashtags, neologisms, etc. Use camelCase, snake_case, or other methods of creating separation between words. Introduce abbreviations by spelling the full term out, then the abbreviation in parentheses. Use headers effectively to break up content by topic. Don't use more than one h1 per page, and don't skip levels (e.g. use an h3 directly under an h1). Avoid using formatting to make something look like a header when it's not. Use descriptive link text. Avoid attaching a link to phrases like \"Read more\" and ensure that the text directly states what it links to. Link text should be able to stand on its own. When including images or diagrams, add alt text. This should never just be \"Image\" or \"Diagram\" (or similar). In your description, highlight the purpose of the image or diagram in the page and what it is intended to convey. Prefer tabs to spaces when possible. This allows users to default to their preferred tab width, so users with a range of vision can all take in code easily. Resources Microsoft Accessibility Technology & Tools Web Content Accessibility Guidelines (WCAG) Accessibility Guidelines and Requirements | Microsoft Style Guide Google Developer Style Guide: Write Accessible Documentation","title":"Accessibility"},{"location":"non-functional-requirements/accessibility/#accessibility","text":"Accessibility is a critical component of any successful project and ensures the solutions we build are usable and enjoyed by as many people as possible. While meeting accessibility compliance standards is required, accessibility is much broader than compliance alone. Accessibility is about using techniques like inclusive design to infuse different perspectives and the full range of human diversity into the products we build. By incorporating accessibility into your project from the initial envisioning through MVP and beyond, you are promoting a more inclusive environment for your team and helping close the \"Disability Divide\" that exists for many people living with disabilities.","title":"Accessibility"},{"location":"non-functional-requirements/accessibility/#getting-started","text":"If you are new to accessibility or are looking for an overview of accessibility fundamentals, Microsoft Learn offers a great training course that covers a broad range of topics from creating accessible content in Office to designing accessibility features in your own apps. You can learn more about the course or get started at Microsoft Learn: Accessibility Fundamentals .","title":"Getting Started"},{"location":"non-functional-requirements/accessibility/#inclusive-design","text":"Inclusive design is a methodology that embraces the full range of human diversity as a resource to help build better products and services. Inclusive design compliments accessibility going beyond accessibility compliance standards to ensure products are usable and enjoyed by all people. By leveraging the inclusive design methodology early in a project, you can expect a more inclusive and better solution for everyone. The Microsoft Inclusive Design website offers a variety of resources for incorporating inclusive design in your projects including inclusive design activities that can be used in envisioning and architecture design sessions. The Microsoft Inclusive Design methodology includes the following principles:","title":"Inclusive Design"},{"location":"non-functional-requirements/accessibility/#recognize-exclusion","text":"Designing for inclusivity not only opens up our products and services to more people, it also reflects how people really are. All humans grow and adapt to the world around them and we want our designs to reflect that.","title":"Recognize Exclusion"},{"location":"non-functional-requirements/accessibility/#solve-for-one-extend-to-many","text":"Everyone has abilities, and limits to those abilities. Designing for people with permanent disabilities actually results in designs that benefit people universally. Constraints are a beautiful thing.","title":"Solve for One, Extend to Many"},{"location":"non-functional-requirements/accessibility/#learn-from-diversity","text":"Human beings are the real experts in adapting to diversity. Inclusive design puts people in the center from the very start of the process, and those fresh, diverse perspectives are the key to true insight.","title":"Learn from Diversity"},{"location":"non-functional-requirements/accessibility/#tools","text":"","title":"Tools"},{"location":"non-functional-requirements/accessibility/#accessibility-insights","text":"Accessibility Insights is a free, open-source solution for identifying accessibility issues in Windows, Android, and web applications. Accessibility Insights can identify a broad range of accessibility issues including problems with missing image alt tags, heading organization, tab order, color contrast, and many more. In addition, you can use Accessibility Insights to simulate color blindness to ensure your user interface is accessible to those that experience some form of color blindness. You can download Accessibility Insights here: https://accessibilityinsights.io/downloads/","title":"Accessibility Insights"},{"location":"non-functional-requirements/accessibility/#accessibility-linter","text":"Deque Systems are web accessibility experts that provide accessibility training and tools to many organizations including Microsoft. One of the many tools offered by Deque is the axe Accessibility Linter for VS Code . This VS Code extension use the axe-core rules engine to identify accessibility issues in HTML, Angular, React, Markdown, and Vue. Using an accessibility linter can help ensure accessibility issues get addressed early in the development lifecycle.","title":"Accessibility Linter"},{"location":"non-functional-requirements/accessibility/#practices","text":"","title":"Practices"},{"location":"non-functional-requirements/accessibility/#accessibility-testing","text":"Accessibility testing is a specialized subset of software testing and includes automated tools and manual testing processes that vary from project to project. In addition to tools like Accessibility Insights discussed earlier, there are many other solutions for accessibility testing. The W3C provides a comprehensive list of evaluation and testing tools on their website at https://www.w3.org/WAI/ER/tools/ . If you are looking to add automated testing to your Azure Pipelines, you may want to consider the Accessibility Testing extension built by Drew Lewis, a former Microsoft employee. It's important to keep in mind that automated tooling alone is not enough - make sure to augment your automated tests with manual ones. Accessibility Insights (linked above) can guide users through some manual testing steps.","title":"Accessibility Testing"},{"location":"non-functional-requirements/accessibility/#code-and-documentation-basics","text":"Before you get to testing, you can make some small changes in how you write code and documentation. Document! Beyond text documentation, this also means code comments, clear variable and file naming, and pipeline or script outputs that clearly report success or failure and give details. Avoid small case for variable and file names, hashtags, neologisms, etc. Use camelCase, snake_case, or other methods of creating separation between words. Introduce abbreviations by spelling the full term out, then the abbreviation in parentheses. Use headers effectively to break up content by topic. Don't use more than one h1 per page, and don't skip levels (e.g. use an h3 directly under an h1). Avoid using formatting to make something look like a header when it's not. Use descriptive link text. Avoid attaching a link to phrases like \"Read more\" and ensure that the text directly states what it links to. Link text should be able to stand on its own. When including images or diagrams, add alt text. This should never just be \"Image\" or \"Diagram\" (or similar). In your description, highlight the purpose of the image or diagram in the page and what it is intended to convey. Prefer tabs to spaces when possible. This allows users to default to their preferred tab width, so users with a range of vision can all take in code easily.","title":"Code and Documentation Basics"},{"location":"non-functional-requirements/accessibility/#resources","text":"Microsoft Accessibility Technology & Tools Web Content Accessibility Guidelines (WCAG) Accessibility Guidelines and Requirements | Microsoft Style Guide Google Developer Style Guide: Write Accessible Documentation","title":"Resources"},{"location":"non-functional-requirements/availability/","text":"Availability Availability refers to the degree to which a system is operational and accessible when needed for use. It is a critical non-functional requirement that ensures users can rely on the system to perform its intended functions without unexpected downtime. High availability is vital for maintaining user trust and satisfaction, especially in industries where service interruptions can lead to significant financial losses or even jeopardize safety. Achieving high availability often involves strategies like redundancy, failover mechanisms, and robust maintenance practices to minimize both planned and unplanned outages. In essence, availability ensures that the system is there when users need it, which is fundamental for any service-oriented or mission-critical application. Characteristics Uptime: This is the proportion of time the system is operational and accessible. It's often measured as a percentage over a specific period (e.g., 99.99% uptime). Redundancy: Implementing backup components or systems that can take over in case of a failure. This ensures continuous operation even if one part fails. Fault Tolerance: The system's ability to continue operating correctly even when part of it fails. This typically involves designing systems that can handle failures gracefully without significant impact on availability. Failover Mechanisms: Automatic switching to a standby system or component when the primary one fails. This minimizes downtime and maintains availability. Scalability: The system's capacity to handle increasing loads without compromising availability. This often involves scaling resources up or out to meet demand. Maintenance and Monitoring: Regular maintenance and real-time monitoring help to detect issues early and address them before they cause downtime. Proactive maintenance schedules and monitoring tools are crucial for maintaining high availability. Recovery Time Objective (RTO) and Recovery Point Objective (RPO): RTO is the maximum acceptable time to restore service after an outage, while RPO is the maximum acceptable amount of data loss measured in time. These metrics guide the design of disaster recovery plans to ensure availability. Service Level Agreements (SLAs): Formal agreements that specify the expected level of service availability and the penalties or compensations if these levels are not met. SLAs help set clear expectations and accountability. Implementations Implementing availability involves various strategies and technologies designed to ensure that a system remains operational and accessible. Here are some examples: Redundant Systems: Deploying duplicate hardware and software systems that can take over if the primary system fails. For instance, using multiple servers in different geographic locations ensures that if one server goes down, another can handle the load. Load Balancing: Distributing incoming network traffic across multiple servers so that no single server becomes a bottleneck. This not only improves performance but also enhances availability by ensuring that if one server fails, the others can take over the traffic. Failover Mechanisms: Implementing automatic failover processes that switch operations to a backup system when a failure is detected. For example, in a database system, using a hot standby database that immediately takes over if the primary database fails. Clustering: Using a group of servers (a cluster) that work together to provide a service. If one server in the cluster fails, others can pick up the load without interrupting the service. This is commonly used in web hosting and database management. Geographic Distribution: Placing copies of data and services in multiple, geographically dispersed data centers. This approach not only improves access speed for users around the world but also protects against regional failures due to natural disasters or other localized issues. Data Replication: Continuously copying and synchronizing data across multiple locations. Techniques like database replication and distributed file systems ensure that data is always available even if one site goes down. Disaster Recovery Plans: Developing and regularly testing comprehensive disaster recovery plans that include steps for restoring services and data in case of a catastrophic failure. These plans often include off-site backups and detailed procedures for quickly bringing systems back online. Real-Time Monitoring and Alerts: Implementing monitoring tools that constantly check the health of the system and send alerts if something goes wrong. This enables quick response to potential issues before they lead to significant downtime. Scheduled Maintenance Windows: Planning and communicating scheduled maintenance periods during off-peak hours to minimize the impact on users. Systems can be designed to perform maintenance tasks without taking the entire service offline. High Availability Software Architectures: Designing software with high availability in mind, using principles like microservices architecture, which isolates different functions of an application. This isolation ensures that a failure in one component doesn\u2019t bring down the entire system. Resources Recommendations for highly available multi-region design Recommendations for using availability zones and regions","title":"Availability"},{"location":"non-functional-requirements/availability/#availability","text":"Availability refers to the degree to which a system is operational and accessible when needed for use. It is a critical non-functional requirement that ensures users can rely on the system to perform its intended functions without unexpected downtime. High availability is vital for maintaining user trust and satisfaction, especially in industries where service interruptions can lead to significant financial losses or even jeopardize safety. Achieving high availability often involves strategies like redundancy, failover mechanisms, and robust maintenance practices to minimize both planned and unplanned outages. In essence, availability ensures that the system is there when users need it, which is fundamental for any service-oriented or mission-critical application.","title":"Availability"},{"location":"non-functional-requirements/availability/#characteristics","text":"Uptime: This is the proportion of time the system is operational and accessible. It's often measured as a percentage over a specific period (e.g., 99.99% uptime). Redundancy: Implementing backup components or systems that can take over in case of a failure. This ensures continuous operation even if one part fails. Fault Tolerance: The system's ability to continue operating correctly even when part of it fails. This typically involves designing systems that can handle failures gracefully without significant impact on availability. Failover Mechanisms: Automatic switching to a standby system or component when the primary one fails. This minimizes downtime and maintains availability. Scalability: The system's capacity to handle increasing loads without compromising availability. This often involves scaling resources up or out to meet demand. Maintenance and Monitoring: Regular maintenance and real-time monitoring help to detect issues early and address them before they cause downtime. Proactive maintenance schedules and monitoring tools are crucial for maintaining high availability. Recovery Time Objective (RTO) and Recovery Point Objective (RPO): RTO is the maximum acceptable time to restore service after an outage, while RPO is the maximum acceptable amount of data loss measured in time. These metrics guide the design of disaster recovery plans to ensure availability. Service Level Agreements (SLAs): Formal agreements that specify the expected level of service availability and the penalties or compensations if these levels are not met. SLAs help set clear expectations and accountability.","title":"Characteristics"},{"location":"non-functional-requirements/availability/#implementations","text":"Implementing availability involves various strategies and technologies designed to ensure that a system remains operational and accessible. Here are some examples: Redundant Systems: Deploying duplicate hardware and software systems that can take over if the primary system fails. For instance, using multiple servers in different geographic locations ensures that if one server goes down, another can handle the load. Load Balancing: Distributing incoming network traffic across multiple servers so that no single server becomes a bottleneck. This not only improves performance but also enhances availability by ensuring that if one server fails, the others can take over the traffic. Failover Mechanisms: Implementing automatic failover processes that switch operations to a backup system when a failure is detected. For example, in a database system, using a hot standby database that immediately takes over if the primary database fails. Clustering: Using a group of servers (a cluster) that work together to provide a service. If one server in the cluster fails, others can pick up the load without interrupting the service. This is commonly used in web hosting and database management. Geographic Distribution: Placing copies of data and services in multiple, geographically dispersed data centers. This approach not only improves access speed for users around the world but also protects against regional failures due to natural disasters or other localized issues. Data Replication: Continuously copying and synchronizing data across multiple locations. Techniques like database replication and distributed file systems ensure that data is always available even if one site goes down. Disaster Recovery Plans: Developing and regularly testing comprehensive disaster recovery plans that include steps for restoring services and data in case of a catastrophic failure. These plans often include off-site backups and detailed procedures for quickly bringing systems back online. Real-Time Monitoring and Alerts: Implementing monitoring tools that constantly check the health of the system and send alerts if something goes wrong. This enables quick response to potential issues before they lead to significant downtime. Scheduled Maintenance Windows: Planning and communicating scheduled maintenance periods during off-peak hours to minimize the impact on users. Systems can be designed to perform maintenance tasks without taking the entire service offline. High Availability Software Architectures: Designing software with high availability in mind, using principles like microservices architecture, which isolates different functions of an application. This isolation ensures that a failure in one component doesn\u2019t bring down the entire system.","title":"Implementations"},{"location":"non-functional-requirements/availability/#resources","text":"Recommendations for highly available multi-region design Recommendations for using availability zones and regions","title":"Resources"},{"location":"non-functional-requirements/capacity/","text":"Capacity Capacity defines the maximum load or volume that a system can handle while maintaining specified performance criteria. This attribute is crucial for ensuring that the system can support the anticipated number of users, transactions, or data volume without degradation in performance. Characteristics Maximum Load: Capacity defines the upper limit of user activity or workload that the system can handle without performance degradation. This includes peak loads during high-demand periods. Scalability: The system's capacity should be scalable, meaning it can be expanded or upgraded to accommodate increased workload or data volume as the organization grows. Resource Management: Efficient allocation and management of resources such as CPU, memory, disk space, and network bandwidth are critical for maintaining capacity. Performance Criteria: Capacity is defined within specific performance criteria, such as response time, throughput, and transaction processing rates, ensuring that the system maintains acceptable performance levels under load. Load Balancing: Systems with high capacity often employ load balancing techniques to distribute workload evenly across servers or resources, optimizing performance and avoiding overload. Failover and Redundancy: Capacity planning may include provisions for failover mechanisms and redundancy to ensure continuity of service and minimal downtime in case of hardware failures or traffic spikes. Monitoring and Testing: Continuous monitoring and periodic load testing are essential to verify that the system's capacity meets expected levels and to identify potential bottlenecks or performance issues proactively. Load testing is one of the critical methods used to ensure that the system can handle expected loads. Capacity Planning: Effective capacity management involves forecasting future needs based on growth projections and historical usage patterns, allowing for timely upgrades or adjustments to infrastructure and resources. Implementations Capacity is typically implemented through a combination of architectural design, infrastructure planning, and performance optimization strategies. For example: Scalable Architecture: Designing the system with scalability in mind allows it to handle increased load by adding resources (horizontal scaling) or upgrading existing resources (vertical scaling). This involves using distributed systems, microservices architecture, and load balancing mechanisms to distribute workload across multiple servers or instances. It is also important to plan for scalability with a forward-looking approach, typically anticipating the needs for at least the next 6 months, to ensure the system can accommodate future growth and demand. Resource Allocation: Efficient allocation and management of resources such as CPU, memory, disk space, and network bandwidth are crucial. This can include techniques like resource pooling, where resources are shared among multiple users or tasks to optimize utilization. Caching: Utilizing caching mechanisms (e.g., in-memory caching, content delivery networks) to store frequently accessed data or computations can reduce the load on backend services and improve response times, thereby enhancing overall capacity. Database Optimization: Ensure that data is modeled efficiently to support optimal performance and scalability. Optimizing database queries, indexing frequently accessed data, and using database scaling techniques (e.g., sharding, replication) can improve the system's ability to handle large volumes of data and concurrent transactions. Load Balancing: Implementing load balancers to evenly distribute incoming traffic across multiple servers or instances helps prevent overload on any single component and ensures efficient resource utilization. Auto-scaling: Leveraging auto-scaling capabilities provided by cloud platforms allows the system to automatically adjust its capacity based on real-time demand. This ensures that additional resources are provisioned during peak periods and scaled down during low traffic times, optimizing cost and performance. Performance Monitoring and Tuning: Continuous monitoring of system performance metrics (e.g., CPU usage, memory utilization, response times) helps identify bottlenecks and areas for optimization. Tuning configurations, optimizing code, and conducting performance testing are essential to maintain and improve system capacity over time. High Availability and Fault Tolerance: Implementing strategies such as redundant servers, failover mechanisms, and disaster recovery plans ensures that the system remains available and operational even in the event of hardware failures or other disruptions. Capacity Planning: Conducting thorough capacity planning based on anticipated growth, usage patterns, and business requirements helps forecast resource needs and proactively scale the system to meet future demands. Resources Performance Testing","title":"Capacity"},{"location":"non-functional-requirements/capacity/#capacity","text":"Capacity defines the maximum load or volume that a system can handle while maintaining specified performance criteria. This attribute is crucial for ensuring that the system can support the anticipated number of users, transactions, or data volume without degradation in performance.","title":"Capacity"},{"location":"non-functional-requirements/capacity/#characteristics","text":"Maximum Load: Capacity defines the upper limit of user activity or workload that the system can handle without performance degradation. This includes peak loads during high-demand periods. Scalability: The system's capacity should be scalable, meaning it can be expanded or upgraded to accommodate increased workload or data volume as the organization grows. Resource Management: Efficient allocation and management of resources such as CPU, memory, disk space, and network bandwidth are critical for maintaining capacity. Performance Criteria: Capacity is defined within specific performance criteria, such as response time, throughput, and transaction processing rates, ensuring that the system maintains acceptable performance levels under load. Load Balancing: Systems with high capacity often employ load balancing techniques to distribute workload evenly across servers or resources, optimizing performance and avoiding overload. Failover and Redundancy: Capacity planning may include provisions for failover mechanisms and redundancy to ensure continuity of service and minimal downtime in case of hardware failures or traffic spikes. Monitoring and Testing: Continuous monitoring and periodic load testing are essential to verify that the system's capacity meets expected levels and to identify potential bottlenecks or performance issues proactively. Load testing is one of the critical methods used to ensure that the system can handle expected loads. Capacity Planning: Effective capacity management involves forecasting future needs based on growth projections and historical usage patterns, allowing for timely upgrades or adjustments to infrastructure and resources.","title":"Characteristics"},{"location":"non-functional-requirements/capacity/#implementations","text":"Capacity is typically implemented through a combination of architectural design, infrastructure planning, and performance optimization strategies. For example: Scalable Architecture: Designing the system with scalability in mind allows it to handle increased load by adding resources (horizontal scaling) or upgrading existing resources (vertical scaling). This involves using distributed systems, microservices architecture, and load balancing mechanisms to distribute workload across multiple servers or instances. It is also important to plan for scalability with a forward-looking approach, typically anticipating the needs for at least the next 6 months, to ensure the system can accommodate future growth and demand. Resource Allocation: Efficient allocation and management of resources such as CPU, memory, disk space, and network bandwidth are crucial. This can include techniques like resource pooling, where resources are shared among multiple users or tasks to optimize utilization. Caching: Utilizing caching mechanisms (e.g., in-memory caching, content delivery networks) to store frequently accessed data or computations can reduce the load on backend services and improve response times, thereby enhancing overall capacity. Database Optimization: Ensure that data is modeled efficiently to support optimal performance and scalability. Optimizing database queries, indexing frequently accessed data, and using database scaling techniques (e.g., sharding, replication) can improve the system's ability to handle large volumes of data and concurrent transactions. Load Balancing: Implementing load balancers to evenly distribute incoming traffic across multiple servers or instances helps prevent overload on any single component and ensures efficient resource utilization. Auto-scaling: Leveraging auto-scaling capabilities provided by cloud platforms allows the system to automatically adjust its capacity based on real-time demand. This ensures that additional resources are provisioned during peak periods and scaled down during low traffic times, optimizing cost and performance. Performance Monitoring and Tuning: Continuous monitoring of system performance metrics (e.g., CPU usage, memory utilization, response times) helps identify bottlenecks and areas for optimization. Tuning configurations, optimizing code, and conducting performance testing are essential to maintain and improve system capacity over time. High Availability and Fault Tolerance: Implementing strategies such as redundant servers, failover mechanisms, and disaster recovery plans ensures that the system remains available and operational even in the event of hardware failures or other disruptions. Capacity Planning: Conducting thorough capacity planning based on anticipated growth, usage patterns, and business requirements helps forecast resource needs and proactively scale the system to meet future demands.","title":"Implementations"},{"location":"non-functional-requirements/capacity/#resources","text":"Performance Testing","title":"Resources"},{"location":"non-functional-requirements/compliance/","text":"Compliance Compliance refers to the adherence to regulatory standards, legal requirements, and organizational policies that govern the handling of data, security practices, and operational procedures. It ensures that the software solution meets specific industry regulations (such as GDPR, HIPAA, PCI-DSS) and internal governance frameworks. Characteristics Regulatory Adherence: Compliance requires the software system to adhere to specific regulatory frameworks relevant to its industry or geographic region. This includes laws and regulations related to data protection, privacy, security, financial transactions, healthcare, and more. Data Privacy: Ensuring that the system handles sensitive data in accordance with privacy laws and regulations, such as implementing encryption, access controls, data anonymization, and secure data storage practices. This includes proper management of Personally Identifiable Information (PII) and encapsulation of secrets to prevent unauthorized access and ensure compliance with data protection standards. Security Standards: Compliance mandates adherence to security standards and best practices to protect against unauthorized access, data breaches, and cyber threats. This involves implementing measures such as firewalls, intrusion detection systems, secure authentication mechanisms, and regular security audits. Auditability: The system must be designed and operated in a way that allows for comprehensive auditing and logging of activities. This ensures that compliance with regulations can be verified through audit trails and compliance reports. Documentation: Comprehensive documentation of policies, procedures, and controls related to compliance requirements is essential. This includes documenting data handling processes, security measures, incident response plans, and compliance assessments. Risk Management: Implementing risk assessment and management practices to identify, assess, and mitigate risks associated with non-compliance. This involves conducting risk assessments regularly and implementing controls to manage identified risks effectively. Change Management: Compliance requires robust change management processes to ensure that any updates or modifications to the software system do not compromise regulatory compliance. This includes testing changes thoroughly and obtaining necessary approvals. Implementations Implementing compliance involves a systematic approach that integrates regulatory requirements, organizational policies, and best practices into the development, deployment, and operation phases. Here are common strategies and practices used to implement compliance: Compliance Framework Selection: Choosing and adopting a compliance framework or standards (e.g., ISO 27001, NIST Cybersecurity Framework) that aligns with the organization's compliance obligations and provides guidelines for implementing controls. Privacy by Design: Integrating privacy considerations into the software design and development process. This includes conducting privacy impact assessments, implementing data minimization techniques, and ensuring user consent mechanisms are in place where required. Audit and Monitoring: Establishing mechanisms for continuous monitoring, auditing, and logging of activities within the software system to ensure compliance with regulatory requirements. This includes maintaining audit trails, generating compliance reports, and conducting regular security assessments. Documentation and Record Keeping: Maintaining comprehensive documentation of compliance efforts, including policies, procedures, audit reports, risk assessments, and compliance certifications. Resources General Data Protection Regulation (GDPR) Purview Compliance Manager","title":"Compliance"},{"location":"non-functional-requirements/compliance/#compliance","text":"Compliance refers to the adherence to regulatory standards, legal requirements, and organizational policies that govern the handling of data, security practices, and operational procedures. It ensures that the software solution meets specific industry regulations (such as GDPR, HIPAA, PCI-DSS) and internal governance frameworks.","title":"Compliance"},{"location":"non-functional-requirements/compliance/#characteristics","text":"Regulatory Adherence: Compliance requires the software system to adhere to specific regulatory frameworks relevant to its industry or geographic region. This includes laws and regulations related to data protection, privacy, security, financial transactions, healthcare, and more. Data Privacy: Ensuring that the system handles sensitive data in accordance with privacy laws and regulations, such as implementing encryption, access controls, data anonymization, and secure data storage practices. This includes proper management of Personally Identifiable Information (PII) and encapsulation of secrets to prevent unauthorized access and ensure compliance with data protection standards. Security Standards: Compliance mandates adherence to security standards and best practices to protect against unauthorized access, data breaches, and cyber threats. This involves implementing measures such as firewalls, intrusion detection systems, secure authentication mechanisms, and regular security audits. Auditability: The system must be designed and operated in a way that allows for comprehensive auditing and logging of activities. This ensures that compliance with regulations can be verified through audit trails and compliance reports. Documentation: Comprehensive documentation of policies, procedures, and controls related to compliance requirements is essential. This includes documenting data handling processes, security measures, incident response plans, and compliance assessments. Risk Management: Implementing risk assessment and management practices to identify, assess, and mitigate risks associated with non-compliance. This involves conducting risk assessments regularly and implementing controls to manage identified risks effectively. Change Management: Compliance requires robust change management processes to ensure that any updates or modifications to the software system do not compromise regulatory compliance. This includes testing changes thoroughly and obtaining necessary approvals.","title":"Characteristics"},{"location":"non-functional-requirements/compliance/#implementations","text":"Implementing compliance involves a systematic approach that integrates regulatory requirements, organizational policies, and best practices into the development, deployment, and operation phases. Here are common strategies and practices used to implement compliance: Compliance Framework Selection: Choosing and adopting a compliance framework or standards (e.g., ISO 27001, NIST Cybersecurity Framework) that aligns with the organization's compliance obligations and provides guidelines for implementing controls. Privacy by Design: Integrating privacy considerations into the software design and development process. This includes conducting privacy impact assessments, implementing data minimization techniques, and ensuring user consent mechanisms are in place where required. Audit and Monitoring: Establishing mechanisms for continuous monitoring, auditing, and logging of activities within the software system to ensure compliance with regulatory requirements. This includes maintaining audit trails, generating compliance reports, and conducting regular security assessments. Documentation and Record Keeping: Maintaining comprehensive documentation of compliance efforts, including policies, procedures, audit reports, risk assessments, and compliance certifications.","title":"Implementations"},{"location":"non-functional-requirements/compliance/#resources","text":"General Data Protection Regulation (GDPR) Purview Compliance Manager","title":"Resources"},{"location":"non-functional-requirements/data-integrity/","text":"Data Integrity Data Integrity is the maintenance and assurance of the quality of data over its entire lifecycle. This includes the many facets of data quality such as, but not limited to, consistency, accuracy, and reliability. The benefits of this NFR are significant, as it ensures that data is trustworthy and reliable for decision-making, analysis, and reporting. Characteristics Accuracy: Data should be correct and free from errors or inconsistencies. Are the column data types correct? Are numeric values rounded off correctly? Completeness: All required data should be present and not missing any essential components. Consistency: Data should be consistent across different databases, applications, or time periods. Validity: Data should conform to defined rules, constraints, or standards. Invalid data should be rejected or flagged for correction. Reliability: Data should be trustworthy and dependable for decision-making and analysis. Timeliness: Data should be up-to-date and reflect the most current information available. Security: Data should be protected from unauthorized access, alteration, or deletion to maintain its integrity. Auditability: Changes to data should be tracked and logged, allowing for accountability and traceability. Transparency: Processes for data collection, storage, and manipulation should be transparent and understandable. Redundancy: Data should have backups or redundancy measures in place to prevent loss or corruption. Compliance: Data handling practices should comply with relevant regulations, standards, and industry best practices. Uniqueness: Data should be unique and not duplicated within the same dataset. Referential integrity: Does every row that depends on a dimension in the fact table actually have its associated dimension? (i.e., foreign keys without a primary) For example, let's say the dimension is \"city\"- then if we have a fact table referencing Seattle, and then delete the Seattle dimension, we need to go delete Seattle from the facts Orderliness: Data should be organized in a logical and consistent manner, making it easy to search, retrieve, and analyze. Implementations Data validation: Implement validation rules at the data entry points to ensure that only accurate and valid data is accepted into the system. This includes checks for data type, format, range, and consistency. Data logging and auditing: Implement logging mechanisms to record all data-related activities, including data modifications, access attempts, and system events. Regularly review audit logs to detect any unauthorized or suspicious activities. Data quality monitoring: Establish data quality monitoring processes to continuously evaluate the accuracy, completeness, and consistency of data. Implement automated checks and alerts to identify and address data quality issues in real-time. Database constraints: Utilize database constraints such as primary keys, foreign keys, unique constraints, and check constraints to enforce data integrity rules at the database level. Regular data backups: Implement regular backups of data to prevent loss in case of system failures, errors, or security breaches. Ensure that backup procedures are automated, monitored, and regularly tested. Resources Great Expectations : A framework to build data validations and test the quality of your data.","title":"Data Integrity"},{"location":"non-functional-requirements/data-integrity/#data-integrity","text":"Data Integrity is the maintenance and assurance of the quality of data over its entire lifecycle. This includes the many facets of data quality such as, but not limited to, consistency, accuracy, and reliability. The benefits of this NFR are significant, as it ensures that data is trustworthy and reliable for decision-making, analysis, and reporting.","title":"Data Integrity"},{"location":"non-functional-requirements/data-integrity/#characteristics","text":"Accuracy: Data should be correct and free from errors or inconsistencies. Are the column data types correct? Are numeric values rounded off correctly? Completeness: All required data should be present and not missing any essential components. Consistency: Data should be consistent across different databases, applications, or time periods. Validity: Data should conform to defined rules, constraints, or standards. Invalid data should be rejected or flagged for correction. Reliability: Data should be trustworthy and dependable for decision-making and analysis. Timeliness: Data should be up-to-date and reflect the most current information available. Security: Data should be protected from unauthorized access, alteration, or deletion to maintain its integrity. Auditability: Changes to data should be tracked and logged, allowing for accountability and traceability. Transparency: Processes for data collection, storage, and manipulation should be transparent and understandable. Redundancy: Data should have backups or redundancy measures in place to prevent loss or corruption. Compliance: Data handling practices should comply with relevant regulations, standards, and industry best practices. Uniqueness: Data should be unique and not duplicated within the same dataset. Referential integrity: Does every row that depends on a dimension in the fact table actually have its associated dimension? (i.e., foreign keys without a primary) For example, let's say the dimension is \"city\"- then if we have a fact table referencing Seattle, and then delete the Seattle dimension, we need to go delete Seattle from the facts Orderliness: Data should be organized in a logical and consistent manner, making it easy to search, retrieve, and analyze.","title":"Characteristics"},{"location":"non-functional-requirements/data-integrity/#implementations","text":"Data validation: Implement validation rules at the data entry points to ensure that only accurate and valid data is accepted into the system. This includes checks for data type, format, range, and consistency. Data logging and auditing: Implement logging mechanisms to record all data-related activities, including data modifications, access attempts, and system events. Regularly review audit logs to detect any unauthorized or suspicious activities. Data quality monitoring: Establish data quality monitoring processes to continuously evaluate the accuracy, completeness, and consistency of data. Implement automated checks and alerts to identify and address data quality issues in real-time. Database constraints: Utilize database constraints such as primary keys, foreign keys, unique constraints, and check constraints to enforce data integrity rules at the database level. Regular data backups: Implement regular backups of data to prevent loss in case of system failures, errors, or security breaches. Ensure that backup procedures are automated, monitored, and regularly tested.","title":"Implementations"},{"location":"non-functional-requirements/data-integrity/#resources","text":"Great Expectations : A framework to build data validations and test the quality of your data.","title":"Resources"},{"location":"non-functional-requirements/disaster-recovery/","text":"Disaster Recovery and Continuity Disaster Recovery (DR) focuses on the processes and technologies required to restore IT systems and data after a catastrophic event, such as a natural disaster, cyber attack, or hardware failure. It involves regular backups, failover procedures, and recovery plans that enable a swift return to normal operations. Business Continuity (BC), on the other hand, encompasses a broader scope, ensuring that essential business functions can continue during and after a disaster. This includes not only IT systems but also processes, personnel, and physical infrastructure. Together, DR and BC strategies are vital for minimizing downtime, protecting data integrity, and maintaining customer trust and operational stability. They ensure that an organization can quickly recover from disruptions and continue providing critical services, safeguarding both its reputation and financial health. Characteristics Recovery Time Objective (RTO) : This defines the maximum acceptable amount of time it should take to restore a system after a disaster. RTO sets the target for how quickly systems and applications must be back online to minimize impact on the business. Recovery Point Objective (RPO) : This specifies the maximum acceptable amount of data loss measured in time. RPO determines how frequently data backups should occur to ensure that data loss remains within acceptable limits. Backup and Restore Procedures : Effective DR involves robust backup procedures, including regular, automated backups of critical data and systems. These backups must be stored securely, often in off-site or cloud locations, and tested regularly to ensure they can be restored as needed. Failover Mechanisms : These are automated processes that switch operations to a standby system or site in the event of a failure. Failover mechanisms ensure continuity of service by redirecting workloads to backup systems without significant downtime. Redundancy : DR plans often include redundant systems and infrastructure to eliminate single points of failure. This can involve duplicate hardware, network paths, and data storage locations. Disaster Recovery Plan (DRP) : A comprehensive DRP outlines the specific steps, roles, and responsibilities involved in responding to a disaster. It includes detailed procedures for data recovery, system restoration, and communication protocols. Testing and Drills : Regular testing and simulation drills are essential to validate the effectiveness of the DR plan. This helps identify potential weaknesses and ensures that staff are familiar with the recovery procedures. Communication Plan : Effective DR includes a clear communication strategy for notifying stakeholders, including employees, customers, and partners, about the status of recovery efforts and expected timelines for restoration. Scalability : The DR plan should be scalable to accommodate changes in the business environment, such as growth in data volume or expansion to new geographic locations. This ensures that the recovery strategy remains effective as the organization evolves. Compliance and Regulatory Requirements : DR plans must adhere to relevant industry standards and regulatory requirements, ensuring that recovery processes meet legal and compliance obligations. Cost Considerations : Balancing the costs associated with implementing and maintaining DR capabilities against the potential losses from downtime and data loss is crucial. Effective DR planning considers cost-efficiency while ensuring robust protection. Implementations Implementing disaster recovery (DR) involves a combination of strategies, technologies, and practices designed to restore systems and data quickly and effectively after a catastrophic event. Here are some examples: Cloud Backups : Store backup copies of data in the cloud, ensuring they are accessible from anywhere and providing geographic redundancy. Disaster Recovery as a Service (DRaaS) : Utilize DRaaS providers that offer comprehensive disaster recovery solutions, including automated failover to cloud-based systems. Failover and Redundancy : Hot Site : Maintain a fully operational, geographically separate duplicate of your primary site that can take over immediately in case of a disaster. Cold Site : Have an alternate site with necessary infrastructure but without active systems or data, ready to be brought online when needed. Warm Site : A compromise between hot and cold sites, with partially prepared systems that require some setup before use. Virtualization and Snapshots : Virtual Machine (VM) Snapshots : Regularly take snapshots of virtual machines, allowing for quick rollback to a known good state. VM Replication : Continuously replicate VMs to a secondary location, ensuring up-to-date copies are ready to take over if the primary site fails. Automated Failover Systems : High Availability Clusters : Implement clusters of servers that automatically detect failures and shift workloads to healthy nodes without manual intervention. Load Balancers : Use load balancers to distribute traffic across multiple servers, ensuring continuous service availability even if one server fails. Data Replication : Ensure that data is simultaneously written to primary and secondary locations, maintaining real-time consistency between sites. Regular Testing and Drills : Conduct regular simulation drills to test the effectiveness of the DR plan and to ensure that all team members are familiar with their roles. Comprehensive Documentation : Develop run books with step-by-step instructions for executing the DR plan, tailored to specific scenarios and systems. Resources Azure Site Recovery","title":"Disaster Recovery and Continuity"},{"location":"non-functional-requirements/disaster-recovery/#disaster-recovery-and-continuity","text":"Disaster Recovery (DR) focuses on the processes and technologies required to restore IT systems and data after a catastrophic event, such as a natural disaster, cyber attack, or hardware failure. It involves regular backups, failover procedures, and recovery plans that enable a swift return to normal operations. Business Continuity (BC), on the other hand, encompasses a broader scope, ensuring that essential business functions can continue during and after a disaster. This includes not only IT systems but also processes, personnel, and physical infrastructure. Together, DR and BC strategies are vital for minimizing downtime, protecting data integrity, and maintaining customer trust and operational stability. They ensure that an organization can quickly recover from disruptions and continue providing critical services, safeguarding both its reputation and financial health.","title":"Disaster Recovery and Continuity"},{"location":"non-functional-requirements/disaster-recovery/#characteristics","text":"Recovery Time Objective (RTO) : This defines the maximum acceptable amount of time it should take to restore a system after a disaster. RTO sets the target for how quickly systems and applications must be back online to minimize impact on the business. Recovery Point Objective (RPO) : This specifies the maximum acceptable amount of data loss measured in time. RPO determines how frequently data backups should occur to ensure that data loss remains within acceptable limits. Backup and Restore Procedures : Effective DR involves robust backup procedures, including regular, automated backups of critical data and systems. These backups must be stored securely, often in off-site or cloud locations, and tested regularly to ensure they can be restored as needed. Failover Mechanisms : These are automated processes that switch operations to a standby system or site in the event of a failure. Failover mechanisms ensure continuity of service by redirecting workloads to backup systems without significant downtime. Redundancy : DR plans often include redundant systems and infrastructure to eliminate single points of failure. This can involve duplicate hardware, network paths, and data storage locations. Disaster Recovery Plan (DRP) : A comprehensive DRP outlines the specific steps, roles, and responsibilities involved in responding to a disaster. It includes detailed procedures for data recovery, system restoration, and communication protocols. Testing and Drills : Regular testing and simulation drills are essential to validate the effectiveness of the DR plan. This helps identify potential weaknesses and ensures that staff are familiar with the recovery procedures. Communication Plan : Effective DR includes a clear communication strategy for notifying stakeholders, including employees, customers, and partners, about the status of recovery efforts and expected timelines for restoration. Scalability : The DR plan should be scalable to accommodate changes in the business environment, such as growth in data volume or expansion to new geographic locations. This ensures that the recovery strategy remains effective as the organization evolves. Compliance and Regulatory Requirements : DR plans must adhere to relevant industry standards and regulatory requirements, ensuring that recovery processes meet legal and compliance obligations. Cost Considerations : Balancing the costs associated with implementing and maintaining DR capabilities against the potential losses from downtime and data loss is crucial. Effective DR planning considers cost-efficiency while ensuring robust protection.","title":"Characteristics"},{"location":"non-functional-requirements/disaster-recovery/#implementations","text":"Implementing disaster recovery (DR) involves a combination of strategies, technologies, and practices designed to restore systems and data quickly and effectively after a catastrophic event. Here are some examples: Cloud Backups : Store backup copies of data in the cloud, ensuring they are accessible from anywhere and providing geographic redundancy. Disaster Recovery as a Service (DRaaS) : Utilize DRaaS providers that offer comprehensive disaster recovery solutions, including automated failover to cloud-based systems. Failover and Redundancy : Hot Site : Maintain a fully operational, geographically separate duplicate of your primary site that can take over immediately in case of a disaster. Cold Site : Have an alternate site with necessary infrastructure but without active systems or data, ready to be brought online when needed. Warm Site : A compromise between hot and cold sites, with partially prepared systems that require some setup before use. Virtualization and Snapshots : Virtual Machine (VM) Snapshots : Regularly take snapshots of virtual machines, allowing for quick rollback to a known good state. VM Replication : Continuously replicate VMs to a secondary location, ensuring up-to-date copies are ready to take over if the primary site fails. Automated Failover Systems : High Availability Clusters : Implement clusters of servers that automatically detect failures and shift workloads to healthy nodes without manual intervention. Load Balancers : Use load balancers to distribute traffic across multiple servers, ensuring continuous service availability even if one server fails. Data Replication : Ensure that data is simultaneously written to primary and secondary locations, maintaining real-time consistency between sites. Regular Testing and Drills : Conduct regular simulation drills to test the effectiveness of the DR plan and to ensure that all team members are familiar with their roles. Comprehensive Documentation : Develop run books with step-by-step instructions for executing the DR plan, tailored to specific scenarios and systems.","title":"Implementations"},{"location":"non-functional-requirements/disaster-recovery/#resources","text":"Azure Site Recovery","title":"Resources"},{"location":"non-functional-requirements/internationalization/","text":"Internationalization and Localization Internationalization (i18n) and Localization (l10n) refer to the design and adaptation of software systems to support multiple languages, cultures, and regions, ensuring usability and compliance with local preferences and regulations. Characteristics Main Characteristics of Internationalization Text Externalization: Moving all user-facing text to external resource files to facilitate easy translation. Unicode Support: Using Unicode or another character encoding that supports all necessary scripts and characters. Date and Time Formatting: Designing the system to handle various date and time formats. Number and Currency Formatting: Ensuring that numbers and currencies can be displayed according to local conventions. Locale-Sensitive Data Processing: Adapting data processing to respect locale-specific rules, such as sorting and case conversion. Bidirectional Text Support: Supporting both left-to-right (LTR) and right-to-left (RTL) text orientations where necessary. Main Characteristics of Localization Translation: Converting text and UI elements to the target language. Cultural Adaptation: Adapting content and design elements to align with local cultural norms and expectations. Legal and Regulatory Compliance: Ensuring that the application meets local legal requirements, such as privacy laws and accessibility standards. Testing in Context: Testing the localized version of the application in its intended locale to ensure proper functionality and usability. Localized User Interfaces: Adjusting the layout and design to accommodate text expansion or contraction and to suit cultural preferences. Help and Documentation: Providing user assistance and documentation in the target language and context. Implementations Resource Bundles: Using resource bundles to store locale-specific text and data. Translation Management Systems: Employing tools and platforms to manage translations and streamline the localization workflow. Locale-Aware Libraries: Leveraging libraries and frameworks that provide built-in support for handling locale-specific data. Automated Testing: Implementing automated tests to verify that the software behaves correctly in different locales. Continuous Localization: Integrating localization processes into the continuous integration/continuous deployment (CI/CD) pipeline to keep translations up-to-date. Coordinated Universal Time: When dealing with times, it is essential to always use UTC for internal storage and processing. Using UTC helps avoid issues related to time zone differences, daylight saving time changes, and other regional time adjustments. Consistent Internal Representation: Store numbers and currency values in a consistent internal representation, such as a standardized numeric format or a base currency, and apply locale-specific formatting only when displaying data to the user. This prevents errors during calculations and data processing.","title":"Internationalization and Localization"},{"location":"non-functional-requirements/internationalization/#internationalization-and-localization","text":"Internationalization (i18n) and Localization (l10n) refer to the design and adaptation of software systems to support multiple languages, cultures, and regions, ensuring usability and compliance with local preferences and regulations.","title":"Internationalization and Localization"},{"location":"non-functional-requirements/internationalization/#characteristics","text":"","title":"Characteristics"},{"location":"non-functional-requirements/internationalization/#main-characteristics-of-internationalization","text":"Text Externalization: Moving all user-facing text to external resource files to facilitate easy translation. Unicode Support: Using Unicode or another character encoding that supports all necessary scripts and characters. Date and Time Formatting: Designing the system to handle various date and time formats. Number and Currency Formatting: Ensuring that numbers and currencies can be displayed according to local conventions. Locale-Sensitive Data Processing: Adapting data processing to respect locale-specific rules, such as sorting and case conversion. Bidirectional Text Support: Supporting both left-to-right (LTR) and right-to-left (RTL) text orientations where necessary.","title":"Main Characteristics of Internationalization"},{"location":"non-functional-requirements/internationalization/#main-characteristics-of-localization","text":"Translation: Converting text and UI elements to the target language. Cultural Adaptation: Adapting content and design elements to align with local cultural norms and expectations. Legal and Regulatory Compliance: Ensuring that the application meets local legal requirements, such as privacy laws and accessibility standards. Testing in Context: Testing the localized version of the application in its intended locale to ensure proper functionality and usability. Localized User Interfaces: Adjusting the layout and design to accommodate text expansion or contraction and to suit cultural preferences. Help and Documentation: Providing user assistance and documentation in the target language and context.","title":"Main Characteristics of Localization"},{"location":"non-functional-requirements/internationalization/#implementations","text":"Resource Bundles: Using resource bundles to store locale-specific text and data. Translation Management Systems: Employing tools and platforms to manage translations and streamline the localization workflow. Locale-Aware Libraries: Leveraging libraries and frameworks that provide built-in support for handling locale-specific data. Automated Testing: Implementing automated tests to verify that the software behaves correctly in different locales. Continuous Localization: Integrating localization processes into the continuous integration/continuous deployment (CI/CD) pipeline to keep translations up-to-date. Coordinated Universal Time: When dealing with times, it is essential to always use UTC for internal storage and processing. Using UTC helps avoid issues related to time zone differences, daylight saving time changes, and other regional time adjustments. Consistent Internal Representation: Store numbers and currency values in a consistent internal representation, such as a standardized numeric format or a base currency, and apply locale-specific formatting only when displaying data to the user. This prevents errors during calculations and data processing.","title":"Implementations"},{"location":"non-functional-requirements/interoperability/","text":"Interoperability Interoperability refers to the ability of different software components or systems to seamlessly exchange and use information. It involves ensuring that the software can integrate effectively with other systems, regardless of their operating platforms, programming languages, or data formats. Characteristics Standardization: Adherence to industry standards, protocols, and specifications that enable consistent and compatible interactions between different software components or systems. Compatibility: The ability of systems to work together without requiring extensive modifications or adaptations, ensuring that data and operations can be shared effectively. Interface Definition: Well-defined interfaces and APIs that facilitate communication and data exchange between systems, abstracting complexities and promoting ease of integration. Data Format Consistency: Consistent handling and interpretation of data formats, ensuring that information exchanged between systems remains accurate and meaningful. Platform Agnosticism: Capability to operate across different hardware platforms, operating systems, and environments without dependency on specific technologies or configurations. Implementations An interoperable solution facilitates seamless communication and data exchange between heterogeneous systems. Here are some of the implementations: Providing RESTful APIs. Using data formats and standards such as JSON schemas. Utilizing libraries and frameworks that provide cross-platform support and abstraction layers for common functionalities. Adhering to industry standards (e.g., ISO, IEEE) and governance frameworks that define interoperability requirements, protocols, and best practices for seamless integration.","title":"Interoperability"},{"location":"non-functional-requirements/interoperability/#interoperability","text":"Interoperability refers to the ability of different software components or systems to seamlessly exchange and use information. It involves ensuring that the software can integrate effectively with other systems, regardless of their operating platforms, programming languages, or data formats.","title":"Interoperability"},{"location":"non-functional-requirements/interoperability/#characteristics","text":"Standardization: Adherence to industry standards, protocols, and specifications that enable consistent and compatible interactions between different software components or systems. Compatibility: The ability of systems to work together without requiring extensive modifications or adaptations, ensuring that data and operations can be shared effectively. Interface Definition: Well-defined interfaces and APIs that facilitate communication and data exchange between systems, abstracting complexities and promoting ease of integration. Data Format Consistency: Consistent handling and interpretation of data formats, ensuring that information exchanged between systems remains accurate and meaningful. Platform Agnosticism: Capability to operate across different hardware platforms, operating systems, and environments without dependency on specific technologies or configurations.","title":"Characteristics"},{"location":"non-functional-requirements/interoperability/#implementations","text":"An interoperable solution facilitates seamless communication and data exchange between heterogeneous systems. Here are some of the implementations: Providing RESTful APIs. Using data formats and standards such as JSON schemas. Utilizing libraries and frameworks that provide cross-platform support and abstraction layers for common functionalities. Adhering to industry standards (e.g., ISO, IEEE) and governance frameworks that define interoperability requirements, protocols, and best practices for seamless integration.","title":"Implementations"},{"location":"non-functional-requirements/maintainability/","text":"Maintainability Maintainability is the ease with which a software system can be modified, updated, extended, or repaired over time. It impacts the long-term viability and sustainability of a software system. A maintainable system is one that is easy to understand, has clear and modular code, is well-documented, and has a low risk of introducing errors when changes are made. Characteristics Modularity: The software is divided into discrete, independent modules or components, each with a clear and specific functionality. This makes it easier to modify or replace individual parts without affecting the entire system. Readability: Code is written clearly and concisely, following consistent naming conventions, coding standards, and documentation practices. Readable code is easier for developers to understand, troubleshoot, and enhance. Testability: The software is designed to support thorough testing, with components that can be tested independently. This includes unit tests, integration tests, and automated testing frameworks that facilitate ongoing validation of the software's behavior. Documentation: Comprehensive and up-to-date documentation is provided, docstrings, design documents, user manuals, and API references. Good documentation helps developers understand the system's structure, functionality, and dependencies. Simplicity: The design and implementation of the software are kept as simple as possible, avoiding unnecessary complexity. Simple systems are easier to understand, maintain, and extend. Consistency: Consistent use of design patterns, coding practices, language best practices, and architectural principles throughout the software. Consistency reduces the learning curve for new developers and helps maintain uniform quality across the codebase. Configurability: The software allows configuration through external files or settings rather than hard-coded values. This makes it easier to adapt the software to different environments or requirements without changing the code. Dependency Management: Proper management of dependencies ensures that external libraries or components can be updated or replaced without major disruptions. This includes using dependency injection, version control, and modular design. Additionally, version management for your own code will ensure consistent and reliable releases. Error Handling and Logging: Robust error handling and logging mechanisms are in place to facilitate debugging and maintenance. This includes meaningful error messages, exception handling, and comprehensive logging of system events and errors. Implementations Implementing maintainability in software systems involves adopting practices, tools, and methodologies that facilitate efficient modification, extension, and troubleshooting of the software over its lifecycle. Consistent Naming Conventions: Use meaningful and consistent names for variables, functions, classes, and other entities. Code Formatting: Follow consistent code formatting rules to enhance readability. Code Reviews: Conduct regular code reviews to ensure adherence to standards and to share knowledge among team members. External Documentation: Maintain up-to-date documentation, including design documents, user manuals, and API references . There are tools to assist with that like Swagger or Postman. README Files: Provide README files in repositories to guide new developers on setup, usage, and contribution guidelines. Automated Testing: Provide unit test, end-to-end tests, smoke and integration tests as well as continuous integration practices. Code Refactoring: Regularly refactor code to improve its structure, readability, and maintainability without changing its external behavior. Implementing pre-commit hooks in the pipelines to automate the monitoring of code refactoring tasks, like forcing coding standards, run static code analysis, linting, etc.","title":"Maintainability"},{"location":"non-functional-requirements/maintainability/#maintainability","text":"Maintainability is the ease with which a software system can be modified, updated, extended, or repaired over time. It impacts the long-term viability and sustainability of a software system. A maintainable system is one that is easy to understand, has clear and modular code, is well-documented, and has a low risk of introducing errors when changes are made.","title":"Maintainability"},{"location":"non-functional-requirements/maintainability/#characteristics","text":"Modularity: The software is divided into discrete, independent modules or components, each with a clear and specific functionality. This makes it easier to modify or replace individual parts without affecting the entire system. Readability: Code is written clearly and concisely, following consistent naming conventions, coding standards, and documentation practices. Readable code is easier for developers to understand, troubleshoot, and enhance. Testability: The software is designed to support thorough testing, with components that can be tested independently. This includes unit tests, integration tests, and automated testing frameworks that facilitate ongoing validation of the software's behavior. Documentation: Comprehensive and up-to-date documentation is provided, docstrings, design documents, user manuals, and API references. Good documentation helps developers understand the system's structure, functionality, and dependencies. Simplicity: The design and implementation of the software are kept as simple as possible, avoiding unnecessary complexity. Simple systems are easier to understand, maintain, and extend. Consistency: Consistent use of design patterns, coding practices, language best practices, and architectural principles throughout the software. Consistency reduces the learning curve for new developers and helps maintain uniform quality across the codebase. Configurability: The software allows configuration through external files or settings rather than hard-coded values. This makes it easier to adapt the software to different environments or requirements without changing the code. Dependency Management: Proper management of dependencies ensures that external libraries or components can be updated or replaced without major disruptions. This includes using dependency injection, version control, and modular design. Additionally, version management for your own code will ensure consistent and reliable releases. Error Handling and Logging: Robust error handling and logging mechanisms are in place to facilitate debugging and maintenance. This includes meaningful error messages, exception handling, and comprehensive logging of system events and errors.","title":"Characteristics"},{"location":"non-functional-requirements/maintainability/#implementations","text":"Implementing maintainability in software systems involves adopting practices, tools, and methodologies that facilitate efficient modification, extension, and troubleshooting of the software over its lifecycle. Consistent Naming Conventions: Use meaningful and consistent names for variables, functions, classes, and other entities. Code Formatting: Follow consistent code formatting rules to enhance readability. Code Reviews: Conduct regular code reviews to ensure adherence to standards and to share knowledge among team members. External Documentation: Maintain up-to-date documentation, including design documents, user manuals, and API references . There are tools to assist with that like Swagger or Postman. README Files: Provide README files in repositories to guide new developers on setup, usage, and contribution guidelines. Automated Testing: Provide unit test, end-to-end tests, smoke and integration tests as well as continuous integration practices. Code Refactoring: Regularly refactor code to improve its structure, readability, and maintainability without changing its external behavior. Implementing pre-commit hooks in the pipelines to automate the monitoring of code refactoring tasks, like forcing coding standards, run static code analysis, linting, etc.","title":"Implementations"},{"location":"non-functional-requirements/performance/","text":"Performance Performance refers to the responsiveness, efficiency, and speed with which a system completes tasks and processes user requests. It encompasses several key metrics such as response time, throughput, latency, and resource utilization. Characteristics Response Time: The time taken by the system to respond to user interactions or requests. Lower response times indicate better performance and user responsiveness. Throughput: The rate at which the system can process and handle a certain volume of transactions or requests within a given time frame. Higher throughput signifies greater processing capacity and efficiency. Latency: The delay or time lag experienced between initiating a request and receiving a response. Low latency is crucial for real-time applications to ensure timely interactions. Scalability: The system's ability to handle increasing workload or user demand by scaling resources (horizontal or vertical scaling) without impacting performance negatively. Concurrency: The system's capability to handle multiple concurrent users or tasks efficiently without significant degradation in performance. This involves managing resources such as CPU, memory, and network bandwidth effectively. Resource Utilization: Efficient utilization of hardware resources (e.g., CPU, memory, disk) to maximize performance without unnecessary overhead or bottlenecks. Stability: Consistency and reliability of performance over time and under varying conditions, ensuring predictable behavior and minimal downtime. Fault Tolerance: The system's ability to continue operating or recover gracefully from failures or disruptions without significant impact on performance or user experience. Load Handling: How well the system manages and distributes workload during peak usage periods to maintain optimal performance levels. Implementations Implementing performance involves a combination of architectural decisions, coding practices, infrastructure setup, and optimization techniques. For example: Efficient Algorithms and Data Structures: Choosing algorithms and data structures that are optimized for the specific tasks and operations performed by the system can significantly improve performance. This includes selecting algorithms with lower time complexity (e.g., O(1), O(log n)) for critical operations. Code Optimization: Writing efficient and optimized code reduces execution time and resource consumption. Techniques such as minimizing loops, reducing unnecessary computations, and using appropriate data types can improve performance. Concurrency: Implementing concurrency models such as threads and async-await techniques optimizes task execution by allowing the system to handle multiple operations simultaneously. Parallel Programming: Enables tasks to be divided into smaller subtasks that can execute concurrently on multi-core processors. This method improves computational efficiency and accelerates the completion of tasks. Caching: Implementing caching mechanisms (e.g., in-memory caching, content delivery networks) to store and retrieve frequently accessed data or computations reduces the need to fetch data from slower storage systems, thereby improving response time and overall system performance. Database Optimization: Optimizing database queries, indexing frequently accessed data, denormalizing data where appropriate, and using database scaling techniques (e.g., sharding, replication) can enhance database performance and reduce latency. Network Optimization: Minimizing network latency by optimizing network protocols, reducing the number of network requests, compressing data where feasible, and leveraging content delivery networks (CDNs) for static content delivery. Load Balancing: Distributing incoming traffic evenly across multiple servers or instances using load balancers ensures optimal resource utilization and prevents overload on any single component, improving overall system performance and availability. Scalable Architecture: Designing the system with scalability in mind allows it to handle increased workload by adding resources dynamically (horizontal scaling) or upgrading existing resources (vertical scaling). This involves using microservices architecture, containerization (e.g., Docker), and orchestration tools (e.g., Kubernetes) for efficient resource management. Performance Testing: Performing rigorous performance tests to pinpoint bottlenecks, measure critical metrics like response time and throughput, and validate system performance across varying load scenarios. Continuous Monitoring: Implementing ongoing monitoring of performance metrics to identify performance degradation. Resources Automated Testing","title":"Performance"},{"location":"non-functional-requirements/performance/#performance","text":"Performance refers to the responsiveness, efficiency, and speed with which a system completes tasks and processes user requests. It encompasses several key metrics such as response time, throughput, latency, and resource utilization.","title":"Performance"},{"location":"non-functional-requirements/performance/#characteristics","text":"Response Time: The time taken by the system to respond to user interactions or requests. Lower response times indicate better performance and user responsiveness. Throughput: The rate at which the system can process and handle a certain volume of transactions or requests within a given time frame. Higher throughput signifies greater processing capacity and efficiency. Latency: The delay or time lag experienced between initiating a request and receiving a response. Low latency is crucial for real-time applications to ensure timely interactions. Scalability: The system's ability to handle increasing workload or user demand by scaling resources (horizontal or vertical scaling) without impacting performance negatively. Concurrency: The system's capability to handle multiple concurrent users or tasks efficiently without significant degradation in performance. This involves managing resources such as CPU, memory, and network bandwidth effectively. Resource Utilization: Efficient utilization of hardware resources (e.g., CPU, memory, disk) to maximize performance without unnecessary overhead or bottlenecks. Stability: Consistency and reliability of performance over time and under varying conditions, ensuring predictable behavior and minimal downtime. Fault Tolerance: The system's ability to continue operating or recover gracefully from failures or disruptions without significant impact on performance or user experience. Load Handling: How well the system manages and distributes workload during peak usage periods to maintain optimal performance levels.","title":"Characteristics"},{"location":"non-functional-requirements/performance/#implementations","text":"Implementing performance involves a combination of architectural decisions, coding practices, infrastructure setup, and optimization techniques. For example: Efficient Algorithms and Data Structures: Choosing algorithms and data structures that are optimized for the specific tasks and operations performed by the system can significantly improve performance. This includes selecting algorithms with lower time complexity (e.g., O(1), O(log n)) for critical operations. Code Optimization: Writing efficient and optimized code reduces execution time and resource consumption. Techniques such as minimizing loops, reducing unnecessary computations, and using appropriate data types can improve performance. Concurrency: Implementing concurrency models such as threads and async-await techniques optimizes task execution by allowing the system to handle multiple operations simultaneously. Parallel Programming: Enables tasks to be divided into smaller subtasks that can execute concurrently on multi-core processors. This method improves computational efficiency and accelerates the completion of tasks. Caching: Implementing caching mechanisms (e.g., in-memory caching, content delivery networks) to store and retrieve frequently accessed data or computations reduces the need to fetch data from slower storage systems, thereby improving response time and overall system performance. Database Optimization: Optimizing database queries, indexing frequently accessed data, denormalizing data where appropriate, and using database scaling techniques (e.g., sharding, replication) can enhance database performance and reduce latency. Network Optimization: Minimizing network latency by optimizing network protocols, reducing the number of network requests, compressing data where feasible, and leveraging content delivery networks (CDNs) for static content delivery. Load Balancing: Distributing incoming traffic evenly across multiple servers or instances using load balancers ensures optimal resource utilization and prevents overload on any single component, improving overall system performance and availability. Scalable Architecture: Designing the system with scalability in mind allows it to handle increased workload by adding resources dynamically (horizontal scaling) or upgrading existing resources (vertical scaling). This involves using microservices architecture, containerization (e.g., Docker), and orchestration tools (e.g., Kubernetes) for efficient resource management. Performance Testing: Performing rigorous performance tests to pinpoint bottlenecks, measure critical metrics like response time and throughput, and validate system performance across varying load scenarios. Continuous Monitoring: Implementing ongoing monitoring of performance metrics to identify performance degradation.","title":"Implementations"},{"location":"non-functional-requirements/performance/#resources","text":"Automated Testing","title":"Resources"},{"location":"non-functional-requirements/portability/","text":"Portability Portability refers to the ease with which software can be transferred and used in different environments or platforms without requiring significant modification. This includes moving the software across various hardware, operating systems, cloud services, or development frameworks while maintaining its functionality, performance, and usability. Characteristics Platform Independence: The ability of the software to run on different operating systems, hardware architectures, and devices without requiring major changes. Minimal Modification: The need for minimal code changes or reconfiguration when moving the software to a different environment. Standard Compliance: Adherence to industry standards and protocols to ensure compatibility across different systems and platforms. Environment Abstraction: Use of abstraction layers or frameworks that isolate the software from specific platform details, making it easier to adapt to different environments. Configuration Flexibility: Ease of modifying configuration settings to suit different environments without altering the core software code. Dependency Management: Efficient handling of external dependencies, ensuring that required libraries, tools, and services are available or can be easily obtained in the new environment. Packaging and Distribution: Efficient packaging methods, such as containerization (e.g., Docker), that encapsulate the software and its dependencies to facilitate deployment in diverse environments. Modular Design: Designing the software in a modular way, where components can be independently developed, tested, and deployed, enhancing the ease of porting parts of the system. Implementations Containerization Docker: Packaging applications and their dependencies into containers, ensuring consistent behavior across different environments. Kubernetes: Orchestrating containerized applications for deployment across various cloud providers and on-premises infrastructures. Virtual Machines Java Virtual Machine (JVM): Writing software in Java or other JVM languages to run on any system with a compatible JVM. VirtualBox or VMware: Using virtual machines to create consistent runtime environments regardless of the underlying hardware. Platform-Agnostic Languages Python, JavaScript, and Go: Utilizing programming languages known for their cross-platform capabilities to ensure code runs on multiple operating systems with little to no modification. However, it's important to select a programming language that aligns with the project's requirements and team expertise. Standardized Interfaces and Protocols APIs: Designing APIs with standardized protocols (e.g., REST, GraphQL) to facilitate interaction between different systems. Data Interchange Formats: Using common data formats like JSON, XML, or Protocol Buffers to ensure data can be exchanged and understood across different systems. Other Practices Debugging and Troubleshooting: Local debugging provides direct access to debugging tools and logs, making it easier to diagnose and resolve issues quickly. CI/CD Integration: Implementing a CI/CD pipeline to automate the building, testing, and packaging of the solution enhances portability by ensuring consistent and reliable deployments across various platforms and environments.","title":"Portability"},{"location":"non-functional-requirements/portability/#portability","text":"Portability refers to the ease with which software can be transferred and used in different environments or platforms without requiring significant modification. This includes moving the software across various hardware, operating systems, cloud services, or development frameworks while maintaining its functionality, performance, and usability.","title":"Portability"},{"location":"non-functional-requirements/portability/#characteristics","text":"Platform Independence: The ability of the software to run on different operating systems, hardware architectures, and devices without requiring major changes. Minimal Modification: The need for minimal code changes or reconfiguration when moving the software to a different environment. Standard Compliance: Adherence to industry standards and protocols to ensure compatibility across different systems and platforms. Environment Abstraction: Use of abstraction layers or frameworks that isolate the software from specific platform details, making it easier to adapt to different environments. Configuration Flexibility: Ease of modifying configuration settings to suit different environments without altering the core software code. Dependency Management: Efficient handling of external dependencies, ensuring that required libraries, tools, and services are available or can be easily obtained in the new environment. Packaging and Distribution: Efficient packaging methods, such as containerization (e.g., Docker), that encapsulate the software and its dependencies to facilitate deployment in diverse environments. Modular Design: Designing the software in a modular way, where components can be independently developed, tested, and deployed, enhancing the ease of porting parts of the system.","title":"Characteristics"},{"location":"non-functional-requirements/portability/#implementations","text":"","title":"Implementations"},{"location":"non-functional-requirements/portability/#containerization","text":"Docker: Packaging applications and their dependencies into containers, ensuring consistent behavior across different environments. Kubernetes: Orchestrating containerized applications for deployment across various cloud providers and on-premises infrastructures.","title":"Containerization"},{"location":"non-functional-requirements/portability/#virtual-machines","text":"Java Virtual Machine (JVM): Writing software in Java or other JVM languages to run on any system with a compatible JVM. VirtualBox or VMware: Using virtual machines to create consistent runtime environments regardless of the underlying hardware.","title":"Virtual Machines"},{"location":"non-functional-requirements/portability/#platform-agnostic-languages","text":"Python, JavaScript, and Go: Utilizing programming languages known for their cross-platform capabilities to ensure code runs on multiple operating systems with little to no modification. However, it's important to select a programming language that aligns with the project's requirements and team expertise.","title":"Platform-Agnostic Languages"},{"location":"non-functional-requirements/portability/#standardized-interfaces-and-protocols","text":"APIs: Designing APIs with standardized protocols (e.g., REST, GraphQL) to facilitate interaction between different systems. Data Interchange Formats: Using common data formats like JSON, XML, or Protocol Buffers to ensure data can be exchanged and understood across different systems.","title":"Standardized Interfaces and Protocols"},{"location":"non-functional-requirements/portability/#other-practices","text":"Debugging and Troubleshooting: Local debugging provides direct access to debugging tools and logs, making it easier to diagnose and resolve issues quickly. CI/CD Integration: Implementing a CI/CD pipeline to automate the building, testing, and packaging of the solution enhances portability by ensuring consistent and reliable deployments across various platforms and environments.","title":"Other Practices"},{"location":"non-functional-requirements/reliability/","text":"Reliability All the other ISE Engineering Fundamentals work towards a more reliable infrastructure. Automated integration and deployment ensures code is properly tested, and helps remove human error, while slow releases build confidence in the code. Observability helps more quickly pinpoint errors when they arise to get back to a stable state, and so on. However, there are some additional steps we can take, that don't neatly fit into the previous categories, to help ensure a more reliable solution. We'll explore these below. Remove \"Foot-Guns\" Prevent your dev team from shooting themselves in the foot. People make mistakes; any mistake made in production is not the fault of that person, it's the collective fault of the system to not prevent that mistake from happening. Check out the below list for some common tooling to remove these foot guns: In Kubernetes, leverage Admission Controllers to prevent \"bad things\" from happening. You can create custom controllers using the Webhook Admission controller. Gatekeeper is a pre-built Webhook Admission controller, leveraging OPA underneath the hood, with support for some out-of-the-box protections If a user ever makes a mistake, don't ask: \"how could somebody possibly do that?\", do ask: \"how can we prevent this from happening in the future?\" Autoscaling Whenever possible, leverage autoscaling for your deployments. Vertical autoscaling can scale your VMs by tuning parameters like CPU, disk, and RAM, while horizontal autoscaling can tune the number of running images backing your deployments. Autoscaling can help your system respond to inorganic growth in traffic, and prevent failing requests due to resource starvation. Note: In environments like K8s, both horizontal and vertical autoscaling are offered as a native solution. The VMs backing each Pod however, may also need autoscaling to handle an increase in the number of Pods. It should also be noted that the parameters that affect autoscaling can be difficult to tune. Typical metrics like CPU or RAM utilization, or request rate may not be enough. Sometimes you might want to consider custom metrics, like cache eviction rate. Load shedding & DOS Protection Often we think of Denial of Service [DOS] attacks as an act from a malicious actor, so we place some load shedding at the gates to our system and call it a day. In reality, many DOS attacks are unintentional, and self-inflicted. A bad deployment that takes down a Cache results in hammering downstream services. Polling from a distributed system synchronizes and results in a thundering herd . A misconfiguration results in an error which triggers clients to retry uncontrollably. Requests append to a stored object until it is so big that future reads crash the server. The list goes on. Follow these steps to protect yourself: Add a jitter (random) to any action that occurs from a non-user triggered flow (ie: add a random duration to the sleep in a cron, or job that continuously polls a downstream service). Implement exponential backoff retry policies in your client code Add load shedding to your servers (yes, your internal microservices too). This can be configured easily when leveraging a sidecar like envoy. Be careful when deserializing user requests, and use buffer limits. ie: HTTP/gRPC Servers can set limits on how much data will get read from the socket. Set alerts for utilization, servers restarting, or going offline to detect when your system may be failing. These types of errors can result in Cascading Failures, where a non-critical portion of your system takes down the entire service. Plan accordingly, and make sure to put extra thought into how your system might degrade during failures. Backup Data Data gets lost, corrupted, or accidentally deleted. It happens. Take data backups to help get your system back up online as soon as possible. It can happen in the application stack, with code deleting or corrupting data, or at the storage layer by losing the volumes, or losing encryption keys. Consider things like: How long will it take to restore data. How much data loss can you tolerate. How long will it take you to notice there is data loss. Look into the difference between snapshot and incremental backups. A good policy might be to take incremental backups on a period of N, and a snapshot backup on a period of M (where N < M). Target Uptime & Failing Gracefully It's a known fact that systems cannot target 100% uptime. There are too many factors in today's software systems to achieve this, many outside of our control. Even a service that never gets updated and is 100% bug free will fail. Upstream DNS servers have issues all the time. Hardware breaks. Power outages, backup generators fail. The world is chaotic. Good services target some number of \"9's\" of uptime. ie: 99.99% uptime means that the system has a \"budget\" of 4 minutes and 22 seconds of uptime each month. Some months might achieve 100% uptime, which means that budget gets rolled over to the next month. What uptime means is different for everybody, and up to the service to define. A good practice is to use any leftover budget at the end of the period (ie: year, quarter), to intentionally take that service down, and ensure that the rest of your systems fail as expected. Often times other engineers and services come to rely on that additional achieved availability, and it can be healthy to ensure that systems fail gracefully. We can build graceful failure (or graceful degradation) into our software stack by anticipating failures. Some tactics include: Failover to healthy services Leader Election can be used to keep healthy services on standby in case the leader experiences issues. Entire cluster failover can redirect traffic to another region or availability zone. Propagate downstream failures of dependent services up the stack via health checks, so that your ingress points can re-route to healthy services. Circuit breakers can bail early on requests vs. propagating errors throughout the system. Consider using a well-known, tested library such as Polly (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns. Practice None of the above recommendations will work if they are not tested . Your backups are meaningless if you don't know how to mount them. Your cluster failover and other mitigations will regress over time if they are not tested. Here are some tips to test the above: Maintain Playbooks No software service is complete without playbooks to navigate the developers through unfamiliar territory. Playbooks should be thorough and cover all known failure scenarios and mitigations. Run Maintenance Exercises Take the time to fabricate scenarios, and run a D&D style campaign to solve your issues. This can be as elaborate as spinning up a new environment and injecting errors, or as simple as asking the \"players\" to navigate to a dashboard and describing would they would see in the fabricated scenario (small amounts of imagination required). The playbooks should easily navigate the user to the correct solution/mitigation. If not, update your playbooks. Chaos Testing Leverage automated chaos testing to see how things break. You can read this playbook's article on fault injection testing for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as this section in the article linked above have more details on available platforms and tooling for this purpose: Azure Chaos Studio - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources. Chaos toolkit - A declarative, modular chaos platform with many extensions, including the Azure actions and probes kit . Kraken - An Openshift-specific chaos tool, maintained by Redhat. Chaos Monkey - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB). Many services meshes, like Linkerd , offer fault injection tooling through the use of their sidecars. Chaos Mesh Simmy - A .NET library for chaos testing and fault injection integrated with the Polly library for resilience engineering. This ISE dev blog post provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing. Analyze All Failures Writing up a post-mortem is a great way to document the root causes, and action items for your failures. They're also a great way to track recurring issues, and create a strong case for prioritizing fixes. This can even be tied into your regular Agile restrospectives .","title":"Reliability"},{"location":"non-functional-requirements/reliability/#reliability","text":"All the other ISE Engineering Fundamentals work towards a more reliable infrastructure. Automated integration and deployment ensures code is properly tested, and helps remove human error, while slow releases build confidence in the code. Observability helps more quickly pinpoint errors when they arise to get back to a stable state, and so on. However, there are some additional steps we can take, that don't neatly fit into the previous categories, to help ensure a more reliable solution. We'll explore these below.","title":"Reliability"},{"location":"non-functional-requirements/reliability/#remove-foot-guns","text":"Prevent your dev team from shooting themselves in the foot. People make mistakes; any mistake made in production is not the fault of that person, it's the collective fault of the system to not prevent that mistake from happening. Check out the below list for some common tooling to remove these foot guns: In Kubernetes, leverage Admission Controllers to prevent \"bad things\" from happening. You can create custom controllers using the Webhook Admission controller. Gatekeeper is a pre-built Webhook Admission controller, leveraging OPA underneath the hood, with support for some out-of-the-box protections If a user ever makes a mistake, don't ask: \"how could somebody possibly do that?\", do ask: \"how can we prevent this from happening in the future?\"","title":"Remove \"Foot-Guns\""},{"location":"non-functional-requirements/reliability/#autoscaling","text":"Whenever possible, leverage autoscaling for your deployments. Vertical autoscaling can scale your VMs by tuning parameters like CPU, disk, and RAM, while horizontal autoscaling can tune the number of running images backing your deployments. Autoscaling can help your system respond to inorganic growth in traffic, and prevent failing requests due to resource starvation. Note: In environments like K8s, both horizontal and vertical autoscaling are offered as a native solution. The VMs backing each Pod however, may also need autoscaling to handle an increase in the number of Pods. It should also be noted that the parameters that affect autoscaling can be difficult to tune. Typical metrics like CPU or RAM utilization, or request rate may not be enough. Sometimes you might want to consider custom metrics, like cache eviction rate.","title":"Autoscaling"},{"location":"non-functional-requirements/reliability/#load-shedding-dos-protection","text":"Often we think of Denial of Service [DOS] attacks as an act from a malicious actor, so we place some load shedding at the gates to our system and call it a day. In reality, many DOS attacks are unintentional, and self-inflicted. A bad deployment that takes down a Cache results in hammering downstream services. Polling from a distributed system synchronizes and results in a thundering herd . A misconfiguration results in an error which triggers clients to retry uncontrollably. Requests append to a stored object until it is so big that future reads crash the server. The list goes on. Follow these steps to protect yourself: Add a jitter (random) to any action that occurs from a non-user triggered flow (ie: add a random duration to the sleep in a cron, or job that continuously polls a downstream service). Implement exponential backoff retry policies in your client code Add load shedding to your servers (yes, your internal microservices too). This can be configured easily when leveraging a sidecar like envoy. Be careful when deserializing user requests, and use buffer limits. ie: HTTP/gRPC Servers can set limits on how much data will get read from the socket. Set alerts for utilization, servers restarting, or going offline to detect when your system may be failing. These types of errors can result in Cascading Failures, where a non-critical portion of your system takes down the entire service. Plan accordingly, and make sure to put extra thought into how your system might degrade during failures.","title":"Load shedding &amp; DOS Protection"},{"location":"non-functional-requirements/reliability/#backup-data","text":"Data gets lost, corrupted, or accidentally deleted. It happens. Take data backups to help get your system back up online as soon as possible. It can happen in the application stack, with code deleting or corrupting data, or at the storage layer by losing the volumes, or losing encryption keys. Consider things like: How long will it take to restore data. How much data loss can you tolerate. How long will it take you to notice there is data loss. Look into the difference between snapshot and incremental backups. A good policy might be to take incremental backups on a period of N, and a snapshot backup on a period of M (where N < M).","title":"Backup Data"},{"location":"non-functional-requirements/reliability/#target-uptime-failing-gracefully","text":"It's a known fact that systems cannot target 100% uptime. There are too many factors in today's software systems to achieve this, many outside of our control. Even a service that never gets updated and is 100% bug free will fail. Upstream DNS servers have issues all the time. Hardware breaks. Power outages, backup generators fail. The world is chaotic. Good services target some number of \"9's\" of uptime. ie: 99.99% uptime means that the system has a \"budget\" of 4 minutes and 22 seconds of uptime each month. Some months might achieve 100% uptime, which means that budget gets rolled over to the next month. What uptime means is different for everybody, and up to the service to define. A good practice is to use any leftover budget at the end of the period (ie: year, quarter), to intentionally take that service down, and ensure that the rest of your systems fail as expected. Often times other engineers and services come to rely on that additional achieved availability, and it can be healthy to ensure that systems fail gracefully. We can build graceful failure (or graceful degradation) into our software stack by anticipating failures. Some tactics include: Failover to healthy services Leader Election can be used to keep healthy services on standby in case the leader experiences issues. Entire cluster failover can redirect traffic to another region or availability zone. Propagate downstream failures of dependent services up the stack via health checks, so that your ingress points can re-route to healthy services. Circuit breakers can bail early on requests vs. propagating errors throughout the system. Consider using a well-known, tested library such as Polly (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns.","title":"Target Uptime &amp; Failing Gracefully"},{"location":"non-functional-requirements/reliability/#practice","text":"None of the above recommendations will work if they are not tested . Your backups are meaningless if you don't know how to mount them. Your cluster failover and other mitigations will regress over time if they are not tested. Here are some tips to test the above:","title":"Practice"},{"location":"non-functional-requirements/reliability/#maintain-playbooks","text":"No software service is complete without playbooks to navigate the developers through unfamiliar territory. Playbooks should be thorough and cover all known failure scenarios and mitigations.","title":"Maintain Playbooks"},{"location":"non-functional-requirements/reliability/#run-maintenance-exercises","text":"Take the time to fabricate scenarios, and run a D&D style campaign to solve your issues. This can be as elaborate as spinning up a new environment and injecting errors, or as simple as asking the \"players\" to navigate to a dashboard and describing would they would see in the fabricated scenario (small amounts of imagination required). The playbooks should easily navigate the user to the correct solution/mitigation. If not, update your playbooks.","title":"Run Maintenance Exercises"},{"location":"non-functional-requirements/reliability/#chaos-testing","text":"Leverage automated chaos testing to see how things break. You can read this playbook's article on fault injection testing for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as this section in the article linked above have more details on available platforms and tooling for this purpose: Azure Chaos Studio - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources. Chaos toolkit - A declarative, modular chaos platform with many extensions, including the Azure actions and probes kit . Kraken - An Openshift-specific chaos tool, maintained by Redhat. Chaos Monkey - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB). Many services meshes, like Linkerd , offer fault injection tooling through the use of their sidecars. Chaos Mesh Simmy - A .NET library for chaos testing and fault injection integrated with the Polly library for resilience engineering. This ISE dev blog post provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing.","title":"Chaos Testing"},{"location":"non-functional-requirements/reliability/#analyze-all-failures","text":"Writing up a post-mortem is a great way to document the root causes, and action items for your failures. They're also a great way to track recurring issues, and create a strong case for prioritizing fixes. This can even be tied into your regular Agile restrospectives .","title":"Analyze All Failures"},{"location":"non-functional-requirements/scalability/","text":"Scalability Scalability is the capability of a system to handle larger volumes, or its potential to accommodate additional growth. For example, a system is considered scalable if it is capable of increasing its total output under an increased load when resources (typically hardware) are added. An example of this is a system that can handle a growing number of requests when more memory is added to it. Characteristics Elasticity: The system should be able to scale up or down based on demand, and be able to automatically provision or de-provision resources as needed. Latency: The system should be able to maintain low latency even under high load, and be able to handle a large number of concurrent requests without slowing down. Examples Load Balancing: The application must be able to handle a minimum of 250 concurrent users and support load balancing across at least 3 servers to handle peak traffic. Database Scalability: The application's database must be able to handle at least 1 million records and support partitioning or sharding to ensure efficient storage and retrieval of data. Cloud-Based Infrastructure: The application must be deployed on cloud-based infrastructure that can handle at least 100,000 requests per hour, and be able to scale up or down to meet changing demand. Microservices Architecture: The application must be designed using a microservices architecture that allows for easy scaling of individual services, and be able to handle at least 500 requests per second. Caching: The application must be able to cache at least 10,000 records, with a cache hit rate of 95%, and support caching across multiple servers to ensure high availability.","title":"Scalability"},{"location":"non-functional-requirements/scalability/#scalability","text":"Scalability is the capability of a system to handle larger volumes, or its potential to accommodate additional growth. For example, a system is considered scalable if it is capable of increasing its total output under an increased load when resources (typically hardware) are added. An example of this is a system that can handle a growing number of requests when more memory is added to it.","title":"Scalability"},{"location":"non-functional-requirements/scalability/#characteristics","text":"Elasticity: The system should be able to scale up or down based on demand, and be able to automatically provision or de-provision resources as needed. Latency: The system should be able to maintain low latency even under high load, and be able to handle a large number of concurrent requests without slowing down.","title":"Characteristics"},{"location":"non-functional-requirements/scalability/#examples","text":"Load Balancing: The application must be able to handle a minimum of 250 concurrent users and support load balancing across at least 3 servers to handle peak traffic. Database Scalability: The application's database must be able to handle at least 1 million records and support partitioning or sharding to ensure efficient storage and retrieval of data. Cloud-Based Infrastructure: The application must be deployed on cloud-based infrastructure that can handle at least 100,000 requests per hour, and be able to scale up or down to meet changing demand. Microservices Architecture: The application must be designed using a microservices architecture that allows for easy scaling of individual services, and be able to handle at least 500 requests per second. Caching: The application must be able to cache at least 10,000 records, with a cache hit rate of 95%, and support caching across multiple servers to ensure high availability.","title":"Examples"},{"location":"non-functional-requirements/usability/","text":"Usability Usability is a topic that is often used interchangeably with user experience (UX), but they are not the same thing. Usability is a subset of UX, focusing specifically on the ease of use and effectiveness of a product, i.e., it is the ease with which users can learn and use a product to achieve their goals. Usability is a key factor in determining the success of a product, as it directly impacts user satisfaction, productivity, and overall experience. A system that is difficult to use or understand can lead to frustration, errors, and ultimately, abandonment by users. Closely coupled with usability and UX is the concept of accessibility . Characteristics The main three characteristics of usability are: - Effectiveness: Users should be able to accomplish their goals with the product. - Efficiency: Users should be able to perform tasks quickly and with minimal effort. Oftentimes this is measured in terms of time on task or number of clicks. - Satisfaction: Users should find the product enjoyable and satisfying to use. Additional characteristics include: - Learnability: Users should be able to easily and quickly learn how to use the product. In other words, the system should be intuitive and require minimal training. - Memorability: Users should be able to remember how to use the product after a period of not using it. - Errors: Users should encounter a minimal number of errors when completing a task, and recover easily from any errors that do occur. - Simplicity: The system should be simple and straightforward to use, with minimal complexity and cognitive load. - Comprehensibility: Users should be able to understand the system and its features easily, with clear instructions and feedback. Implementations One way of implementing usability in a user interface is by basing your design decisions on usability testing results. Usability testing's goal is to identify any usability issues, gather feedback, and make improvements to the product. It can be conducted at various stages of the design and development process, from wireframes and prototypes to the final product. These evaluations can collect two key metrics: quantitative data and qualitative data . Quantitative data can be collected through observing the facts of what actually happened. Qualitative data can be collected through interviews, observations, and other methods that provide insights into user behavior and preferences. There are several methods for conducting usability testing, including, but not limited to: - Focus groups - Wireframes - Prototyping - Surveys/Questionnaires - Interviews - Think-aloud protocol Examples One example of usability in action is the design of a website. A website that is easy to navigate, with clear labels, intuitive menus, and a logical flow of information, is more likely to be successful than one that is cluttered, confusing, and difficult to use. The latter website is likely to have a low rate of user engagement, high bounce rates , and low conversion rates, as users will quickly become frustrated and abandon the site. Resources GeeksForGeeks: What is Usability? Usability.gov Human-computer Interaction (HCI) Jakob Nielsen's 10 Usability Heuristics for User Interface Design","title":"Usability"},{"location":"non-functional-requirements/usability/#usability","text":"Usability is a topic that is often used interchangeably with user experience (UX), but they are not the same thing. Usability is a subset of UX, focusing specifically on the ease of use and effectiveness of a product, i.e., it is the ease with which users can learn and use a product to achieve their goals. Usability is a key factor in determining the success of a product, as it directly impacts user satisfaction, productivity, and overall experience. A system that is difficult to use or understand can lead to frustration, errors, and ultimately, abandonment by users. Closely coupled with usability and UX is the concept of accessibility .","title":"Usability"},{"location":"non-functional-requirements/usability/#characteristics","text":"The main three characteristics of usability are: - Effectiveness: Users should be able to accomplish their goals with the product. - Efficiency: Users should be able to perform tasks quickly and with minimal effort. Oftentimes this is measured in terms of time on task or number of clicks. - Satisfaction: Users should find the product enjoyable and satisfying to use. Additional characteristics include: - Learnability: Users should be able to easily and quickly learn how to use the product. In other words, the system should be intuitive and require minimal training. - Memorability: Users should be able to remember how to use the product after a period of not using it. - Errors: Users should encounter a minimal number of errors when completing a task, and recover easily from any errors that do occur. - Simplicity: The system should be simple and straightforward to use, with minimal complexity and cognitive load. - Comprehensibility: Users should be able to understand the system and its features easily, with clear instructions and feedback.","title":"Characteristics"},{"location":"non-functional-requirements/usability/#implementations","text":"One way of implementing usability in a user interface is by basing your design decisions on usability testing results. Usability testing's goal is to identify any usability issues, gather feedback, and make improvements to the product. It can be conducted at various stages of the design and development process, from wireframes and prototypes to the final product. These evaluations can collect two key metrics: quantitative data and qualitative data . Quantitative data can be collected through observing the facts of what actually happened. Qualitative data can be collected through interviews, observations, and other methods that provide insights into user behavior and preferences. There are several methods for conducting usability testing, including, but not limited to: - Focus groups - Wireframes - Prototyping - Surveys/Questionnaires - Interviews - Think-aloud protocol","title":"Implementations"},{"location":"non-functional-requirements/usability/#examples","text":"One example of usability in action is the design of a website. A website that is easy to navigate, with clear labels, intuitive menus, and a logical flow of information, is more likely to be successful than one that is cluttered, confusing, and difficult to use. The latter website is likely to have a low rate of user engagement, high bounce rates , and low conversion rates, as users will quickly become frustrated and abandon the site.","title":"Examples"},{"location":"non-functional-requirements/usability/#resources","text":"GeeksForGeeks: What is Usability? Usability.gov Human-computer Interaction (HCI) Jakob Nielsen's 10 Usability Heuristics for User Interface Design","title":"Resources"},{"location":"observability/","text":"Observability Building observable systems enables development teams at ISE to measure how well the application is behaving. Observability serves the following goals: Provide holistic view of the application health . Help measure business performance for the customer. Measure operational performance of the system. Identify and diagnose failures to get to the problem fast. Pillars of Observability Logs Metrics Tracing Logs vs Metrics vs Traces Insights Dashboards and Reporting Tools, Patterns and Recommended Practices Tooling and Patterns Observability As Code Recommended Practices Diagnostics tools OpenTelemetry Facets of Observability Observability for Microservices Observability in Machine Learning Observability of CI/CD Pipelines Observability in Azure Databricks Recipes Resources Non-Functional Requirements Guidance","title":"Observability"},{"location":"observability/#observability","text":"Building observable systems enables development teams at ISE to measure how well the application is behaving. Observability serves the following goals: Provide holistic view of the application health . Help measure business performance for the customer. Measure operational performance of the system. Identify and diagnose failures to get to the problem fast.","title":"Observability"},{"location":"observability/#pillars-of-observability","text":"Logs Metrics Tracing Logs vs Metrics vs Traces","title":"Pillars of Observability"},{"location":"observability/#insights","text":"Dashboards and Reporting","title":"Insights"},{"location":"observability/#tools-patterns-and-recommended-practices","text":"Tooling and Patterns Observability As Code Recommended Practices Diagnostics tools OpenTelemetry","title":"Tools, Patterns and Recommended Practices"},{"location":"observability/#facets-of-observability","text":"Observability for Microservices Observability in Machine Learning Observability of CI/CD Pipelines Observability in Azure Databricks Recipes","title":"Facets of Observability"},{"location":"observability/#resources","text":"Non-Functional Requirements Guidance","title":"Resources"},{"location":"observability/alerting/","text":"Guidance for Alerting One of the goals of building highly observable systems is to provide valuable insight into the behavior of the application. Observable systems allow problems to be identified and surfaced through alerts before end users are impacted. Best Practices The foremost thing to do before creating alerts is to implement observability. Without monitoring systems in place, it becomes next to impossible to know what activities need to be monitored and when to alert the teams. Identify what the application's minimum viable service quality needs to be. It is not what you intend to deliver, but is acceptable for the customer. These Service Level Objectives (SLOs) are a metric for measurement of the application's performance. SLOs are defined with respect to the end users. The alerts must watch for visible impact to the user. For example, alerting on request rate, latency and errors. Use automated, scriptable tools to mimic end-to-end important code paths relatable to activities in the application. Create alert polices on user impacting events or metric rate of change. Alert fatigue is real. Engineers are recommended to pay attention to their monitoring system so that accurate alerts and thresholds can be defined. Establish a primary channel for alerts that needs immediate attention and tag the right team/person(s) based on the nature of the incident. Not every single alert needs to be sent to the primary on-call channel. Establish a secondary channel for items that need to be looked into and does not affect the users, yet. For example, storage that nearing capacity threshold. These items will be what the engineering services will look to regularly to monitor the health of the system. Ensure to set up proper alerting for failures in dependent services like Redis cache, Service Bus etc. For example, if Redis cache is throwing 10 exceptions in last 60 secs, proper alerts are recommended to be created so that these failures are surfaced and action be taken. It is important to learn from each incident and continually improve the process. After every incident has been triaged, conduct a post mortem of the scenario . Scenarios and situations that were not initially considered will occur, and the post-mortem workflow is a great way to highlight that to improve the monitoring/alerting of the system. Configuring an alert to detect that incident scenario is a good idea to see if the event occurs again.","title":"Guidance for Alerting"},{"location":"observability/alerting/#guidance-for-alerting","text":"One of the goals of building highly observable systems is to provide valuable insight into the behavior of the application. Observable systems allow problems to be identified and surfaced through alerts before end users are impacted.","title":"Guidance for Alerting"},{"location":"observability/alerting/#best-practices","text":"The foremost thing to do before creating alerts is to implement observability. Without monitoring systems in place, it becomes next to impossible to know what activities need to be monitored and when to alert the teams. Identify what the application's minimum viable service quality needs to be. It is not what you intend to deliver, but is acceptable for the customer. These Service Level Objectives (SLOs) are a metric for measurement of the application's performance. SLOs are defined with respect to the end users. The alerts must watch for visible impact to the user. For example, alerting on request rate, latency and errors. Use automated, scriptable tools to mimic end-to-end important code paths relatable to activities in the application. Create alert polices on user impacting events or metric rate of change. Alert fatigue is real. Engineers are recommended to pay attention to their monitoring system so that accurate alerts and thresholds can be defined. Establish a primary channel for alerts that needs immediate attention and tag the right team/person(s) based on the nature of the incident. Not every single alert needs to be sent to the primary on-call channel. Establish a secondary channel for items that need to be looked into and does not affect the users, yet. For example, storage that nearing capacity threshold. These items will be what the engineering services will look to regularly to monitor the health of the system. Ensure to set up proper alerting for failures in dependent services like Redis cache, Service Bus etc. For example, if Redis cache is throwing 10 exceptions in last 60 secs, proper alerts are recommended to be created so that these failures are surfaced and action be taken. It is important to learn from each incident and continually improve the process. After every incident has been triaged, conduct a post mortem of the scenario . Scenarios and situations that were not initially considered will occur, and the post-mortem workflow is a great way to highlight that to improve the monitoring/alerting of the system. Configuring an alert to detect that incident scenario is a good idea to see if the event occurs again.","title":"Best Practices"},{"location":"observability/best-practices/","text":"Recommended Practices Correlation Id : Include unique identifier at the start of the interaction to tie down aggregated data from various system components and provide a holistic view. Read more guidelines about using correlation id . Ensure health of the services are monitored and provide insights into system's performance and behavior. Ensure dependent services are monitored properly. Errors and exceptions in dependent services like Redis cache, Service bus, etc. should be logged and alerted. Also, metrics related to dependent services should be captured and logged. - Additionally, failures in dependent services should be propagated up each level of the stack by the health check. Faults, crashes, and failures are logged as discrete events. This helps engineers identify problem area(s) during failures. Ensure logging configuration (eg: setting logging to \"verbose\") can be controlled without code changes. Ensure that metrics around latency and duration are collected and can be aggregated. Start small and add where there is customer impact. Avoiding metric fatigue is very crucial to collecting actionable data. It is important that every data that is collected contains relevant and rich context. Personally Identifiable Information or any other customer sensitive information should never be logged. Special attention should be paid to any local privacy data regulations and collected data must adhere to those. (ex: GDPR) Health checks : Appropriate health checks should added to determine if service is healthy and ready to serve traffic. On a kubernetes platform different types of probes e.g. Liveness, Readiness, Startup etc. can be used to determine health and readiness of the deployed service. Read more here to understand what to watch out for while designing and building an observable system.","title":"Recommended Practices"},{"location":"observability/best-practices/#recommended-practices","text":"Correlation Id : Include unique identifier at the start of the interaction to tie down aggregated data from various system components and provide a holistic view. Read more guidelines about using correlation id . Ensure health of the services are monitored and provide insights into system's performance and behavior. Ensure dependent services are monitored properly. Errors and exceptions in dependent services like Redis cache, Service bus, etc. should be logged and alerted. Also, metrics related to dependent services should be captured and logged. - Additionally, failures in dependent services should be propagated up each level of the stack by the health check. Faults, crashes, and failures are logged as discrete events. This helps engineers identify problem area(s) during failures. Ensure logging configuration (eg: setting logging to \"verbose\") can be controlled without code changes. Ensure that metrics around latency and duration are collected and can be aggregated. Start small and add where there is customer impact. Avoiding metric fatigue is very crucial to collecting actionable data. It is important that every data that is collected contains relevant and rich context. Personally Identifiable Information or any other customer sensitive information should never be logged. Special attention should be paid to any local privacy data regulations and collected data must adhere to those. (ex: GDPR) Health checks : Appropriate health checks should added to determine if service is healthy and ready to serve traffic. On a kubernetes platform different types of probes e.g. Liveness, Readiness, Startup etc. can be used to determine health and readiness of the deployed service. Read more here to understand what to watch out for while designing and building an observable system.","title":"Recommended Practices"},{"location":"observability/correlation-id/","text":"Correlation IDs The Need In a distributed system architecture (microservice architecture), it is highly difficult to understand a single end to end customer transaction flow through the various components. Here are some the general challenges - It becomes challenging to understand the end-to-end behavior of a client request entering the application. Aggregation: Consolidating logs from multiple components and making sense out of these logs is difficult, if not impossible. Cyclic dependencies on services, course of events and asynchronous requests are not easily deciphered. While troubleshooting a request, the diagnostic context of the logs are very important to get to the root of the problem. Solution A Correlation ID is a unique identifier that is added to the very first interaction (incoming request) to identify the context and is passed to all components that are involved in the transaction flow. Correlation ID becomes the glue that binds the transaction together and helps to draw an overall picture of events. Note: Before implementing your own Correlation ID, investigate if your telemetry tool of choice provides an auto-generated Correlation ID and that it serves the purposes of your application. For instance, Application Insights offers dependency auto-collection for some application frameworks Recommended Practices Assign each external request a Correlation ID that binds the message to a transaction. The Correlation ID for a transaction must be assigned as early as you can. Propagate Correlation ID to all downstream components/services. All components/services of the transaction use this Correlation ID in their logs. For an HTTP Request, Correlation ID is typically passed in the header. Add it to an outgoing response where possible. Based on the use case, there can be additional correlation IDs that may be needed. For instance, tracking logs based on both Session ID and User ID may be required. While adding multiple correlation ID, remember to propagate them through the components. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the \"Correlation-id\", called TraceId. Use Cases Log Correlation Log correlation is the ability to track disparate events through different parts of the application. Having a Correlation ID provides more context making it easy to build rules for reporting and analysis. Secondary Reporting/Observer Systems Using Correlation ID helps secondary systems to correlate data without application context. Some examples - generating metrics based on tracing data, integrating runtime/system diagnostics etc. For example, feeding AppInsights data and correlating it to infrastructure issues. Troubleshooting Errors For troubleshooting an errors, Correlation ID is a great starting point to trace the workflow of a transaction.","title":"Correlation IDs"},{"location":"observability/correlation-id/#correlation-ids","text":"","title":"Correlation IDs"},{"location":"observability/correlation-id/#the-need","text":"In a distributed system architecture (microservice architecture), it is highly difficult to understand a single end to end customer transaction flow through the various components. Here are some the general challenges - It becomes challenging to understand the end-to-end behavior of a client request entering the application. Aggregation: Consolidating logs from multiple components and making sense out of these logs is difficult, if not impossible. Cyclic dependencies on services, course of events and asynchronous requests are not easily deciphered. While troubleshooting a request, the diagnostic context of the logs are very important to get to the root of the problem.","title":"The Need"},{"location":"observability/correlation-id/#solution","text":"A Correlation ID is a unique identifier that is added to the very first interaction (incoming request) to identify the context and is passed to all components that are involved in the transaction flow. Correlation ID becomes the glue that binds the transaction together and helps to draw an overall picture of events. Note: Before implementing your own Correlation ID, investigate if your telemetry tool of choice provides an auto-generated Correlation ID and that it serves the purposes of your application. For instance, Application Insights offers dependency auto-collection for some application frameworks","title":"Solution"},{"location":"observability/correlation-id/#recommended-practices","text":"Assign each external request a Correlation ID that binds the message to a transaction. The Correlation ID for a transaction must be assigned as early as you can. Propagate Correlation ID to all downstream components/services. All components/services of the transaction use this Correlation ID in their logs. For an HTTP Request, Correlation ID is typically passed in the header. Add it to an outgoing response where possible. Based on the use case, there can be additional correlation IDs that may be needed. For instance, tracking logs based on both Session ID and User ID may be required. While adding multiple correlation ID, remember to propagate them through the components. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the \"Correlation-id\", called TraceId.","title":"Recommended Practices"},{"location":"observability/correlation-id/#use-cases","text":"","title":"Use Cases"},{"location":"observability/correlation-id/#log-correlation","text":"Log correlation is the ability to track disparate events through different parts of the application. Having a Correlation ID provides more context making it easy to build rules for reporting and analysis.","title":"Log Correlation"},{"location":"observability/correlation-id/#secondary-reportingobserver-systems","text":"Using Correlation ID helps secondary systems to correlate data without application context. Some examples - generating metrics based on tracing data, integrating runtime/system diagnostics etc. For example, feeding AppInsights data and correlating it to infrastructure issues.","title":"Secondary Reporting/Observer Systems"},{"location":"observability/correlation-id/#troubleshooting-errors","text":"For troubleshooting an errors, Correlation ID is a great starting point to trace the workflow of a transaction.","title":"Troubleshooting Errors"},{"location":"observability/diagnostic-tools/","text":"Diagnostic tools Besides Logging , Tracing and Metrics , there are additional tools to help diagnose issues when applications do not behave as expected. In some scenarios, analyzing the memory consumption and drilling down into why a specific process takes longer than expected may require additional measures. In these cases, platform or programming language specific diagnostic tools come into play and are useful to debug a memory leak, profile the CPU usage, or the cause of delays in multi-threading. Profilers and Memory Analyzers There are two types of diagnostics tools you may want to use: profilers and memory analyzers. Profiling Profiling is a technique where you take small snapshots of all the threads in a running application to see the stack trace of each thread for a specified duration. This tool can help you identify where you are spending CPU time during the execution of your application. There are two main techniques to achieve this: CPU-Sampling and Instrumentation. CPU-Sampling is a non-invasive method which takes snapshots of all the stacks at a set interval. It is the most common technique for profiling and doesn't require any modification to your code. Instrumentation is the other technique where you insert a small piece of code at the beginning and end of each function which is going to signal back to the profiler about the time spent in the function, the function name, parameters and others. This way you modify the code of your running application. There are two effects to this: your code may run a little bit more slowly, but on the other hand you have a more accurate view of every function and class that has been executed so far in your application. When to use Sampling vs Instrumentation? Not all programming languages support instrumentation. Instrumentation is mostly supported for compiled languages like .NET and Java, and some languages interpreted at runtime like Python and Javascript. Keep in mind that enabling instrumentation can require to modify your build pipeline, i.e. by adding special parameters to the command line argument. You should normally start with Sampling because it doesn't require to modify your binaries, it doesn't affect your process performance, and can be quicker to start with. Once you have your profiling data, there are multiple ways to visualize this information depending of the format you saved it. As an example for .NET (dotnet-trace), there are three available formats to save these traces: Chromium, NetTrace and SpeedScope. Select the output format depending on the tool you are going to use. SpeedScope is an online web application you can use to visualize and analyze traces, and you only need a modern browser. Be careful with online tools, as dumps/traces might contain confidential information that you don't want to share outside of your organization. Memory Analyzers Memory analyzers and memory dumps are another set of diagnostic tools you can use to identify issues in your process. Normally these types of tools take the whole memory the process is using at a point in time and saves it in a file which can be analyzed. When using these types of tools, you want to stress your process as much as possible to amplify whatever deficiency you may have in terms of memory management. The memory dump should then be taken when the process is in this stressed state. In some scenarios we recommend to take more than one memory dump during the reproduction of a problem. For example, if you suspect a memory leak and you are running a test for 30 min, it is useful to take at least 3 dumps at different intervals (i.e. 10, 20 & 30 min) to compare them with each other. There are multiple ways to take a memory dump depending the operating system you are using. Also, each operating system has it own debugger which is able to load this memory dump, and explore the state of the process at the time the memory dump was taken. The most common debuggers are: Windows - WinDbg and WinDgbNext (included in the Windows SDK), Visual Studio can also load a memory dump for a .NET Framework and .NET Core process Linux - GDB is the GNU Debugger Mac OS - LLDB Debugger There are a range of developer platform specific diagnostic tools which can be used: .NET Core diagnostic tools , GitHub repository Java diagnostic tools - version specific Python debugging and profiling - version specific Node.js Diagnostics working group Environment for Profiling To create an application profile as close to production as possible, the environment in which the application is intended to run in production has to be considered and it might be necessary to perform a snapshot of the application state under load . Diagnostics in Containers For monolithic applications, diagnostics tools can be installed and run on the VM hosting them. Most scalable applications are developed as microservices and have complex interactions which require to install the tools in the containers running the process or to leverage a sidecar container (see sidecar pattern ). Some platforms expose endpoints to interact with the application and return a dump. Resources .NET Core diagnostics in containers Experimental tool dotnet-monitor , What's new , GItHub repository Spring Boot actuator endpoints","title":"Diagnostic tools"},{"location":"observability/diagnostic-tools/#diagnostic-tools","text":"Besides Logging , Tracing and Metrics , there are additional tools to help diagnose issues when applications do not behave as expected. In some scenarios, analyzing the memory consumption and drilling down into why a specific process takes longer than expected may require additional measures. In these cases, platform or programming language specific diagnostic tools come into play and are useful to debug a memory leak, profile the CPU usage, or the cause of delays in multi-threading.","title":"Diagnostic tools"},{"location":"observability/diagnostic-tools/#profilers-and-memory-analyzers","text":"There are two types of diagnostics tools you may want to use: profilers and memory analyzers.","title":"Profilers and Memory Analyzers"},{"location":"observability/diagnostic-tools/#profiling","text":"Profiling is a technique where you take small snapshots of all the threads in a running application to see the stack trace of each thread for a specified duration. This tool can help you identify where you are spending CPU time during the execution of your application. There are two main techniques to achieve this: CPU-Sampling and Instrumentation. CPU-Sampling is a non-invasive method which takes snapshots of all the stacks at a set interval. It is the most common technique for profiling and doesn't require any modification to your code. Instrumentation is the other technique where you insert a small piece of code at the beginning and end of each function which is going to signal back to the profiler about the time spent in the function, the function name, parameters and others. This way you modify the code of your running application. There are two effects to this: your code may run a little bit more slowly, but on the other hand you have a more accurate view of every function and class that has been executed so far in your application.","title":"Profiling"},{"location":"observability/diagnostic-tools/#when-to-use-sampling-vs-instrumentation","text":"Not all programming languages support instrumentation. Instrumentation is mostly supported for compiled languages like .NET and Java, and some languages interpreted at runtime like Python and Javascript. Keep in mind that enabling instrumentation can require to modify your build pipeline, i.e. by adding special parameters to the command line argument. You should normally start with Sampling because it doesn't require to modify your binaries, it doesn't affect your process performance, and can be quicker to start with. Once you have your profiling data, there are multiple ways to visualize this information depending of the format you saved it. As an example for .NET (dotnet-trace), there are three available formats to save these traces: Chromium, NetTrace and SpeedScope. Select the output format depending on the tool you are going to use. SpeedScope is an online web application you can use to visualize and analyze traces, and you only need a modern browser. Be careful with online tools, as dumps/traces might contain confidential information that you don't want to share outside of your organization.","title":"When to use Sampling vs Instrumentation?"},{"location":"observability/diagnostic-tools/#memory-analyzers","text":"Memory analyzers and memory dumps are another set of diagnostic tools you can use to identify issues in your process. Normally these types of tools take the whole memory the process is using at a point in time and saves it in a file which can be analyzed. When using these types of tools, you want to stress your process as much as possible to amplify whatever deficiency you may have in terms of memory management. The memory dump should then be taken when the process is in this stressed state. In some scenarios we recommend to take more than one memory dump during the reproduction of a problem. For example, if you suspect a memory leak and you are running a test for 30 min, it is useful to take at least 3 dumps at different intervals (i.e. 10, 20 & 30 min) to compare them with each other. There are multiple ways to take a memory dump depending the operating system you are using. Also, each operating system has it own debugger which is able to load this memory dump, and explore the state of the process at the time the memory dump was taken. The most common debuggers are: Windows - WinDbg and WinDgbNext (included in the Windows SDK), Visual Studio can also load a memory dump for a .NET Framework and .NET Core process Linux - GDB is the GNU Debugger Mac OS - LLDB Debugger There are a range of developer platform specific diagnostic tools which can be used: .NET Core diagnostic tools , GitHub repository Java diagnostic tools - version specific Python debugging and profiling - version specific Node.js Diagnostics working group","title":"Memory Analyzers"},{"location":"observability/diagnostic-tools/#environment-for-profiling","text":"To create an application profile as close to production as possible, the environment in which the application is intended to run in production has to be considered and it might be necessary to perform a snapshot of the application state under load .","title":"Environment for Profiling"},{"location":"observability/diagnostic-tools/#diagnostics-in-containers","text":"For monolithic applications, diagnostics tools can be installed and run on the VM hosting them. Most scalable applications are developed as microservices and have complex interactions which require to install the tools in the containers running the process or to leverage a sidecar container (see sidecar pattern ). Some platforms expose endpoints to interact with the application and return a dump.","title":"Diagnostics in Containers"},{"location":"observability/diagnostic-tools/#resources","text":".NET Core diagnostics in containers Experimental tool dotnet-monitor , What's new , GItHub repository Spring Boot actuator endpoints","title":"Resources"},{"location":"observability/log-vs-metric-vs-trace/","text":"Logs vs Metrics vs Traces Overview Metrics The purpose of metrics is to inform observers about the health & operations regarding a component or system. A metric represents a point in time measure of a particular source, and data-wise tends to be very small. The compact size allows for efficient collection even at scale in large systems. Metrics also lend themselves very well to pre-aggregation within the component before collection, reducing computation cost for processing & storing large numbers of metric time series in a central system. Due to how efficiently metrics are processed & stored, it lends itself very well for use in automated alerting, as metrics are an excellent source for the health data for all components in the system. Logs Log data inform observers about the discrete events that occurred within a component or a set of components. Just about every software component log information about its activities over time. This rich data tends to be much larger than metric data and can cause processing issues, especially if components are logging too verbosely. Therefore, using log data to understand the health of an extensive system tends to be avoided and depends on metrics for that data. Once metric telemetry highlights potential problem sources, filtered log data for those sources can be used to understand what occurred. Traces Where logging provides an overview to a discrete, event-triggered log, tracing encompasses a much wider, continuous view of an application. The goal of tracing is to following a program\u2019s flow and data progression. In many instances, tracing represents a single user\u2019s journey through an entire app stack. Its purpose isn\u2019t reactive, but instead focused on optimization. By tracing through a stack, developers can identify bottlenecks and focus on improving performance. A distributed trace is defined as a collection of spans. A span is the smallest unit in a trace and represents a piece of the workflow in a distributed landscape. It can be an HTTP request, call to a database, or execution of a message from a queue. When a problem does occur, tracing allows you to see how you got there: Which function. The function\u2019s duration. Parameters passed. How deep into the function the user could get. Usage Guidance When to use metric or log data to track a particular piece of telemetry can be summarized with the following points: Use metrics to track the occurrence of an event, counting of items, the time taken to perform an action or to report the current value of a resource (CPU, memory, etc.) Use logs to track detailed information about an event also monitored by a metric, particularly errors, warnings or other exceptional situations. A trace provides visibility into how a request is processed across multiple services in a microservices environment. Every trace needs to have a unique identifier associated with it.","title":"Logs vs Metrics vs Traces"},{"location":"observability/log-vs-metric-vs-trace/#logs-vs-metrics-vs-traces","text":"","title":"Logs vs Metrics vs Traces"},{"location":"observability/log-vs-metric-vs-trace/#overview","text":"","title":"Overview"},{"location":"observability/log-vs-metric-vs-trace/#metrics","text":"The purpose of metrics is to inform observers about the health & operations regarding a component or system. A metric represents a point in time measure of a particular source, and data-wise tends to be very small. The compact size allows for efficient collection even at scale in large systems. Metrics also lend themselves very well to pre-aggregation within the component before collection, reducing computation cost for processing & storing large numbers of metric time series in a central system. Due to how efficiently metrics are processed & stored, it lends itself very well for use in automated alerting, as metrics are an excellent source for the health data for all components in the system.","title":"Metrics"},{"location":"observability/log-vs-metric-vs-trace/#logs","text":"Log data inform observers about the discrete events that occurred within a component or a set of components. Just about every software component log information about its activities over time. This rich data tends to be much larger than metric data and can cause processing issues, especially if components are logging too verbosely. Therefore, using log data to understand the health of an extensive system tends to be avoided and depends on metrics for that data. Once metric telemetry highlights potential problem sources, filtered log data for those sources can be used to understand what occurred.","title":"Logs"},{"location":"observability/log-vs-metric-vs-trace/#traces","text":"Where logging provides an overview to a discrete, event-triggered log, tracing encompasses a much wider, continuous view of an application. The goal of tracing is to following a program\u2019s flow and data progression. In many instances, tracing represents a single user\u2019s journey through an entire app stack. Its purpose isn\u2019t reactive, but instead focused on optimization. By tracing through a stack, developers can identify bottlenecks and focus on improving performance. A distributed trace is defined as a collection of spans. A span is the smallest unit in a trace and represents a piece of the workflow in a distributed landscape. It can be an HTTP request, call to a database, or execution of a message from a queue. When a problem does occur, tracing allows you to see how you got there: Which function. The function\u2019s duration. Parameters passed. How deep into the function the user could get.","title":"Traces"},{"location":"observability/log-vs-metric-vs-trace/#usage-guidance","text":"When to use metric or log data to track a particular piece of telemetry can be summarized with the following points: Use metrics to track the occurrence of an event, counting of items, the time taken to perform an action or to report the current value of a resource (CPU, memory, etc.) Use logs to track detailed information about an event also monitored by a metric, particularly errors, warnings or other exceptional situations. A trace provides visibility into how a request is processed across multiple services in a microservices environment. Every trace needs to have a unique identifier associated with it.","title":"Usage Guidance"},{"location":"observability/logs-privacy/","text":"Guidance for Privacy Overview To ensure the privacy of your system users, as well as comply with several regulations like GDPR, some types of data shouldn\u2019t exist in logs. This includes customer's sensitive, Personal Identifiable Information (PII), and any other data that wasn't legally sanctioned. Recommended Practices Separate components and minimize the parts of the system that log sensitive data. Keep sensitive data out of URLs, since request URLs are typically logged by proxies and web servers. Avoid using PII data for system debugging as much as possible. For example, use ids instead of usernames. Use Structured Logging and include a deny-list for sensitive properties. Put an extra effort on spotting logging statements with sensitive data during code review, as it is common for reviewers to skip reading logging statements. This can be added as an additional checkbox if you're using Pull Request Templates. Include mechanisms to detect sensitive data in logs, on your organizational pipelines for QA or Automated Testing. Tools and Implementation Methods Use these tools and methods for sensitive data de-identification in logs. Application Insights Application Insights offers telemetry interception in some of the SDKs, that can be done by implementing the ITelemetryProcessor interface. ITelemetryProcessor processes the telemetry information before it is sent to Application Insights, and can be useful in many situations, such as filtering and modifications. Below is an example of intercepting 'trace' typed telemetry: using Microsoft.ApplicationInsights.DataContracts ; namespace Example { using Microsoft.ApplicationInsights.Channel ; using Microsoft.ApplicationInsights.Extensibility ; internal class RedactTelemetryInitializer : ITelemetryInitializer { public void Initialize ( ITelemetry telemetry ) { var requestTelemetry = telemetry as TraceTelemetry ; if ( requestTelemetry == null ) return ; # redact emails from the message parameter requestTelemetry . Message = Regex . Replace ( requestTelemetry . Message , @\"[^@\\s]+@[^@\\s]+\\.[^@\\s]+\" , \"[email removed]\" ); } } } Elastic Stack Elastic Stack (formerly \"ELK stack\") allows logs interception by Logstash's filter-plugins . Using some of the existing plugins, like 'mutate', 'alter' and 'prune' might be sufficient for most cases of deidentifying and redacting PIIs. For a more robust and customized use-case, a 'ruby' plugin can be used, executing arbitrary Ruby code. Filter plugins also exists in some Logstash alternatives, like Fluentd and Fluent Bit . Presidio Presidio offers data protection and anonymization API. It provides fast identification and anonymization modules for private entities in text. Presidio allows using predefined or custom PII recognizers, leveraging Named Entity Recognition, regular expressions, rule based logic and checksum with relevant context in multiple languages. It can be used alongside the log interception methods mentioned above to help and ensure sensitive data is properly managed and governed. Presidio is containerized for REST HTTP API and also can be installed as a python package, to be called from python code. Instead of handling the anonymization in the application code, both APIs can be used using external calls. Elastic Stack, for example, can handle PII redaction using the 'ruby' filter plugin to call Presidio in REST HTTP API, or by calling a python script consuming Presidio as a package: logstash.conf input { ... } filter { ruby { code => 'require \"open3\" message = event.get(\"message\") # Call a python script triggering Presidio analyzer and anonymizer, and printing the result. cmd = \"python /path/to/presidio/anonymization/script.py \\\" #{ message } \\\" \" # Fetch the script' s stdout stdin , stdout , stderr = Open3 . popen3 ( cmd ) # Override message with the anonymized text. event . set ( \"message\" , stdout . read ) filter_matched ( event ) ' } } output { ... }","title":"Guidance for Privacy"},{"location":"observability/logs-privacy/#guidance-for-privacy","text":"","title":"Guidance for Privacy"},{"location":"observability/logs-privacy/#overview","text":"To ensure the privacy of your system users, as well as comply with several regulations like GDPR, some types of data shouldn\u2019t exist in logs. This includes customer's sensitive, Personal Identifiable Information (PII), and any other data that wasn't legally sanctioned.","title":"Overview"},{"location":"observability/logs-privacy/#recommended-practices","text":"Separate components and minimize the parts of the system that log sensitive data. Keep sensitive data out of URLs, since request URLs are typically logged by proxies and web servers. Avoid using PII data for system debugging as much as possible. For example, use ids instead of usernames. Use Structured Logging and include a deny-list for sensitive properties. Put an extra effort on spotting logging statements with sensitive data during code review, as it is common for reviewers to skip reading logging statements. This can be added as an additional checkbox if you're using Pull Request Templates. Include mechanisms to detect sensitive data in logs, on your organizational pipelines for QA or Automated Testing.","title":"Recommended Practices"},{"location":"observability/logs-privacy/#tools-and-implementation-methods","text":"Use these tools and methods for sensitive data de-identification in logs.","title":"Tools and Implementation Methods"},{"location":"observability/logs-privacy/#application-insights","text":"Application Insights offers telemetry interception in some of the SDKs, that can be done by implementing the ITelemetryProcessor interface. ITelemetryProcessor processes the telemetry information before it is sent to Application Insights, and can be useful in many situations, such as filtering and modifications. Below is an example of intercepting 'trace' typed telemetry: using Microsoft.ApplicationInsights.DataContracts ; namespace Example { using Microsoft.ApplicationInsights.Channel ; using Microsoft.ApplicationInsights.Extensibility ; internal class RedactTelemetryInitializer : ITelemetryInitializer { public void Initialize ( ITelemetry telemetry ) { var requestTelemetry = telemetry as TraceTelemetry ; if ( requestTelemetry == null ) return ; # redact emails from the message parameter requestTelemetry . Message = Regex . Replace ( requestTelemetry . Message , @\"[^@\\s]+@[^@\\s]+\\.[^@\\s]+\" , \"[email removed]\" ); } } }","title":"Application Insights"},{"location":"observability/logs-privacy/#elastic-stack","text":"Elastic Stack (formerly \"ELK stack\") allows logs interception by Logstash's filter-plugins . Using some of the existing plugins, like 'mutate', 'alter' and 'prune' might be sufficient for most cases of deidentifying and redacting PIIs. For a more robust and customized use-case, a 'ruby' plugin can be used, executing arbitrary Ruby code. Filter plugins also exists in some Logstash alternatives, like Fluentd and Fluent Bit .","title":"Elastic Stack"},{"location":"observability/logs-privacy/#presidio","text":"Presidio offers data protection and anonymization API. It provides fast identification and anonymization modules for private entities in text. Presidio allows using predefined or custom PII recognizers, leveraging Named Entity Recognition, regular expressions, rule based logic and checksum with relevant context in multiple languages. It can be used alongside the log interception methods mentioned above to help and ensure sensitive data is properly managed and governed. Presidio is containerized for REST HTTP API and also can be installed as a python package, to be called from python code. Instead of handling the anonymization in the application code, both APIs can be used using external calls. Elastic Stack, for example, can handle PII redaction using the 'ruby' filter plugin to call Presidio in REST HTTP API, or by calling a python script consuming Presidio as a package: logstash.conf input { ... } filter { ruby { code => 'require \"open3\" message = event.get(\"message\") # Call a python script triggering Presidio analyzer and anonymizer, and printing the result. cmd = \"python /path/to/presidio/anonymization/script.py \\\" #{ message } \\\" \" # Fetch the script' s stdout stdin , stdout , stderr = Open3 . popen3 ( cmd ) # Override message with the anonymized text. event . set ( \"message\" , stdout . read ) filter_matched ( event ) ' } } output { ... }","title":"Presidio"},{"location":"observability/microservices/","text":"Observability in Microservices Microservices is a very popular software architecture, where the application is arranged as a collection of loosely coupled services. Some of those services can be written in different languages by different teams. Motivations We need to consider special cases when creating a microservice architecture from the perspective of observability. We want to capture the interactions when making requests between those microservices and correlate them. Imagine we have a microservice that accesses a database to retrieve some data as part of a request. This microservice is going to be called by someone else as part of an incoming http request or an internal process being executed. What happens if a problem occurs during the retrieval of the data (or the update of the data)? How can we associate, or correlate, that this particular call failed in the destination microservice? This is a common issue. When calling other microservices, depending on the technology stack we use, we can accidentally hide errors and exceptions that might happen on the other side. If we are using a simple REST interface, the other microservice can return a 500 HTTP status code and we don't have any idea what happen inside that microservice. More important, we don't have any way to associate our Correlation Id to whatever happens inside that microservice. Therefore, is so important to have a plan in place to be able to extend your traceability and monitoring efforts, especially when using a microservice architecture. How to Extend Your Tracing Information Between Microservices The W3C consortium is working on a Trace Context definition that can be applied when using HTTP as the protocol in a microservice architecture. But let's explain how we can implement this functionality in our software. The main idea behind this is to propagate the correlation information between HTTP request so other pieces of software can read this information and correctly correlate telemetry across microservices. The way to propagate this information is to use HTTP Headers for the Correlation Id, parent Correlation Id, etc. When you are in the scope of a HTTP Request, your tracing system should already have created four properties that you can use to send across your microservices. RequestId:0HLQV2BC3VP2T:00000001, SpanId:da13aa3c6fd9c146, TraceId:f11a03e3f078414fa7c0a0ce568c8b5c, ParentId:5076c17d0a604244 This is an example of the four properties you can find which identify the current request. RequestId is the unique id that represent the current HTTP Request. SpanId is the default automatically generated span. You can have more than one Span that scope different functionality inside your software. TraceId represent the id for current log trace. ParentId is the parent span id, that in some case can be the same or something different. Example Now we are going to explore an example with 3 microservices that calls to each other in a row. This image is the summary of what is needed in each microservice to propagate the trace-id from A to C. The root caller is A and that is why it doesn't have a parent-id, only have a new trace-id. Next, A calls B using HTTP. To propagate the correlation information as part of the request, we are using two new headers based on the W3C Correlation specification, trace-id and parent-id. In this example because A is the root caller, A only sends its own trace-id to microservice B. When microservice B receives the incoming HTTP request, it checks the contents of these two headers. It reads the content of the trace-id header and sets its own parent-id to this trace-id (as shown in the green rectangle inside's B). In addition, it creates a new trace-id to signal that is a new scope for the telemetry. During the execution of microservice B, it also calls microservice C and repeats the pattern. As part of the request it includes the two headers and propagates trace-id and parent-id as well. Finally, microservice C, reads the value for the incoming trace-id and sets as his own parent-id, but also creates a new trace-id that will use to send telemetry about his own operations. Summary A number of Application Monitoring (APM) technology products already supports most of this Correlation Propagation. The most popular is OpenZipkin/B3-Propagation . W3C already proposed a recommendation for the W3C Trace Context , where you can see what SDK and frameworks already support this functionality. It's important to correctly implement the propagation specially when there are different teams that used different technology stacks in the same project. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the Trace Context object among a full stack of microservices implemented across different technical stacks.","title":"Observability in Microservices"},{"location":"observability/microservices/#observability-in-microservices","text":"Microservices is a very popular software architecture, where the application is arranged as a collection of loosely coupled services. Some of those services can be written in different languages by different teams.","title":"Observability in Microservices"},{"location":"observability/microservices/#motivations","text":"We need to consider special cases when creating a microservice architecture from the perspective of observability. We want to capture the interactions when making requests between those microservices and correlate them. Imagine we have a microservice that accesses a database to retrieve some data as part of a request. This microservice is going to be called by someone else as part of an incoming http request or an internal process being executed. What happens if a problem occurs during the retrieval of the data (or the update of the data)? How can we associate, or correlate, that this particular call failed in the destination microservice? This is a common issue. When calling other microservices, depending on the technology stack we use, we can accidentally hide errors and exceptions that might happen on the other side. If we are using a simple REST interface, the other microservice can return a 500 HTTP status code and we don't have any idea what happen inside that microservice. More important, we don't have any way to associate our Correlation Id to whatever happens inside that microservice. Therefore, is so important to have a plan in place to be able to extend your traceability and monitoring efforts, especially when using a microservice architecture.","title":"Motivations"},{"location":"observability/microservices/#how-to-extend-your-tracing-information-between-microservices","text":"The W3C consortium is working on a Trace Context definition that can be applied when using HTTP as the protocol in a microservice architecture. But let's explain how we can implement this functionality in our software. The main idea behind this is to propagate the correlation information between HTTP request so other pieces of software can read this information and correctly correlate telemetry across microservices. The way to propagate this information is to use HTTP Headers for the Correlation Id, parent Correlation Id, etc. When you are in the scope of a HTTP Request, your tracing system should already have created four properties that you can use to send across your microservices. RequestId:0HLQV2BC3VP2T:00000001, SpanId:da13aa3c6fd9c146, TraceId:f11a03e3f078414fa7c0a0ce568c8b5c, ParentId:5076c17d0a604244 This is an example of the four properties you can find which identify the current request. RequestId is the unique id that represent the current HTTP Request. SpanId is the default automatically generated span. You can have more than one Span that scope different functionality inside your software. TraceId represent the id for current log trace. ParentId is the parent span id, that in some case can be the same or something different.","title":"How to Extend Your Tracing Information Between Microservices"},{"location":"observability/microservices/#example","text":"Now we are going to explore an example with 3 microservices that calls to each other in a row. This image is the summary of what is needed in each microservice to propagate the trace-id from A to C. The root caller is A and that is why it doesn't have a parent-id, only have a new trace-id. Next, A calls B using HTTP. To propagate the correlation information as part of the request, we are using two new headers based on the W3C Correlation specification, trace-id and parent-id. In this example because A is the root caller, A only sends its own trace-id to microservice B. When microservice B receives the incoming HTTP request, it checks the contents of these two headers. It reads the content of the trace-id header and sets its own parent-id to this trace-id (as shown in the green rectangle inside's B). In addition, it creates a new trace-id to signal that is a new scope for the telemetry. During the execution of microservice B, it also calls microservice C and repeats the pattern. As part of the request it includes the two headers and propagates trace-id and parent-id as well. Finally, microservice C, reads the value for the incoming trace-id and sets as his own parent-id, but also creates a new trace-id that will use to send telemetry about his own operations.","title":"Example"},{"location":"observability/microservices/#summary","text":"A number of Application Monitoring (APM) technology products already supports most of this Correlation Propagation. The most popular is OpenZipkin/B3-Propagation . W3C already proposed a recommendation for the W3C Trace Context , where you can see what SDK and frameworks already support this functionality. It's important to correctly implement the propagation specially when there are different teams that used different technology stacks in the same project. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the Trace Context object among a full stack of microservices implemented across different technical stacks.","title":"Summary"},{"location":"observability/ml-observability/","text":"Observability in Machine Learning Development process of software system with machine learning component is more complex than traditional software. We need to monitor changes and variations in three dimensions: the code, the model and the data. We can distinguish two stages of such system lifespan: experimentation and production that require different approaches to observability as discussed below: Model Experimentation and Tuning Experimentation is a process of finding suitable machine learning model and its parameters via training and evaluating such models with one or more datasets. When developing and tuning machine learning models, the data scientists are interested in observing and comparing selected performance metrics for various model parameters. They also need a reliable way to reproduce a training process, such that a given dataset and given parameters produces the same models. There are many model metric evaluation solutions available, both open source (like MLFlow) and proprietary (like Azure Machine Learning Service), and of which some serve different purposes. To capture model metrics, there are a.o. the following options available: Azure Machine Learning Service SDK Azure Machine Learning service provides an SDK for Python, R and C# to capture your evaluation metrics to an Azure Machine Learning service (AML) Experiment. Experiments are viewed in the AML dashboard. Reproducibility is achieved by storing code or notebook snapshot together with viewed metric. You can create versioned Datasets within Azure Machine Learning service. MLFlow (for Databricks) MLFlow is open source framework, and can be hosted on Azure Databricks as its remote tracking server (it currently is the only solution that offers first-party integration with Databricks). You can use the MLFlow SDK tracking component to capture your evaluation metrics or any parameter you would like and track it at experimentation board in Azure Databricks. Source code and dataset version are also saved with log snapshot to provide reproducibility. TensorBoard TensorBoard is a popular tool amongst data scientist to visualize specific metrics of Deep Learning runs, especially of TensorFlow runs. TensorBoard is not an MLOps tool like AML/MLFlow, and therefore does not offer extensive logging capabilities. It is meant to be transient; and can therefore be used as an addition to an end-to-end MLOps tool like AML, but not as a complete MLOps tool. Application Insights Application Insights can be used as an alternative sink to capture model metrics, and can therefore offer more extensive options as metrics can be transferred to e.g. a PowerBI dashboard. It also enables log querying. However, this solution means that a custom application needs to be written to send logs to AppInsights (using for example the OpenCensus Python SDK), which would mean extra effort of creating/maintaining custom code. An extensive comparison of the four tools can be found as follows: Azure ML MLFlow TensorBoard Application Insights Metrics support Values, images, matrices, logs Values, images, matrices and plots as files Metrics relevant to DL research phase Values, images, matrices, logs Customizabile Basic Basic Very basic High Metrics accessible AML portal, AML SDK MLFlow UI, Tracking service API Tensorboard UI, history object Application Insights Logs accessible Rolling logs written to .txt files in blob storage, accessible via blob or AML portal. Not query-able Rolling logs are not stored Rolling logs are not stored Application Insights in Azure Portal. Query-able with KQL Ease of use and set up Very straightforward, only one portal More moving parts due to remote tracking server A bit over process overhead. Also depending on ML framework More moving parts as a custom app needs to be maintained Shareability Across people with access to AML workspace Across people with access to remote tracking server Across people with access to same directory Across people with access to AppInsights Model in Production The trained model can be deployed to production as container. Azure Machine Learning service provides SDK to deploy model as Azure Container Instance and publishes REST endpoint. You can monitor it using microservice observability methods( for more details -refer to Recipes section). MLFLow is an alternative way to deploy ML model as a service. Training and Re-Training To automatically retrain the model you can use AML Pipelines or Azure Databricks. When re-training with AML Pipelines you can monitor information of each run, including the output, logs, and various metrics in the Azure portal experiment dashboard, or manually extract it using the AML SDK Model Performance Over Time: Data Drift We re-train machine learning models to improve their performance and make models better aligned with data changing over time. However, in some cases model performance may degrade. This may happen in case data change dramatically and do not exhibit the patterns we observed during model development anymore. This effect is called data drift. Azure Machine Learning Service has preview feature to observe and report data drift. This article describes it in detail. Data Versioning It is recommended practice to add version to all datasets. You can create a versioned Azure ML Dataset for this purpose, or manually version it if using other systems.","title":"Observability in Machine Learning"},{"location":"observability/ml-observability/#observability-in-machine-learning","text":"Development process of software system with machine learning component is more complex than traditional software. We need to monitor changes and variations in three dimensions: the code, the model and the data. We can distinguish two stages of such system lifespan: experimentation and production that require different approaches to observability as discussed below:","title":"Observability in Machine Learning"},{"location":"observability/ml-observability/#model-experimentation-and-tuning","text":"Experimentation is a process of finding suitable machine learning model and its parameters via training and evaluating such models with one or more datasets. When developing and tuning machine learning models, the data scientists are interested in observing and comparing selected performance metrics for various model parameters. They also need a reliable way to reproduce a training process, such that a given dataset and given parameters produces the same models. There are many model metric evaluation solutions available, both open source (like MLFlow) and proprietary (like Azure Machine Learning Service), and of which some serve different purposes. To capture model metrics, there are a.o. the following options available: Azure Machine Learning Service SDK Azure Machine Learning service provides an SDK for Python, R and C# to capture your evaluation metrics to an Azure Machine Learning service (AML) Experiment. Experiments are viewed in the AML dashboard. Reproducibility is achieved by storing code or notebook snapshot together with viewed metric. You can create versioned Datasets within Azure Machine Learning service. MLFlow (for Databricks) MLFlow is open source framework, and can be hosted on Azure Databricks as its remote tracking server (it currently is the only solution that offers first-party integration with Databricks). You can use the MLFlow SDK tracking component to capture your evaluation metrics or any parameter you would like and track it at experimentation board in Azure Databricks. Source code and dataset version are also saved with log snapshot to provide reproducibility. TensorBoard TensorBoard is a popular tool amongst data scientist to visualize specific metrics of Deep Learning runs, especially of TensorFlow runs. TensorBoard is not an MLOps tool like AML/MLFlow, and therefore does not offer extensive logging capabilities. It is meant to be transient; and can therefore be used as an addition to an end-to-end MLOps tool like AML, but not as a complete MLOps tool. Application Insights Application Insights can be used as an alternative sink to capture model metrics, and can therefore offer more extensive options as metrics can be transferred to e.g. a PowerBI dashboard. It also enables log querying. However, this solution means that a custom application needs to be written to send logs to AppInsights (using for example the OpenCensus Python SDK), which would mean extra effort of creating/maintaining custom code. An extensive comparison of the four tools can be found as follows: Azure ML MLFlow TensorBoard Application Insights Metrics support Values, images, matrices, logs Values, images, matrices and plots as files Metrics relevant to DL research phase Values, images, matrices, logs Customizabile Basic Basic Very basic High Metrics accessible AML portal, AML SDK MLFlow UI, Tracking service API Tensorboard UI, history object Application Insights Logs accessible Rolling logs written to .txt files in blob storage, accessible via blob or AML portal. Not query-able Rolling logs are not stored Rolling logs are not stored Application Insights in Azure Portal. Query-able with KQL Ease of use and set up Very straightforward, only one portal More moving parts due to remote tracking server A bit over process overhead. Also depending on ML framework More moving parts as a custom app needs to be maintained Shareability Across people with access to AML workspace Across people with access to remote tracking server Across people with access to same directory Across people with access to AppInsights","title":"Model Experimentation and Tuning"},{"location":"observability/ml-observability/#model-in-production","text":"The trained model can be deployed to production as container. Azure Machine Learning service provides SDK to deploy model as Azure Container Instance and publishes REST endpoint. You can monitor it using microservice observability methods( for more details -refer to Recipes section). MLFLow is an alternative way to deploy ML model as a service.","title":"Model in Production"},{"location":"observability/ml-observability/#training-and-re-training","text":"To automatically retrain the model you can use AML Pipelines or Azure Databricks. When re-training with AML Pipelines you can monitor information of each run, including the output, logs, and various metrics in the Azure portal experiment dashboard, or manually extract it using the AML SDK","title":"Training and Re-Training"},{"location":"observability/ml-observability/#model-performance-over-time-data-drift","text":"We re-train machine learning models to improve their performance and make models better aligned with data changing over time. However, in some cases model performance may degrade. This may happen in case data change dramatically and do not exhibit the patterns we observed during model development anymore. This effect is called data drift. Azure Machine Learning Service has preview feature to observe and report data drift. This article describes it in detail.","title":"Model Performance Over Time: Data Drift"},{"location":"observability/ml-observability/#data-versioning","text":"It is recommended practice to add version to all datasets. You can create a versioned Azure ML Dataset for this purpose, or manually version it if using other systems.","title":"Data Versioning"},{"location":"observability/observability-as-code/","text":"Observability as Code As much as possible, configuration and management of observability assets such as cloud resource provisioning, monitoring alerts and dashboards must be managed as code. Observability as Code is achieved using any one of Terraform / Ansible / ARM Templates Examples of Observability as Code Dashboards as Code - Monitoring Dashboards can be created as JSON or XML templates. This template is source control maintained and any changes to the dashboards can be reviewed. Automation can be built for enabling the dashboard. More about how to do this in Azure . Grafana dashboard can also be configured as code which eventually can be source-controlled to be used in automation and pipelines. Alerts as Code - Alerts can be created within Azure by using Terraform or ARM templates. Such alerts can be source-controlled and be deployed as part of pipelines (Azure DevOps pipelines, Jenkins, GitHub Actions etc.). Few references of how to do this are: Terraform Monitor Metric Alert . Alerts can also be created based on log analytics query and can be defined as code using Terraform Monitor Scheduled Query Rules Alert . Automating Log Analytics Queries - There are several use cases where automation of log analytics queries may be needed. Example, Automatic Report Generation, Running custom queries programmatically for analysis, debugging etc. For these use cases to work, log queries should be source-controlled and automation can be built using log analytics REST or azure cli . Why It makes configuration repeatable and automatable. It also avoids manual configuration of monitoring alerts and dashboards from scratch across environments. Configured dashboards help troubleshoot errors during integration and deployment (CI/CD) We can audit changes and roll them back if there are any issues. Identify actionable insights from the generated metrics data across all environments, not just production. Configuration and management of observability assets like alert threshold, duration, configuration values using IAC help us in avoiding configuration mistakes, errors or overlooks during deployment. When practicing observability as code, the changes can be reviewed by the team similar to other code contributions.","title":"Observability as Code"},{"location":"observability/observability-as-code/#observability-as-code","text":"As much as possible, configuration and management of observability assets such as cloud resource provisioning, monitoring alerts and dashboards must be managed as code. Observability as Code is achieved using any one of Terraform / Ansible / ARM Templates","title":"Observability as Code"},{"location":"observability/observability-as-code/#examples-of-observability-as-code","text":"Dashboards as Code - Monitoring Dashboards can be created as JSON or XML templates. This template is source control maintained and any changes to the dashboards can be reviewed. Automation can be built for enabling the dashboard. More about how to do this in Azure . Grafana dashboard can also be configured as code which eventually can be source-controlled to be used in automation and pipelines. Alerts as Code - Alerts can be created within Azure by using Terraform or ARM templates. Such alerts can be source-controlled and be deployed as part of pipelines (Azure DevOps pipelines, Jenkins, GitHub Actions etc.). Few references of how to do this are: Terraform Monitor Metric Alert . Alerts can also be created based on log analytics query and can be defined as code using Terraform Monitor Scheduled Query Rules Alert . Automating Log Analytics Queries - There are several use cases where automation of log analytics queries may be needed. Example, Automatic Report Generation, Running custom queries programmatically for analysis, debugging etc. For these use cases to work, log queries should be source-controlled and automation can be built using log analytics REST or azure cli .","title":"Examples of Observability as Code"},{"location":"observability/observability-as-code/#why","text":"It makes configuration repeatable and automatable. It also avoids manual configuration of monitoring alerts and dashboards from scratch across environments. Configured dashboards help troubleshoot errors during integration and deployment (CI/CD) We can audit changes and roll them back if there are any issues. Identify actionable insights from the generated metrics data across all environments, not just production. Configuration and management of observability assets like alert threshold, duration, configuration values using IAC help us in avoiding configuration mistakes, errors or overlooks during deployment. When practicing observability as code, the changes can be reviewed by the team similar to other code contributions.","title":"Why"},{"location":"observability/observability-databricks/","text":"Observability for Azure Databricks Overview Azure Databricks is an Apache Spark\u2013based analytics service that makes it easy to rapidly develop and deploy big data analytics. Monitoring and troubleshooting performance issues is critical when operating production Azure Databricks workloads. It is important to log adequate information from Azure Databricks so that it is helpful to monitor and troubleshoot performance issues. Spark is designed to run on a cluster - a cluster is a set of Virtual Machines (VMs). Spark can horizontally scale with bigger workloads needed more VMs. Azure Databricks can scale in and out as needed. Approaches to Observability Azure Diagnostic Logs Azure Diagnostic Logging is provided out-of-the-box by Azure Databricks, providing visibility into actions performed against DBFS, Clusters, Accounts, Jobs, Notebooks, SSH, Workspace, Secrets, SQL Permissions, and Instance Pools. These logs are enabled using Azure Portal or CLI and can be configured to be delivered to one of these Azure resources. Log Analytics Workspace Blob Storage Event Hub Cluster Event Logs Cluster Event logs provide a quick overview into important Cluster lifecycle events. The log are structured - Timestamp, Event Type and Details. Unfortunately, there is no native way to export logs to Log Analytics. Logs will have to be delivered to Log Analytics either using REST API or polled in the dbfs using Azure Functions. VM Performance Metrics (OMS) Log Analytics Agent provides insights into the performance counters from the Cluster VMs and helps to understand the Cluster Utilization patters. Leveraging Linux OMX Agent to onboard VMs into Log Analytics, helps provide insights into the VM metrics, performance, inventory and syslog metrics. It is important to note that Linux OMS Agent is not specific to Azure Databricks. Application Logging Of all the logs collected, this is perhaps the most important one. Spark Monitoring library collects metrics about the driver, executors, JVM, HDFS, cache shuffling, DAGs, and much more. This library provides helpful insights to fine-tune Spark jobs. It allows monitoring and tracing each layer within Spark workloads, including performance and resource usage on the host and JVM, as well as Spark metrics and application-level logging. The library also includes ready-made Grafana dashboards that is a great starting point for building Azure Databricks dashboard. Logs via REST API Azure Databricks also provides REST API support. If there's any specific log data that is required, this data can be collected using the REST API calls. NSG Flow Logs Network security group (NSG) flow logs is a feature of Azure Network Watcher that allows you to log information about IP traffic flowing through an NSG. Flow data is sent to Azure Storage accounts from where you can access it as well as export it to any visualization tool, SIEM, or IDS of your choice. This log information is not specific to NSG Flow logs. This data can be used to identify unknown or undesired traffic and monitor traffic levels and/or bandwidth consumption. This is possible only with VNET-injected workspaces. Platform Logs Platform logs can be used to review provisioning/de-provisioning operations. This can be used to review activity in Databricks managed resource group. It helps discover operations performed at subscription level (like provisioning of VM, Disk etc.) These logs can be enabled via Azure Monitor > Activity Logs and shipped to Log Analytics. Ganglia Metrics Ganglia metrics is a Cluster Utilization UI and is available on the Azure Databricks. It is great for viewing live metrics of interactive clusters. Ganglia metrics is available by default and takes snapshot of usage every 15 minutes. Historical metrics are stored as .png files, making it impossible to analyze data.","title":"Observability for Azure Databricks"},{"location":"observability/observability-databricks/#observability-for-azure-databricks","text":"","title":"Observability for Azure Databricks"},{"location":"observability/observability-databricks/#overview","text":"Azure Databricks is an Apache Spark\u2013based analytics service that makes it easy to rapidly develop and deploy big data analytics. Monitoring and troubleshooting performance issues is critical when operating production Azure Databricks workloads. It is important to log adequate information from Azure Databricks so that it is helpful to monitor and troubleshoot performance issues. Spark is designed to run on a cluster - a cluster is a set of Virtual Machines (VMs). Spark can horizontally scale with bigger workloads needed more VMs. Azure Databricks can scale in and out as needed.","title":"Overview"},{"location":"observability/observability-databricks/#approaches-to-observability","text":"","title":"Approaches to Observability"},{"location":"observability/observability-databricks/#azure-diagnostic-logs","text":"Azure Diagnostic Logging is provided out-of-the-box by Azure Databricks, providing visibility into actions performed against DBFS, Clusters, Accounts, Jobs, Notebooks, SSH, Workspace, Secrets, SQL Permissions, and Instance Pools. These logs are enabled using Azure Portal or CLI and can be configured to be delivered to one of these Azure resources. Log Analytics Workspace Blob Storage Event Hub","title":"Azure Diagnostic Logs"},{"location":"observability/observability-databricks/#cluster-event-logs","text":"Cluster Event logs provide a quick overview into important Cluster lifecycle events. The log are structured - Timestamp, Event Type and Details. Unfortunately, there is no native way to export logs to Log Analytics. Logs will have to be delivered to Log Analytics either using REST API or polled in the dbfs using Azure Functions.","title":"Cluster Event Logs"},{"location":"observability/observability-databricks/#vm-performance-metrics-oms","text":"Log Analytics Agent provides insights into the performance counters from the Cluster VMs and helps to understand the Cluster Utilization patters. Leveraging Linux OMX Agent to onboard VMs into Log Analytics, helps provide insights into the VM metrics, performance, inventory and syslog metrics. It is important to note that Linux OMS Agent is not specific to Azure Databricks.","title":"VM Performance Metrics (OMS)"},{"location":"observability/observability-databricks/#application-logging","text":"Of all the logs collected, this is perhaps the most important one. Spark Monitoring library collects metrics about the driver, executors, JVM, HDFS, cache shuffling, DAGs, and much more. This library provides helpful insights to fine-tune Spark jobs. It allows monitoring and tracing each layer within Spark workloads, including performance and resource usage on the host and JVM, as well as Spark metrics and application-level logging. The library also includes ready-made Grafana dashboards that is a great starting point for building Azure Databricks dashboard.","title":"Application Logging"},{"location":"observability/observability-databricks/#logs-via-rest-api","text":"Azure Databricks also provides REST API support. If there's any specific log data that is required, this data can be collected using the REST API calls.","title":"Logs via REST API"},{"location":"observability/observability-databricks/#nsg-flow-logs","text":"Network security group (NSG) flow logs is a feature of Azure Network Watcher that allows you to log information about IP traffic flowing through an NSG. Flow data is sent to Azure Storage accounts from where you can access it as well as export it to any visualization tool, SIEM, or IDS of your choice. This log information is not specific to NSG Flow logs. This data can be used to identify unknown or undesired traffic and monitor traffic levels and/or bandwidth consumption. This is possible only with VNET-injected workspaces.","title":"NSG Flow Logs"},{"location":"observability/observability-databricks/#platform-logs","text":"Platform logs can be used to review provisioning/de-provisioning operations. This can be used to review activity in Databricks managed resource group. It helps discover operations performed at subscription level (like provisioning of VM, Disk etc.) These logs can be enabled via Azure Monitor > Activity Logs and shipped to Log Analytics.","title":"Platform Logs"},{"location":"observability/observability-databricks/#ganglia-metrics","text":"Ganglia metrics is a Cluster Utilization UI and is available on the Azure Databricks. It is great for viewing live metrics of interactive clusters. Ganglia metrics is available by default and takes snapshot of usage every 15 minutes. Historical metrics are stored as .png files, making it impossible to analyze data.","title":"Ganglia Metrics"},{"location":"observability/observability-pipelines/","text":"Observability of CI/CD Pipelines With increasing complexity to delivery pipelines, it is very important to consider Observability in the context of build and release of applications. Benefits Having proper instrumentation during build time helps gain insights into the various stages of the build and release process. Helps developers understand where the pipeline performance bottlenecks are, based on the data collected. This helps in having data-driven conversations around identifying latency between jobs, performance issues, artifact upload/download times providing valuable insights into agents availability and capacity. Helps to identify trends in failures, thus allowing developers to quickly do root cause analysis. Helps to provide an organization-wide view of pipeline health to easily identify trends. Points to Consider It is important to identify the Key Performance Indicators (KPIs) for evaluating a successful CI/CD pipeline. Where needed, additional tracing can be added to better record KPI metrics. For example, adding pipeline build tags to identify a 'Release Candidate' vs. 'Non-Release Candidate' helps in evaluating the end-to-end release process timeline. Depending on the tooling used (Azure DevOps, Jenkins etc.,), basic reporting on the pipelines is available out-of-the-box. It is important to evaluate these reports against the KPIs to understand if a custom reporting solution for their pipelines is needed. If required, custom dashboards can be built using third-party tools like Grafana or Power BI Dashboards.","title":"Observability of CI/CD Pipelines"},{"location":"observability/observability-pipelines/#observability-of-cicd-pipelines","text":"With increasing complexity to delivery pipelines, it is very important to consider Observability in the context of build and release of applications.","title":"Observability of CI/CD Pipelines"},{"location":"observability/observability-pipelines/#benefits","text":"Having proper instrumentation during build time helps gain insights into the various stages of the build and release process. Helps developers understand where the pipeline performance bottlenecks are, based on the data collected. This helps in having data-driven conversations around identifying latency between jobs, performance issues, artifact upload/download times providing valuable insights into agents availability and capacity. Helps to identify trends in failures, thus allowing developers to quickly do root cause analysis. Helps to provide an organization-wide view of pipeline health to easily identify trends.","title":"Benefits"},{"location":"observability/observability-pipelines/#points-to-consider","text":"It is important to identify the Key Performance Indicators (KPIs) for evaluating a successful CI/CD pipeline. Where needed, additional tracing can be added to better record KPI metrics. For example, adding pipeline build tags to identify a 'Release Candidate' vs. 'Non-Release Candidate' helps in evaluating the end-to-end release process timeline. Depending on the tooling used (Azure DevOps, Jenkins etc.,), basic reporting on the pipelines is available out-of-the-box. It is important to evaluate these reports against the KPIs to understand if a custom reporting solution for their pipelines is needed. If required, custom dashboards can be built using third-party tools like Grafana or Power BI Dashboards.","title":"Points to Consider"},{"location":"observability/pitfalls/","text":"Things to Watch for when Building Observable Systems Observability as an Afterthought One of the design goals when building a system should be to enable monitoring of the system. This helps planning and thinking application availability, logging and metrics at the time of design and development. Observability also acts as a great debugging tool providing developers a bird's eye view of the system. By leaving instrumentation and logging of metrics towards the end, the development teams lose valuable insights during development. Metric Fatigue It is recommended to collect and measure what you need and not what you can . Don't attempt to monitor everything. If the data is not actionable, it is useless and becomes noise. On the contrary, it is sometimes very difficult to forecast every possible scenario that could go wrong. There must be a balance between collecting what is needed vs. logging every single activity in the system. A general rule of thumb is to follow these principles rules that catch incidents must be simple, relevant and reliable any data that is collected but not aggregated or alerted on must be reviewed if it is still required. Context All data logged must contain rich context, which is useful for getting an overall view of the system and easy to trace back errors/failures during troubleshooting. While logging data, care must also be taken to avoid data silos. Personally Identifiable Information As a general rule, do not log any customer sensitive and Personal Identifiable Information (PII). Ensure any pertinent privacy regulations are followed regarding PII (Ex: GDPR etc.) Read more here on how to keep sensitive data out of logs.","title":"Things to Watch for when Building Observable Systems"},{"location":"observability/pitfalls/#things-to-watch-for-when-building-observable-systems","text":"","title":"Things to Watch for when Building Observable Systems"},{"location":"observability/pitfalls/#observability-as-an-afterthought","text":"One of the design goals when building a system should be to enable monitoring of the system. This helps planning and thinking application availability, logging and metrics at the time of design and development. Observability also acts as a great debugging tool providing developers a bird's eye view of the system. By leaving instrumentation and logging of metrics towards the end, the development teams lose valuable insights during development.","title":"Observability as an Afterthought"},{"location":"observability/pitfalls/#metric-fatigue","text":"It is recommended to collect and measure what you need and not what you can . Don't attempt to monitor everything. If the data is not actionable, it is useless and becomes noise. On the contrary, it is sometimes very difficult to forecast every possible scenario that could go wrong. There must be a balance between collecting what is needed vs. logging every single activity in the system. A general rule of thumb is to follow these principles rules that catch incidents must be simple, relevant and reliable any data that is collected but not aggregated or alerted on must be reviewed if it is still required.","title":"Metric Fatigue"},{"location":"observability/pitfalls/#context","text":"All data logged must contain rich context, which is useful for getting an overall view of the system and easy to trace back errors/failures during troubleshooting. While logging data, care must also be taken to avoid data silos.","title":"Context"},{"location":"observability/pitfalls/#personally-identifiable-information","text":"As a general rule, do not log any customer sensitive and Personal Identifiable Information (PII). Ensure any pertinent privacy regulations are followed regarding PII (Ex: GDPR etc.) Read more here on how to keep sensitive data out of logs.","title":"Personally Identifiable Information"},{"location":"observability/profiling/","text":"Profiling Overview Profiling is a form of runtime analysis that measures various components of the runtime such as, memory allocation, garbage collection, threads and locks, call stacks, or frequency and duration of specific functions. It can be used to see which functions are the most costly in your binary, allowing you to focus your effort on removing the largest inefficiencies as quickly as possible. It can help you find deadlocks, memory leaks, or inefficient memory allocation, and help inform decisions around resource allocation (ie: CPU or RAM). How to Profile your Applications Profiling is somewhat language dependent, so start off by searching for \"profile $language\" (some common tools are listed below). Additionally, Linux Perf is a good fallback, since a lot of languages have bindings in C/C++. Profiling does incur some cost, as it requires inspecting the call stack, and sometimes pausing the application all together (ie: to trigger a full GC in Java). It is recommended to continuously profile your services, say for 10s every 10 minutes. Consider the cost when deciding on tuning these parameters. Different tools visualize profiles differently. Common CPU profiles might use a directed graph or a flame graph. Unfortunately, each profiler tool typically uses its own format for storing profiles, and comes with its own visualization. Tools (Java, Go, Python, Ruby, eBPF) Pyroscope continuous profiling out of the box. (Java and Go) Flame - profiling containers in Kubernetes (Java, Python, Go) Datadog Continuous profiler (Go) profefe , which builds pprof to provide continuous profiling (Java) Eclipse Memory Analyzer","title":"Profiling"},{"location":"observability/profiling/#profiling","text":"","title":"Profiling"},{"location":"observability/profiling/#overview","text":"Profiling is a form of runtime analysis that measures various components of the runtime such as, memory allocation, garbage collection, threads and locks, call stacks, or frequency and duration of specific functions. It can be used to see which functions are the most costly in your binary, allowing you to focus your effort on removing the largest inefficiencies as quickly as possible. It can help you find deadlocks, memory leaks, or inefficient memory allocation, and help inform decisions around resource allocation (ie: CPU or RAM).","title":"Overview"},{"location":"observability/profiling/#how-to-profile-your-applications","text":"Profiling is somewhat language dependent, so start off by searching for \"profile $language\" (some common tools are listed below). Additionally, Linux Perf is a good fallback, since a lot of languages have bindings in C/C++. Profiling does incur some cost, as it requires inspecting the call stack, and sometimes pausing the application all together (ie: to trigger a full GC in Java). It is recommended to continuously profile your services, say for 10s every 10 minutes. Consider the cost when deciding on tuning these parameters. Different tools visualize profiles differently. Common CPU profiles might use a directed graph or a flame graph. Unfortunately, each profiler tool typically uses its own format for storing profiles, and comes with its own visualization.","title":"How to Profile your Applications"},{"location":"observability/profiling/#tools","text":"(Java, Go, Python, Ruby, eBPF) Pyroscope continuous profiling out of the box. (Java and Go) Flame - profiling containers in Kubernetes (Java, Python, Go) Datadog Continuous profiler (Go) profefe , which builds pprof to provide continuous profiling (Java) Eclipse Memory Analyzer","title":"Tools"},{"location":"observability/recipes-observability/","text":"Recipes Application Insights/ASP.NET GitHub Repo , Article . Application Insights/ASP.NET Core with Distributed Trace Context Propagation to Kafka GitHub Repo . Example: OpenTelemetry Over a Message Oriented Architecture in Java with Jaeger, Prometheus and Azure Monitor GitHub Repo Example: Setup Azure Monitor Dashboards and Alerts with Terraform GitHub Repo On-premises Application Insights On-premise Application Insights is a service that is compatible with Azure App Insights, but stores the data in an in-house database like PostgreSQL or object storage like Azurite . On-premises Application Insights is useful as a drop-in replacement for Azure Application Insights in scenarios where a solution must be cloud deployable but must also support on-premises disconnected deployment scenarios. On-premises Application Insights is also useful for testing telemetry integration. Issues related to telemetry can be hard to catch since often these integrations are excluded from unit-test or integration test flows due to it being non-trivial to use a live Azure Application Insights resource for testing, e.g. managing the lifetime of the resource, having to ignore old telemetry for assertions, if a new resource is used it can take a while for the telemetry to show up, etc. The On-premise Application Insights service can be used to make it easier to integrate with an Azure Application Insights compatible API endpoint during local development or continuous integration without having to spin up a resource in Azure. Additionally, the service simplifies integration testing of asynchronous workflows such as web workers since integration tests can now be written to assert against telemetry logged to the service, e.g. assert that no exceptions were logged, assert that some number of events of a specific type were logged within a certain time-frame, etc. Azure DevOps Pipelines Reporting with Power BI The Azure DevOps Pipelines Report contains a Power BI template for monitoring project, pipeline, and pipeline run data from an Azure DevOps (AzDO) organization. This dashboard recipe provides observability for AzDO pipelines by displaying various metrics (i.e. average runtime, run outcome statistics, etc.) in a table. Additionally, the second page of the template visualizes pipeline success and failure trends using Power BI charts. Documentation and setup information can be found in the project README. Python Logger Class for Application Insights using OpenCensus This repository contains \"AppLogger\" class which is a python logger class for Application Insights using Opencensus. It also contains sample code that shows the usage of \"AppLogger\". GitHub Repo Java OpenTelemetry Examples This GitHub Repo contains a set of fully-functional, working examples of using the OpenTelemetry Java APIs and SDK.","title":"Recipes"},{"location":"observability/recipes-observability/#recipes","text":"","title":"Recipes"},{"location":"observability/recipes-observability/#application-insightsaspnet","text":"GitHub Repo , Article .","title":"Application Insights/ASP.NET"},{"location":"observability/recipes-observability/#application-insightsaspnet-core-with-distributed-trace-context-propagation-to-kafka","text":"GitHub Repo .","title":"Application Insights/ASP.NET Core with Distributed Trace Context Propagation to Kafka"},{"location":"observability/recipes-observability/#example-opentelemetry-over-a-message-oriented-architecture-in-java-with-jaeger-prometheus-and-azure-monitor","text":"GitHub Repo","title":"Example: OpenTelemetry Over a Message Oriented Architecture in Java with Jaeger, Prometheus and Azure Monitor"},{"location":"observability/recipes-observability/#example-setup-azure-monitor-dashboards-and-alerts-with-terraform","text":"GitHub Repo","title":"Example: Setup Azure Monitor Dashboards and Alerts with Terraform"},{"location":"observability/recipes-observability/#on-premises-application-insights","text":"On-premise Application Insights is a service that is compatible with Azure App Insights, but stores the data in an in-house database like PostgreSQL or object storage like Azurite . On-premises Application Insights is useful as a drop-in replacement for Azure Application Insights in scenarios where a solution must be cloud deployable but must also support on-premises disconnected deployment scenarios. On-premises Application Insights is also useful for testing telemetry integration. Issues related to telemetry can be hard to catch since often these integrations are excluded from unit-test or integration test flows due to it being non-trivial to use a live Azure Application Insights resource for testing, e.g. managing the lifetime of the resource, having to ignore old telemetry for assertions, if a new resource is used it can take a while for the telemetry to show up, etc. The On-premise Application Insights service can be used to make it easier to integrate with an Azure Application Insights compatible API endpoint during local development or continuous integration without having to spin up a resource in Azure. Additionally, the service simplifies integration testing of asynchronous workflows such as web workers since integration tests can now be written to assert against telemetry logged to the service, e.g. assert that no exceptions were logged, assert that some number of events of a specific type were logged within a certain time-frame, etc.","title":"On-premises Application Insights"},{"location":"observability/recipes-observability/#azure-devops-pipelines-reporting-with-power-bi","text":"The Azure DevOps Pipelines Report contains a Power BI template for monitoring project, pipeline, and pipeline run data from an Azure DevOps (AzDO) organization. This dashboard recipe provides observability for AzDO pipelines by displaying various metrics (i.e. average runtime, run outcome statistics, etc.) in a table. Additionally, the second page of the template visualizes pipeline success and failure trends using Power BI charts. Documentation and setup information can be found in the project README.","title":"Azure DevOps Pipelines Reporting with Power BI"},{"location":"observability/recipes-observability/#python-logger-class-for-application-insights-using-opencensus","text":"This repository contains \"AppLogger\" class which is a python logger class for Application Insights using Opencensus. It also contains sample code that shows the usage of \"AppLogger\". GitHub Repo","title":"Python Logger Class for Application Insights using OpenCensus"},{"location":"observability/recipes-observability/#java-opentelemetry-examples","text":"This GitHub Repo contains a set of fully-functional, working examples of using the OpenTelemetry Java APIs and SDK.","title":"Java OpenTelemetry Examples"},{"location":"observability/pillars/dashboard/","text":"Dashboard Overview Dashboard is a form of data visualization that provides \"at a glance\" view of Key Performance Indicators(KPIs) of observable system. Dashboard connects multiple data sources allowing creation of visual representation of data insights which otherwise are difficult to understand. Dashboard can be used to: show trends identify patterns(user, usage, search etc) measure efficiency easily identify data outliers and correlations view health state or performance of the system give an outlook of the KPI that is important to a business/process Best Practices Common questions to ask yourself when building dashboard would be: Where did my user spend most of their time at? What is my user searching? How do I better help my team with alerts and troubleshooting? Is my system healthy for the past one day/week/month/quarter? Here are principles to consider when building dashboards: Separate a dashboard in multiple sections for simplicity. Adding page jump or anchor(#section) is also a plus if applicable. Add multiple and simple charts. Build simple chart, have more of them rather than a complicated all in one chart. Identify goals or KPI measurement. Identifying goals or KPI helps in defining what needs to be achieved. Here are some examples - server downtime, mean time to address error, service level agreement. Ask questions that can help reach the defined goal or KPI. This may sound counter-intuitive, the more questions asked while constructing dashboard the better the outcome will be. Questions like location, internet service provider, time of day the users make requests to server would be a good start. Validate the questions. This is often done with stakeholders, sponsors, leads or project managers. Observe the dashboard that is built. Is the data reflecting what the stakeholders set out to answer? Always remember this process takes time. Building dashboard is easy, building an observable dashboard to show pattern is hard. Recommended Tools Azure Monitor Workbooks - Supporting markdown, Azure Workbooks is tightly integrated with Azure services making this highly customizable without extra tool. Create dashboard using log query - Dashboard can be created using log query on Log Analytics data. Building dashboards using Application Insights - Dashboards can be created using Application Insights as well. Power Bi - Power Bi is one of the easier tools to create dashboards from data sources and reports. Grafana - Getting started with Grafana. Grafana is a popular open source tool for dashboarding and visualization. Azure Monitor as Grafana data source - This provides a step by step integration of Azure Monitor to Grafana. Brief comparison of various tools Dashboard Samples and Recipes Azure Workbooks Performance analysis - A measurement on how the system performs. Workbook template available in gallery. Failure analysis - A report about system failure with details. Workbook template available in gallery. Application Performance Index( Apdex ) - This is a way to measure user satisfaction. It classifies performance into three zones based on a baseline performance threshold T. The template for Appdex is available in Azure Workbooks gallery as well. Application Insights User retention analysis User navigation patterns analysis User session analysis For other tools, these can be used as a reference to recreate if a template is not readily available. Grafana with Azure Monitor as Data Source Azure Kubernetes Service - Cluster & Namespace Metrics - Container Insights metrics for Kubernetes clusters. Cluster utilization, namespace utilization, Node cpu & memory, Node disk usage & disk io, node network & kubelet docker operation metrics Azure Kubernetes Service - Container Level & Pod Metrics - This contains Container level and Pod Metrics like CPU and Memory which are missing in the above dashboard. Summary In order to build an observable dashboard, the goal is to make use of collected metrics, logs, traces to give an insight on how the system performs, user behaves and identify patterns. There are a lot of tools and templates out there. Whichever the choice is, a good dashboard is always a dashboard that can help you answer questions about the system and user, keep track of the KPI and goal while also allowing informed business decisions to be made.","title":"Dashboard"},{"location":"observability/pillars/dashboard/#dashboard","text":"","title":"Dashboard"},{"location":"observability/pillars/dashboard/#overview","text":"Dashboard is a form of data visualization that provides \"at a glance\" view of Key Performance Indicators(KPIs) of observable system. Dashboard connects multiple data sources allowing creation of visual representation of data insights which otherwise are difficult to understand. Dashboard can be used to: show trends identify patterns(user, usage, search etc) measure efficiency easily identify data outliers and correlations view health state or performance of the system give an outlook of the KPI that is important to a business/process","title":"Overview"},{"location":"observability/pillars/dashboard/#best-practices","text":"Common questions to ask yourself when building dashboard would be: Where did my user spend most of their time at? What is my user searching? How do I better help my team with alerts and troubleshooting? Is my system healthy for the past one day/week/month/quarter? Here are principles to consider when building dashboards: Separate a dashboard in multiple sections for simplicity. Adding page jump or anchor(#section) is also a plus if applicable. Add multiple and simple charts. Build simple chart, have more of them rather than a complicated all in one chart. Identify goals or KPI measurement. Identifying goals or KPI helps in defining what needs to be achieved. Here are some examples - server downtime, mean time to address error, service level agreement. Ask questions that can help reach the defined goal or KPI. This may sound counter-intuitive, the more questions asked while constructing dashboard the better the outcome will be. Questions like location, internet service provider, time of day the users make requests to server would be a good start. Validate the questions. This is often done with stakeholders, sponsors, leads or project managers. Observe the dashboard that is built. Is the data reflecting what the stakeholders set out to answer? Always remember this process takes time. Building dashboard is easy, building an observable dashboard to show pattern is hard.","title":"Best Practices"},{"location":"observability/pillars/dashboard/#recommended-tools","text":"Azure Monitor Workbooks - Supporting markdown, Azure Workbooks is tightly integrated with Azure services making this highly customizable without extra tool. Create dashboard using log query - Dashboard can be created using log query on Log Analytics data. Building dashboards using Application Insights - Dashboards can be created using Application Insights as well. Power Bi - Power Bi is one of the easier tools to create dashboards from data sources and reports. Grafana - Getting started with Grafana. Grafana is a popular open source tool for dashboarding and visualization. Azure Monitor as Grafana data source - This provides a step by step integration of Azure Monitor to Grafana. Brief comparison of various tools","title":"Recommended Tools"},{"location":"observability/pillars/dashboard/#dashboard-samples-and-recipes","text":"","title":"Dashboard Samples and Recipes"},{"location":"observability/pillars/dashboard/#azure-workbooks","text":"Performance analysis - A measurement on how the system performs. Workbook template available in gallery. Failure analysis - A report about system failure with details. Workbook template available in gallery. Application Performance Index( Apdex ) - This is a way to measure user satisfaction. It classifies performance into three zones based on a baseline performance threshold T. The template for Appdex is available in Azure Workbooks gallery as well.","title":"Azure Workbooks"},{"location":"observability/pillars/dashboard/#application-insights","text":"User retention analysis User navigation patterns analysis User session analysis For other tools, these can be used as a reference to recreate if a template is not readily available.","title":"Application Insights"},{"location":"observability/pillars/dashboard/#grafana-with-azure-monitor-as-data-source","text":"Azure Kubernetes Service - Cluster & Namespace Metrics - Container Insights metrics for Kubernetes clusters. Cluster utilization, namespace utilization, Node cpu & memory, Node disk usage & disk io, node network & kubelet docker operation metrics Azure Kubernetes Service - Container Level & Pod Metrics - This contains Container level and Pod Metrics like CPU and Memory which are missing in the above dashboard.","title":"Grafana with Azure Monitor as Data Source"},{"location":"observability/pillars/dashboard/#summary","text":"In order to build an observable dashboard, the goal is to make use of collected metrics, logs, traces to give an insight on how the system performs, user behaves and identify patterns. There are a lot of tools and templates out there. Whichever the choice is, a good dashboard is always a dashboard that can help you answer questions about the system and user, keep track of the KPI and goal while also allowing informed business decisions to be made.","title":"Summary"},{"location":"observability/pillars/logging/","text":"Logging Overview Logs are discrete events with the goal of helping engineers identify problem area(s) during failures. Collection Methods When it comes to log collection methods, two of the standard techniques are a direct-write, or an agent-based approach. Directly written log events are handled in-process of the particular component, usually utilizing a provided library. Azure Monitor has direct send capabilities, but it's not recommended for serious/production use. This approach has some advantages: There is no external process to configure or monitor No log file management (rolling, expiring) to prevent out of disk space issues. The potential trade-offs of this approach: Potentially higher memory usage if the particular library is using a memory backed buffer. In the event of an extended service outage, log data may get dropped or truncated due to buffer constraints. Multiple component process logging will manage & emit logs individually, which can be more complex to manage for the outbound load. Agent-based log collection relies on an external process running on the host machine, with the particular component emitting log data stdout or file. Writing log data to stdout is the preferred practice when running applications within a container environment like Kubernetes. The container runtime redirects the output to files, which can then be processed by an agent. Azure Monitor , Grafana Loki Elastic's Logstash and Fluent Bit are examples of log shipping agents. There are several advantages when using an agent to collect & ship log files: Centralized configuration. Collecting multiple sources of data with a single process. Local pre-processing & filtering of log data before sending it to a central service. Utilizing disk space as a data buffer during a service disruption. This approach isn't without trade-offs: Required exclusive CPU & memory resources for the processing of log data. Persistent disk space for buffering. Best Practices Pay attention to logging levels. Logging too much will increase costs and decrease application throughput. Ensure logging configuration can be modified without code changes. Ideally, make it changeable without application restarts. If available, take advantage of logging levels per category allowing granular logging configuration. Check for log levels before logging, thus avoiding allocations and string manipulation costs. Ensure service versions are included in logs to be able to identify problematic releases. Log a raised exception only once. In your handlers, only catch expected exceptions that you can handle gracefully (even with a specific return code). If you want to log and rethrow, leave it to the top level exception handler. Do the minimal amount of cleanup work needed then throw to maintain the original stack trace. Don\u2019t log a warning or stack trace for expected exceptions (eg: properly expected 404, 403 HTTP statuses). Fine tune logging levels in production (>= warning for instance). During a new release the verbosity can be increased to facilitate bug identification. If using sampling, implement this at the service level rather than defining it in the logging system. This way we have control over what gets logged. An additional benefit is reduced number of roundtrips. Only include failures from health checks and non-business driven requests. Ensure a downstream system malfunction won't cause repetitive logs being stored. Don't reinvent the wheel, use existing tools to collect and analyze the data. Ensure personal identifiable information policies and restrictions are followed. Ensure errors and exceptions in dependent services are captured and logged. For example, if an application uses Redis cache, Service Bus or any other service, any errors/exceptions raised while accessing these services should be captured and logged. If there's Sufficient Log Data, is there a Need for Instrumenting Metrics? Logs vs Metrics vs Traces covers some high level guidance on when to utilize metric data and when to use log data. Both have a valuable part to play in creating observable systems. Having Problems Identifying What to Log? At application startup : Unrecoverable errors from startup. Warnings if application still runnable, but not as expected (i.e. not providing blob connection string, thus resorting to local files. Another example is if there's a need to fail back to a secondary service or a known good state, because it didn\u2019t get an answer from a primary dependency.) Information about the service\u2019s state at startup (build #, configs loaded, etc.) Per incoming request : Basic information for each incoming request: the url (scrubbed of any personally identifying data, a.k.a. PII), any user/tenant/request dimensions, response code returned, request-to-response latency, payload size, record counts, etc. (whatever you need to learn something from the aggregate data) Warning for any unexpected exceptions, caught only at the top controller/interceptor and logged with or alongside the request info, with stack trace. Return a 500. This code doesn\u2019t know what happened. Per outgoing request : Basic information for each outgoing request: the url (scrubbed of any personally identifying data, a.k.a. PII), any user/tenant/request dimensions, response code returned, request-to-response latency, payload sizes, record counts returned, etc. Report perceived availability and latency of dependencies and including slicing/clustering data that could help with later analysis. Recommended Tools Azure Monitor - Umbrella of services including system metrics, log analytics and more. Grafana Loki - An open source log aggregation platform, built on the learnings from the Prometheus Community for highly efficient collection & storage of log data at scale. The Elastic Stack - An open source log analytics tech stack utilizing Logstash, Beats, Elastic search and Kibana. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources.","title":"Logging"},{"location":"observability/pillars/logging/#logging","text":"","title":"Logging"},{"location":"observability/pillars/logging/#overview","text":"Logs are discrete events with the goal of helping engineers identify problem area(s) during failures.","title":"Overview"},{"location":"observability/pillars/logging/#collection-methods","text":"When it comes to log collection methods, two of the standard techniques are a direct-write, or an agent-based approach. Directly written log events are handled in-process of the particular component, usually utilizing a provided library. Azure Monitor has direct send capabilities, but it's not recommended for serious/production use. This approach has some advantages: There is no external process to configure or monitor No log file management (rolling, expiring) to prevent out of disk space issues. The potential trade-offs of this approach: Potentially higher memory usage if the particular library is using a memory backed buffer. In the event of an extended service outage, log data may get dropped or truncated due to buffer constraints. Multiple component process logging will manage & emit logs individually, which can be more complex to manage for the outbound load. Agent-based log collection relies on an external process running on the host machine, with the particular component emitting log data stdout or file. Writing log data to stdout is the preferred practice when running applications within a container environment like Kubernetes. The container runtime redirects the output to files, which can then be processed by an agent. Azure Monitor , Grafana Loki Elastic's Logstash and Fluent Bit are examples of log shipping agents. There are several advantages when using an agent to collect & ship log files: Centralized configuration. Collecting multiple sources of data with a single process. Local pre-processing & filtering of log data before sending it to a central service. Utilizing disk space as a data buffer during a service disruption. This approach isn't without trade-offs: Required exclusive CPU & memory resources for the processing of log data. Persistent disk space for buffering.","title":"Collection Methods"},{"location":"observability/pillars/logging/#best-practices","text":"Pay attention to logging levels. Logging too much will increase costs and decrease application throughput. Ensure logging configuration can be modified without code changes. Ideally, make it changeable without application restarts. If available, take advantage of logging levels per category allowing granular logging configuration. Check for log levels before logging, thus avoiding allocations and string manipulation costs. Ensure service versions are included in logs to be able to identify problematic releases. Log a raised exception only once. In your handlers, only catch expected exceptions that you can handle gracefully (even with a specific return code). If you want to log and rethrow, leave it to the top level exception handler. Do the minimal amount of cleanup work needed then throw to maintain the original stack trace. Don\u2019t log a warning or stack trace for expected exceptions (eg: properly expected 404, 403 HTTP statuses). Fine tune logging levels in production (>= warning for instance). During a new release the verbosity can be increased to facilitate bug identification. If using sampling, implement this at the service level rather than defining it in the logging system. This way we have control over what gets logged. An additional benefit is reduced number of roundtrips. Only include failures from health checks and non-business driven requests. Ensure a downstream system malfunction won't cause repetitive logs being stored. Don't reinvent the wheel, use existing tools to collect and analyze the data. Ensure personal identifiable information policies and restrictions are followed. Ensure errors and exceptions in dependent services are captured and logged. For example, if an application uses Redis cache, Service Bus or any other service, any errors/exceptions raised while accessing these services should be captured and logged.","title":"Best Practices"},{"location":"observability/pillars/logging/#if-theres-sufficient-log-data-is-there-a-need-for-instrumenting-metrics","text":"Logs vs Metrics vs Traces covers some high level guidance on when to utilize metric data and when to use log data. Both have a valuable part to play in creating observable systems.","title":"If there's Sufficient Log Data, is there a Need for Instrumenting Metrics?"},{"location":"observability/pillars/logging/#having-problems-identifying-what-to-log","text":"At application startup : Unrecoverable errors from startup. Warnings if application still runnable, but not as expected (i.e. not providing blob connection string, thus resorting to local files. Another example is if there's a need to fail back to a secondary service or a known good state, because it didn\u2019t get an answer from a primary dependency.) Information about the service\u2019s state at startup (build #, configs loaded, etc.) Per incoming request : Basic information for each incoming request: the url (scrubbed of any personally identifying data, a.k.a. PII), any user/tenant/request dimensions, response code returned, request-to-response latency, payload size, record counts, etc. (whatever you need to learn something from the aggregate data) Warning for any unexpected exceptions, caught only at the top controller/interceptor and logged with or alongside the request info, with stack trace. Return a 500. This code doesn\u2019t know what happened. Per outgoing request : Basic information for each outgoing request: the url (scrubbed of any personally identifying data, a.k.a. PII), any user/tenant/request dimensions, response code returned, request-to-response latency, payload sizes, record counts returned, etc. Report perceived availability and latency of dependencies and including slicing/clustering data that could help with later analysis.","title":"Having Problems Identifying What to Log?"},{"location":"observability/pillars/logging/#recommended-tools","text":"Azure Monitor - Umbrella of services including system metrics, log analytics and more. Grafana Loki - An open source log aggregation platform, built on the learnings from the Prometheus Community for highly efficient collection & storage of log data at scale. The Elastic Stack - An open source log analytics tech stack utilizing Logstash, Beats, Elastic search and Kibana. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources.","title":"Recommended Tools"},{"location":"observability/pillars/metrics/","text":"Metrics Overview Metrics provide a near real-time stream of data, informing operators and stakeholders about the functions the system is performing as well as its health. Unlike logging and tracing, metric data tends to be more efficient to transmit and store. Collection Methods Metric collection approaches fall into two broad categories: push metrics & pull metrics. Push metrics means that the originating component sends the data to a remote service or agent. Azure Monitor and Etsy's statsd are examples of push metrics. Some strengths with push metrics include: Only require network egress to the remote target. Originating component controls the frequency of measurement. Simplified configuration as the component only needs to know the destination of where to send data. Some trade-offs with this approach: At scale, it is much more difficult to control data transmission rates, which can cause service throttling or dropping of values. Determining if every component, particularly in a dynamic scale environment, is healthy and sending data is difficult. In the case of pull metrics, each originating component publishes an endpoint for the metric agent to connect to and gather measurements. Prometheus and its ecosystem of tools are an example of pull style metrics. Benefits experienced using a pull metrics setup may involve: Singular configuration for determining what is measured and the frequency of measurement for the local environment. Every measurement target has a meta metric related to if the collection is successful or not, which can be used as a general health check. Support for routing, filtering and processing of metrics before sending them onto a globally central metrics store. Items of concern to some may include: Configuring & managing data sources can lead to a complex configuration. Prometheus has tooling to auto-discover and configure data sources in some environments, such as Kubernetes, but there are always exceptions to this, which lead to configuration complexity. Network configuration may add further complexity if firewalls and other ACLs need to be managed to allow connectivity. Best Practices When Should I use Metrics Instead of Logs? Logs vs Metrics vs Traces covers some high level guidance on when to utilize metric data and when to use log data. Both have a valuable part to play in creating observable systems. What Should be Tracked? System critical measurements that relate to the application/machine health, which are usually excellent alert candidates. Work with your engineering and devops peers to identify the metrics, but they may include: CPU and memory utilization. Request rate. Queue length. Unexpected exception count. Dependent service metrics like response time for Redis cache, Sql server or Service bus. Important business-related measurements, which drive reporting to stakeholders. Consult with the various stakeholders of the component, but some examples may include: Jobs performed. User Session length. Games played. Site visits. Dimension Labels Modern metric systems today usually define a single time series metric as the combination of the name of the metric and its dictionary of dimension labels. Labels are an excellent way to distinguish one instance of a metric, from another while still allowing for aggregation and other operations to be performed on the set for analysis. Some common labels used in metrics may include: Container Name Host name Code Version Kubernetes cluster name Azure Region Note : Since dimension labels are used for aggregations and grouping operations, do not use unique strings or those with high cardinality as the value of a label. The value of the label is significantly diminished for reporting and in many cases has a negative performance hit on the metric system used to track it. Recommended Tools Azure Monitor - Umbrella of services including system metrics, log analytics and more. Prometheus - A real-time monitoring & alerting application. It's exposition format for exposing time-series is the basis for OpenMetrics's standard format. Thanos - Open source, highly available Prometheus setup with long term storage capabilities. Cortex - Horizontally scalable, highly available, multi-tenant, long term Prometheus. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources.","title":"Metrics"},{"location":"observability/pillars/metrics/#metrics","text":"","title":"Metrics"},{"location":"observability/pillars/metrics/#overview","text":"Metrics provide a near real-time stream of data, informing operators and stakeholders about the functions the system is performing as well as its health. Unlike logging and tracing, metric data tends to be more efficient to transmit and store.","title":"Overview"},{"location":"observability/pillars/metrics/#collection-methods","text":"Metric collection approaches fall into two broad categories: push metrics & pull metrics. Push metrics means that the originating component sends the data to a remote service or agent. Azure Monitor and Etsy's statsd are examples of push metrics. Some strengths with push metrics include: Only require network egress to the remote target. Originating component controls the frequency of measurement. Simplified configuration as the component only needs to know the destination of where to send data. Some trade-offs with this approach: At scale, it is much more difficult to control data transmission rates, which can cause service throttling or dropping of values. Determining if every component, particularly in a dynamic scale environment, is healthy and sending data is difficult. In the case of pull metrics, each originating component publishes an endpoint for the metric agent to connect to and gather measurements. Prometheus and its ecosystem of tools are an example of pull style metrics. Benefits experienced using a pull metrics setup may involve: Singular configuration for determining what is measured and the frequency of measurement for the local environment. Every measurement target has a meta metric related to if the collection is successful or not, which can be used as a general health check. Support for routing, filtering and processing of metrics before sending them onto a globally central metrics store. Items of concern to some may include: Configuring & managing data sources can lead to a complex configuration. Prometheus has tooling to auto-discover and configure data sources in some environments, such as Kubernetes, but there are always exceptions to this, which lead to configuration complexity. Network configuration may add further complexity if firewalls and other ACLs need to be managed to allow connectivity.","title":"Collection Methods"},{"location":"observability/pillars/metrics/#best-practices","text":"","title":"Best Practices"},{"location":"observability/pillars/metrics/#when-should-i-use-metrics-instead-of-logs","text":"Logs vs Metrics vs Traces covers some high level guidance on when to utilize metric data and when to use log data. Both have a valuable part to play in creating observable systems.","title":"When Should I use Metrics Instead of Logs?"},{"location":"observability/pillars/metrics/#what-should-be-tracked","text":"System critical measurements that relate to the application/machine health, which are usually excellent alert candidates. Work with your engineering and devops peers to identify the metrics, but they may include: CPU and memory utilization. Request rate. Queue length. Unexpected exception count. Dependent service metrics like response time for Redis cache, Sql server or Service bus. Important business-related measurements, which drive reporting to stakeholders. Consult with the various stakeholders of the component, but some examples may include: Jobs performed. User Session length. Games played. Site visits.","title":"What Should be Tracked?"},{"location":"observability/pillars/metrics/#dimension-labels","text":"Modern metric systems today usually define a single time series metric as the combination of the name of the metric and its dictionary of dimension labels. Labels are an excellent way to distinguish one instance of a metric, from another while still allowing for aggregation and other operations to be performed on the set for analysis. Some common labels used in metrics may include: Container Name Host name Code Version Kubernetes cluster name Azure Region Note : Since dimension labels are used for aggregations and grouping operations, do not use unique strings or those with high cardinality as the value of a label. The value of the label is significantly diminished for reporting and in many cases has a negative performance hit on the metric system used to track it.","title":"Dimension Labels"},{"location":"observability/pillars/metrics/#recommended-tools","text":"Azure Monitor - Umbrella of services including system metrics, log analytics and more. Prometheus - A real-time monitoring & alerting application. It's exposition format for exposing time-series is the basis for OpenMetrics's standard format. Thanos - Open source, highly available Prometheus setup with long term storage capabilities. Cortex - Horizontally scalable, highly available, multi-tenant, long term Prometheus. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources.","title":"Recommended Tools"},{"location":"observability/pillars/tracing/","text":"Tracing Overview Produces the information required to observe series of correlated operations in a distributed system. Once collected they show the path, measurements and faults in an end-to-end transaction. Best Practices Ensure that at least key business transactions are traced. Include in each trace necessary information to identify software releases (i.e. service name, version). This is important to correlate deployments and system degradation. Ensure dependencies are included in trace (databases, I/O). If costs are a concern use sampling, avoiding throwing away errors, unexpected behavior and critical information. Don't reinvent the wheel, use existing tools to collect and analyze the data. Ensure personal identifiable information policies and restrictions are followed. Recommended Tools Azure Monitor - Umbrella of services including system metrics, log analytics and more. Jaeger Tracing - Open source, end-to-end distributed tracing. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the Trace Context object among a full stack of microservices implemented across different technical stacks.","title":"Tracing"},{"location":"observability/pillars/tracing/#tracing","text":"","title":"Tracing"},{"location":"observability/pillars/tracing/#overview","text":"Produces the information required to observe series of correlated operations in a distributed system. Once collected they show the path, measurements and faults in an end-to-end transaction.","title":"Overview"},{"location":"observability/pillars/tracing/#best-practices","text":"Ensure that at least key business transactions are traced. Include in each trace necessary information to identify software releases (i.e. service name, version). This is important to correlate deployments and system degradation. Ensure dependencies are included in trace (databases, I/O). If costs are a concern use sampling, avoiding throwing away errors, unexpected behavior and critical information. Don't reinvent the wheel, use existing tools to collect and analyze the data. Ensure personal identifiable information policies and restrictions are followed.","title":"Best Practices"},{"location":"observability/pillars/tracing/#recommended-tools","text":"Azure Monitor - Umbrella of services including system metrics, log analytics and more. Jaeger Tracing - Open source, end-to-end distributed tracing. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the Trace Context object among a full stack of microservices implemented across different technical stacks.","title":"Recommended Tools"},{"location":"observability/tools/","text":"Tools and Patterns There are a number of modern tools to make systems observable. While identifying and/or creating tools that work for your system, here are a few things to consider to help guide the choices. Must be simple to integrate and easy to use. It must be possible to aggregate and visualize data. Tools must provide real-time data. Must be able to guide users to the problem area with suitable, adequate end-to-end context. Choices Loki OpenTelemetry Kubernetes Dashboards Prometheus Service Mesh Leveraging a Service Mesh that follows the Sidecar Pattern quickly sets up a go-to set of metrics, and traces (although traces need to be propagated from incoming requests to outgoing requests manually). A sidecar works by intercepting all incoming and outgoing traffic to your image. It then adds trace headers to each request and emits a standard set of logs and metrics. These metrics are extremely powerful for observability, allowing every service, whether client-side or server-side, to leverage a unified set of metrics, including: Latency Bytes Request Rate Error Rate In a microservice architecture, pinpointing the root cause of a spike in 500's can be non-trivial, but with the added observability from a sidecar you can quickly determine which service in your service mesh resulted in the spike in errors. Service Mesh's have a large surface area for configuration, and can seem like a daunting undertaking to deploy. However, most services (including Linkerd) offer a sane set of defaults, and can be deployed via the happy path to quickly land these observability wins.","title":"Tools and Patterns"},{"location":"observability/tools/#tools-and-patterns","text":"There are a number of modern tools to make systems observable. While identifying and/or creating tools that work for your system, here are a few things to consider to help guide the choices. Must be simple to integrate and easy to use. It must be possible to aggregate and visualize data. Tools must provide real-time data. Must be able to guide users to the problem area with suitable, adequate end-to-end context.","title":"Tools and Patterns"},{"location":"observability/tools/#choices","text":"Loki OpenTelemetry Kubernetes Dashboards Prometheus","title":"Choices"},{"location":"observability/tools/#service-mesh","text":"Leveraging a Service Mesh that follows the Sidecar Pattern quickly sets up a go-to set of metrics, and traces (although traces need to be propagated from incoming requests to outgoing requests manually). A sidecar works by intercepting all incoming and outgoing traffic to your image. It then adds trace headers to each request and emits a standard set of logs and metrics. These metrics are extremely powerful for observability, allowing every service, whether client-side or server-side, to leverage a unified set of metrics, including: Latency Bytes Request Rate Error Rate In a microservice architecture, pinpointing the root cause of a spike in 500's can be non-trivial, but with the added observability from a sidecar you can quickly determine which service in your service mesh resulted in the spike in errors. Service Mesh's have a large surface area for configuration, and can seem like a daunting undertaking to deploy. However, most services (including Linkerd) offer a sane set of defaults, and can be deployed via the happy path to quickly land these observability wins.","title":"Service Mesh"},{"location":"observability/tools/KubernetesDashboards/","text":"Kubernetes UI Dashboards This document covers the options and benefits of various Kubernetes UI Dashboards which are useful tools for monitoring and debugging your application on Kubernetes Clusters. It allows the management of applications running in the cluster, debug them and manage the cluster all through these dashboards. Overview and Background There are times when not all solutions can be run locally. This limitation could be due to a cloud service which does not offer a robust or efficient way to locally debug the environment. In these cases, it is necessary to use other tools which provide the capabilities to monitor your application with Kubernetes. Advantages and Use Cases Allows the ability to view, manage and monitor the operational aspects of the Kubernetes Cluster. Benefits of using a UI dashboard includes the following: see an overview of the cluster deploy applications onto the cluster troubleshoot applications running on the cluster view, create, modify, and delete Kubernetes resources view basic resource metrics including resource usage for Kubernetes objects view and access logs live view of the pods state (e.g. started, terminating, etc) Different dashboards may provide different functionalities, and the use case to choose a particular dashboard will depend on the requirements. For example, many dashboards provide a way to only monitor your applications on Kubernetes but do not provide a way to manage them. Open Source Dashboards There are currently several UI dashboards available to monitor your applications or manage them with Kubernetes. For example: Octant Prometheus and Grafana Kube Prometheus Stack Chart : provides an easy way to operate end-to-end Kubernetes cluster monitoring with Prometheus using the Prometheus Operator. K8Dash kube-ops-view : a tool to visualize node occupancy & utilization Lens : Client side desktop tool Thanos and Cortex : Multi-cluster implementations Resources Alternatives to Kubernetes Dashboard","title":"Kubernetes UI Dashboards"},{"location":"observability/tools/KubernetesDashboards/#kubernetes-ui-dashboards","text":"This document covers the options and benefits of various Kubernetes UI Dashboards which are useful tools for monitoring and debugging your application on Kubernetes Clusters. It allows the management of applications running in the cluster, debug them and manage the cluster all through these dashboards.","title":"Kubernetes UI Dashboards"},{"location":"observability/tools/KubernetesDashboards/#overview-and-background","text":"There are times when not all solutions can be run locally. This limitation could be due to a cloud service which does not offer a robust or efficient way to locally debug the environment. In these cases, it is necessary to use other tools which provide the capabilities to monitor your application with Kubernetes.","title":"Overview and Background"},{"location":"observability/tools/KubernetesDashboards/#advantages-and-use-cases","text":"Allows the ability to view, manage and monitor the operational aspects of the Kubernetes Cluster. Benefits of using a UI dashboard includes the following: see an overview of the cluster deploy applications onto the cluster troubleshoot applications running on the cluster view, create, modify, and delete Kubernetes resources view basic resource metrics including resource usage for Kubernetes objects view and access logs live view of the pods state (e.g. started, terminating, etc) Different dashboards may provide different functionalities, and the use case to choose a particular dashboard will depend on the requirements. For example, many dashboards provide a way to only monitor your applications on Kubernetes but do not provide a way to manage them.","title":"Advantages and Use Cases"},{"location":"observability/tools/KubernetesDashboards/#open-source-dashboards","text":"There are currently several UI dashboards available to monitor your applications or manage them with Kubernetes. For example: Octant Prometheus and Grafana Kube Prometheus Stack Chart : provides an easy way to operate end-to-end Kubernetes cluster monitoring with Prometheus using the Prometheus Operator. K8Dash kube-ops-view : a tool to visualize node occupancy & utilization Lens : Client side desktop tool Thanos and Cortex : Multi-cluster implementations","title":"Open Source Dashboards"},{"location":"observability/tools/KubernetesDashboards/#resources","text":"Alternatives to Kubernetes Dashboard","title":"Resources"},{"location":"observability/tools/OpenTelemetry/","text":"Open Telemetry Building observable systems enable one to measure how well or bad the application is behaving and WHY it is behaving either way. Adopting open-source standards related to implementing telemetry and tracing features built on top of the OpenTelemetry framework helps decouple vendor-specific implementations while maintaining an extensible, standard, and portable open-source solution. OpenTelemetry is an open-source observability standard that defines how to generate, collect and describe telemetry in distributed systems. OpenTelemetry also provides a single-point distribution of a set of APIs, SDKs, and instrumentation libraries that implements the open-source standard, which can collect, process, and orchestrate telemetry data (signals) like traces, metrics, and logs. It supports multiple popular languages (Java, .NET, Python, JavaScript, Golang, Erlang, etc.). Open telemetry follows a vendor-agnostic and standards-based approach for collecting and managing telemetry data. An important point to note is that OpenTelemetry does not have its own backend; all telemetry collected by OpenTelemetry Collector must be sent to a backend like Prometheus, Jaeger, Zipkin, Azure Monitor, etc. Open telemetry is also the 2nd most active CNCF project only after Kubernetes. The main two Problems OpenTelemetry solves are: First, vendor neutrality for tracing, monitoring, and logging APIs and second, out-of-the-box cross-platform context propagation implementation for end-to-end distributed tracing over heterogeneous components. Open Telemetry Core Concepts Open Telemetry Implementation Patterns A detailed explanation of OpenTelemetry concepts is out of the scope of this repo. There is plenty of available information about how the SDK and the automatic instrumentation are configured and how the Exporters, Tracers, Context, and Span's hierarchy work. See the Reference section for valuable OpenTelemetry resources. However, understanding the core implementation patterns will help you know what approach better fits the scenario you are trying to solve. These are three main patterns as follows: Automatic telemetry: Support for automatic instrumentation is available for some languages. OpenTelemetry automatic instrumentation (100% codeless) is typically done through library hooks or monkey-patching library code. Automatic instrumentation will intercept all interactions and dependencies and automatically send the telemetry to the configured exporters. More information about this concept can be found in the OpenTelemetry instrumentation doc . Manual tracing: This must be done by coding using the OpenTelemetry SDK, managing the tracer objects to obtain Spans, and forming instrumented OpenTelemetry Scopes to identify the code segments to be manually traced. Also, by using the @WithSpan annotations (method decorations in C# and Java ) to mark whole methods that will be automatically traced. Hybrid approach: Most Production-ready scenarios will require a mix of both techniques, using the automatic instrumentation to collect automatic telemetry and the OpenTelemetry SDK to identify code segments that are important to instrument manually. When considering production-ready scenarios, the hybrid approach is the way to go as it allows for a throughout cover over the whole solution. It provides automatic context propagation and events correlation out of the box. Collector The collector is a separate process that is designed to be a \u2018sink\u2019 for telemetry data emitted by many processes, which can then export that data to backend systems. The collector has two different deployment strategies \u2013 either running as an agent alongside a service or as a gateway which is a remote application. In general, using both is recommended: the agent would be deployed with your service and run as a separate process or in a sidecar; meanwhile, the collector would be deployed separately, as its own application running in a container or virtual machine. Each agent would forward telemetry data to the collector, which could then export it to a variety of backend systems such as Lightstep, Jaeger, or Prometheus. The agent can be also replaced with the automatic instrumentation if supported. The automatic instrumentation provides the collector capabilities of retrieving, processing and exporting the telemetry. Regardless of how you choose to instrument or deploy OpenTelemetry, exporters provide powerful options for reporting telemetry data. You can directly export from your service, you can proxy through the collector, or you can aggregate into standalone collectors \u2013 or even a mix of these. Instrumentation Libraries A library that enables observability for another library is called an instrumentation library. OpenTelemetry libraries are language specific, currently there is good support for Java, Python, Javascript, dotnet and golang. Support for automatic instrumentation is available for some libraries which make using OpenTelemetry easy and trivial. In case automatic instrumentation is not available, manual instrumentation can be configured by using the OpenTelemetry SDK. Integration of OpenTelemetry OpenTelemetry can be used to collect, process and export data into multiple backends, some popular integrations supported with OpenTelemetry are: Zipkin Prometheus Jaeger New Relic Azure Monitor AWS X-Ray Datadog Kafka Lightstep Splunk GCP Monitor Why use OpenTelemetry The main reason to use OpenTelemetry is that it offers an open-source standard for implementing distributed telemetry (context propagation) over heterogeneous systems. There is no need to reinvent the wheel to implement end-to-end business flow transactions monitoring when using OpenTelemetry. It enables tracing, metrics, and logging telemetry through a set of single-distribution multi-language libraries and tools that allow for a plug-and-play telemetry architecture that includes the concept of agents and collectors. Moreover, avoiding any proprietary lock down and achieving vendor-agnostic neutrality for tracing, monitoring, and logging APIs AND backends allow maximum portability and extensibility patterns. Another good reason to use OpenTelemetry would be whether the stack uses OpenCensus or OpenTracing. As OpenCensus and OpenTracing have carved the way for OpenTelemetry, it makes sense to introduce OpenTelemetry where OpenCensus or OpenTracing is used as it still has backward compatibility. Apart from adding custom attributes, sampling, collecting data for metrics and traces, OpenTelemetry is governed by specifications and backed up by big players in the Observability landscape like Microsoft, Splunk, AppDynamics, etc. OpenTelemetry will likely become a de-facto open-source standard for enabling metrics and tracing when all features become GA. Current Status of OpenTelemetry Project OpenTelemetry is a project which emerged from merging of OpenCensus and OpenTracing in 2019. Although OpenCensus and OpenTracing are frozen and no new features are being developed for them, OpenTelemetry has backward compatibility with OpenCensus and OpenTracing. Some features of OpenTelemetry are still in beta, feature support for different languages is being tracked here: Feature Status of OpenTelemetry . Status of OpenTelemetry project can be tracked here . From the website: Our goal is to provide a generally available, production quality release for the tracing data source across most OpenTelemetry components in the first half of 2021. Several components have already reached this milestone! We expect metrics to reach the same status in the second half of 2021 and are targeting logs in 2022. What to Watch Out for As OpenTelemetry is a very recent project (first GA version of some features released in 2020), many features are still in beta hence due diligence needs to be done before using such features in production. Also, OpenTelemetry supports many popular languages but features in all languages are not at par. Some languages offer more features as compared to other languages. It also needs to be called out as some features are not in GA, there may be some incompatibility issues with the tooling. That being said, OpenTelemetry is one of the most active projects of CNCF , so it is expected that many more features would reach GA soon. January 2022 UPDATE Apart from the logging specification and implementation that are still marked as draft or beta, all other specifications and implementations regarding tracing and metrics are marked as stable or feature-freeze. Many libraries are still on active development whatsoever, so thorough analysis has to be made depending on the language on a feature basis. Integration Options with Azure Monitor Using the Azure Monitor OpenTelemetry Exporter Library This scenario uses the OpenTelemetry SDK as the core instrumentation library. Basically this means you will instrument your application using the OpenTelemetry libraries, but you will additionally use the Azure Monitor OpenTelemetry Exporter and then added it as an additional exporter with the OpenTelemetry SDK. In this way, the OpenTelemetry traces your application creates will be pushed to your Azure Monitor Instance. Using the Application Insights Agent Jar File - Java Only Java OpenTelemetry instrumentation provides another way to integrate with Azure Monitor, by using Applications Insights Java Agent jar . When configuring this option, the Applications Insights Agent file is added when executing the application. The applicationinsights.json configuration file must be also be added as part of the applications artifacts. Pay close attention to the preview section, where the \"openTelemetryApiSupport\": true, property is set to true, enabling the agent to intercept OpenTelemetry telemetry created in the application code pushing it to Azure Monitor. OpenTelemetry Java Agent instrumentation supports many libraries and frameworks and application servers . Application Insights Java Agent enhances this list. Therefore, the main difference between running the OpenTelemetry Java Agent vs. the Application Insights Java Agent is demonstrated in the amount of traces getting logged in Azure Monitor. When running with Application Insights Java agent there's more telemetry getting pushed to Azure Monitor. On the other hand, when running the solution using the Application Insights agent mode, it is essential to highlight that nothing gets logged on Jaeger (or any other OpenTelemetry exporter). All traces will be pushed exclusively to Azure Monitor. However, both manual instrumentation done via the OpenTelemetry SDK and all automatic traces, dependencies, performance counters, and metrics being instrumented by the Application Insights agent are sent to Azure Monitor. Although there is a rich amount of additional data automatically instrumented by the Application Insights agent, it can be deduced that it is not necessarily OpenTelemetry compliant. Only the traces logged by the manual instrumentation using the OpenTelemetry SDK are. OpenTelemetry vs Application Insights Agents Compared Highlight OpenTelemetry Agent App Insights Agent Automatic Telemetry Y Y Manual OpenTelemetry Y Y Plug and Play Exports Y N Multiple Exports Y N Full Open Telemetry layout (decoupling agents, collectors and exporters) Y N Enriched out of the box telemetry N Y Unified telemetry backend N Y Summary As you may have guessed, there is no \"one size fits all\" approach when implementing OpenTelemetry with Azure Monitor as a backend. At the time of this writing, if you want to have the flexibility of having different OpenTelemetry backends, you should definitively go with the OpenTelemetry Agent, even though you'd sacrifice all automating tracing flowing to Azure Monitor. On the other hand, if you want to get the best of Azure Monitor and still want to instrument your code with the OpenTelemetry SDK, you should use the Application Insights Agent and manually instrument your code with the OpenTelemetry SDK to get the best of both worlds. Either way, instrumenting your code with OpenTelemetry seems the right approach as the ecosystem will only get bigger, better, and more robust. Advanced topics Use the Azure OpenTelemetry Tracing plugin library for Java to enable distributed tracing across Azure components through OpenTelemetry. Manual Trace Context Propagation The trace context is stored in Thread-local storage. When the application flow involves multiple threads (eg. multithreaded work-queue, asynchronous processing) then the traces won't get combined into one end-to-end trace chain with automatic context propagation . To achieve that you need to manually propagate the trace context ( example in Java ) by storing the trace headers along with the work-queue item. Telemetry Testing Mission critical telemetry data should be covered by testing. You can cover telemetry by tests by mocking the telemetry collector web server. In automated testing environment the telemetry instrumentation can be configured to use OTLP exporter and point the OTLP exporter endpoint to the collector web server. Using mocking servers libraries (eg. MockServer or WireMock) can help verify the telemetry data pushed to the collector. Resources OpenTelemetry Official Site Getting Started with dotnet and OpenTelemetry Using OpenTelemetry Collector OpenTelemetry Java SDK Manual Instrumentation OpenTelemetry Instrumentation Agent for Java Application Insights Java Agent Azure Monitor OpenTelemetry Exporter client library for Java Azure OpenTelemetry Tracing plugin library for Java Application Insights Agent's OpenTelemetry configuration","title":"Open Telemetry"},{"location":"observability/tools/OpenTelemetry/#open-telemetry","text":"Building observable systems enable one to measure how well or bad the application is behaving and WHY it is behaving either way. Adopting open-source standards related to implementing telemetry and tracing features built on top of the OpenTelemetry framework helps decouple vendor-specific implementations while maintaining an extensible, standard, and portable open-source solution. OpenTelemetry is an open-source observability standard that defines how to generate, collect and describe telemetry in distributed systems. OpenTelemetry also provides a single-point distribution of a set of APIs, SDKs, and instrumentation libraries that implements the open-source standard, which can collect, process, and orchestrate telemetry data (signals) like traces, metrics, and logs. It supports multiple popular languages (Java, .NET, Python, JavaScript, Golang, Erlang, etc.). Open telemetry follows a vendor-agnostic and standards-based approach for collecting and managing telemetry data. An important point to note is that OpenTelemetry does not have its own backend; all telemetry collected by OpenTelemetry Collector must be sent to a backend like Prometheus, Jaeger, Zipkin, Azure Monitor, etc. Open telemetry is also the 2nd most active CNCF project only after Kubernetes. The main two Problems OpenTelemetry solves are: First, vendor neutrality for tracing, monitoring, and logging APIs and second, out-of-the-box cross-platform context propagation implementation for end-to-end distributed tracing over heterogeneous components.","title":"Open Telemetry"},{"location":"observability/tools/OpenTelemetry/#open-telemetry-core-concepts","text":"","title":"Open Telemetry Core Concepts"},{"location":"observability/tools/OpenTelemetry/#open-telemetry-implementation-patterns","text":"A detailed explanation of OpenTelemetry concepts is out of the scope of this repo. There is plenty of available information about how the SDK and the automatic instrumentation are configured and how the Exporters, Tracers, Context, and Span's hierarchy work. See the Reference section for valuable OpenTelemetry resources. However, understanding the core implementation patterns will help you know what approach better fits the scenario you are trying to solve. These are three main patterns as follows: Automatic telemetry: Support for automatic instrumentation is available for some languages. OpenTelemetry automatic instrumentation (100% codeless) is typically done through library hooks or monkey-patching library code. Automatic instrumentation will intercept all interactions and dependencies and automatically send the telemetry to the configured exporters. More information about this concept can be found in the OpenTelemetry instrumentation doc . Manual tracing: This must be done by coding using the OpenTelemetry SDK, managing the tracer objects to obtain Spans, and forming instrumented OpenTelemetry Scopes to identify the code segments to be manually traced. Also, by using the @WithSpan annotations (method decorations in C# and Java ) to mark whole methods that will be automatically traced. Hybrid approach: Most Production-ready scenarios will require a mix of both techniques, using the automatic instrumentation to collect automatic telemetry and the OpenTelemetry SDK to identify code segments that are important to instrument manually. When considering production-ready scenarios, the hybrid approach is the way to go as it allows for a throughout cover over the whole solution. It provides automatic context propagation and events correlation out of the box.","title":"Open Telemetry Implementation Patterns"},{"location":"observability/tools/OpenTelemetry/#collector","text":"The collector is a separate process that is designed to be a \u2018sink\u2019 for telemetry data emitted by many processes, which can then export that data to backend systems. The collector has two different deployment strategies \u2013 either running as an agent alongside a service or as a gateway which is a remote application. In general, using both is recommended: the agent would be deployed with your service and run as a separate process or in a sidecar; meanwhile, the collector would be deployed separately, as its own application running in a container or virtual machine. Each agent would forward telemetry data to the collector, which could then export it to a variety of backend systems such as Lightstep, Jaeger, or Prometheus. The agent can be also replaced with the automatic instrumentation if supported. The automatic instrumentation provides the collector capabilities of retrieving, processing and exporting the telemetry. Regardless of how you choose to instrument or deploy OpenTelemetry, exporters provide powerful options for reporting telemetry data. You can directly export from your service, you can proxy through the collector, or you can aggregate into standalone collectors \u2013 or even a mix of these.","title":"Collector"},{"location":"observability/tools/OpenTelemetry/#instrumentation-libraries","text":"A library that enables observability for another library is called an instrumentation library. OpenTelemetry libraries are language specific, currently there is good support for Java, Python, Javascript, dotnet and golang. Support for automatic instrumentation is available for some libraries which make using OpenTelemetry easy and trivial. In case automatic instrumentation is not available, manual instrumentation can be configured by using the OpenTelemetry SDK.","title":"Instrumentation Libraries"},{"location":"observability/tools/OpenTelemetry/#integration-of-opentelemetry","text":"OpenTelemetry can be used to collect, process and export data into multiple backends, some popular integrations supported with OpenTelemetry are: Zipkin Prometheus Jaeger New Relic Azure Monitor AWS X-Ray Datadog Kafka Lightstep Splunk GCP Monitor","title":"Integration of OpenTelemetry"},{"location":"observability/tools/OpenTelemetry/#why-use-opentelemetry","text":"The main reason to use OpenTelemetry is that it offers an open-source standard for implementing distributed telemetry (context propagation) over heterogeneous systems. There is no need to reinvent the wheel to implement end-to-end business flow transactions monitoring when using OpenTelemetry. It enables tracing, metrics, and logging telemetry through a set of single-distribution multi-language libraries and tools that allow for a plug-and-play telemetry architecture that includes the concept of agents and collectors. Moreover, avoiding any proprietary lock down and achieving vendor-agnostic neutrality for tracing, monitoring, and logging APIs AND backends allow maximum portability and extensibility patterns. Another good reason to use OpenTelemetry would be whether the stack uses OpenCensus or OpenTracing. As OpenCensus and OpenTracing have carved the way for OpenTelemetry, it makes sense to introduce OpenTelemetry where OpenCensus or OpenTracing is used as it still has backward compatibility. Apart from adding custom attributes, sampling, collecting data for metrics and traces, OpenTelemetry is governed by specifications and backed up by big players in the Observability landscape like Microsoft, Splunk, AppDynamics, etc. OpenTelemetry will likely become a de-facto open-source standard for enabling metrics and tracing when all features become GA.","title":"Why use OpenTelemetry"},{"location":"observability/tools/OpenTelemetry/#current-status-of-opentelemetry-project","text":"OpenTelemetry is a project which emerged from merging of OpenCensus and OpenTracing in 2019. Although OpenCensus and OpenTracing are frozen and no new features are being developed for them, OpenTelemetry has backward compatibility with OpenCensus and OpenTracing. Some features of OpenTelemetry are still in beta, feature support for different languages is being tracked here: Feature Status of OpenTelemetry . Status of OpenTelemetry project can be tracked here . From the website: Our goal is to provide a generally available, production quality release for the tracing data source across most OpenTelemetry components in the first half of 2021. Several components have already reached this milestone! We expect metrics to reach the same status in the second half of 2021 and are targeting logs in 2022.","title":"Current Status of OpenTelemetry Project"},{"location":"observability/tools/OpenTelemetry/#what-to-watch-out-for","text":"As OpenTelemetry is a very recent project (first GA version of some features released in 2020), many features are still in beta hence due diligence needs to be done before using such features in production. Also, OpenTelemetry supports many popular languages but features in all languages are not at par. Some languages offer more features as compared to other languages. It also needs to be called out as some features are not in GA, there may be some incompatibility issues with the tooling. That being said, OpenTelemetry is one of the most active projects of CNCF , so it is expected that many more features would reach GA soon.","title":"What to Watch Out for"},{"location":"observability/tools/OpenTelemetry/#january-2022-update","text":"Apart from the logging specification and implementation that are still marked as draft or beta, all other specifications and implementations regarding tracing and metrics are marked as stable or feature-freeze. Many libraries are still on active development whatsoever, so thorough analysis has to be made depending on the language on a feature basis.","title":"January 2022 UPDATE"},{"location":"observability/tools/OpenTelemetry/#integration-options-with-azure-monitor","text":"","title":"Integration Options with Azure Monitor"},{"location":"observability/tools/OpenTelemetry/#using-the-azure-monitor-opentelemetry-exporter-library","text":"This scenario uses the OpenTelemetry SDK as the core instrumentation library. Basically this means you will instrument your application using the OpenTelemetry libraries, but you will additionally use the Azure Monitor OpenTelemetry Exporter and then added it as an additional exporter with the OpenTelemetry SDK. In this way, the OpenTelemetry traces your application creates will be pushed to your Azure Monitor Instance.","title":"Using the Azure Monitor OpenTelemetry Exporter Library"},{"location":"observability/tools/OpenTelemetry/#using-the-application-insights-agent-jar-file-java-only","text":"Java OpenTelemetry instrumentation provides another way to integrate with Azure Monitor, by using Applications Insights Java Agent jar . When configuring this option, the Applications Insights Agent file is added when executing the application. The applicationinsights.json configuration file must be also be added as part of the applications artifacts. Pay close attention to the preview section, where the \"openTelemetryApiSupport\": true, property is set to true, enabling the agent to intercept OpenTelemetry telemetry created in the application code pushing it to Azure Monitor. OpenTelemetry Java Agent instrumentation supports many libraries and frameworks and application servers . Application Insights Java Agent enhances this list. Therefore, the main difference between running the OpenTelemetry Java Agent vs. the Application Insights Java Agent is demonstrated in the amount of traces getting logged in Azure Monitor. When running with Application Insights Java agent there's more telemetry getting pushed to Azure Monitor. On the other hand, when running the solution using the Application Insights agent mode, it is essential to highlight that nothing gets logged on Jaeger (or any other OpenTelemetry exporter). All traces will be pushed exclusively to Azure Monitor. However, both manual instrumentation done via the OpenTelemetry SDK and all automatic traces, dependencies, performance counters, and metrics being instrumented by the Application Insights agent are sent to Azure Monitor. Although there is a rich amount of additional data automatically instrumented by the Application Insights agent, it can be deduced that it is not necessarily OpenTelemetry compliant. Only the traces logged by the manual instrumentation using the OpenTelemetry SDK are.","title":"Using the Application Insights Agent Jar File - Java Only"},{"location":"observability/tools/OpenTelemetry/#opentelemetry-vs-application-insights-agents-compared","text":"Highlight OpenTelemetry Agent App Insights Agent Automatic Telemetry Y Y Manual OpenTelemetry Y Y Plug and Play Exports Y N Multiple Exports Y N Full Open Telemetry layout (decoupling agents, collectors and exporters) Y N Enriched out of the box telemetry N Y Unified telemetry backend N Y","title":"OpenTelemetry vs Application Insights Agents Compared"},{"location":"observability/tools/OpenTelemetry/#summary","text":"As you may have guessed, there is no \"one size fits all\" approach when implementing OpenTelemetry with Azure Monitor as a backend. At the time of this writing, if you want to have the flexibility of having different OpenTelemetry backends, you should definitively go with the OpenTelemetry Agent, even though you'd sacrifice all automating tracing flowing to Azure Monitor. On the other hand, if you want to get the best of Azure Monitor and still want to instrument your code with the OpenTelemetry SDK, you should use the Application Insights Agent and manually instrument your code with the OpenTelemetry SDK to get the best of both worlds. Either way, instrumenting your code with OpenTelemetry seems the right approach as the ecosystem will only get bigger, better, and more robust.","title":"Summary"},{"location":"observability/tools/OpenTelemetry/#advanced-topics","text":"Use the Azure OpenTelemetry Tracing plugin library for Java to enable distributed tracing across Azure components through OpenTelemetry.","title":"Advanced topics"},{"location":"observability/tools/OpenTelemetry/#manual-trace-context-propagation","text":"The trace context is stored in Thread-local storage. When the application flow involves multiple threads (eg. multithreaded work-queue, asynchronous processing) then the traces won't get combined into one end-to-end trace chain with automatic context propagation . To achieve that you need to manually propagate the trace context ( example in Java ) by storing the trace headers along with the work-queue item.","title":"Manual Trace Context Propagation"},{"location":"observability/tools/OpenTelemetry/#telemetry-testing","text":"Mission critical telemetry data should be covered by testing. You can cover telemetry by tests by mocking the telemetry collector web server. In automated testing environment the telemetry instrumentation can be configured to use OTLP exporter and point the OTLP exporter endpoint to the collector web server. Using mocking servers libraries (eg. MockServer or WireMock) can help verify the telemetry data pushed to the collector.","title":"Telemetry Testing"},{"location":"observability/tools/OpenTelemetry/#resources","text":"OpenTelemetry Official Site Getting Started with dotnet and OpenTelemetry Using OpenTelemetry Collector OpenTelemetry Java SDK Manual Instrumentation OpenTelemetry Instrumentation Agent for Java Application Insights Java Agent Azure Monitor OpenTelemetry Exporter client library for Java Azure OpenTelemetry Tracing plugin library for Java Application Insights Agent's OpenTelemetry configuration","title":"Resources"},{"location":"observability/tools/Prometheus/","text":"Prometheus Overview Originally built at SoundCloud, Prometheus is an open-source monitoring and alerting toolkit based on time series metrics data. It has become a de facto standard metrics solution in the Cloud Native world and widely used with Kubernetes. The core of Prometheus is a server that scrapes and stores metrics. There are other numerous optional features and components like an Alert-manager and client libraries for programming languages to extend the functionalities of Prometheus beyond the basics. The client libraries offer four metric types : Counter , Gauge , Histogram , and Summary . Why Prometheus? Prometheus is a time series database and allow for events or measurements to be tracked, monitored, and aggregated over time. Prometheus is a pull-based tool. One of the biggest advantages of Prometheus over other monitoring tools is that Prometheus actively scrapes targets in order to retrieve metrics from them. Prometheus also supports the push model for pushing metrics. Prometheus allows for control over how to scrape, and how often to scrape them. Through the Prometheus server, there can be multiple scrape configurations, allowing for multiple rates for different targets. Similar to Grafana , visualization for the time series can be directly done through the Prometheus Web UI. The Web UI provides the ability to easily filter and have an overview of what is taking place with your different targets. Prometheus provides a powerful functional query language called PromQL (Prometheus Query Language) that lets the user aggregate time series data in real time. Integration with Other Tools The Prometheus client libraries allow you to add instrumentation to your code and expose internal metrics via an HTTP endpoint. The official Prometheus client libraries currently are Go , Java or Scala , Python and Ruby . Unofficial third-party libraries include: .NET/C# , Node.js , and C++ . Prometheus' metrics format is supported by a wide array of tools and services including: Azure Monitor Stackdriver Datadog CloudWatch New Relic Flagger Grafana GitLab etc... There are numerous exporters which are used in exporting existing metrics from third-party databases, hardware, CI/CD tools, messaging systems, APIs and other monitoring systems. In addition to client libraries and exporters, there is a significant number of integration points for service discovery, remote storage, alerts and management. Resources Prometheus Docs Prometheus Best Practices Grafana with Prometheus","title":"Prometheus"},{"location":"observability/tools/Prometheus/#prometheus","text":"","title":"Prometheus"},{"location":"observability/tools/Prometheus/#overview","text":"Originally built at SoundCloud, Prometheus is an open-source monitoring and alerting toolkit based on time series metrics data. It has become a de facto standard metrics solution in the Cloud Native world and widely used with Kubernetes. The core of Prometheus is a server that scrapes and stores metrics. There are other numerous optional features and components like an Alert-manager and client libraries for programming languages to extend the functionalities of Prometheus beyond the basics. The client libraries offer four metric types : Counter , Gauge , Histogram , and Summary .","title":"Overview"},{"location":"observability/tools/Prometheus/#why-prometheus","text":"Prometheus is a time series database and allow for events or measurements to be tracked, monitored, and aggregated over time. Prometheus is a pull-based tool. One of the biggest advantages of Prometheus over other monitoring tools is that Prometheus actively scrapes targets in order to retrieve metrics from them. Prometheus also supports the push model for pushing metrics. Prometheus allows for control over how to scrape, and how often to scrape them. Through the Prometheus server, there can be multiple scrape configurations, allowing for multiple rates for different targets. Similar to Grafana , visualization for the time series can be directly done through the Prometheus Web UI. The Web UI provides the ability to easily filter and have an overview of what is taking place with your different targets. Prometheus provides a powerful functional query language called PromQL (Prometheus Query Language) that lets the user aggregate time series data in real time.","title":"Why Prometheus?"},{"location":"observability/tools/Prometheus/#integration-with-other-tools","text":"The Prometheus client libraries allow you to add instrumentation to your code and expose internal metrics via an HTTP endpoint. The official Prometheus client libraries currently are Go , Java or Scala , Python and Ruby . Unofficial third-party libraries include: .NET/C# , Node.js , and C++ . Prometheus' metrics format is supported by a wide array of tools and services including: Azure Monitor Stackdriver Datadog CloudWatch New Relic Flagger Grafana GitLab etc... There are numerous exporters which are used in exporting existing metrics from third-party databases, hardware, CI/CD tools, messaging systems, APIs and other monitoring systems. In addition to client libraries and exporters, there is a significant number of integration points for service discovery, remote storage, alerts and management.","title":"Integration with Other Tools"},{"location":"observability/tools/Prometheus/#resources","text":"Prometheus Docs Prometheus Best Practices Grafana with Prometheus","title":"Resources"},{"location":"observability/tools/loki/","text":"Loki Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system, created by Grafana Labs inspired by the learnings from Prometheus. Loki is commonly referred as 'Prometheus, but for logs', which makes total sense. Both tools follow the same architecture, which is an agent collecting metrics in each of the components of the software system, a server which stores the logs and also the Grafana dashboard, which access the loki server to build its visualizations and queries. That being said, Loki has three main components: Promtail It is the agent portion of Loki. It can be used to grab logs from several places, like var/log/ for example. The configuration of the Promtail is a yaml file called config-promtail.yml . In this file, its described all the paths and log sources that will be aggregated on Loki Server. Loki Server Loki Server is responsible for receiving and storing all the logs received from all the different systems. The Loki Server is also responsible for the queries done on Grafana, for example. Grafana Dashboards Grafana Dashboards are responsible for creating the visualizations and performing queries. After all, it will be a web page that people with the right access can log into to see, query and create alerts for the aggregated logs. Why use Loki The main reason to use Loki instead of other log aggregation tools, is that Loki optimizes the necessary storage. It does that by following the same pattern as prometheus, which index the labels and make chunks of the log itself, using less space than just storing the raw logs. Resources Loki Official Site Inserting logs into Loki Adding Loki Source to Grafana Loki Best Practices","title":"Loki"},{"location":"observability/tools/loki/#loki","text":"Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system, created by Grafana Labs inspired by the learnings from Prometheus. Loki is commonly referred as 'Prometheus, but for logs', which makes total sense. Both tools follow the same architecture, which is an agent collecting metrics in each of the components of the software system, a server which stores the logs and also the Grafana dashboard, which access the loki server to build its visualizations and queries. That being said, Loki has three main components:","title":"Loki"},{"location":"observability/tools/loki/#promtail","text":"It is the agent portion of Loki. It can be used to grab logs from several places, like var/log/ for example. The configuration of the Promtail is a yaml file called config-promtail.yml . In this file, its described all the paths and log sources that will be aggregated on Loki Server.","title":"Promtail"},{"location":"observability/tools/loki/#loki-server","text":"Loki Server is responsible for receiving and storing all the logs received from all the different systems. The Loki Server is also responsible for the queries done on Grafana, for example.","title":"Loki Server"},{"location":"observability/tools/loki/#grafana-dashboards","text":"Grafana Dashboards are responsible for creating the visualizations and performing queries. After all, it will be a web page that people with the right access can log into to see, query and create alerts for the aggregated logs.","title":"Grafana Dashboards"},{"location":"observability/tools/loki/#why-use-loki","text":"The main reason to use Loki instead of other log aggregation tools, is that Loki optimizes the necessary storage. It does that by following the same pattern as prometheus, which index the labels and make chunks of the log itself, using less space than just storing the raw logs.","title":"Why use Loki"},{"location":"observability/tools/loki/#resources","text":"Loki Official Site Inserting logs into Loki Adding Loki Source to Grafana Loki Best Practices","title":"Resources"},{"location":"privacy/","text":"Privacy fundamentals This part of the engineering playbook focuses on privacy design guidelines and principles. Private data handling and protection requires both the proper design of software, systems and databases, as well as the implementation of organizational processes and procedures. In general, developers working on ISE projects should adhere to Microsoft's recommended standard practices and regulations on Privacy and Data Handling. The playbook currently contains two main parts: Privacy and Data : Best practices for properly handling sensitive and private data. Privacy frameworks : A list of frameworks which could be applied in private data scenarios.","title":"Privacy fundamentals"},{"location":"privacy/#privacy-fundamentals","text":"This part of the engineering playbook focuses on privacy design guidelines and principles. Private data handling and protection requires both the proper design of software, systems and databases, as well as the implementation of organizational processes and procedures. In general, developers working on ISE projects should adhere to Microsoft's recommended standard practices and regulations on Privacy and Data Handling. The playbook currently contains two main parts: Privacy and Data : Best practices for properly handling sensitive and private data. Privacy frameworks : A list of frameworks which could be applied in private data scenarios.","title":"Privacy fundamentals"},{"location":"privacy/data-handling/","text":"Privacy and Data Goal The goal of this section is to briefly describe best practices in privacy fundamentals for data heavy projects or portions of a project that may contain data. What it is not : This document is not a checklist for how customers or readers should handle data in their environment, and does not override Microsoft's or the customers' policies for data handling, data protection and information security. Introduction Microsoft runs on trust. Our customers trust ISE to adhere to the highest standards when handling their data. Protecting our customers' data is a joint responsibility between Microsoft and the customers; both have the responsibility to help projects follow the guidelines outlined on this page. Developers working on ISE projects should implement best practices and guidance on handling data throughout the project phases. This page is not meant to suggest how customers should handle data in their environment. It does not override : Microsoft's Information Security Policy Limited Data Protection Addendum Professional Services Data Protection Addendum 5 W's of Data Handling When working on an engagement it is important to address the following 5 W 's: Who \u2013 gets access to and with whom will we share the data and/or models developed with the data? What \u2013 data is shared with us and under what expectations and understanding. Customers need to be explicit about how the data they share applies to the overarching effort. The understanding shouldn't be vague and we shouldn't have access to broad set of data if not necessary. Where \u2013 will the data be stored and what legal jurisdiction will preside over that data. This is particularly important in countries like Germany, where different privacy laws apply but also important when it comes to responding to legal subpoenas for the data. When \u2013 will the access to data be provided and for how long? It is important to not leave straggling access to data once the engagement is completed, and define a priori the data retention policies. Why \u2013 have you given access to the data? This is particularly important to clarify the purpose and any restrictions on usage beyond the intended purpose. Please use the above guidelines to ensure the data is used only for intended purposes and thereby gain trust. It is important to be aware of data handling best practices and ensure the required clarity is provided to adhere to the above 5Ws. Handling Data in ISE Engagements Data should never leave customer-controlled environments and contractors and/or other members in the engagement should never have access to complete customer data sets but use limited customer data sets using the following prioritized approaches: Contractors or engagement partners do not work directly with production data, data will be copied before processing per the guidelines below. Always apply data minimization principles to minimize the blast radius of errors, only work with the minimal data set required to achieve the goals. Generate synthetic data to support engagement work. If synthetic data is not possible to achieve project goals, request anonymized data in which the likelihood that unique individuals can be re-identified is minimal. Select a suitably diverse, limited data set, again, follow the Principles of Data Minimization and attempt to work with the fewest rows possible to achieve the goals. Before work begins on data, ensure OS patches are up to date and permissions are properly set with no open internet access. Developers working on ISE projects will work with our customers to define the data needed for each engagement. If there is a need to access production data, ISE needs to review the need with their lead and work with the customer to put audits in place verifying what data was accessed. Production data must only be shared with approved members of the engagement team and must not be processed/transferred outside of the customer controlled environment. Customers should provide ISE with a copy of the requested data in a location managed by the customer. The customer should consider turning any logging capabilities on so they can clearly identify who has access and what they do with that access. ISE should notify the customer when they are done with the data and suggest the customer destroy copies of the data if they are no longer needed. Our Guiding Principles when Handling Data in an Engagement Never directly access production data. Explicitly state the intended purpose of data that can be used for engagement. Only share copies of the production data with the approved members of the engagement team. The entire team should work together to ensure that there are no dead copies of data. When the data is no longer needed, the team should promptly work to clean up engagement copies of data. Do not send any copies of the production data outside the customer-controlled environment. Only use the minimal subset of the data needed for the purpose of the engagement. Questions to Consider when Working with Data What data do we need? What is the legal basis for processing this data? If we are the processor based on contract obligation what is our responsibility listed in the contract? Does the contract need to be amended? How can we contain data proliferation? What security controls are in place to protect this data? What is the data breech protocol? How does this data benefit the data subject? What is the lifespan of this data? Do we need to keep this data linked to a data subject? Can we turn this data into Not in a Position to Identify (NPI) data to be used later on? How is the system architected so data subject rights can be fulfilled? (ex manually, automated) If personal data is involved have engaged privacy and legal teams for this project? Summary It is important to only pull in data that is needed for the problem at hand, when this is put in practice we find that we only maintain data that is adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed. This is particularly important for personal data. Once you have personal data there are many rules and regulations that apply, some examples of these might be HIPAA, GDPR, CCPA. The customer should be aware of and surface any applicable regulations that apply to their data. Furthermore the seven principles of privacy by design should be reviewed and considered when handling any type of sensitive data. Resources Microsoft Trust Center Tools for responsible AI - Protect Data Protection Resources FAQ and White Papers Microsoft Compliance Offerings Accountability Readiness Checklists Privacy by Design The 7 Foundational Principles","title":"Privacy and Data"},{"location":"privacy/data-handling/#privacy-and-data","text":"","title":"Privacy and Data"},{"location":"privacy/data-handling/#goal","text":"The goal of this section is to briefly describe best practices in privacy fundamentals for data heavy projects or portions of a project that may contain data. What it is not : This document is not a checklist for how customers or readers should handle data in their environment, and does not override Microsoft's or the customers' policies for data handling, data protection and information security.","title":"Goal"},{"location":"privacy/data-handling/#introduction","text":"Microsoft runs on trust. Our customers trust ISE to adhere to the highest standards when handling their data. Protecting our customers' data is a joint responsibility between Microsoft and the customers; both have the responsibility to help projects follow the guidelines outlined on this page. Developers working on ISE projects should implement best practices and guidance on handling data throughout the project phases. This page is not meant to suggest how customers should handle data in their environment. It does not override : Microsoft's Information Security Policy Limited Data Protection Addendum Professional Services Data Protection Addendum","title":"Introduction"},{"location":"privacy/data-handling/#5-ws-of-data-handling","text":"When working on an engagement it is important to address the following 5 W 's: Who \u2013 gets access to and with whom will we share the data and/or models developed with the data? What \u2013 data is shared with us and under what expectations and understanding. Customers need to be explicit about how the data they share applies to the overarching effort. The understanding shouldn't be vague and we shouldn't have access to broad set of data if not necessary. Where \u2013 will the data be stored and what legal jurisdiction will preside over that data. This is particularly important in countries like Germany, where different privacy laws apply but also important when it comes to responding to legal subpoenas for the data. When \u2013 will the access to data be provided and for how long? It is important to not leave straggling access to data once the engagement is completed, and define a priori the data retention policies. Why \u2013 have you given access to the data? This is particularly important to clarify the purpose and any restrictions on usage beyond the intended purpose. Please use the above guidelines to ensure the data is used only for intended purposes and thereby gain trust. It is important to be aware of data handling best practices and ensure the required clarity is provided to adhere to the above 5Ws.","title":"5 W's of Data Handling"},{"location":"privacy/data-handling/#handling-data-in-ise-engagements","text":"Data should never leave customer-controlled environments and contractors and/or other members in the engagement should never have access to complete customer data sets but use limited customer data sets using the following prioritized approaches: Contractors or engagement partners do not work directly with production data, data will be copied before processing per the guidelines below. Always apply data minimization principles to minimize the blast radius of errors, only work with the minimal data set required to achieve the goals. Generate synthetic data to support engagement work. If synthetic data is not possible to achieve project goals, request anonymized data in which the likelihood that unique individuals can be re-identified is minimal. Select a suitably diverse, limited data set, again, follow the Principles of Data Minimization and attempt to work with the fewest rows possible to achieve the goals. Before work begins on data, ensure OS patches are up to date and permissions are properly set with no open internet access. Developers working on ISE projects will work with our customers to define the data needed for each engagement. If there is a need to access production data, ISE needs to review the need with their lead and work with the customer to put audits in place verifying what data was accessed. Production data must only be shared with approved members of the engagement team and must not be processed/transferred outside of the customer controlled environment. Customers should provide ISE with a copy of the requested data in a location managed by the customer. The customer should consider turning any logging capabilities on so they can clearly identify who has access and what they do with that access. ISE should notify the customer when they are done with the data and suggest the customer destroy copies of the data if they are no longer needed.","title":"Handling Data in ISE Engagements"},{"location":"privacy/data-handling/#our-guiding-principles-when-handling-data-in-an-engagement","text":"Never directly access production data. Explicitly state the intended purpose of data that can be used for engagement. Only share copies of the production data with the approved members of the engagement team. The entire team should work together to ensure that there are no dead copies of data. When the data is no longer needed, the team should promptly work to clean up engagement copies of data. Do not send any copies of the production data outside the customer-controlled environment. Only use the minimal subset of the data needed for the purpose of the engagement.","title":"Our Guiding Principles when Handling Data in an Engagement"},{"location":"privacy/data-handling/#questions-to-consider-when-working-with-data","text":"What data do we need? What is the legal basis for processing this data? If we are the processor based on contract obligation what is our responsibility listed in the contract? Does the contract need to be amended? How can we contain data proliferation? What security controls are in place to protect this data? What is the data breech protocol? How does this data benefit the data subject? What is the lifespan of this data? Do we need to keep this data linked to a data subject? Can we turn this data into Not in a Position to Identify (NPI) data to be used later on? How is the system architected so data subject rights can be fulfilled? (ex manually, automated) If personal data is involved have engaged privacy and legal teams for this project?","title":"Questions to Consider when Working with Data"},{"location":"privacy/data-handling/#summary","text":"It is important to only pull in data that is needed for the problem at hand, when this is put in practice we find that we only maintain data that is adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed. This is particularly important for personal data. Once you have personal data there are many rules and regulations that apply, some examples of these might be HIPAA, GDPR, CCPA. The customer should be aware of and surface any applicable regulations that apply to their data. Furthermore the seven principles of privacy by design should be reviewed and considered when handling any type of sensitive data.","title":"Summary"},{"location":"privacy/data-handling/#resources","text":"Microsoft Trust Center Tools for responsible AI - Protect Data Protection Resources FAQ and White Papers Microsoft Compliance Offerings Accountability Readiness Checklists Privacy by Design The 7 Foundational Principles","title":"Resources"},{"location":"privacy/privacy-frameworks/","text":"Privacy Related frameworks The following tools/frameworks could be leveraged when data analysis or model development needs to take place on private data. Note that the use of such frameworks still requires the solution to adhere to privacy regulations and others, and additional safeguards should be applied. Typical Scenarios for Leveraging a Privacy Framework Sharing data or results while preserving data subjects' privacy Performing analysis or statistical modeling on private data Developing privacy preserving ML models and data pipelines Privacy Frameworks Protecting private data involves the entire data lifecycle, from acquisition, through storage, processing, analysis, modeling and usage in reports or machine learning models. Proper safeguards and restrictions should be applied in each of these phases. In this section we provide a non-exhaustive list of privacy frameworks which can be leveraged for protecting and preserving privacy. We focus on four main use cases in the data lifecycle: Obtaining non-sensitive data Establishing trusted research and modeling environments Creating privacy preserving data and ML pipelines Data loss prevention Obtaining Non-Sensitive Data In many scenarios, analysts, researchers and data scientists require access to a non-sensitive version or sample of the private data. In this section we focus on two approaches for obtaining non-sensitive data. Note: These two approaches do not guarantee that the outcome would not include private data, and additional measures should be applied. Data De-Identification De-identification is the process of applying a set of transformations to a dataset, in order to lower the risk of unintended disclosure of personal data. De-identification involves the removal or substitution of both direct identifiers (such as name, or social security number) or quasi-identifiers, which can be used for re-identification using additional external information. De-identification can be applied to different types of data, such as structured data, images and text. However, de-identification of non-structured data often involves statistical approaches which might result in undetected PII (Personal Identifiable Information) or non-private information being redacted or replaced. Here we outline several de-identification solutions available as open source: Solution Notes Presidio Presidio helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more in unstructured text and images. It's useful when high customization is required, for example to detect custom PII entities or languages. Link to repo , link to docs , link to demo . FHIR tools for anonymization FHIR Tools for Anonymization is an open-source project that helps anonymize healthcare FHIR data (FHIR=Fast Healthcare Interoperability Resources, a standard for exchanging Electric Health Records), on-premises or in the cloud, for secondary usage such as research, public health, and more. Link . Works with FHIR format (Stu3 and R4), allows different strategies for anonymization (date shift, crypto-hash, encrypt, substitute, perturb, generalize) ARX Anonymization using statistical models, specifically k-anonymity, \u2113-diversity, t-closeness and \u03b4-presence. Useful for validating the anonymization of aggregated data. Links: Repo , Website . Written in Java. k-Anonymity GitHub repo with examples on how to produce k-anonymous datasets. k-anonymity protects the privacy of individual persons by pooling their attributes into groups of at least k people. repo Synthetic Data Generation A synthetic dataset is a repository of data generated from actual data and has the same statistical properties as the real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. The potential benefit of such synthetic datasets is for sensitive applications \u2013 medical classifications or financial modelling, where getting hands on a high-quality labelled dataset is often prohibitive. When determining the best method for creating synthetic data, it is essential first to consider what type of synthetic data you aim to have. There are two broad categories to choose from, each with different benefits and drawbacks: Fully synthetic: This data does not contain any original data, which means that re-identification of any single unit is almost impossible, and all variables are still fully available. Partially synthetic: Only sensitive data is replaced with synthetic data, which requires a heavy dependency on the imputation model. This leads to decreased model dependence but does mean that some disclosure is possible due to the actual values within the dataset. Solution Notes Synthea Synthea was developed with numerous data sources collected on the internet, including US Census Bureau demographics, Centers for Disease Control and Prevention prevalence and incidence rates, and National Institutes of Health reports. The source code and disease models include annotations and citations for all data, statistics, and treatments. These models of diseases and treatments interact appropriately with the health record. PII dataset generator A synthetic data generator developed on top of Fake Name Generator which takes a text file with templates (e.g. my name is PERSON ) and creates a list of Input Samples which contain fake PII entities instead of placeholders. CheckList CheckList provides a framework for perturbation techniques to evaluate specific behavioral capabilities of NLP models systematically Mimesis Mimesis a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages. Faker Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you. Plaitpy The idea behind plait.py is that it should be easy to model fake data that has an interesting shape. Currently, many fake data generators model their data as a collection of IID variables; with plait.py we can stitch together those variables into a more coherent model. Trusted Research and Modeling Environments Trusted Research Environments Trusted Research Environments (TREs) enable organizations to create secure workspaces for analysts, data scientists and researchers who require access to sensitive data. TREs enforce a secure boundary around distinct workspaces to enable information governance controls. Each workspace is accessible by a set of authorized users, prevents the exfiltration of sensitive data, and has access to one or more datasets provided by the data platform. We highlight several alternatives for Trusted Research Environments: Solution Notes Azure Trusted Research Environment An Open Source TRE for Azure. Aridhia DRE Eyes-Off Machine Learning In certain situations, Data Scientists may need to train models on data they are not allowed to see. In these cases, an \"eyes-off\" approach is recommended. An eyes-off approach provides a data scientist with an environment in which scripts can be run on the data but direct access to samples is not allowed. When using Azure ML, tools such as the Identity Based Data Access can enable this scenario, alongside proper role assignment for users. During the processing within the eyes-off environment, only certain outputs (e.g. logs) are allowed to be extracted back to the user. For example, a user would be able to submit a script which trains a model and inspect the model's performance, but would not be able to see on which samples the model predicted the wrong output. In addition to the eyes-off environment, this approach usually entails providing access to an \"eyes-on\" dataset, which is a representative, cleansed, sample set of data for model design purposes. The Eyes-on dataset is often a de-identified subset of the private dataset, or a synthetic dataset generated based on the characteristics of the private dataset. Private Data Sharing Platforms Various tools and systems allow different parties to share data with 3rd parties while protecting private entities, and securely process data while reducing the likelihood of data exfiltration. These tools include Secure Multi Party Computation (SMPC) systems, Homomorphic Encryption systems, Confidential Computing , private data analysis frameworks such as PySift among others. Privacy Preserving Data Pipelines and ML Even when our data is secure, private entities can still be extracted when the data is consumed. Privacy preserving data pipelines and ML models focus on minimizing the risk of private data exfiltration during data querying or model predictions. Differential Privacy Differential privacy (DP) is a system that enables one to extract meaningful insights from datasets about subgroups of people, while also providing strong guarantees with regards to protecting any given individual's privacy. This is typically achieved by adding a small statistical noise to every individual's information, thereby introducing uncertainty in the data. However, the insights gleaned still accurately represent what we intend to learn about the population in the aggregate. This approach is known to be robust to re-identification attacks and data reconstruction by adversaries who possess auxiliary information. For a more comprehensive overview, check out Differential privacy: A primer for a non-technical audience . DP has been widely adopted in various scenarios such as learning from census data, user telemetry data analysis, audience engagement to advertisements, and health data insights where PII protection is of paramount importance. However, DP is less suitable for small datasets. Tools that implement DP include SmartNoise , Tensorflow Privacy among some others. Homomorphic Encryption Homomorphic Encryption (HE) is a form of encryption allowing one to perform calculations on encrypted data without decrypting it first. The result of the computation F is in an encrypted form, which on decrypting gives us the same result if computation F was done on raw unencrypted data. ( source ) Homomorphic Encryption frameworks: Solution Notes Microsoft SEAL Secure Cloud Storage and Computation, ML Modeling. A widely used open-source library from Microsoft that supports the BFV and the CKKS schemes. Palisade A widely used open-source library from a consortium of DARPA-funded defense contractors that supports multiple homomorphic encryption schemes such as BGV, BFV, CKKS, TFHE and FHEW, among others, with multiparty support. Link to repo PySift Private deep learning. PySyft decouples private data from model training, using Federated Learning, Differential Privacy, and Encrypted Computation (like Multi-Party Computation (MPC) and Homomorphic Encryption (HE)) within the main Deep Learning frameworks like PyTorch and TensorFlow. A list of additional OSS tools can be found here . Federated Learning Federated learning is a Machine Learning technique which allows the training of ML models in a decentralized way without having to share the actual data. Instead of sending data to the processing engine of the model, the approach is to distribute the model to the different data owners and perform training in a distributed fashion. Federated learning frameworks: Solution Notes TensorFlow Federated Learning OSS federated learning system built on top of TensorFlow FATE An OSS federated learning system with different options for deployment and different algorithms adapted for federated learning IBM Federated Learning A Python based federated learning framework focused on enterprise environments. Data Loss Prevention Organizations have sensitive information under their control such as financial data, proprietary data, credit card numbers, health records, or social security numbers. To help protect this sensitive data and reduce risk, they need a way to prevent their users from inappropriately sharing it with people who shouldn't have it. This practice is called data loss prevention (DLP) . Below we focus on two aspects of DLP: Sensitive data classification and Access management. Sensitive Data Classification Sensitive data classification is an important aspect of DLP, as it allows organizations to track, monitor, secure and identify sensitive and private data. Furthermore, different sensitivity levels can be applied to different data items, facilitating proper governance and cataloging. There are typically four levels data classification levels: Public Internal Confidential Restricted / Highly confidential Tools for data classification on Azure: Solution Notes Microsoft Information Protection (MIP) A suite for DLP, sensitive data classification, cataloging and more. Azure Purview A unified data governance service, which includes the classification and cataloging of sensitive data. Azure Purview leverages the MIP technology for data classification and more. Data Discovery & Classification for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Basic capabilities for discovering, classifying, labeling, and reporting the sensitive data in Azure SQL and Synapse databases. Data Discovery & Classification for SQL Server Capabilities for discovering, classifying, labeling & reporting the sensitive data in SQL Server databases. Often, tools used for de-identification can also serve as sensitive data classifiers. Refer to de-identification tools for such tools. Additional resources: Example guidelines for data classification Learn about sensitivity levels Access Management Access control is an important component of privacy by design and falls into overall data lifecycle protection. Successful access control will restrict access only to authorized individuals that should have access to data. Once data is secure in an environment, it is important to review who should access this data and for what purpose. Access control may be audited with a comprehensive logging strategy which may include the integration of activity logs that can provide insight into operations performed on resources in a subscription. OWASP Access Control Cheat Sheet","title":"Privacy Related frameworks"},{"location":"privacy/privacy-frameworks/#privacy-related-frameworks","text":"The following tools/frameworks could be leveraged when data analysis or model development needs to take place on private data. Note that the use of such frameworks still requires the solution to adhere to privacy regulations and others, and additional safeguards should be applied.","title":"Privacy Related frameworks"},{"location":"privacy/privacy-frameworks/#typical-scenarios-for-leveraging-a-privacy-framework","text":"Sharing data or results while preserving data subjects' privacy Performing analysis or statistical modeling on private data Developing privacy preserving ML models and data pipelines","title":"Typical Scenarios for Leveraging a Privacy Framework"},{"location":"privacy/privacy-frameworks/#privacy-frameworks","text":"Protecting private data involves the entire data lifecycle, from acquisition, through storage, processing, analysis, modeling and usage in reports or machine learning models. Proper safeguards and restrictions should be applied in each of these phases. In this section we provide a non-exhaustive list of privacy frameworks which can be leveraged for protecting and preserving privacy. We focus on four main use cases in the data lifecycle: Obtaining non-sensitive data Establishing trusted research and modeling environments Creating privacy preserving data and ML pipelines Data loss prevention","title":"Privacy Frameworks"},{"location":"privacy/privacy-frameworks/#obtaining-non-sensitive-data","text":"In many scenarios, analysts, researchers and data scientists require access to a non-sensitive version or sample of the private data. In this section we focus on two approaches for obtaining non-sensitive data. Note: These two approaches do not guarantee that the outcome would not include private data, and additional measures should be applied.","title":"Obtaining Non-Sensitive Data"},{"location":"privacy/privacy-frameworks/#data-de-identification","text":"De-identification is the process of applying a set of transformations to a dataset, in order to lower the risk of unintended disclosure of personal data. De-identification involves the removal or substitution of both direct identifiers (such as name, or social security number) or quasi-identifiers, which can be used for re-identification using additional external information. De-identification can be applied to different types of data, such as structured data, images and text. However, de-identification of non-structured data often involves statistical approaches which might result in undetected PII (Personal Identifiable Information) or non-private information being redacted or replaced. Here we outline several de-identification solutions available as open source: Solution Notes Presidio Presidio helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more in unstructured text and images. It's useful when high customization is required, for example to detect custom PII entities or languages. Link to repo , link to docs , link to demo . FHIR tools for anonymization FHIR Tools for Anonymization is an open-source project that helps anonymize healthcare FHIR data (FHIR=Fast Healthcare Interoperability Resources, a standard for exchanging Electric Health Records), on-premises or in the cloud, for secondary usage such as research, public health, and more. Link . Works with FHIR format (Stu3 and R4), allows different strategies for anonymization (date shift, crypto-hash, encrypt, substitute, perturb, generalize) ARX Anonymization using statistical models, specifically k-anonymity, \u2113-diversity, t-closeness and \u03b4-presence. Useful for validating the anonymization of aggregated data. Links: Repo , Website . Written in Java. k-Anonymity GitHub repo with examples on how to produce k-anonymous datasets. k-anonymity protects the privacy of individual persons by pooling their attributes into groups of at least k people. repo","title":"Data De-Identification"},{"location":"privacy/privacy-frameworks/#synthetic-data-generation","text":"A synthetic dataset is a repository of data generated from actual data and has the same statistical properties as the real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. The potential benefit of such synthetic datasets is for sensitive applications \u2013 medical classifications or financial modelling, where getting hands on a high-quality labelled dataset is often prohibitive. When determining the best method for creating synthetic data, it is essential first to consider what type of synthetic data you aim to have. There are two broad categories to choose from, each with different benefits and drawbacks: Fully synthetic: This data does not contain any original data, which means that re-identification of any single unit is almost impossible, and all variables are still fully available. Partially synthetic: Only sensitive data is replaced with synthetic data, which requires a heavy dependency on the imputation model. This leads to decreased model dependence but does mean that some disclosure is possible due to the actual values within the dataset. Solution Notes Synthea Synthea was developed with numerous data sources collected on the internet, including US Census Bureau demographics, Centers for Disease Control and Prevention prevalence and incidence rates, and National Institutes of Health reports. The source code and disease models include annotations and citations for all data, statistics, and treatments. These models of diseases and treatments interact appropriately with the health record. PII dataset generator A synthetic data generator developed on top of Fake Name Generator which takes a text file with templates (e.g. my name is PERSON ) and creates a list of Input Samples which contain fake PII entities instead of placeholders. CheckList CheckList provides a framework for perturbation techniques to evaluate specific behavioral capabilities of NLP models systematically Mimesis Mimesis a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages. Faker Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you. Plaitpy The idea behind plait.py is that it should be easy to model fake data that has an interesting shape. Currently, many fake data generators model their data as a collection of IID variables; with plait.py we can stitch together those variables into a more coherent model.","title":"Synthetic Data Generation"},{"location":"privacy/privacy-frameworks/#trusted-research-and-modeling-environments","text":"","title":"Trusted Research and Modeling Environments"},{"location":"privacy/privacy-frameworks/#trusted-research-environments","text":"Trusted Research Environments (TREs) enable organizations to create secure workspaces for analysts, data scientists and researchers who require access to sensitive data. TREs enforce a secure boundary around distinct workspaces to enable information governance controls. Each workspace is accessible by a set of authorized users, prevents the exfiltration of sensitive data, and has access to one or more datasets provided by the data platform. We highlight several alternatives for Trusted Research Environments: Solution Notes Azure Trusted Research Environment An Open Source TRE for Azure. Aridhia DRE","title":"Trusted Research Environments"},{"location":"privacy/privacy-frameworks/#eyes-off-machine-learning","text":"In certain situations, Data Scientists may need to train models on data they are not allowed to see. In these cases, an \"eyes-off\" approach is recommended. An eyes-off approach provides a data scientist with an environment in which scripts can be run on the data but direct access to samples is not allowed. When using Azure ML, tools such as the Identity Based Data Access can enable this scenario, alongside proper role assignment for users. During the processing within the eyes-off environment, only certain outputs (e.g. logs) are allowed to be extracted back to the user. For example, a user would be able to submit a script which trains a model and inspect the model's performance, but would not be able to see on which samples the model predicted the wrong output. In addition to the eyes-off environment, this approach usually entails providing access to an \"eyes-on\" dataset, which is a representative, cleansed, sample set of data for model design purposes. The Eyes-on dataset is often a de-identified subset of the private dataset, or a synthetic dataset generated based on the characteristics of the private dataset.","title":"Eyes-Off Machine Learning"},{"location":"privacy/privacy-frameworks/#private-data-sharing-platforms","text":"Various tools and systems allow different parties to share data with 3rd parties while protecting private entities, and securely process data while reducing the likelihood of data exfiltration. These tools include Secure Multi Party Computation (SMPC) systems, Homomorphic Encryption systems, Confidential Computing , private data analysis frameworks such as PySift among others.","title":"Private Data Sharing Platforms"},{"location":"privacy/privacy-frameworks/#privacy-preserving-data-pipelines-and-ml","text":"Even when our data is secure, private entities can still be extracted when the data is consumed. Privacy preserving data pipelines and ML models focus on minimizing the risk of private data exfiltration during data querying or model predictions.","title":"Privacy Preserving Data Pipelines and ML"},{"location":"privacy/privacy-frameworks/#differential-privacy","text":"Differential privacy (DP) is a system that enables one to extract meaningful insights from datasets about subgroups of people, while also providing strong guarantees with regards to protecting any given individual's privacy. This is typically achieved by adding a small statistical noise to every individual's information, thereby introducing uncertainty in the data. However, the insights gleaned still accurately represent what we intend to learn about the population in the aggregate. This approach is known to be robust to re-identification attacks and data reconstruction by adversaries who possess auxiliary information. For a more comprehensive overview, check out Differential privacy: A primer for a non-technical audience . DP has been widely adopted in various scenarios such as learning from census data, user telemetry data analysis, audience engagement to advertisements, and health data insights where PII protection is of paramount importance. However, DP is less suitable for small datasets. Tools that implement DP include SmartNoise , Tensorflow Privacy among some others.","title":"Differential Privacy"},{"location":"privacy/privacy-frameworks/#homomorphic-encryption","text":"Homomorphic Encryption (HE) is a form of encryption allowing one to perform calculations on encrypted data without decrypting it first. The result of the computation F is in an encrypted form, which on decrypting gives us the same result if computation F was done on raw unencrypted data. ( source ) Homomorphic Encryption frameworks: Solution Notes Microsoft SEAL Secure Cloud Storage and Computation, ML Modeling. A widely used open-source library from Microsoft that supports the BFV and the CKKS schemes. Palisade A widely used open-source library from a consortium of DARPA-funded defense contractors that supports multiple homomorphic encryption schemes such as BGV, BFV, CKKS, TFHE and FHEW, among others, with multiparty support. Link to repo PySift Private deep learning. PySyft decouples private data from model training, using Federated Learning, Differential Privacy, and Encrypted Computation (like Multi-Party Computation (MPC) and Homomorphic Encryption (HE)) within the main Deep Learning frameworks like PyTorch and TensorFlow. A list of additional OSS tools can be found here .","title":"Homomorphic Encryption"},{"location":"privacy/privacy-frameworks/#federated-learning","text":"Federated learning is a Machine Learning technique which allows the training of ML models in a decentralized way without having to share the actual data. Instead of sending data to the processing engine of the model, the approach is to distribute the model to the different data owners and perform training in a distributed fashion. Federated learning frameworks: Solution Notes TensorFlow Federated Learning OSS federated learning system built on top of TensorFlow FATE An OSS federated learning system with different options for deployment and different algorithms adapted for federated learning IBM Federated Learning A Python based federated learning framework focused on enterprise environments.","title":"Federated Learning"},{"location":"privacy/privacy-frameworks/#data-loss-prevention","text":"Organizations have sensitive information under their control such as financial data, proprietary data, credit card numbers, health records, or social security numbers. To help protect this sensitive data and reduce risk, they need a way to prevent their users from inappropriately sharing it with people who shouldn't have it. This practice is called data loss prevention (DLP) . Below we focus on two aspects of DLP: Sensitive data classification and Access management.","title":"Data Loss Prevention"},{"location":"privacy/privacy-frameworks/#sensitive-data-classification","text":"Sensitive data classification is an important aspect of DLP, as it allows organizations to track, monitor, secure and identify sensitive and private data. Furthermore, different sensitivity levels can be applied to different data items, facilitating proper governance and cataloging. There are typically four levels data classification levels: Public Internal Confidential Restricted / Highly confidential Tools for data classification on Azure: Solution Notes Microsoft Information Protection (MIP) A suite for DLP, sensitive data classification, cataloging and more. Azure Purview A unified data governance service, which includes the classification and cataloging of sensitive data. Azure Purview leverages the MIP technology for data classification and more. Data Discovery & Classification for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Basic capabilities for discovering, classifying, labeling, and reporting the sensitive data in Azure SQL and Synapse databases. Data Discovery & Classification for SQL Server Capabilities for discovering, classifying, labeling & reporting the sensitive data in SQL Server databases. Often, tools used for de-identification can also serve as sensitive data classifiers. Refer to de-identification tools for such tools. Additional resources: Example guidelines for data classification Learn about sensitivity levels","title":"Sensitive Data Classification"},{"location":"privacy/privacy-frameworks/#access-management","text":"Access control is an important component of privacy by design and falls into overall data lifecycle protection. Successful access control will restrict access only to authorized individuals that should have access to data. Once data is secure in an environment, it is important to review who should access this data and for what purpose. Access control may be audited with a comprehensive logging strategy which may include the integration of activity logs that can provide insight into operations performed on resources in a subscription. OWASP Access Control Cheat Sheet","title":"Access Management"},{"location":"security/","text":"Security Developers working on projects should adhere to industry-recommended standard practices for secure design and implementation of code. For the purposes of our customers, this means our engineers should understand the OWASP Top 10 Web Application Security Risks , as well as how to mitigate as many of them as possible, using the resources below. If you are looking for a fast way to get started evaluating your application or design, check out the \"Secure Coding Practices Quick Reference\" document below, which contains an itemized checklist of high-level concepts you can validate are being done properly. This checklist covers many common errors associated with the OWASP Top 10 list linked above, and should be the minimum amount of effort being put into security. Requesting Security Reviews When requesting a security review for your application, please make sure you have familiarized yourself with the Rules of Engagement . This will help you to prepare the application for testing, as well as understand the scope limits of the test. Quick Resources Secure Coding Practices Quick Reference Web Application Security Quick Reference Security Mindset/Creating a Security Program Quick Start Credential Scanning / Secret Detection Threat Modelling Azure DevOps Security Security Engineering DevSecOps Practices Azure DevOps Data Protection Overview Security and Identity in Azure DevOps Security Code Analysis DevSecOps Introduce security to your project at early stages. The DevSecOps section covers security practices, automation, tools and frameworks as part of the application CI. OWASP Cheat Sheets Note: OWASP is considered to be the gold-standard in computer security information. OWASP maintains an extensive series of cheat sheets which cover all the OWASP Top 10 and more. Below, many of the more relevant cheat sheets have been summarized. To view all the cheat sheets, check out their Cheat Sheet Index . Attack Surface Analysis Authorization Basics Content Security Policy (CSP) Cross-Site Request Forgery (CSRF) Prevention Cross-Site Scripting (XSS) Prevention Cryptographic Storage Deserialization Docker/Kubernetes (k8s) Security Input Validation Key Management OS Command Injection Defense Query Parameterization Examples Server-Side Request Forgery Prevention SQL Injection Prevention Unvalidated Redirects and Forwards Web Service Security XML Security Recommended Tools Check out the list of tools to help enable security in your projects. Note: Although some tools are agnostic, the below list is geared towards Cloud Native security, with a focus on Kubernetes. Vulnerability Scanning SonarCloud Integrates with Azure Devops with the click of a button. Snyk Trivy Cloudsploit Anchore Other tools from OWASP See why you should check for vulnerabilities at all layers of the stack , as well as a couple of other useful tips to reduce surface area for attacks. Runtime Security Falco Tracee Kubelinter May not fully qualify as runtime security, but helps ensure you're enabling best practices. Binary Authorization Binary authorization can happen both at the docker registry layer, and runtime (ie: via a K8s admission controller). The authorization check ensures that the image is signed by a trusted authority. This can occur for both (pre-approved) 3rd party images, and internal images. Taking this a step further the signing should occur only on images where all code has been reviewed and approved. Binary authorization can both reduce the impact of damage from a compromised hosting environment, and the damage from malicious insiders. Harbor Operator available Portieris Notary Note harbor leverages notary internally. TUF Other K8s Security OPA , Gatekeeper , and the Gatekeeper Library cert-manager for easy certificate provisioning and automatic rotation. Quickly enable mTLS between your microservices with Linkerd . Resources Non-Functional Requirements Guidance","title":"Security"},{"location":"security/#security","text":"Developers working on projects should adhere to industry-recommended standard practices for secure design and implementation of code. For the purposes of our customers, this means our engineers should understand the OWASP Top 10 Web Application Security Risks , as well as how to mitigate as many of them as possible, using the resources below. If you are looking for a fast way to get started evaluating your application or design, check out the \"Secure Coding Practices Quick Reference\" document below, which contains an itemized checklist of high-level concepts you can validate are being done properly. This checklist covers many common errors associated with the OWASP Top 10 list linked above, and should be the minimum amount of effort being put into security.","title":"Security"},{"location":"security/#requesting-security-reviews","text":"When requesting a security review for your application, please make sure you have familiarized yourself with the Rules of Engagement . This will help you to prepare the application for testing, as well as understand the scope limits of the test.","title":"Requesting Security Reviews"},{"location":"security/#quick-resources","text":"Secure Coding Practices Quick Reference Web Application Security Quick Reference Security Mindset/Creating a Security Program Quick Start Credential Scanning / Secret Detection Threat Modelling","title":"Quick Resources"},{"location":"security/#azure-devops-security","text":"Security Engineering DevSecOps Practices Azure DevOps Data Protection Overview Security and Identity in Azure DevOps Security Code Analysis","title":"Azure DevOps Security"},{"location":"security/#devsecops","text":"Introduce security to your project at early stages. The DevSecOps section covers security practices, automation, tools and frameworks as part of the application CI.","title":"DevSecOps"},{"location":"security/#owasp-cheat-sheets","text":"Note: OWASP is considered to be the gold-standard in computer security information. OWASP maintains an extensive series of cheat sheets which cover all the OWASP Top 10 and more. Below, many of the more relevant cheat sheets have been summarized. To view all the cheat sheets, check out their Cheat Sheet Index . Attack Surface Analysis Authorization Basics Content Security Policy (CSP) Cross-Site Request Forgery (CSRF) Prevention Cross-Site Scripting (XSS) Prevention Cryptographic Storage Deserialization Docker/Kubernetes (k8s) Security Input Validation Key Management OS Command Injection Defense Query Parameterization Examples Server-Side Request Forgery Prevention SQL Injection Prevention Unvalidated Redirects and Forwards Web Service Security XML Security","title":"OWASP Cheat Sheets"},{"location":"security/#recommended-tools","text":"Check out the list of tools to help enable security in your projects. Note: Although some tools are agnostic, the below list is geared towards Cloud Native security, with a focus on Kubernetes. Vulnerability Scanning SonarCloud Integrates with Azure Devops with the click of a button. Snyk Trivy Cloudsploit Anchore Other tools from OWASP See why you should check for vulnerabilities at all layers of the stack , as well as a couple of other useful tips to reduce surface area for attacks. Runtime Security Falco Tracee Kubelinter May not fully qualify as runtime security, but helps ensure you're enabling best practices. Binary Authorization Binary authorization can happen both at the docker registry layer, and runtime (ie: via a K8s admission controller). The authorization check ensures that the image is signed by a trusted authority. This can occur for both (pre-approved) 3rd party images, and internal images. Taking this a step further the signing should occur only on images where all code has been reviewed and approved. Binary authorization can both reduce the impact of damage from a compromised hosting environment, and the damage from malicious insiders. Harbor Operator available Portieris Notary Note harbor leverages notary internally. TUF Other K8s Security OPA , Gatekeeper , and the Gatekeeper Library cert-manager for easy certificate provisioning and automatic rotation. Quickly enable mTLS between your microservices with Linkerd .","title":"Recommended Tools"},{"location":"security/#resources","text":"Non-Functional Requirements Guidance","title":"Resources"},{"location":"security/rules-of-engagement/","text":"Application Security Analysis: Rules of Engagement When performing application security analysis, it is expected that the tester follow the Rules of Engagement as laid out below. This is to standardize the scope of application testing and provide a concrete awareness of what is considered \"out of scope\" for security analysis. Rules of Engagement - For Those Requesting Review Web Application Firewalls can be up and configured, but do not enable any automatic blocking. This can greatly slow down the person performing the test. Similarly, if a service is running on a virtual machine, ensure services such as fail2ban are disabled. You cannot make changes to the running application until the test is complete. This is to prevent accidentally breaking an otherwise valid attack in progress. Any review results are not considered as \"final\". A security review should always be performed by a security team orchestrated by the customer prior to moving an application into production. If a customer requires further assistance, they can engage Premier Support. Rules of Engagement - For Those Performing Tests Do not attempt to perform Denial-of-Service attacks or otherwise crash services. Heavy active scanning is tolerated (and is assumed to be somewhat of a load test) but deliberate takedowns are not permitted. Do not interact with human beings. Phishing credentials or other such client-side attacks are off-limits. Detailing XSS and similar attacks is encouraged as a part of the test, but do not leverage these against internal users or customers. Attack from a single point. Especially if the application is currently in the customer's hands, provide the IP address or hostname of the attacking host to avoid setting off alarms.","title":"Application Security Analysis: Rules of Engagement"},{"location":"security/rules-of-engagement/#application-security-analysis-rules-of-engagement","text":"When performing application security analysis, it is expected that the tester follow the Rules of Engagement as laid out below. This is to standardize the scope of application testing and provide a concrete awareness of what is considered \"out of scope\" for security analysis.","title":"Application Security Analysis: Rules of Engagement"},{"location":"security/rules-of-engagement/#rules-of-engagement-for-those-requesting-review","text":"Web Application Firewalls can be up and configured, but do not enable any automatic blocking. This can greatly slow down the person performing the test. Similarly, if a service is running on a virtual machine, ensure services such as fail2ban are disabled. You cannot make changes to the running application until the test is complete. This is to prevent accidentally breaking an otherwise valid attack in progress. Any review results are not considered as \"final\". A security review should always be performed by a security team orchestrated by the customer prior to moving an application into production. If a customer requires further assistance, they can engage Premier Support.","title":"Rules of Engagement - For Those Requesting Review"},{"location":"security/rules-of-engagement/#rules-of-engagement-for-those-performing-tests","text":"Do not attempt to perform Denial-of-Service attacks or otherwise crash services. Heavy active scanning is tolerated (and is assumed to be somewhat of a load test) but deliberate takedowns are not permitted. Do not interact with human beings. Phishing credentials or other such client-side attacks are off-limits. Detailing XSS and similar attacks is encouraged as a part of the test, but do not leverage these against internal users or customers. Attack from a single point. Especially if the application is currently in the customer's hands, provide the IP address or hostname of the attacking host to avoid setting off alarms.","title":"Rules of Engagement - For Those Performing Tests"},{"location":"security/threat-modelling-example/","text":"Threat Modelling Example This document covers the threat models for a sample project which takes video frames from video camera and process these frames on IoTEdge device and send them to Azure Cognitive Service to get the audio output. These models can be considered as reference template to show how we can construct threat modeling document. Each of the labeled entities in the figures below are accompanied by meta-information which describe the threats, recommended mitigations, and the associated security principle or goal . Architecture Diagram Assets Asset Entry Point Trust Level Azure Blob Storage Http End point Connection String Azure Monitor Http End Point Connection String Azure Cognitive Service Http End Point Connection String IoTEdge Module: M1 Http End Point Public Access (Local Area Network) IoTEdge Module: M2 Http End Point Public Access (Local Area Network) IoTEdge Module: M3 Http End Point Public Access (Local Area Network) IoTEdge Module: IoTEdgeMetricsCollector Http EndPoint Public Access (Local Area Network) Application Insights Http End Point Connection String Data Flow Diagram Client Browser makes requests to the M1 IoTEdge module. Browser and IoTEdge device are on same network, so browser directly hits the webapp URL. M1 IoTEdge module interacts with other two IoTEdge modules to render live stream from video device and display order scanning results via WebSockets. IoTEdge modules interact with Azure Cognitive service to get the translated text via OCR and audio stream via Text to Speech Service. IoTEdge modules send telemetry information to application insights. IoTEdge device is deployed with IoTEdge runtime which interacts with IoTEdge hub for deployments. IoTEdge module also sends some data to Azure storage which is required for debugging purpose. Cognitive service, application insights and Azure Storage are authenticated using connection strings which are stored in GitHub secrets and deployed using CI/CD pipelines. Threat List Assumptions Secrets like ACR credentials are stored in GitHub secrets store which are deployed to IoTEdge Device by CI/CD pipelines. However, CI/CD pipelines are out of scope. Threats Vector Threat Mitigation (1) Sniff Unencrypted data can be intercepted in transit Not Mitigated (2) Access to M1 IoT Edge Module Unauthorized Access to M1 IoT Edge Module Not Mitigated (3) Access to M2 IoT Edge Module Unauthorized Access to M2 IoT Edge Module Not Mitigated (4) Access to M3 IoT Edge Module Unauthorized Access to M3 IoT Edge Module Not Mitigated (5) Steal Storage Credentials Unauthorized Access to M2 IoTEdge Module where database secrets are used Not Mitigated (6) Denial Of Service Dos attack on all IoTEdge Modules since there is no Authentication Not Mitigated (7) Tampering with Log data Application Insights is connected via Connection String which is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper the log data. Not Mitigated (8) Tampering with video camera device. Video camera path is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper the video feed or use another video source or fake video stream. Not Mitigated (9) Spoofing Tampering Azure IoT Hub connection string is stored in .env file on IoTEdge Device. Once user gains access to the device, .env file can be read and attacker cause Dos attacks on IoTHub Not Mitigated (10) Denial of Service DDOS attack Azure Cognitive Service connection string is stored in .env file on IoTEdge Device. Once user gains access to the device, .env file can be read and attacker cause DoS attacks on Azure Cognitive Service Not Mitigated (11) Tampering with Storage Storage connection string is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper data on storage or read from the storage. Not Mitigated (12) Tampering with Storage Cognitive Service connection string is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker use cognitive service API's for his own purpose causing increase cost to use. Not Mitigated Threat Model Threat Properties Notable Threats # Principle Threat Mitigation 1 Authenticity Since channel from browser to IoTEdge Module is not authenticated, anyone can spoof it once gains access to WiFi network. Add authentication in all IoTEdge modules. 2 Confidentiality and Integrity As a result of the vulnerability of not encrypting data, plaintext data could be intercepted during transit via a man-in-the-middle (MitM) attack. Sensitive data could be exposed or tampered with to allow further exploits. All products and services must encrypt data in transit using approved cryptographic protocols and algorithms. Use TLS to encrypt all HTTP-based network traffic. Use other mechanisms, such as IPSec, to encrypt non-HTTP network traffic that contains customer or confidential data. Applies to data flow from browser to IoTEdge modules. 3 Confidentiality Data is a valuable target for most threat actors and attacking the data store directly, as opposed to stealing it during transit, allows data exfiltration at a much larger scale. In our scenario we are storing some data in Azure Blob containers. All customer or confidential data must be encrypted before being written to non-volatile storage media (encrypted at-rest) per the following requirements. Use approved algorithms. This includes AES-256, AES-192, or AES-128. Encryption must be enabled before writing data to storage. Applies to all data stores on the diagram. Azure Storage encrypt data at rest by default (AES-256). 4 Confidentiality Broken or non-existent authentication mechanisms may allow attackers to gain access to confidential information. All services within the Azure Trust Boundary must authenticate all incoming requests, including requests coming from the same network. Proper authorizations should also be applied to prevent unnecessary privileges. Whenever available, use Azure Managed Identities to authenticate services. Service Principals may be used if Managed Identities are not supported. External users or services may use UserName + Passwords, Tokens, Certificates or Connection Strings to authenticate, provided these are stored on Key Vault or any other vaulting solution. For authorization, use Azure RBAC to segregate duties and grant only the least amount of access to perform an action at a particular scope. Applies to Azure services like Azure IoTHub, Azure Cognitive Service, Azure Application Insights are authenticated using connection strings. 5 Confidentiality and Integrity A large attack surface, particularly those that are exposed on the internet, will increase the probability of a compromise Minimize the application attack surface by limiting publicly exposed services. Use strong network controls by using virtual networks, subnets and network security groups to protect against unsolicited traffic. Use Azure Private Endpoint for Azure Storage. Applies to Azure storage. 6 Confidentiality and Integrity Browser and IoTEdge device are connected over in store WIFI network Minimize the attack on WIFI network by using secure algorithm like WPA2. Applies to connection between browser and IoTEdge devices. 7 Integrity Exploitation of insufficient logging and monitoring is the bedrock of nearly every major incident. Attackers rely on the lack of monitoring and timely response to achieve their goals without being detected. Logging of critical application events must be performed to ensure that, should a security incident occur, incident response and root-cause analysis may be done. Steps must also be taken to ensure that logs are available and cannot be overwritten or destroyed through malicious or accidental occurrences. At a minimum, the following events should be logged. Login/logout events Privilege delegation events Security validation failures (e.g. input validation or authorization check failures) Application errors and system events Application and system start-ups and shut-downs, as well as logging initialization 6 Availability Exploitation of the public endpoint by malicious actors who aim to render the service unavailable to its intended users by interrupting the service normal activity, for instance by flooding the target service with requests until normal traffic is unable to be processed (Denial of Service) Application is accessed via web app deployed as one of the IoTEdge modules on the IoTEdge device. This app can be accessed by anyone in the local area network. Hence DDoS attacks are possible if the attacker gained access to local area network. All services deployed as IoTEdge modules must use authentication. Applies to services deployed on IoTEdge device 7 Integrity Tampering with data Data at rest, in Azure Storage must be encrypted on disk. Data at rest, in Azure can be protected further by Azure Advanced Threat Protection. Data at rest, in Azure Storage and Azure monitor workspace will use Azure RBAC to segregate duties and grant only the least amount of access to perform an action at a particular scope. Data in motion between services can be encrypted in TLS 1.2 Applies to data flow between IoTEdge modules and Azure Services. Security Principles Confidentiality refers to the objective of keeping data private or secret. In practice, it\u2019s about controlling access to data to prevent unauthorized disclosure. Integrity is about ensuring that data has not been tampered with and, therefore, can be trusted. It is correct, authentic, and reliable. Availability means that networks, systems, and applications are up and running. It ensures that authorized users have timely, reliable access to resources when they are needed.","title":"Threat Modelling Example"},{"location":"security/threat-modelling-example/#threat-modelling-example","text":"This document covers the threat models for a sample project which takes video frames from video camera and process these frames on IoTEdge device and send them to Azure Cognitive Service to get the audio output. These models can be considered as reference template to show how we can construct threat modeling document. Each of the labeled entities in the figures below are accompanied by meta-information which describe the threats, recommended mitigations, and the associated security principle or goal .","title":"Threat Modelling Example"},{"location":"security/threat-modelling-example/#architecture-diagram","text":"","title":"Architecture Diagram"},{"location":"security/threat-modelling-example/#assets","text":"Asset Entry Point Trust Level Azure Blob Storage Http End point Connection String Azure Monitor Http End Point Connection String Azure Cognitive Service Http End Point Connection String IoTEdge Module: M1 Http End Point Public Access (Local Area Network) IoTEdge Module: M2 Http End Point Public Access (Local Area Network) IoTEdge Module: M3 Http End Point Public Access (Local Area Network) IoTEdge Module: IoTEdgeMetricsCollector Http EndPoint Public Access (Local Area Network) Application Insights Http End Point Connection String","title":"Assets"},{"location":"security/threat-modelling-example/#data-flow-diagram","text":"Client Browser makes requests to the M1 IoTEdge module. Browser and IoTEdge device are on same network, so browser directly hits the webapp URL. M1 IoTEdge module interacts with other two IoTEdge modules to render live stream from video device and display order scanning results via WebSockets. IoTEdge modules interact with Azure Cognitive service to get the translated text via OCR and audio stream via Text to Speech Service. IoTEdge modules send telemetry information to application insights. IoTEdge device is deployed with IoTEdge runtime which interacts with IoTEdge hub for deployments. IoTEdge module also sends some data to Azure storage which is required for debugging purpose. Cognitive service, application insights and Azure Storage are authenticated using connection strings which are stored in GitHub secrets and deployed using CI/CD pipelines.","title":"Data Flow Diagram"},{"location":"security/threat-modelling-example/#threat-list","text":"","title":"Threat List"},{"location":"security/threat-modelling-example/#assumptions","text":"Secrets like ACR credentials are stored in GitHub secrets store which are deployed to IoTEdge Device by CI/CD pipelines. However, CI/CD pipelines are out of scope.","title":"Assumptions"},{"location":"security/threat-modelling-example/#threats","text":"Vector Threat Mitigation (1) Sniff Unencrypted data can be intercepted in transit Not Mitigated (2) Access to M1 IoT Edge Module Unauthorized Access to M1 IoT Edge Module Not Mitigated (3) Access to M2 IoT Edge Module Unauthorized Access to M2 IoT Edge Module Not Mitigated (4) Access to M3 IoT Edge Module Unauthorized Access to M3 IoT Edge Module Not Mitigated (5) Steal Storage Credentials Unauthorized Access to M2 IoTEdge Module where database secrets are used Not Mitigated (6) Denial Of Service Dos attack on all IoTEdge Modules since there is no Authentication Not Mitigated (7) Tampering with Log data Application Insights is connected via Connection String which is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper the log data. Not Mitigated (8) Tampering with video camera device. Video camera path is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper the video feed or use another video source or fake video stream. Not Mitigated (9) Spoofing Tampering Azure IoT Hub connection string is stored in .env file on IoTEdge Device. Once user gains access to the device, .env file can be read and attacker cause Dos attacks on IoTHub Not Mitigated (10) Denial of Service DDOS attack Azure Cognitive Service connection string is stored in .env file on IoTEdge Device. Once user gains access to the device, .env file can be read and attacker cause DoS attacks on Azure Cognitive Service Not Mitigated (11) Tampering with Storage Storage connection string is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper data on storage or read from the storage. Not Mitigated (12) Tampering with Storage Cognitive Service connection string is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker use cognitive service API's for his own purpose causing increase cost to use. Not Mitigated","title":"Threats"},{"location":"security/threat-modelling-example/#threat-model","text":"","title":"Threat Model"},{"location":"security/threat-modelling-example/#threat-properties","text":"Notable Threats # Principle Threat Mitigation 1 Authenticity Since channel from browser to IoTEdge Module is not authenticated, anyone can spoof it once gains access to WiFi network. Add authentication in all IoTEdge modules. 2 Confidentiality and Integrity As a result of the vulnerability of not encrypting data, plaintext data could be intercepted during transit via a man-in-the-middle (MitM) attack. Sensitive data could be exposed or tampered with to allow further exploits. All products and services must encrypt data in transit using approved cryptographic protocols and algorithms. Use TLS to encrypt all HTTP-based network traffic. Use other mechanisms, such as IPSec, to encrypt non-HTTP network traffic that contains customer or confidential data. Applies to data flow from browser to IoTEdge modules. 3 Confidentiality Data is a valuable target for most threat actors and attacking the data store directly, as opposed to stealing it during transit, allows data exfiltration at a much larger scale. In our scenario we are storing some data in Azure Blob containers. All customer or confidential data must be encrypted before being written to non-volatile storage media (encrypted at-rest) per the following requirements. Use approved algorithms. This includes AES-256, AES-192, or AES-128. Encryption must be enabled before writing data to storage. Applies to all data stores on the diagram. Azure Storage encrypt data at rest by default (AES-256). 4 Confidentiality Broken or non-existent authentication mechanisms may allow attackers to gain access to confidential information. All services within the Azure Trust Boundary must authenticate all incoming requests, including requests coming from the same network. Proper authorizations should also be applied to prevent unnecessary privileges. Whenever available, use Azure Managed Identities to authenticate services. Service Principals may be used if Managed Identities are not supported. External users or services may use UserName + Passwords, Tokens, Certificates or Connection Strings to authenticate, provided these are stored on Key Vault or any other vaulting solution. For authorization, use Azure RBAC to segregate duties and grant only the least amount of access to perform an action at a particular scope. Applies to Azure services like Azure IoTHub, Azure Cognitive Service, Azure Application Insights are authenticated using connection strings. 5 Confidentiality and Integrity A large attack surface, particularly those that are exposed on the internet, will increase the probability of a compromise Minimize the application attack surface by limiting publicly exposed services. Use strong network controls by using virtual networks, subnets and network security groups to protect against unsolicited traffic. Use Azure Private Endpoint for Azure Storage. Applies to Azure storage. 6 Confidentiality and Integrity Browser and IoTEdge device are connected over in store WIFI network Minimize the attack on WIFI network by using secure algorithm like WPA2. Applies to connection between browser and IoTEdge devices. 7 Integrity Exploitation of insufficient logging and monitoring is the bedrock of nearly every major incident. Attackers rely on the lack of monitoring and timely response to achieve their goals without being detected. Logging of critical application events must be performed to ensure that, should a security incident occur, incident response and root-cause analysis may be done. Steps must also be taken to ensure that logs are available and cannot be overwritten or destroyed through malicious or accidental occurrences. At a minimum, the following events should be logged. Login/logout events Privilege delegation events Security validation failures (e.g. input validation or authorization check failures) Application errors and system events Application and system start-ups and shut-downs, as well as logging initialization 6 Availability Exploitation of the public endpoint by malicious actors who aim to render the service unavailable to its intended users by interrupting the service normal activity, for instance by flooding the target service with requests until normal traffic is unable to be processed (Denial of Service) Application is accessed via web app deployed as one of the IoTEdge modules on the IoTEdge device. This app can be accessed by anyone in the local area network. Hence DDoS attacks are possible if the attacker gained access to local area network. All services deployed as IoTEdge modules must use authentication. Applies to services deployed on IoTEdge device 7 Integrity Tampering with data Data at rest, in Azure Storage must be encrypted on disk. Data at rest, in Azure can be protected further by Azure Advanced Threat Protection. Data at rest, in Azure Storage and Azure monitor workspace will use Azure RBAC to segregate duties and grant only the least amount of access to perform an action at a particular scope. Data in motion between services can be encrypted in TLS 1.2 Applies to data flow between IoTEdge modules and Azure Services.","title":"Threat Properties"},{"location":"security/threat-modelling-example/#security-principles","text":"Confidentiality refers to the objective of keeping data private or secret. In practice, it\u2019s about controlling access to data to prevent unauthorized disclosure. Integrity is about ensuring that data has not been tampered with and, therefore, can be trusted. It is correct, authentic, and reliable. Availability means that networks, systems, and applications are up and running. It ensures that authorized users have timely, reliable access to resources when they are needed.","title":"Security Principles"},{"location":"security/threat-modelling/","text":"Threat Modeling Threat modeling is an effective way to help secure your systems, applications, networks, and services. It's a systematic approach that identifies potential threats and recommendations to help reduce risk and meet security objectives earlier in the development lifecycle. Threat Modeling Phases Diagram Capture all requirements for your system and create a data-flow diagram Identify Apply a threat-modeling framework to the data-flow diagram and find potential security issues. Here we can use STRIDE framework to identify the threats. Mitigate Decide how to approach each issue with the appropriate combination of security controls. Validate Verify requirements are met, issues are found, and security controls are implemented. Example of these phases is covered in the threat modelling example. More details about these phases can be found at Threat Modeling Security Fundamentals. Threat Modeling Example Here is an example of a threat modeling document which talks about the architecture and different phases involved in the threat modeling. This document can be used as reference template for creating threat modeling documents. Resources Threat Modeling Microsoft Threat Modeling Tool STRIDE (Threat modeling framework)","title":"Threat Modeling"},{"location":"security/threat-modelling/#threat-modeling","text":"Threat modeling is an effective way to help secure your systems, applications, networks, and services. It's a systematic approach that identifies potential threats and recommendations to help reduce risk and meet security objectives earlier in the development lifecycle.","title":"Threat Modeling"},{"location":"security/threat-modelling/#threat-modeling-phases","text":"Diagram Capture all requirements for your system and create a data-flow diagram Identify Apply a threat-modeling framework to the data-flow diagram and find potential security issues. Here we can use STRIDE framework to identify the threats. Mitigate Decide how to approach each issue with the appropriate combination of security controls. Validate Verify requirements are met, issues are found, and security controls are implemented. Example of these phases is covered in the threat modelling example. More details about these phases can be found at Threat Modeling Security Fundamentals.","title":"Threat Modeling Phases"},{"location":"security/threat-modelling/#threat-modeling-example","text":"Here is an example of a threat modeling document which talks about the architecture and different phases involved in the threat modeling. This document can be used as reference template for creating threat modeling documents.","title":"Threat Modeling Example"},{"location":"security/threat-modelling/#resources","text":"Threat Modeling Microsoft Threat Modeling Tool STRIDE (Threat modeling framework)","title":"Resources"},{"location":"source-control/","text":"Source Control There are many options when working with Source Control. In ISE we use AzureDevOps for private repositories and GitHub for public repositories. Goal Following industry best practice to work in geo-distributed teams which encourage contributions from all across ISE as well as the broader OSS community Improve code quality by enforcing reviews before merging into main branches Improve traceability of features and fixes through a clean commit history General Guidance Consistency is important, so agree to the approach as a team before starting to code. Treat this as a design decision, so include a design proposal and review, in the same way as you would document all design decisions (see Working Agreements and Design Reviews ). Creating a New Repository When creating a new repository, the team should at least do the following Agree on the branch , release and merge strategy Define the merge strategy ( linear or non-linear ) Lock the default branch and merge using pull requests (PRs) Agree on branch naming (e.g. user/your_alias/feature_name ) Establish branch/PR policies For public repositories the default branch should contain the following files: LICENSE README.md contributing.md Contributing to an Existing Repository When working on an existing project, git clone the repository and ensure you understand the team's branch, merge and release strategy (e.g. through the projects CONTRIBUTING.md file ). Mixed DevOps Environments For most engagements having a single hosted DevOps environment (i.e. Azure DevOps) is the preferred path but there are times when a mixed DevOps environment (i.e. Azure DevOps for Agile/Work item tracking & GitHub for Source Control) is needed due to customer requirements. When working in a mixed environment: Manually tag PR's in work items Ensure that the scope of work items / tasks align with PR's Resources Git --local-branching-on-the-cheap Azure DevOps ISE Git details details on how to use Git as part of a ISE project. GitHub - Removing sensitive data from a repository How Git Works Pluralsight course Mastering Git Pluralsight course","title":"Source Control"},{"location":"source-control/#source-control","text":"There are many options when working with Source Control. In ISE we use AzureDevOps for private repositories and GitHub for public repositories.","title":"Source Control"},{"location":"source-control/#goal","text":"Following industry best practice to work in geo-distributed teams which encourage contributions from all across ISE as well as the broader OSS community Improve code quality by enforcing reviews before merging into main branches Improve traceability of features and fixes through a clean commit history","title":"Goal"},{"location":"source-control/#general-guidance","text":"Consistency is important, so agree to the approach as a team before starting to code. Treat this as a design decision, so include a design proposal and review, in the same way as you would document all design decisions (see Working Agreements and Design Reviews ).","title":"General Guidance"},{"location":"source-control/#creating-a-new-repository","text":"When creating a new repository, the team should at least do the following Agree on the branch , release and merge strategy Define the merge strategy ( linear or non-linear ) Lock the default branch and merge using pull requests (PRs) Agree on branch naming (e.g. user/your_alias/feature_name ) Establish branch/PR policies For public repositories the default branch should contain the following files: LICENSE README.md contributing.md","title":"Creating a New Repository"},{"location":"source-control/#contributing-to-an-existing-repository","text":"When working on an existing project, git clone the repository and ensure you understand the team's branch, merge and release strategy (e.g. through the projects CONTRIBUTING.md file ).","title":"Contributing to an Existing Repository"},{"location":"source-control/#mixed-devops-environments","text":"For most engagements having a single hosted DevOps environment (i.e. Azure DevOps) is the preferred path but there are times when a mixed DevOps environment (i.e. Azure DevOps for Agile/Work item tracking & GitHub for Source Control) is needed due to customer requirements. When working in a mixed environment: Manually tag PR's in work items Ensure that the scope of work items / tasks align with PR's","title":"Mixed DevOps Environments"},{"location":"source-control/#resources","text":"Git --local-branching-on-the-cheap Azure DevOps ISE Git details details on how to use Git as part of a ISE project. GitHub - Removing sensitive data from a repository How Git Works Pluralsight course Mastering Git Pluralsight course","title":"Resources"},{"location":"source-control/component-versioning/","text":"Component Versioning Goal Larger applications consist of multiple components that reference each other and rely on compatibility of the interfaces/contracts of the components. To achieve the goal of loosely coupled applications, each component should be versioned independently hence allowing developers to detect breaking changes or seamless updates just by looking at the version number. Version Numbers and Versioning Schemes For developers or other components to detect breaking changes the version number of a component is important. There is different versioning number schemes, e.g. major.minor[.build[.revision]] or major.minor[.maintenance[.build]] . Upon build / CI these version numbers are being generated. During CD / release components are pushed to a component repository such as Nuget, NPM, Docker Hub where a history of different versions is being kept. Each build the version number is incremented at the last digit. Updating the major / minor version indicates changes of the API / interfaces / contracts: Major Version: A breaking change Minor Version: A backwards-compatible minor change Build / Revision: No API change, just a different build. Semantic Versioning Semantic Versioning is a versioning scheme specifying how to interpret the different version numbers. The most common format is major.minor.patch . The version number is incremented based on the following rules: Major version when you make incompatible API changes, Minor version when you add functionality in a backwards-compatible manner, and Patch version when you make backwards-compatible bug fixes. Examples of semver version numbers: 1.0.0-alpha.1 : +1 commit after the alpha release of 1.0.0 2.1.0-beta : 2.1.0 in beta branch 2.4.2 : 2.4.2 release A common practice is to determine the version number during the build process. For this the source control repository is utilized to determine the version number automatically based the source code repository. The GitVersion tool uses the git history to generate repeatable and unique version number based on number of commits since last major or minor release commit messages tags branch names Version updates happen through: Commit messages or tags for Major / Minor / Revision updates. When using commit messages a convention such as Conventional Commits is recommended (see Git Guidance - Commit Message Structure ) Branch names (e.g. develop, release/..) for Alpha / Beta / RC Otherwise: Number of commits (+12, ...) Semantic Versioning Within a Monorepo A monorepo, short for \"monolithic repository\", is a software development practice where multiple related projects, components, or modules are stored within a single version-controlled repository as opposed to maintaining them in separate repositories. Challenges with Versioning in a Monorepo Structure Versioning in a monorepo involves making decisions about how to assign version numbers to different projects and components contained within the repository. Assigning a single version number to all projects in a monorepo can lead to frequent version increments if changes in one project don't match the significance of changes in another. This might be excessive if some projects undergo rapid development while others evolve more slowly. Ideally, we would want each project within the monorepo to have its own version number. Changes in one project shouldn't necessarily trigger version changes in others. This strategy allows projects to evolve at their own pace, without forcing all projects to adopt the same version number. It aligns well with the differing release cadences of distinct projects. semantic-release Package for Versioning semantic-release simplifies the entire process of releasing a package, which encompasses tasks such as identifying the upcoming version number, producing release notes, and distributing the package. This process severs the direct link between human sentiments and version identifiers. Instead, it rigorously adheres to the Semantic Versioning standards and effectively conveys the significance of alterations to end users. semantic-release relies on commit messages to assess how codebase changes impact consumers. By adhering to structured conventions for commit messages, semantic-release autonomously identifies the subsequent semantic version, compiles a changelog, and releases the software. Angular Commit Message Conventions serve as the default for semantic-release . However, the configuration options of the @semantic-release/commit-analyzer and @semantic-release/release-notes-generator plugins, including presets, can be adjusted to modify the commit message format. The table below shows which commit message gets you which release type when semantic-release runs (using the default configuration): Commit message Release type fix(pencil): stop graphite breaking when too much pressure applied Patch Fix Release feat(pencil): add 'graphiteWidth' option Minor Feature Release perf(pencil): remove graphiteWidth option BREAKING CHANGE: The graphiteWidth option has been removed. The default graphite width of 10mm is always used for performance reasons. Major Breaking Release (Note that the BREAKING CHANGE: token must be in the footer of the commit) The inherent setup of semantic-release presumes a direct correspondence between a GitHub repository and a package. Hence changes anywhere in the project result in a version upgrade for the project. The semantic-release-monorepo tool permits the utilization of semantic-release within a solitary GitHub repository that encompasses numerous packages. Instead of attributing all commits to a single package, commits are assigned to packages based on the files that a commit touched. If a commit touches a file in or below a package's root, it will be considered for that package's next release. A single commit can belong to multiple packages and may trigger the release of multiple packages. In order to avoid version collisions, generated git tags are namespaced using the given package's name: <package-name> - <version> . semantic-release Configurations semantic-release \u2019s options, mode and plugins can be set via either: A .releaserc file, written in YAML or JSON, with optional extensions: .yaml/.yml/.json/.js/.cjs A release.config.(js|cjs) file that exports an object A release key in the project's package.json file Here is an example .releaserc file which contains the configuration for: 1. git tags for the releases from different types of branches 2. Any plugins required, list of supported plugins can be found here . In this file semantic-release-monorepo plugin is extended. { \"ci\" : true , \"repositoryUrl\" : \"your repository url\" , \"branches\" : [ \"master\" , { \"name\" : \"feature/*\" , \"prerelease\" : \"beta-${name.replace(/\\\\//g, '-').replace(/_/g, '-')}\" }, { \"name\" : \"[a-zA-Z0-9_]+/[a-zA-Z0-9-_]+\" , \"prerelease\" : \"dev-${name.replace(/\\\\//g, '-').replace(/_/g, '--')}\" } ], \"plugins\" : [ \"@semantic-release/commit-analyzer\" , \"@semantic-release/release-notes-generator\" , [ \"@semantic-release/exec\" , { \"verifyReleaseCmd\" : \"echo ${nextRelease.name} > .VERSION\" } ], \"semantic-release-ado\" ], \"extends\" : \"semantic-release-monorepo\" } Resources GitVersion Semantic Versioning Versioning in C# semantic-release semantic-release-monorepo","title":"Component Versioning"},{"location":"source-control/component-versioning/#component-versioning","text":"","title":"Component Versioning"},{"location":"source-control/component-versioning/#goal","text":"Larger applications consist of multiple components that reference each other and rely on compatibility of the interfaces/contracts of the components. To achieve the goal of loosely coupled applications, each component should be versioned independently hence allowing developers to detect breaking changes or seamless updates just by looking at the version number.","title":"Goal"},{"location":"source-control/component-versioning/#version-numbers-and-versioning-schemes","text":"For developers or other components to detect breaking changes the version number of a component is important. There is different versioning number schemes, e.g. major.minor[.build[.revision]] or major.minor[.maintenance[.build]] . Upon build / CI these version numbers are being generated. During CD / release components are pushed to a component repository such as Nuget, NPM, Docker Hub where a history of different versions is being kept. Each build the version number is incremented at the last digit. Updating the major / minor version indicates changes of the API / interfaces / contracts: Major Version: A breaking change Minor Version: A backwards-compatible minor change Build / Revision: No API change, just a different build.","title":"Version Numbers and Versioning Schemes"},{"location":"source-control/component-versioning/#semantic-versioning","text":"Semantic Versioning is a versioning scheme specifying how to interpret the different version numbers. The most common format is major.minor.patch . The version number is incremented based on the following rules: Major version when you make incompatible API changes, Minor version when you add functionality in a backwards-compatible manner, and Patch version when you make backwards-compatible bug fixes. Examples of semver version numbers: 1.0.0-alpha.1 : +1 commit after the alpha release of 1.0.0 2.1.0-beta : 2.1.0 in beta branch 2.4.2 : 2.4.2 release A common practice is to determine the version number during the build process. For this the source control repository is utilized to determine the version number automatically based the source code repository. The GitVersion tool uses the git history to generate repeatable and unique version number based on number of commits since last major or minor release commit messages tags branch names Version updates happen through: Commit messages or tags for Major / Minor / Revision updates. When using commit messages a convention such as Conventional Commits is recommended (see Git Guidance - Commit Message Structure ) Branch names (e.g. develop, release/..) for Alpha / Beta / RC Otherwise: Number of commits (+12, ...)","title":"Semantic Versioning"},{"location":"source-control/component-versioning/#semantic-versioning-within-a-monorepo","text":"A monorepo, short for \"monolithic repository\", is a software development practice where multiple related projects, components, or modules are stored within a single version-controlled repository as opposed to maintaining them in separate repositories.","title":"Semantic Versioning Within a Monorepo"},{"location":"source-control/component-versioning/#challenges-with-versioning-in-a-monorepo-structure","text":"Versioning in a monorepo involves making decisions about how to assign version numbers to different projects and components contained within the repository. Assigning a single version number to all projects in a monorepo can lead to frequent version increments if changes in one project don't match the significance of changes in another. This might be excessive if some projects undergo rapid development while others evolve more slowly. Ideally, we would want each project within the monorepo to have its own version number. Changes in one project shouldn't necessarily trigger version changes in others. This strategy allows projects to evolve at their own pace, without forcing all projects to adopt the same version number. It aligns well with the differing release cadences of distinct projects.","title":"Challenges with Versioning in a Monorepo Structure"},{"location":"source-control/component-versioning/#semantic-release-package-for-versioning","text":"semantic-release simplifies the entire process of releasing a package, which encompasses tasks such as identifying the upcoming version number, producing release notes, and distributing the package. This process severs the direct link between human sentiments and version identifiers. Instead, it rigorously adheres to the Semantic Versioning standards and effectively conveys the significance of alterations to end users. semantic-release relies on commit messages to assess how codebase changes impact consumers. By adhering to structured conventions for commit messages, semantic-release autonomously identifies the subsequent semantic version, compiles a changelog, and releases the software. Angular Commit Message Conventions serve as the default for semantic-release . However, the configuration options of the @semantic-release/commit-analyzer and @semantic-release/release-notes-generator plugins, including presets, can be adjusted to modify the commit message format. The table below shows which commit message gets you which release type when semantic-release runs (using the default configuration): Commit message Release type fix(pencil): stop graphite breaking when too much pressure applied Patch Fix Release feat(pencil): add 'graphiteWidth' option Minor Feature Release perf(pencil): remove graphiteWidth option BREAKING CHANGE: The graphiteWidth option has been removed. The default graphite width of 10mm is always used for performance reasons. Major Breaking Release (Note that the BREAKING CHANGE: token must be in the footer of the commit) The inherent setup of semantic-release presumes a direct correspondence between a GitHub repository and a package. Hence changes anywhere in the project result in a version upgrade for the project. The semantic-release-monorepo tool permits the utilization of semantic-release within a solitary GitHub repository that encompasses numerous packages. Instead of attributing all commits to a single package, commits are assigned to packages based on the files that a commit touched. If a commit touches a file in or below a package's root, it will be considered for that package's next release. A single commit can belong to multiple packages and may trigger the release of multiple packages. In order to avoid version collisions, generated git tags are namespaced using the given package's name: <package-name> - <version> .","title":"semantic-release Package for Versioning"},{"location":"source-control/component-versioning/#semantic-release-configurations","text":"semantic-release \u2019s options, mode and plugins can be set via either: A .releaserc file, written in YAML or JSON, with optional extensions: .yaml/.yml/.json/.js/.cjs A release.config.(js|cjs) file that exports an object A release key in the project's package.json file Here is an example .releaserc file which contains the configuration for: 1. git tags for the releases from different types of branches 2. Any plugins required, list of supported plugins can be found here . In this file semantic-release-monorepo plugin is extended. { \"ci\" : true , \"repositoryUrl\" : \"your repository url\" , \"branches\" : [ \"master\" , { \"name\" : \"feature/*\" , \"prerelease\" : \"beta-${name.replace(/\\\\//g, '-').replace(/_/g, '-')}\" }, { \"name\" : \"[a-zA-Z0-9_]+/[a-zA-Z0-9-_]+\" , \"prerelease\" : \"dev-${name.replace(/\\\\//g, '-').replace(/_/g, '--')}\" } ], \"plugins\" : [ \"@semantic-release/commit-analyzer\" , \"@semantic-release/release-notes-generator\" , [ \"@semantic-release/exec\" , { \"verifyReleaseCmd\" : \"echo ${nextRelease.name} > .VERSION\" } ], \"semantic-release-ado\" ], \"extends\" : \"semantic-release-monorepo\" }","title":"semantic-release Configurations"},{"location":"source-control/component-versioning/#resources","text":"GitVersion Semantic Versioning Versioning in C# semantic-release semantic-release-monorepo","title":"Resources"},{"location":"source-control/merge-strategies/","text":"Merge Strategies Agree if you want a linear or non-linear commit history. There are pros and cons to both approaches: Pro linear: Avoid messy git history, use linear history Con linear: Why you should stop using Git rebase Approach for Non-Linear Commit History Merging topic into main A---B---C topic / \\ D---E---F---G---H main git fetch origin git checkout main git merge topic Two Approaches to Achieve a Linear Commit History Rebase Topic Branch Before Merging into Main Before merging topic into main , we rebase topic with the main branch: A---B---C topic / \\ D---E---F-----------G---H main git checkout main git pull git checkout topic git rebase origin/main Create a PR topic --> main in Azure DevOps and approve using the squash merge option Rebase Topic Branch Before Squash Merge into Main Squash merging is a merge option that allows you to condense the Git history of topic branches when you complete a pull request. Instead of adding each commit on topic to the history of main , a squash merge takes all the file changes and adds them to a single new commit on main . A---B---C topic / D---E---F-----------G---H main Create a PR topic --> main in Azure DevOps and approve using the squash merge option","title":"Merge Strategies"},{"location":"source-control/merge-strategies/#merge-strategies","text":"Agree if you want a linear or non-linear commit history. There are pros and cons to both approaches: Pro linear: Avoid messy git history, use linear history Con linear: Why you should stop using Git rebase","title":"Merge Strategies"},{"location":"source-control/merge-strategies/#approach-for-non-linear-commit-history","text":"Merging topic into main A---B---C topic / \\ D---E---F---G---H main git fetch origin git checkout main git merge topic","title":"Approach for Non-Linear Commit History"},{"location":"source-control/merge-strategies/#two-approaches-to-achieve-a-linear-commit-history","text":"","title":"Two Approaches to Achieve a Linear Commit History"},{"location":"source-control/merge-strategies/#rebase-topic-branch-before-merging-into-main","text":"Before merging topic into main , we rebase topic with the main branch: A---B---C topic / \\ D---E---F-----------G---H main git checkout main git pull git checkout topic git rebase origin/main Create a PR topic --> main in Azure DevOps and approve using the squash merge option","title":"Rebase Topic Branch Before Merging into Main"},{"location":"source-control/merge-strategies/#rebase-topic-branch-before-squash-merge-into-main","text":"Squash merging is a merge option that allows you to condense the Git history of topic branches when you complete a pull request. Instead of adding each commit on topic to the history of main , a squash merge takes all the file changes and adds them to a single new commit on main . A---B---C topic / D---E---F-----------G---H main Create a PR topic --> main in Azure DevOps and approve using the squash merge option","title":"Rebase Topic Branch Before Squash Merge into Main"},{"location":"source-control/naming-branches/","text":"Naming Branches When contributing to existing projects, look for and stick with the agreed branch naming convention. In open source projects this information is typically found in the contributing instructions, often in a file named CONTRIBUTING.md . In the beginning of a new project the team agrees on the project conventions including the branch naming strategy. Here's an example of a branch naming convention: <user alias>/ [ feature/bug/hotfix ] /<work item ID>_<title> Which could translate to something as follows: dickinson/feature/271_add_more_cowbell The example above is just that - an example. The team can choose to omit or add parts. Choosing a branch convention can depend on the development model (e.g. trunk-based development ), versioning model, tools used in managing source control, matter of taste etc. Focus on simplicity and reducing ambiguity; a good branch naming strategy allows the team to understand the purpose and ownership of each branch in the repository.","title":"Naming Branches"},{"location":"source-control/naming-branches/#naming-branches","text":"When contributing to existing projects, look for and stick with the agreed branch naming convention. In open source projects this information is typically found in the contributing instructions, often in a file named CONTRIBUTING.md . In the beginning of a new project the team agrees on the project conventions including the branch naming strategy. Here's an example of a branch naming convention: <user alias>/ [ feature/bug/hotfix ] /<work item ID>_<title> Which could translate to something as follows: dickinson/feature/271_add_more_cowbell The example above is just that - an example. The team can choose to omit or add parts. Choosing a branch convention can depend on the development model (e.g. trunk-based development ), versioning model, tools used in managing source control, matter of taste etc. Focus on simplicity and reducing ambiguity; a good branch naming strategy allows the team to understand the purpose and ownership of each branch in the repository.","title":"Naming Branches"},{"location":"source-control/secrets-management/","text":"Working with Secrets in Source Control The best way to avoid leaking secrets is to store them in local/private files and exclude these from git tracking with a .gitignore file. E.g. the following pattern will exclude all files with the extension .private.config : # remove private configuration *.private.config For more details on proper management of credentials and secrets in source control, and handling an accidental commit of secrets to source control, please refer to the Secrets Management document which has further information, split by language as well. As an extra security measure, apply credential scanning in your CI/CD pipeline.","title":"Working with Secrets in Source Control"},{"location":"source-control/secrets-management/#working-with-secrets-in-source-control","text":"The best way to avoid leaking secrets is to store them in local/private files and exclude these from git tracking with a .gitignore file. E.g. the following pattern will exclude all files with the extension .private.config : # remove private configuration *.private.config For more details on proper management of credentials and secrets in source control, and handling an accidental commit of secrets to source control, please refer to the Secrets Management document which has further information, split by language as well. As an extra security measure, apply credential scanning in your CI/CD pipeline.","title":"Working with Secrets in Source Control"},{"location":"source-control/git-guidance/","text":"Git Guidance What is Git? Git is a distributed version control system. This means that - unlike SVN or CVS - it doesn't use a central server to synchronize. Instead, every participant has a local copy of the source-code, and the attached history that is kept in sync by comparing commit hashes (SHA hashes of changes between each git commit command) making up the latest version (called HEAD ). For example: repo 1 : A -> B -> C -> D -> HEAD repo 2 : A -> B -> HEAD repo 3 : X -> Y -> Z -> HEAD repo 4 : A -> J -> HEAD Since they share a common history, repo 1 and repo 2 can be synchronized fairly easily, repo 4 may be able to synchronize as well, but it's going to have to add a commit (J, and maybe a merge commit) to repo 1. Repo 3 cannot be easily synchronized with the others. Everything related to these commits is stored in a local .git directory in the root of the repository. In other words, by using Git you are simply creating immutable file histories that uniquely identify the current state and therefore allow sharing whatever comes after. It's a Merkle tree . Be sure to run git help after Git installation to find really in-depth explanations of everything. Installation Git is a tool set that must be installed. Install Git and follow the First-Time Git Setup . A recommended installation is the Git Lens extension for Visual Studio Code . Visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more. You can use these commands as well to configure your Git for Visual Studio Code as an editor for merge conflicts and diff tool. git config --global user.name [ YOUR FIRST AND LAST NAME ] git config --global user.email [ YOUR E-MAIL ADDRESS ] git config --global merge.tool vscode git config --global mergetool.vscode.cmd \"code --wait $MERGED \" git config --global diff.tool vscode git config --global difftool.vscode.cmd \"code --wait --diff $LOCAL $REMOTE \" Basic Workflow A basic Git workflow is as follows; you can find more information on the specific steps below. # pull the latest changes git pull # start a new feature branch based on the develop branch git checkout -b feature/123-add-git-instructions develop # edit some files # add and commit the files git add <file> git commit -m \"add basic instructions\" # edit some files # add and commit the files git add <file> git commit -m \"add more advanced instructions\" # check your changes git status # push the branch to the remote repository git push --set-upstream origin feature/123-add-git-instructions Cloning Whenever you want to make a change to a repository, you need to first clone it. Cloning a repository pulls down a full copy of all the repository data, so that you can work on it locally. This copy includes all versions of every file and folder for the project. git clone https://github.com/username/repo-name You only need to clone the repository the first time. Before any subsequent branches you can sync any changes from the remote repository using git pull . Branching To avoid adding code that has not been peer reviewed to the main branch (ex. develop ) we typically work in feature branches, and merge these back to the main trunk with a Pull Request. It's even the case that often the main or develop branch of a repository are locked so that you can't make changes without a Pull Request. Therefore, it is useful to create a separate branch for your local/feature work, so that you can work and track your changes in this branch. Pull the latest changes and create a new branch for your work based on the trunk (in this case develop ). git pull git checkout -b feature/feature-name develop At any point, you can move between the branches with git checkout <branch> as long as you have committed or stashed your work. If you forget the name of your branch use git branch --all to list all branches. Committing To avoid losing work, it is good to commit often in small chunks. This allows you to revert only the last changes if you discover a problem and also neatly explains exactly what changes were made and why. Make changes to your branch Check what files were changed > git status On branch feature/271-basic-commit-info Changes not staged for commit: ( use \"git add <file>...\" to update what will be committed ) ( use \"git restore <file>...\" to discard changes in working directory ) modified: source-control/git-guidance/README.md Track the files you wish to include in the commit. To track all modified files: git add --all Or to track only specific files: git add source-control/git-guidance/README.md Commit the changes to your local branch with a descriptive commit message git commit -m \"add basic git instructions\" Pushing When you are done working, push your changes to a branch in the remote repository using: git push The first time you push, you first need to set an upstream branch as follows. After the first push, the --set-upstream parameter and branch name are not needed anymore. git push --set-upstream origin feature/feature-name Once the feature branch is pushed to the remote repository, it is visible to anyone with access to the code. Merging We encourage the use of Pull Request to merge code to the main repository to make sure that all code in the final product is code reviewed The Pull Request (PR) process in Azure DevOps , GitHub and other similar tools make it easy both to start a PR, review a PR and merge a PR. Merge Conflicts If multiple people make changes to the same files, you may need to resolve any conflicts that have occurred before you can merge. # check out the develop branch and get the latest changes git checkout develop git pull # check out your branch git checkout <your branch> # merge the develop branch into your branch git merge develop # if merge conflicts occur, above command will fail with a message telling you that there are conflicts to be solved # find which files need to be resolved git status You can start an interactive process that will show which files have conflicts. Sometimes you removed a file, where it was changed in dev. Or you made changes to some lines in a file where another developer made changes as well. If you went through the installation steps mentioned before, Visual Studio Code is set up as merge tool. You can also use a merge tool like kdiff3 . When editing conflicts occur, the process will automatically open Visual Studio Code where the conflicting parts are highlighted in green and blue, and you have make a choice: Accept your changes (current) Accept the changes from dev branch (incoming) Accept them both and fix the code (probably needed) Here are lines that are either unchanged from the common ancestor, or cleanly resolved because only one side changed. <<<<<<< yours:sample.txt Conflict resolution is hard; let's go shopping. ======= Git makes conflict resolution easy. >>>>>>> theirs:sample.txt And here is another line that is cleanly resolved or unmodified When this process is completed, make sure you test the result by executing build, checks, test to validate this merged result. # conclude the merge git merge --continue # verify that everything went ok git log # push the changes to the remote branch git push If no other conflicts appear, the PR can now be merged, and your branch deleted. Use squash to reduce your changes into a single commit, so the commit history can be within an acceptable size. Stashing Changes git stash is super handy if you have un-committed changes in your working directory, but you want to work on a different branch. You can run git stash , save the un-committed work, and revert to the HEAD commit. You can retrieve the saved changes by running git stash pop : git stash \u2026 git stash pop Or you can move the current state into a new branch: git stash branch <new_branch_to_save_changes> Recovering Lost Commits If you \"lost\" a commit that you want to return to, for example to revert a git rebase where your commits got squashed, you can use git reflog to find the commit: git reflog Then you can use the reflog reference ( HEAD@{} ) to reset to a specific commit before the rebase: git reset HEAD@ { 2 } Commit Best Practices A commit combines changes into a logical unit. Adding a descriptive commit message can aid in comprehending the code changes and understanding the rationale behind the modifications. Consider the following when making your commits: Make small commits. This makes changes easier to review, and if we need to revert a commit, we lose less work. Consider splitting the commit into separate commits with git add -p if it includes more than one logical change or bug fix. Don't mix whitespace changes with functional code changes. It is hard to determine if the line has a functional change or only removes a whitespace, so functional changes may go unnoticed. Commit complete and well tested code. Never commit incomplete code, get in the habit of testing your code before committing. Write good commit messages. Why is it necessary? It may fix a bug, add a feature, improve performance, or just be a change for the sake of correctness What effects does this change have? In addition to the obvious ones, this may include benchmarks, side effects etc. You can specify the default git editor, which allows you to write your commit messages using your favorite editor. The following command makes Visual Studio Code your default git editor: git config --global core.editor \"code --wait\" Commit Message Structure The essential parts of a commit message are: subject line: a short description of the commit, maximum 50 characters long body (optional): a longer description of the commit, wrapped at 72 characters, separated from the subject line by a blank line You are free to structure commit messages; however, git commands like git log utilize above structure. Therefore, it can be helpful to follow a convention within your team and to utilize git best. For example, Conventional Commits is a lightweight convention that complements SemVer , by describing the features, fixes, and breaking changes made in commit messages. See Component Versioning for more information on versioning. For more information on commit message conventions, see: A Note About Git Commit Messages Conventional Commits Git commit best practices How to Write a Git Commit Message How to Write Better Git Commit Messages Information in commit messages On commit messages Managing Remotes A local git repository can have one or more backing remote repositories. You can list the remote repositories using git remote - by default, the remote repository you cloned from will be called origin > git remote -v origin https://github.com/microsoft/code-with-engineering-playbook.git ( fetch ) origin https://github.com/microsoft/code-with-engineering-playbook.git ( push ) Working with Forks You can set multiple remotes. This is useful for example if you want to work with a forked version of the repository. For more info on how to set upstream remotes and syncing repositories when working with forks see GitHub's Working with forks documentation . Updating the Remote if a Repository Changes Names If the repository is changed in some way, for example a name change, or if you want to switch between HTTPS and SSH you need to update the remote # list the existing remotes > git remote -v origin https://hostname/username/repository-name.git ( fetch ) origin https://hostname/username/repository-name.git ( push ) # change the remote url git remote set-url origin https://hostname/username/new-repository-name.git # verify that the remote URL has changed > git remote -v origin https://hostname/username/new-repository-name.git ( fetch ) origin https://hostname/username/new-repository-name.git ( push ) Rolling Back Changes Reverting and Deleting Commits To \"undo\" a commit, run the following two commands: git revert and git reset . git revert creates a new commit that undoes commits while git reset allows deleting commits entirely from the commit history. If you have committed secrets/keys, git reset will remove them from the commit history! To delete the latest commit use HEAD~ : git reset --hard HEAD~1 To delete commits back to a specific commit, use the respective commit id: git reset --hard <sha1-commit-id> after you deleted the unwanted commits, push using force : git push origin HEAD --force Interactive rebase for undoing commits: git rebase -i HEAD~N The above command will open an interactive session in an editor (for example vim) with the last N commits sorted from oldest to newest. To undo a commit, delete the corresponding line of the commit and save the file. Git will rewrite the commits in the order listed in the file and because one (or many) commits were deleted, the commit will no longer be part of the history. Running rebase will locally modify the history, after this one can use force to push the changes to remote without the deleted commit. Using Submodules Submodules can be useful in more complex deployment and/or development scenarios Adding a submodule to your repo git submodule add -b master <your_submodule> Initialize and pull a repo with submodules: git submodule init git submodule update --init --remote git submodule foreach git checkout master git submodule foreach git pull origin Working with Images, Video and Other Binary Content Avoid committing frequently changed binary files, such as large images, video or compiled code to your git repository. Binary content is not diffed like text content, so cloning or pulling from the repository may pull each revision of the binary file. One solution to this problem is Git LFS (Git Large File Storage) - an open source Git extension for versioning large files. You can find more information on Git LFS in the Git LFS and VFS document . Working with Large Repositories When working with a very large repository of which you don't require all the files, you can use VFS for Git - an open source Git extension that virtualize the file system beneath your Git repository, so that you seem to work in a regular working directory but while VFS for Git only downloads objects as they are needed. You can find more information on VFS for Git in the Git LFS and VFS document . Tools Visual Studio Code is a cross-platform powerful source code editor with built in git commands. Within Visual Studio Code editor you can review diffs, stage changes, make commits, pull and push to your git repositories. You can refer to Visual Studio Code Git Support for documentation. Use a shell/terminal to work with Git commands instead of relying on GUI clients . If you're working on Windows, posh-git is a great PowerShell environment for Git. Another option is to use Git bash for Windows . On Linux/Mac, install git and use your favorite shell/terminal.","title":"Git Guidance"},{"location":"source-control/git-guidance/#git-guidance","text":"","title":"Git Guidance"},{"location":"source-control/git-guidance/#what-is-git","text":"Git is a distributed version control system. This means that - unlike SVN or CVS - it doesn't use a central server to synchronize. Instead, every participant has a local copy of the source-code, and the attached history that is kept in sync by comparing commit hashes (SHA hashes of changes between each git commit command) making up the latest version (called HEAD ). For example: repo 1 : A -> B -> C -> D -> HEAD repo 2 : A -> B -> HEAD repo 3 : X -> Y -> Z -> HEAD repo 4 : A -> J -> HEAD Since they share a common history, repo 1 and repo 2 can be synchronized fairly easily, repo 4 may be able to synchronize as well, but it's going to have to add a commit (J, and maybe a merge commit) to repo 1. Repo 3 cannot be easily synchronized with the others. Everything related to these commits is stored in a local .git directory in the root of the repository. In other words, by using Git you are simply creating immutable file histories that uniquely identify the current state and therefore allow sharing whatever comes after. It's a Merkle tree . Be sure to run git help after Git installation to find really in-depth explanations of everything.","title":"What is Git?"},{"location":"source-control/git-guidance/#installation","text":"Git is a tool set that must be installed. Install Git and follow the First-Time Git Setup . A recommended installation is the Git Lens extension for Visual Studio Code . Visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more. You can use these commands as well to configure your Git for Visual Studio Code as an editor for merge conflicts and diff tool. git config --global user.name [ YOUR FIRST AND LAST NAME ] git config --global user.email [ YOUR E-MAIL ADDRESS ] git config --global merge.tool vscode git config --global mergetool.vscode.cmd \"code --wait $MERGED \" git config --global diff.tool vscode git config --global difftool.vscode.cmd \"code --wait --diff $LOCAL $REMOTE \"","title":"Installation"},{"location":"source-control/git-guidance/#basic-workflow","text":"A basic Git workflow is as follows; you can find more information on the specific steps below. # pull the latest changes git pull # start a new feature branch based on the develop branch git checkout -b feature/123-add-git-instructions develop # edit some files # add and commit the files git add <file> git commit -m \"add basic instructions\" # edit some files # add and commit the files git add <file> git commit -m \"add more advanced instructions\" # check your changes git status # push the branch to the remote repository git push --set-upstream origin feature/123-add-git-instructions","title":"Basic Workflow"},{"location":"source-control/git-guidance/#cloning","text":"Whenever you want to make a change to a repository, you need to first clone it. Cloning a repository pulls down a full copy of all the repository data, so that you can work on it locally. This copy includes all versions of every file and folder for the project. git clone https://github.com/username/repo-name You only need to clone the repository the first time. Before any subsequent branches you can sync any changes from the remote repository using git pull .","title":"Cloning"},{"location":"source-control/git-guidance/#branching","text":"To avoid adding code that has not been peer reviewed to the main branch (ex. develop ) we typically work in feature branches, and merge these back to the main trunk with a Pull Request. It's even the case that often the main or develop branch of a repository are locked so that you can't make changes without a Pull Request. Therefore, it is useful to create a separate branch for your local/feature work, so that you can work and track your changes in this branch. Pull the latest changes and create a new branch for your work based on the trunk (in this case develop ). git pull git checkout -b feature/feature-name develop At any point, you can move between the branches with git checkout <branch> as long as you have committed or stashed your work. If you forget the name of your branch use git branch --all to list all branches.","title":"Branching"},{"location":"source-control/git-guidance/#committing","text":"To avoid losing work, it is good to commit often in small chunks. This allows you to revert only the last changes if you discover a problem and also neatly explains exactly what changes were made and why. Make changes to your branch Check what files were changed > git status On branch feature/271-basic-commit-info Changes not staged for commit: ( use \"git add <file>...\" to update what will be committed ) ( use \"git restore <file>...\" to discard changes in working directory ) modified: source-control/git-guidance/README.md Track the files you wish to include in the commit. To track all modified files: git add --all Or to track only specific files: git add source-control/git-guidance/README.md Commit the changes to your local branch with a descriptive commit message git commit -m \"add basic git instructions\"","title":"Committing"},{"location":"source-control/git-guidance/#pushing","text":"When you are done working, push your changes to a branch in the remote repository using: git push The first time you push, you first need to set an upstream branch as follows. After the first push, the --set-upstream parameter and branch name are not needed anymore. git push --set-upstream origin feature/feature-name Once the feature branch is pushed to the remote repository, it is visible to anyone with access to the code.","title":"Pushing"},{"location":"source-control/git-guidance/#merging","text":"We encourage the use of Pull Request to merge code to the main repository to make sure that all code in the final product is code reviewed The Pull Request (PR) process in Azure DevOps , GitHub and other similar tools make it easy both to start a PR, review a PR and merge a PR.","title":"Merging"},{"location":"source-control/git-guidance/#merge-conflicts","text":"If multiple people make changes to the same files, you may need to resolve any conflicts that have occurred before you can merge. # check out the develop branch and get the latest changes git checkout develop git pull # check out your branch git checkout <your branch> # merge the develop branch into your branch git merge develop # if merge conflicts occur, above command will fail with a message telling you that there are conflicts to be solved # find which files need to be resolved git status You can start an interactive process that will show which files have conflicts. Sometimes you removed a file, where it was changed in dev. Or you made changes to some lines in a file where another developer made changes as well. If you went through the installation steps mentioned before, Visual Studio Code is set up as merge tool. You can also use a merge tool like kdiff3 . When editing conflicts occur, the process will automatically open Visual Studio Code where the conflicting parts are highlighted in green and blue, and you have make a choice: Accept your changes (current) Accept the changes from dev branch (incoming) Accept them both and fix the code (probably needed) Here are lines that are either unchanged from the common ancestor, or cleanly resolved because only one side changed. <<<<<<< yours:sample.txt Conflict resolution is hard; let's go shopping. ======= Git makes conflict resolution easy. >>>>>>> theirs:sample.txt And here is another line that is cleanly resolved or unmodified When this process is completed, make sure you test the result by executing build, checks, test to validate this merged result. # conclude the merge git merge --continue # verify that everything went ok git log # push the changes to the remote branch git push If no other conflicts appear, the PR can now be merged, and your branch deleted. Use squash to reduce your changes into a single commit, so the commit history can be within an acceptable size.","title":"Merge Conflicts"},{"location":"source-control/git-guidance/#stashing-changes","text":"git stash is super handy if you have un-committed changes in your working directory, but you want to work on a different branch. You can run git stash , save the un-committed work, and revert to the HEAD commit. You can retrieve the saved changes by running git stash pop : git stash \u2026 git stash pop Or you can move the current state into a new branch: git stash branch <new_branch_to_save_changes>","title":"Stashing Changes"},{"location":"source-control/git-guidance/#recovering-lost-commits","text":"If you \"lost\" a commit that you want to return to, for example to revert a git rebase where your commits got squashed, you can use git reflog to find the commit: git reflog Then you can use the reflog reference ( HEAD@{} ) to reset to a specific commit before the rebase: git reset HEAD@ { 2 }","title":"Recovering Lost Commits"},{"location":"source-control/git-guidance/#commit-best-practices","text":"A commit combines changes into a logical unit. Adding a descriptive commit message can aid in comprehending the code changes and understanding the rationale behind the modifications. Consider the following when making your commits: Make small commits. This makes changes easier to review, and if we need to revert a commit, we lose less work. Consider splitting the commit into separate commits with git add -p if it includes more than one logical change or bug fix. Don't mix whitespace changes with functional code changes. It is hard to determine if the line has a functional change or only removes a whitespace, so functional changes may go unnoticed. Commit complete and well tested code. Never commit incomplete code, get in the habit of testing your code before committing. Write good commit messages. Why is it necessary? It may fix a bug, add a feature, improve performance, or just be a change for the sake of correctness What effects does this change have? In addition to the obvious ones, this may include benchmarks, side effects etc. You can specify the default git editor, which allows you to write your commit messages using your favorite editor. The following command makes Visual Studio Code your default git editor: git config --global core.editor \"code --wait\"","title":"Commit Best Practices"},{"location":"source-control/git-guidance/#commit-message-structure","text":"The essential parts of a commit message are: subject line: a short description of the commit, maximum 50 characters long body (optional): a longer description of the commit, wrapped at 72 characters, separated from the subject line by a blank line You are free to structure commit messages; however, git commands like git log utilize above structure. Therefore, it can be helpful to follow a convention within your team and to utilize git best. For example, Conventional Commits is a lightweight convention that complements SemVer , by describing the features, fixes, and breaking changes made in commit messages. See Component Versioning for more information on versioning. For more information on commit message conventions, see: A Note About Git Commit Messages Conventional Commits Git commit best practices How to Write a Git Commit Message How to Write Better Git Commit Messages Information in commit messages On commit messages","title":"Commit Message Structure"},{"location":"source-control/git-guidance/#managing-remotes","text":"A local git repository can have one or more backing remote repositories. You can list the remote repositories using git remote - by default, the remote repository you cloned from will be called origin > git remote -v origin https://github.com/microsoft/code-with-engineering-playbook.git ( fetch ) origin https://github.com/microsoft/code-with-engineering-playbook.git ( push )","title":"Managing Remotes"},{"location":"source-control/git-guidance/#working-with-forks","text":"You can set multiple remotes. This is useful for example if you want to work with a forked version of the repository. For more info on how to set upstream remotes and syncing repositories when working with forks see GitHub's Working with forks documentation .","title":"Working with Forks"},{"location":"source-control/git-guidance/#updating-the-remote-if-a-repository-changes-names","text":"If the repository is changed in some way, for example a name change, or if you want to switch between HTTPS and SSH you need to update the remote # list the existing remotes > git remote -v origin https://hostname/username/repository-name.git ( fetch ) origin https://hostname/username/repository-name.git ( push ) # change the remote url git remote set-url origin https://hostname/username/new-repository-name.git # verify that the remote URL has changed > git remote -v origin https://hostname/username/new-repository-name.git ( fetch ) origin https://hostname/username/new-repository-name.git ( push )","title":"Updating the Remote if a Repository Changes Names"},{"location":"source-control/git-guidance/#rolling-back-changes","text":"","title":"Rolling Back Changes"},{"location":"source-control/git-guidance/#reverting-and-deleting-commits","text":"To \"undo\" a commit, run the following two commands: git revert and git reset . git revert creates a new commit that undoes commits while git reset allows deleting commits entirely from the commit history. If you have committed secrets/keys, git reset will remove them from the commit history! To delete the latest commit use HEAD~ : git reset --hard HEAD~1 To delete commits back to a specific commit, use the respective commit id: git reset --hard <sha1-commit-id> after you deleted the unwanted commits, push using force : git push origin HEAD --force Interactive rebase for undoing commits: git rebase -i HEAD~N The above command will open an interactive session in an editor (for example vim) with the last N commits sorted from oldest to newest. To undo a commit, delete the corresponding line of the commit and save the file. Git will rewrite the commits in the order listed in the file and because one (or many) commits were deleted, the commit will no longer be part of the history. Running rebase will locally modify the history, after this one can use force to push the changes to remote without the deleted commit.","title":"Reverting and Deleting Commits"},{"location":"source-control/git-guidance/#using-submodules","text":"Submodules can be useful in more complex deployment and/or development scenarios Adding a submodule to your repo git submodule add -b master <your_submodule> Initialize and pull a repo with submodules: git submodule init git submodule update --init --remote git submodule foreach git checkout master git submodule foreach git pull origin","title":"Using Submodules"},{"location":"source-control/git-guidance/#working-with-images-video-and-other-binary-content","text":"Avoid committing frequently changed binary files, such as large images, video or compiled code to your git repository. Binary content is not diffed like text content, so cloning or pulling from the repository may pull each revision of the binary file. One solution to this problem is Git LFS (Git Large File Storage) - an open source Git extension for versioning large files. You can find more information on Git LFS in the Git LFS and VFS document .","title":"Working with Images, Video and Other Binary Content"},{"location":"source-control/git-guidance/#working-with-large-repositories","text":"When working with a very large repository of which you don't require all the files, you can use VFS for Git - an open source Git extension that virtualize the file system beneath your Git repository, so that you seem to work in a regular working directory but while VFS for Git only downloads objects as they are needed. You can find more information on VFS for Git in the Git LFS and VFS document .","title":"Working with Large Repositories"},{"location":"source-control/git-guidance/#tools","text":"Visual Studio Code is a cross-platform powerful source code editor with built in git commands. Within Visual Studio Code editor you can review diffs, stage changes, make commits, pull and push to your git repositories. You can refer to Visual Studio Code Git Support for documentation. Use a shell/terminal to work with Git commands instead of relying on GUI clients . If you're working on Windows, posh-git is a great PowerShell environment for Git. Another option is to use Git bash for Windows . On Linux/Mac, install git and use your favorite shell/terminal.","title":"Tools"},{"location":"source-control/git-guidance/git-lfs-and-vfs/","text":"Using Git LFS and VFS for Git Introduction Git LFS and VFS for Git are solutions for using Git with (large) binary files and large source trees. Git LFS Git is very good and keeping track of changes in text-based files like code, but it is not that good at tracking binary files. For instance, if you store a Photoshop image file (PSD) in a repository, with every change, the complete file is stored again in the history. This can make the history of the Git repo very large, which makes a clone of the repository more and more time-consuming. A solution to work with binary files is using Git LFS (or Git Large File System). This is an extension to Git and must be installed separately, and it can only be used with a repository platform that supports LFS. GitHub.com and Azure DevOps for instance are platforms that have support for LFS. The way it works in short, is that a placeholder file is stored in the repo with information for the LFS system. It looks something like this: version https://git-lfs.github.com/spec/v1 oid a747cfbbef63fc0a3f5ffca332ae486ee7bf77c1d1b9b2de02e261ef97d085fe size 4923023 The actual file is stored in a separate storage. This way Git will track changes in this placeholder file, not the large file. The combination of using Git and Git LFS will hide this from the developer though. You will just work with the repository and files as before. When working with these large files yourself, you'll still see the Git history grown on your own machine, as Git will still start tracking these large files locally, but when you clone the repo, the history is actually pretty small. So it's beneficial for others not working directly on the large files. Pros of Git LFS Uses the end to end Git workflow for all files Git LFS supports file locking to avoid conflicts for undiffable assets Git LFS is fully supported in Azure DevOps Services Cons of Git LFS Everyone who contributes to the repository needs to install Git LFS If not set up properly: Binary files committed through Git LFS are not visible as Git will only download the data describing the large file Committing large binaries will push the full binary to the repository Git cannot merge the changes from two different versions of a binary file; file locking mitigates this Azure Repos do not support using SSH for repositories with Git LFS tracked files - for more information see the Git LFS authentication documentation Installation and use of Git LFS Go to https://git-lfs.github.com and download and install the setup from there. For every repository you want to use LFS, you have to go through these steps: Setup LFS for the repo: git lfs install Indicate which files have to be considered as large files (or binary files). As an example, to consider all Photoshop files to be large: git lfs track \"*.psd\" There are more fine-grained ways to indicate files in a folder and more. See the Git LFS Documentation . With these commands a .gitattribute file is created which contains these settings and must be part of the repository. From here on you just use the standard Git commands to work in the repository. The rest will be handled by Git and Git LFS. Common LFS Commands Install Git LFS git lfs install # windows sudo apt-get git-lfs # linux See the Git LFS installation instructions for installation on other systems Track .mp4 files with Git LFS git lfs track '*.mp4' Update the .gitattributes file listing the files and patterns to track *.mp4 filter = lfs diff = lfs merge = lfs -text docs/images/* filter = lfs diff = lfs merge = lfs -text List all patterns tracked git lfs track List all files tracked git lfs ls-files Download files to your working directory git lfs pull git lfs pull --include = \"path/to/file\" VFS for Git Imagine a large repository containing multiple projects, ex. one per feature. As a developer you may only be working on some features, and thus you don't want to download all the projects in the repo. By default, with Git however, cloning the repository means you will download all files/projects. VFS for Git (or Virtual File System for Git) solves this problem, as it will only download what you need to your local machine, but if you look in the file system, e.g. with Windows Explorer, it will show all the folders and files including the correct file sizes. The Git platform must support GVFS to make this work. GitHub.com and Azure DevOps both support this out of the box. Installation and use of VFS for Git Microsoft create VFS for Git and made it open source. It can be found at https://github.com/microsoft/VFSForGit . It's only available for Windows. The necessary installers can be found at https://github.com/Microsoft/VFSForGit/releases On the releases page you'll find two important downloads: Git 2.28.0.0 installer, which is a requirement for running VFS for Git. This is not the same as the standard Git for Windows install! SetupGVFS installer. Download those files and install them on your machine. To be able to use VFS for Git for a repository, a .gitattributes file needs to be added to the repo with this line in it: * -text To clone a repository to your machine using VFS for Git you use gvfs instead of git like so: gvfs clone [ URL ] [ dir ] Once this is done, you have a folder which contains a src folder which contains the contents of the repository. This is done because of a practice to put all outputs of build systems outside this tree. This makes it easier to manage .gitignore files and to keep Git performant with lots of files. For working with the repository you just use Git commands as before. To remove a VFS for Git repository from your machine, make sure the VFS process is stopped and execute this command from the main folder: gvfs unmount This will stop the process and unregister it, after that you can safely remove the folder. Resources Git LFS getting started Git LFS manual Git LFS on Azure Repos","title":"Using Git LFS and VFS for Git Introduction"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#using-git-lfs-and-vfs-for-git-introduction","text":"Git LFS and VFS for Git are solutions for using Git with (large) binary files and large source trees.","title":"Using Git LFS and VFS for Git Introduction"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#git-lfs","text":"Git is very good and keeping track of changes in text-based files like code, but it is not that good at tracking binary files. For instance, if you store a Photoshop image file (PSD) in a repository, with every change, the complete file is stored again in the history. This can make the history of the Git repo very large, which makes a clone of the repository more and more time-consuming. A solution to work with binary files is using Git LFS (or Git Large File System). This is an extension to Git and must be installed separately, and it can only be used with a repository platform that supports LFS. GitHub.com and Azure DevOps for instance are platforms that have support for LFS. The way it works in short, is that a placeholder file is stored in the repo with information for the LFS system. It looks something like this: version https://git-lfs.github.com/spec/v1 oid a747cfbbef63fc0a3f5ffca332ae486ee7bf77c1d1b9b2de02e261ef97d085fe size 4923023 The actual file is stored in a separate storage. This way Git will track changes in this placeholder file, not the large file. The combination of using Git and Git LFS will hide this from the developer though. You will just work with the repository and files as before. When working with these large files yourself, you'll still see the Git history grown on your own machine, as Git will still start tracking these large files locally, but when you clone the repo, the history is actually pretty small. So it's beneficial for others not working directly on the large files.","title":"Git LFS"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#pros-of-git-lfs","text":"Uses the end to end Git workflow for all files Git LFS supports file locking to avoid conflicts for undiffable assets Git LFS is fully supported in Azure DevOps Services","title":"Pros of Git LFS"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#cons-of-git-lfs","text":"Everyone who contributes to the repository needs to install Git LFS If not set up properly: Binary files committed through Git LFS are not visible as Git will only download the data describing the large file Committing large binaries will push the full binary to the repository Git cannot merge the changes from two different versions of a binary file; file locking mitigates this Azure Repos do not support using SSH for repositories with Git LFS tracked files - for more information see the Git LFS authentication documentation","title":"Cons of Git LFS"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#installation-and-use-of-git-lfs","text":"Go to https://git-lfs.github.com and download and install the setup from there. For every repository you want to use LFS, you have to go through these steps: Setup LFS for the repo: git lfs install Indicate which files have to be considered as large files (or binary files). As an example, to consider all Photoshop files to be large: git lfs track \"*.psd\" There are more fine-grained ways to indicate files in a folder and more. See the Git LFS Documentation . With these commands a .gitattribute file is created which contains these settings and must be part of the repository. From here on you just use the standard Git commands to work in the repository. The rest will be handled by Git and Git LFS.","title":"Installation and use of Git LFS"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#common-lfs-commands","text":"Install Git LFS git lfs install # windows sudo apt-get git-lfs # linux See the Git LFS installation instructions for installation on other systems Track .mp4 files with Git LFS git lfs track '*.mp4' Update the .gitattributes file listing the files and patterns to track *.mp4 filter = lfs diff = lfs merge = lfs -text docs/images/* filter = lfs diff = lfs merge = lfs -text List all patterns tracked git lfs track List all files tracked git lfs ls-files Download files to your working directory git lfs pull git lfs pull --include = \"path/to/file\"","title":"Common LFS Commands"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#vfs-for-git","text":"Imagine a large repository containing multiple projects, ex. one per feature. As a developer you may only be working on some features, and thus you don't want to download all the projects in the repo. By default, with Git however, cloning the repository means you will download all files/projects. VFS for Git (or Virtual File System for Git) solves this problem, as it will only download what you need to your local machine, but if you look in the file system, e.g. with Windows Explorer, it will show all the folders and files including the correct file sizes. The Git platform must support GVFS to make this work. GitHub.com and Azure DevOps both support this out of the box.","title":"VFS for Git"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#installation-and-use-of-vfs-for-git","text":"Microsoft create VFS for Git and made it open source. It can be found at https://github.com/microsoft/VFSForGit . It's only available for Windows. The necessary installers can be found at https://github.com/Microsoft/VFSForGit/releases On the releases page you'll find two important downloads: Git 2.28.0.0 installer, which is a requirement for running VFS for Git. This is not the same as the standard Git for Windows install! SetupGVFS installer. Download those files and install them on your machine. To be able to use VFS for Git for a repository, a .gitattributes file needs to be added to the repo with this line in it: * -text To clone a repository to your machine using VFS for Git you use gvfs instead of git like so: gvfs clone [ URL ] [ dir ] Once this is done, you have a folder which contains a src folder which contains the contents of the repository. This is done because of a practice to put all outputs of build systems outside this tree. This makes it easier to manage .gitignore files and to keep Git performant with lots of files. For working with the repository you just use Git commands as before. To remove a VFS for Git repository from your machine, make sure the VFS process is stopped and execute this command from the main folder: gvfs unmount This will stop the process and unregister it, after that you can safely remove the folder.","title":"Installation and use of VFS for Git"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#resources","text":"Git LFS getting started Git LFS manual Git LFS on Azure Repos","title":"Resources"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"ISE Engineering Fundamentals Playbook An engineer working for a ISE project... Has responsibilities to their team \u2013 mentor, coach, and lead. Knows their playbook . Follows their playbook. Fixes their playbook if it is broken. If they find a better playbook, they copy it. If somebody could use their playbook, they share it. Leads by example. Models the behaviors we desire both interpersonally and technically. Strives to understand how their work fits into a broader context and ensures the outcome. This is our playbook. All contributions are welcome! Please feel free to submit a pull request to get involved. Why Have a Playbook To increase overall efficiency for team members and the whole team in general. To reduce the number of mistakes and avoid common pitfalls. To strive to be better engineers and learn from other people's shared experience. If you do nothing else follow the Engineering Fundamentals Checklist ! The first week of an ISE project is a breakdown of the sections of the playbook according to the structure of an Agile sprint. General Guidance Keep the code quality bar high. Value quality and precision over \u2018getting things done\u2019. Work diligently on the one important thing. As a distributed team take time to share context via wiki, teams and backlog items. Make the simple thing work now. Build fewer features today, but ensure they work amazingly. Then add more features tomorrow. Avoid adding scope to a backlog item, instead add a new backlog item. Our goal is to ship incremental customer value. Keep backlog item details up to date to communicate the state of things with the rest of your team. Report product issues found and provide clear and repeatable engineering feedback! We all own our code and each one of us has an obligation to make all parts of the solution great. Contributing See CONTRIBUTING.md for contribution guidelines.","title":"ISE Engineering Fundamentals Playbook"},{"location":"#ise-engineering-fundamentals-playbook","text":"An engineer working for a ISE project... Has responsibilities to their team \u2013 mentor, coach, and lead. Knows their playbook . Follows their playbook. Fixes their playbook if it is broken. If they find a better playbook, they copy it. If somebody could use their playbook, they share it. Leads by example. Models the behaviors we desire both interpersonally and technically. Strives to understand how their work fits into a broader context and ensures the outcome. This is our playbook. All contributions are welcome! Please feel free to submit a pull request to get involved.","title":"ISE Engineering Fundamentals Playbook"},{"location":"#why-have-a-playbook","text":"To increase overall efficiency for team members and the whole team in general. To reduce the number of mistakes and avoid common pitfalls. To strive to be better engineers and learn from other people's shared experience. If you do nothing else follow the Engineering Fundamentals Checklist ! The first week of an ISE project is a breakdown of the sections of the playbook according to the structure of an Agile sprint.","title":"Why Have a Playbook"},{"location":"#general-guidance","text":"Keep the code quality bar high. Value quality and precision over \u2018getting things done\u2019. Work diligently on the one important thing. As a distributed team take time to share context via wiki, teams and backlog items. Make the simple thing work now. Build fewer features today, but ensure they work amazingly. Then add more features tomorrow. Avoid adding scope to a backlog item, instead add a new backlog item. Our goal is to ship incremental customer value. Keep backlog item details up to date to communicate the state of things with the rest of your team. Report product issues found and provide clear and repeatable engineering feedback! We all own our code and each one of us has an obligation to make all parts of the solution great.","title":"General Guidance"},{"location":"#contributing","text":"See CONTRIBUTING.md for contribution guidelines.","title":"Contributing"},{"location":"ISE/","text":"Who is ISE (Industry Solutions Engineering) Our team, ISE (Industry Solutions Engineering), works side-by-side with customers to help them tackle their toughest technical problems both in the cloud and on the edge. We meet customers where they are, work in the languages they use, with the open source frameworks they use, and on the operating systems they use. We work with enterprises and start-ups across many industries from financial services to manufacturing. Our work covers a broad spectrum of domains including IoT, machine learning, and high scale compute. Our \"superpower\" is that we work closely with both our customers\u2019 engineering teams and Microsoft\u2019s product engineering teams, developing real-world expertise that we can use to help our customers grow their business and help Microsoft improve our products and services. We are very community focused in our work, with one foot in Microsoft and one foot in the open source communities that we help. We make pull requests on open source projects to add support for Microsoft platforms and/or improve existing implementations. We build frameworks and other tools to make it easier for developers to use Microsoft platforms. We source all the ideas for this work by maintaining very deep connections with these communities and the customers and partners that use them. If you like variety, coding in many languages, using any available tech across our industry, digging in with our customers, hack fests, occasional travel, and telling the story of what you\u2019ve done in blog posts and at conferences, then come talk to us. You can check out some of our work on our Developer Blog","title":"Who is ISE?"},{"location":"ISE/#who-is-ise-industry-solutions-engineering","text":"Our team, ISE (Industry Solutions Engineering), works side-by-side with customers to help them tackle their toughest technical problems both in the cloud and on the edge. We meet customers where they are, work in the languages they use, with the open source frameworks they use, and on the operating systems they use. We work with enterprises and start-ups across many industries from financial services to manufacturing. Our work covers a broad spectrum of domains including IoT, machine learning, and high scale compute. Our \"superpower\" is that we work closely with both our customers\u2019 engineering teams and Microsoft\u2019s product engineering teams, developing real-world expertise that we can use to help our customers grow their business and help Microsoft improve our products and services. We are very community focused in our work, with one foot in Microsoft and one foot in the open source communities that we help. We make pull requests on open source projects to add support for Microsoft platforms and/or improve existing implementations. We build frameworks and other tools to make it easier for developers to use Microsoft platforms. We source all the ideas for this work by maintaining very deep connections with these communities and the customers and partners that use them. If you like variety, coding in many languages, using any available tech across our industry, digging in with our customers, hack fests, occasional travel, and telling the story of what you\u2019ve done in blog posts and at conferences, then come talk to us. You can check out some of our work on our Developer Blog","title":"Who is ISE (Industry Solutions Engineering)"},{"location":"engineering-fundamentals-checklist/","text":"Engineering Fundamentals Checklist This checklist helps to ensure that our projects meet our Engineering Fundamentals. Source Control The default target branch is locked. Merges are done through PRs. PRs reference related work items. Commit history is consistent and commit messages are informative (what, why). Consistent branch naming conventions. Clear documentation of repository structure. Secrets are not part of the commit history or made public. (see Credential scanning ) Public repositories follow the OSS guidelines , see Required files in default branch for public repositories . More details on source control Work Item Tracking All items are tracked in AzDevOps (or similar). The board is organized (swim lanes, feature tags, technology tags). More details on backlog management Testing Unit tests cover the majority of all components (>90% if possible). Integration tests run to test the solution e2e. More details on automated testing CI/CD Project runs CI with automated build and test on each PR. Project uses CD to manage deployments to a replica environment before PRs are merged. Main branch is always shippable. More details on continuous integration and continuous delivery Security Access is only granted on an as-needed basis Secrets are stored in secured locations and not checked in to code Data is encrypted in transit (and if necessary at rest) and passwords are hashed Is the system split into logical segments with separation of concerns? This helps limiting security vulnerabilities. More details on security Observability Significant business and functional events are tracked and related metrics collected. Application faults and errors are logged. Health of the system is monitored. The client and server side observability data can be differentiated. Logging configuration can be modified without code changes (eg: verbose mode). Incoming tracing context is propagated to allow for production issue debugging purposes. GDPR compliance is ensured regarding PII (Personally Identifiable Information). More details on observability Agile/Scrum Process Lead (fixed/rotating) runs the daily standup The agile process is clearly defined within team. The Dev Lead (+ PO/Others) are responsible for backlog management and refinement. A working agreement is established between team members and customer. More details on agile development Design Reviews Process for conducting design reviews is included in the Working Agreement . Design reviews for each major component of the solution are carried out and documented, including alternatives. Stories and/or PRs link to the design document. Each user story includes a task for design review by default, which is assigned or removed during sprint planning. Project advisors are invited to design reviews or asked to give feedback to the design decisions captured in documentation. Discover all the reviews that the customer's processes require and plan for them. Clear non-functional requirements captured (see Non-Functional Requirements Guidance ) Risks and opportunities captured (see Risk/Opportunity Management ) More details on design reviews Code Reviews There is a clear agreement in the team as to function of code reviews. The team has a code review checklist or established process. A minimum number of reviewers (usually 2) for a PR merge is enforced by policy. Linters/Code Analyzers, unit tests and successful builds for PR merges are set up. There is a process to enforce a quick review turnaround. More details on code reviews Retrospectives Retrospectives are conducted each week/at the end of each sprint. The team identifies 1-3 proposed experiments to try each week/sprint to improve the process. Experiments have owners and are added to project backlog. The team conducts longer retrospective for Milestones and project completion. More details on retrospectives Engineering Feedback The team submits feedback on business and technical blockers that prevent project success Suggestions for improvements are incorporated in the solution Feedback is detailed and repeatable More details on engineering feedback Developer Experience (DevEx) Developers on the team can: Build/Compile source to verify it is free of syntax errors and compiles. Execute all automated tests (unit, e2e, etc). Start/Launch end-to-end to simulate execution in a deployed environment. Attach a debugger to started solution or running automated tests, set breakpoints, step through code, and inspect variables. Automatically install dependencies by pressing F5 (or equivalent) in their IDE. Use local dev configuration values (i.e. .env, appsettings.development.json). More details on developer experience","title":"Engineering Fundamentals Checklist"},{"location":"engineering-fundamentals-checklist/#engineering-fundamentals-checklist","text":"This checklist helps to ensure that our projects meet our Engineering Fundamentals.","title":"Engineering Fundamentals Checklist"},{"location":"engineering-fundamentals-checklist/#source-control","text":"The default target branch is locked. Merges are done through PRs. PRs reference related work items. Commit history is consistent and commit messages are informative (what, why). Consistent branch naming conventions. Clear documentation of repository structure. Secrets are not part of the commit history or made public. (see Credential scanning ) Public repositories follow the OSS guidelines , see Required files in default branch for public repositories . More details on source control","title":"Source Control"},{"location":"engineering-fundamentals-checklist/#work-item-tracking","text":"All items are tracked in AzDevOps (or similar). The board is organized (swim lanes, feature tags, technology tags). More details on backlog management","title":"Work Item Tracking"},{"location":"engineering-fundamentals-checklist/#testing","text":"Unit tests cover the majority of all components (>90% if possible). Integration tests run to test the solution e2e. More details on automated testing","title":"Testing"},{"location":"engineering-fundamentals-checklist/#cicd","text":"Project runs CI with automated build and test on each PR. Project uses CD to manage deployments to a replica environment before PRs are merged. Main branch is always shippable. More details on continuous integration and continuous delivery","title":"CI/CD"},{"location":"engineering-fundamentals-checklist/#security","text":"Access is only granted on an as-needed basis Secrets are stored in secured locations and not checked in to code Data is encrypted in transit (and if necessary at rest) and passwords are hashed Is the system split into logical segments with separation of concerns? This helps limiting security vulnerabilities. More details on security","title":"Security"},{"location":"engineering-fundamentals-checklist/#observability","text":"Significant business and functional events are tracked and related metrics collected. Application faults and errors are logged. Health of the system is monitored. The client and server side observability data can be differentiated. Logging configuration can be modified without code changes (eg: verbose mode). Incoming tracing context is propagated to allow for production issue debugging purposes. GDPR compliance is ensured regarding PII (Personally Identifiable Information). More details on observability","title":"Observability"},{"location":"engineering-fundamentals-checklist/#agilescrum","text":"Process Lead (fixed/rotating) runs the daily standup The agile process is clearly defined within team. The Dev Lead (+ PO/Others) are responsible for backlog management and refinement. A working agreement is established between team members and customer. More details on agile development","title":"Agile/Scrum"},{"location":"engineering-fundamentals-checklist/#design-reviews","text":"Process for conducting design reviews is included in the Working Agreement . Design reviews for each major component of the solution are carried out and documented, including alternatives. Stories and/or PRs link to the design document. Each user story includes a task for design review by default, which is assigned or removed during sprint planning. Project advisors are invited to design reviews or asked to give feedback to the design decisions captured in documentation. Discover all the reviews that the customer's processes require and plan for them. Clear non-functional requirements captured (see Non-Functional Requirements Guidance ) Risks and opportunities captured (see Risk/Opportunity Management ) More details on design reviews","title":"Design Reviews"},{"location":"engineering-fundamentals-checklist/#code-reviews","text":"There is a clear agreement in the team as to function of code reviews. The team has a code review checklist or established process. A minimum number of reviewers (usually 2) for a PR merge is enforced by policy. Linters/Code Analyzers, unit tests and successful builds for PR merges are set up. There is a process to enforce a quick review turnaround. More details on code reviews","title":"Code Reviews"},{"location":"engineering-fundamentals-checklist/#retrospectives","text":"Retrospectives are conducted each week/at the end of each sprint. The team identifies 1-3 proposed experiments to try each week/sprint to improve the process. Experiments have owners and are added to project backlog. The team conducts longer retrospective for Milestones and project completion. More details on retrospectives","title":"Retrospectives"},{"location":"engineering-fundamentals-checklist/#engineering-feedback","text":"The team submits feedback on business and technical blockers that prevent project success Suggestions for improvements are incorporated in the solution Feedback is detailed and repeatable More details on engineering feedback","title":"Engineering Feedback"},{"location":"engineering-fundamentals-checklist/#developer-experience-devex","text":"Developers on the team can: Build/Compile source to verify it is free of syntax errors and compiles. Execute all automated tests (unit, e2e, etc). Start/Launch end-to-end to simulate execution in a deployed environment. Attach a debugger to started solution or running automated tests, set breakpoints, step through code, and inspect variables. Automatically install dependencies by pressing F5 (or equivalent) in their IDE. Use local dev configuration values (i.e. .env, appsettings.development.json). More details on developer experience","title":"Developer Experience (DevEx)"},{"location":"the-first-week-of-an-ise-project/","text":"The First Week of an ISE Project The purpose of this document is to: Organize content in the playbook for quick reference and discoverability Provide content in a logical structure which reflects the engineering process Extensible hierarchy to allow teams to share deep subject-matter expertise Before Starting the Project Discuss and start writing the Team Agreements. Update these documents with any process decisions made throughout the project Working Agreement Definition of Ready Definition of Done Estimation Set up the repository/repositories Decide on repository structure/s Add README.md, LICENSE, CONTRIBUTING.md, .gitignore, etc Build a Product Backlog Set up a project in your chosen project management tool (ex. Azure DevOps) INVEST in good User Stories and Acceptance Criteria Non-Functional Requirements Guidance Day 1 Plan the first sprint Agree on a sprint goal, and how to measure the sprint progress Determine team capacity Assign user stories to the sprint and split user stories into tasks Set up Work in Progress (WIP) limits Decide on test frameworks and discuss test strategies Discuss the purpose and goals of tests and how to measure test coverage Agree on how to separate unit tests from integration, load and smoke tests Design the first test cases Decide on branch naming Discuss security needs and verify that secrets are kept out of source control Day 2 Set up Source Control Agree on best practices for commits Set up basic Continuous Integration with linters and automated tests Set up meetings for Daily Stand-ups and decide on a Process Lead Discuss purpose, goals, participants and facilitation guidance Discuss timing, and how to run an efficient stand-up If the project has sub-teams, set up a Scrum of Scrums Day 3 Agree on code style and on how to assign Pull Requests Set up Build Validation for Pull Requests (2 reviewers, linters, automated tests) and agree on Definition of Done Agree on a Code Merging strategy and update the CONTRIBUTING.md Agree on logging and observability frameworks and strategies Day 4 Set up Continuous Deployment Determine what environments are appropriate for this solution For each environment discuss purpose, when deployment should trigger, pre-deployment approvers, sing-off for promotion. Decide on a versioning strategy Agree on how to Design a feature and conduct a Design Review Day 5 Conduct a Sprint Demo Conduct a Retrospective Determine required participants, how to capture input (tools) and outcome Set a timeline, and discuss facilitation, meeting structure etc. Refine the Backlog Determine required participants Update the Definition of Ready Update estimates, and the Estimation document Submit Engineering Feedback for issues encountered","title":"The First Week of an ISE Project"},{"location":"the-first-week-of-an-ise-project/#the-first-week-of-an-ise-project","text":"The purpose of this document is to: Organize content in the playbook for quick reference and discoverability Provide content in a logical structure which reflects the engineering process Extensible hierarchy to allow teams to share deep subject-matter expertise","title":"The First Week of an ISE Project"},{"location":"the-first-week-of-an-ise-project/#before-starting-the-project","text":"Discuss and start writing the Team Agreements. Update these documents with any process decisions made throughout the project Working Agreement Definition of Ready Definition of Done Estimation Set up the repository/repositories Decide on repository structure/s Add README.md, LICENSE, CONTRIBUTING.md, .gitignore, etc Build a Product Backlog Set up a project in your chosen project management tool (ex. Azure DevOps) INVEST in good User Stories and Acceptance Criteria Non-Functional Requirements Guidance","title":"Before Starting the Project"},{"location":"the-first-week-of-an-ise-project/#day-1","text":"Plan the first sprint Agree on a sprint goal, and how to measure the sprint progress Determine team capacity Assign user stories to the sprint and split user stories into tasks Set up Work in Progress (WIP) limits Decide on test frameworks and discuss test strategies Discuss the purpose and goals of tests and how to measure test coverage Agree on how to separate unit tests from integration, load and smoke tests Design the first test cases Decide on branch naming Discuss security needs and verify that secrets are kept out of source control","title":"Day 1"},{"location":"the-first-week-of-an-ise-project/#day-2","text":"Set up Source Control Agree on best practices for commits Set up basic Continuous Integration with linters and automated tests Set up meetings for Daily Stand-ups and decide on a Process Lead Discuss purpose, goals, participants and facilitation guidance Discuss timing, and how to run an efficient stand-up If the project has sub-teams, set up a Scrum of Scrums","title":"Day 2"},{"location":"the-first-week-of-an-ise-project/#day-3","text":"Agree on code style and on how to assign Pull Requests Set up Build Validation for Pull Requests (2 reviewers, linters, automated tests) and agree on Definition of Done Agree on a Code Merging strategy and update the CONTRIBUTING.md Agree on logging and observability frameworks and strategies","title":"Day 3"},{"location":"the-first-week-of-an-ise-project/#day-4","text":"Set up Continuous Deployment Determine what environments are appropriate for this solution For each environment discuss purpose, when deployment should trigger, pre-deployment approvers, sing-off for promotion. Decide on a versioning strategy Agree on how to Design a feature and conduct a Design Review","title":"Day 4"},{"location":"the-first-week-of-an-ise-project/#day-5","text":"Conduct a Sprint Demo Conduct a Retrospective Determine required participants, how to capture input (tools) and outcome Set a timeline, and discuss facilitation, meeting structure etc. Refine the Backlog Determine required participants Update the Definition of Ready Update estimates, and the Estimation document Submit Engineering Feedback for issues encountered","title":"Day 5"},{"location":"CI-CD/","text":"Continuous Integration and Continuous Delivery Continuous Integration (CI) is the engineering practice of frequently committing code in a shared repository, ideally several times a day, and performing an automated build on it. These changes are built with other simultaneous changes to the system, which enables early detection of integration issues between multiple developers working on a project. Build breaks due to integration failures are treated as the highest priority issue for all the developers on a team and generally work stops until they are fixed. Paired with an automated testing approach, continuous integration also allows us to also test the integrated build such that we can verify that not only does the code base still build correctly, but also is still functionally correct. This is also a best practice for building robust and flexible software systems. Continuous Delivery (CD) takes the Continuous Integration (CI) concept further to also test deployments of the integrated code base on a replica of the environment it will be ultimately deployed on. This enables us to learn early about any unforeseen operational issues that arise from our changes as quickly as possible and also learn about gaps in our test coverage. The goal of all of this is to ensure that the main branch is always shippable, meaning that we could, if we needed to, take a build from the main branch of our code base and ship it on production. If these concepts are unfamiliar to you, take a few minutes and read through Continuous Integration and Continuous Delivery . Our expectation is that CI/CD should be used in all the engineering projects that we do with our customers and that we are building, testing, and deploying each change we make to any software system that we are building. For a much deeper understanding of all of these concepts, the books Continuous Integration and Continuous Delivery provide a comprehensive background. Why CI/CD We want to have an automated build and deployment of our software We want automated configuration of all components We want to be able to quickly re-build the environment from scratch in case of disaster We want the latest version of the code to always be deployed to our dev/test environments We want a reliable release strategy, where the policies for release are well understood by all The Fundamentals We run a quality pipeline (with linting, unit tests etc.) on each PR/update of the main branch All cloud resources (including secrets and permissions) are provisioned through infrastructure as code templates \u2013 ex. Terraform, Bicep (ARM), Pulumi etc. All release candidates are deployed to a non-production environment through an automated process (ex Azure DevOps or Github pipelines) Releases are deployed to the production environment through an automated process Release rollbacks are carried out through a repeatable process Our release pipeline runs automated tests, validating all release candidate artifact(s) end-to-end against a non-production environment Tools Azure Pipelines Our tooling at Microsoft has made setting up integration and delivery systems like this easy. If you are unfamiliar with it, take a few moments now to read through Azure Pipelines (Previously VSTS) and for a practical walkthrough of how this works in practice, one example you can read through is CI/CD on Kubernetes with VSTS . Jenkins Jenkins is one of the most commonly used tools across the open source community. It is well-known with hundreds of plugins for every build requirement. Jenkins is free but requires a dedicated server. You can easily create a Jenkins VM using this template TravisCI Travis CI can be used for open source projects at no cost but developers must purchase an enterprise plan for private projects. This service is ideal for validation of PR's on GitHub because it is lightweight and easy to set up with no need for dedicated server setup. It also supports a Build matrix feature which allows accelerating the build and testing process by breaking them into parts. CircleCI CircleCI is a free service for open source projects with no dedicated server required. It is also ideal for validation of PR's on GitHub. CircleCI also allows workflows, parallelism and splitting your tests across any number of containers with a wide array of packages pre-installed on the build containers. AppVeyor AppVeyor is another free CI service for open source projects which also supports Windows-based builds.","title":"Continuous Integration and Continuous Delivery"},{"location":"CI-CD/#continuous-integration-and-continuous-delivery","text":"Continuous Integration (CI) is the engineering practice of frequently committing code in a shared repository, ideally several times a day, and performing an automated build on it. These changes are built with other simultaneous changes to the system, which enables early detection of integration issues between multiple developers working on a project. Build breaks due to integration failures are treated as the highest priority issue for all the developers on a team and generally work stops until they are fixed. Paired with an automated testing approach, continuous integration also allows us to also test the integrated build such that we can verify that not only does the code base still build correctly, but also is still functionally correct. This is also a best practice for building robust and flexible software systems. Continuous Delivery (CD) takes the Continuous Integration (CI) concept further to also test deployments of the integrated code base on a replica of the environment it will be ultimately deployed on. This enables us to learn early about any unforeseen operational issues that arise from our changes as quickly as possible and also learn about gaps in our test coverage. The goal of all of this is to ensure that the main branch is always shippable, meaning that we could, if we needed to, take a build from the main branch of our code base and ship it on production. If these concepts are unfamiliar to you, take a few minutes and read through Continuous Integration and Continuous Delivery . Our expectation is that CI/CD should be used in all the engineering projects that we do with our customers and that we are building, testing, and deploying each change we make to any software system that we are building. For a much deeper understanding of all of these concepts, the books Continuous Integration and Continuous Delivery provide a comprehensive background.","title":"Continuous Integration and Continuous Delivery"},{"location":"CI-CD/#why-cicd","text":"We want to have an automated build and deployment of our software We want automated configuration of all components We want to be able to quickly re-build the environment from scratch in case of disaster We want the latest version of the code to always be deployed to our dev/test environments We want a reliable release strategy, where the policies for release are well understood by all","title":"Why CI/CD"},{"location":"CI-CD/#the-fundamentals","text":"We run a quality pipeline (with linting, unit tests etc.) on each PR/update of the main branch All cloud resources (including secrets and permissions) are provisioned through infrastructure as code templates \u2013 ex. Terraform, Bicep (ARM), Pulumi etc. All release candidates are deployed to a non-production environment through an automated process (ex Azure DevOps or Github pipelines) Releases are deployed to the production environment through an automated process Release rollbacks are carried out through a repeatable process Our release pipeline runs automated tests, validating all release candidate artifact(s) end-to-end against a non-production environment","title":"The Fundamentals"},{"location":"CI-CD/#tools","text":"","title":"Tools"},{"location":"CI-CD/#azure-pipelines","text":"Our tooling at Microsoft has made setting up integration and delivery systems like this easy. If you are unfamiliar with it, take a few moments now to read through Azure Pipelines (Previously VSTS) and for a practical walkthrough of how this works in practice, one example you can read through is CI/CD on Kubernetes with VSTS .","title":"Azure Pipelines"},{"location":"CI-CD/#jenkins","text":"Jenkins is one of the most commonly used tools across the open source community. It is well-known with hundreds of plugins for every build requirement. Jenkins is free but requires a dedicated server. You can easily create a Jenkins VM using this template","title":"Jenkins"},{"location":"CI-CD/#travisci","text":"Travis CI can be used for open source projects at no cost but developers must purchase an enterprise plan for private projects. This service is ideal for validation of PR's on GitHub because it is lightweight and easy to set up with no need for dedicated server setup. It also supports a Build matrix feature which allows accelerating the build and testing process by breaking them into parts.","title":"TravisCI"},{"location":"CI-CD/#circleci","text":"CircleCI is a free service for open source projects with no dedicated server required. It is also ideal for validation of PR's on GitHub. CircleCI also allows workflows, parallelism and splitting your tests across any number of containers with a wide array of packages pre-installed on the build containers.","title":"CircleCI"},{"location":"CI-CD/#appveyor","text":"AppVeyor is another free CI service for open source projects which also supports Windows-based builds.","title":"AppVeyor"},{"location":"CI-CD/continuous-delivery/","text":"Continuous Delivery The inspiration behind continuous delivery is constantly delivering valuable software to users and developers more frequently. Applying the principles and practices laid out in this readme will help you reduce risk, eliminate manual operations and increase quality and confidence. Deploying software involves the following principles: Provision and manage the cloud environment runtime for your application (cloud resources, infrastructure, hardware, services, etc). Install the target application version across your cloud environments. Configure your application, including any required data. A continuous delivery pipeline is an automated manifestation of your process to streamline these very principles in a consistent and repeatable manner. Goal Follow industry best practices for delivering software changes to customers and developers. Establish consistency for the guiding principles and best practices when assembling continuous delivery workflows. General Guidance Define a Release Strategy It's important to establish a common understanding between the Dev Lead and application stakeholder(s) around the release strategy / design during the planning phase of a project. This common understanding includes the deployment and maintenance of the application throughout its SDLC. Release Strategy Principles Continuous Delivery by Jez Humble, David Farley cover the key considerations to follow when creating a release strategy: Parties in charge of deployments to each environment, as well as in charge of the release. An asset and configuration management strategy. An enumeration of the environments available for acceptance, capacity, integration, and user acceptance testing, and the process by which builds will be moved through these environments. A description of the processes to be followed for deployment into testing and production environments, such as change requests to be opened and approvals that need to be granted. A discussion of the method by which the application\u2019s deploy-time and runtime configuration will be managed, and how this relates to the automated deployment process. _Description of the integration with any external systems. At what stage and how are they tested as part of a release? How does the technical operator communicate with the provider in the event of a problem? _A disaster recovery plan so that the application\u2019s state can be recovered following a disaster. Which steps will need to be in place to restart or redeploy the application should it fail. _Production sizing and capacity planning: How much data will your live application create? How many log files or databases will you need? How much bandwidth and disk space will you need? What latency are clients expecting? How the initial deployment to production works. How fixing defects and applying patches to the production environment will be handled. How upgrades to the production environment will be handled, including data migration. How will upgrades be carried out to the application without destroying its state. Application Release and Environment Promotion Your release manifestation process should take the deployable build artifact created from your commit stage and deploy them across all cloud environments, starting with your test environment. The test environment ( often called Integration ) acts as a gate to validate if your test suite completes successfully for all release candidates. This validation should always begin in a test environment while inspecting the deployed release integrated from the feature / release branch containing your code changes. Code changes released into the test environment typically targets the main branch (when doing trunk ) or release branch (when doing gitflow ). The First Deployment The very first deployment of any application should be showcased to the customer in a production-like environment ( UAT ) to solicit feedback early. The UAT environment is used to obtain product owner sign-off acceptance to ultimately promote the release to production. Criteria for a Production-Like Environment Runs the same operating system as production. Has the same software installed as production. Is sized and configured the same way as production. Mirrors production's networking topology. Simulated production-like load tests are executed following a release to surface any latency or throughput degradation. Modeling Your Release Pipeline It's critical to model your test and release process to establish a common understanding between the application engineers and customer stakeholders. Specifically aligning expectations for how many cloud environments need to be pre-provisioned as well as defining sign-off gate roles and responsibilities. Release Pipeline Modeling Considerations Depict all stages an application change would have to go through before it is released to production. Define all release gate controls. Determine customer-specific Cloud RBAC groups which have the authority to approve release candidates per environment. Release Pipeline Stages The stages within your release workflow are ultimately testing a version of your application to validate it can be released in accordance to your acceptance criteria. The release pipeline should account for the following conditions: Release Selection: The developer carrying out application testing should have the capability to select which release version to deploy to the testing environment. Deployment - Release the application deployable build artifact ( created from the CI stage ) to the target cloud environment. Configuration - Applications should be configured consistently across all your environments. This configuration is applied at the time of deployment. Sensitive data like app secrets and certificates should be mastered in a fully managed PaaS key and secret store (eg Key Vault , KMS ). Any secrets used by the application should be sourced internally within the application itself. Application Secrets should not be exposed within the runtime environment. We encourage 12 Factor principles, especially when it comes to configuration management . Data Migration - Pre populate application state and/or data records which is needed for your runtime environment. This may also include test data required for your end-to-end integration test suite. Deployment smoke test. Your smoke test should also verify that your application is pointing to the correct configuration (e.g. production pointing to a UAT Database). Perform any manual or automated acceptance test scenarios. Approve the release gate to promote the application version to the target cloud environment. This promotion should also include the environment's configuration state (e.g. new env settings, feature flags, etc). Live Release Warm Up A release should be running for a period of time before it's considered live and allowed to accept user traffic. These warm up activities may include application server(s) and database(s) pre-fill any dependent cache(s) as well as establish all service connections (eg connection pool allocations, etc ). Pre-production Releases Application release candidates should be deployed to a staging environment similar to production for carrying out final manual/automated tests ( including capacity testing ). Your production and staging / pre-prod cloud environments should be setup at the beginning of your project. Application warm up should be a quantified measurement that's validated as part of your pre-prod smoke tests. Rolling-Back Releases Your release strategy should account for rollback scenarios in the event of unexpected failures following a deployment. Rolling back releases can get tricky, especially when database record/object changes occur in result of your deployment ( either inadvertently or intentionally ). If there are no data changes which need to be backed out, then you can simply trigger a new release candidate for the last known production version and promote that release along your CD pipeline. For rollback scenarios involving data changes, there are several approaches to mitigating this which fall outside the scope of this guide. Some involve database record versioning, time machining database records / objects, etc. All data files and databases should be backed up prior to each release so they could be restored. The mitigation strategy for this scenario will vary across our projects. The expectation is that this mitigation strategy should be covered as part of your release strategy. Another approach to consider when designing your release strategy is deployment rings . This approach simplifies rollback scenarios while limiting the impact of your release to end-users by gradually deploying and validating your changes in production. Zero Downtime Releases A hot deployment follows a process of switching users from one release to another with no impact to the user experience. As an example, Azure managed app services allows developers to validate app changes in a staging deployment slot before swapping it with the production slot. App Service slot swapping can also be fully automated once the source slot is fully warmed up (and auto swap is enabled). Slot swapping also simplifies release rollbacks once a technical operator restores the slots to their pre-swap states. Kubernetes natively supports rolling updates . Blue-Green Deployments Blue / Green is a deployment technique which reduces downtime by running two identical instances of a production environment called Blue and Green . Only one of these environments accepts live production traffic at a given time. In the above example, live production traffic is routed to the Green environment. During application releases, the new version is deployed to the blue environment which occurs independently from the Green environment. Live traffic is unaffected from Blue environment releases. You can point your end-to-end test suite against the Blue environment as one of your test checkouts. Migrating users to the new application version is as simple as changing the router configuration to direct all traffic to the Blue environment. This technique simplifies rollback scenarios as we can simply switch the router back to Green. Database providers like Cosmos and Azure SQL natively support data replication to help enable fully synchronized Blue Green database environments. Canary Releasing Canary releasing enables development teams to gather faster feedback when deploying new features to production. These releases are rolled out to a subset of production nodes ( where no users are routed to ) to collect early insights around capacity testing and functional completeness and impact. Once smoke and capacity tests are completed, you can route a small subset of users to the production nodes hosting the release candidate. Canary releases simplify rollbacks as you can avoid routing users to bad application versions. Try to limit the number of versions of your application running parallel in production, as it can complicate maintenance and monitoring controls. Low Code Solutions Low code solutions have increased their participation in the applications and processes and because of that it is required that a proper conjunction of disciplines improve their development. Here is a guide for continuous deployment for Low Code Solutions . Resources Continuous Delivery by Jez Humble, David Farley. Continuous integration vs. continuous delivery vs. continuous deployment Deployment Rings Tools Check out the below tools to help with some CD best practices listed above: Flux for gitops CI/CD workflow using GitOps Tekton for Kubernetes native pipelines Note Jenkins-X uses Tekton under the hood. Argo Workflows Flagger for powerful, Kubernetes native releases including blue/green, canary, and A/B testing. Not quite CD related, but checkout jsonnet , a templating language to reduce boilerplate and increase sharing between your yaml/json manifests.","title":"Continuous Delivery"},{"location":"CI-CD/continuous-delivery/#continuous-delivery","text":"The inspiration behind continuous delivery is constantly delivering valuable software to users and developers more frequently. Applying the principles and practices laid out in this readme will help you reduce risk, eliminate manual operations and increase quality and confidence. Deploying software involves the following principles: Provision and manage the cloud environment runtime for your application (cloud resources, infrastructure, hardware, services, etc). Install the target application version across your cloud environments. Configure your application, including any required data. A continuous delivery pipeline is an automated manifestation of your process to streamline these very principles in a consistent and repeatable manner.","title":"Continuous Delivery"},{"location":"CI-CD/continuous-delivery/#goal","text":"Follow industry best practices for delivering software changes to customers and developers. Establish consistency for the guiding principles and best practices when assembling continuous delivery workflows.","title":"Goal"},{"location":"CI-CD/continuous-delivery/#general-guidance","text":"","title":"General Guidance"},{"location":"CI-CD/continuous-delivery/#define-a-release-strategy","text":"It's important to establish a common understanding between the Dev Lead and application stakeholder(s) around the release strategy / design during the planning phase of a project. This common understanding includes the deployment and maintenance of the application throughout its SDLC.","title":"Define a Release Strategy"},{"location":"CI-CD/continuous-delivery/#release-strategy-principles","text":"Continuous Delivery by Jez Humble, David Farley cover the key considerations to follow when creating a release strategy: Parties in charge of deployments to each environment, as well as in charge of the release. An asset and configuration management strategy. An enumeration of the environments available for acceptance, capacity, integration, and user acceptance testing, and the process by which builds will be moved through these environments. A description of the processes to be followed for deployment into testing and production environments, such as change requests to be opened and approvals that need to be granted. A discussion of the method by which the application\u2019s deploy-time and runtime configuration will be managed, and how this relates to the automated deployment process. _Description of the integration with any external systems. At what stage and how are they tested as part of a release? How does the technical operator communicate with the provider in the event of a problem? _A disaster recovery plan so that the application\u2019s state can be recovered following a disaster. Which steps will need to be in place to restart or redeploy the application should it fail. _Production sizing and capacity planning: How much data will your live application create? How many log files or databases will you need? How much bandwidth and disk space will you need? What latency are clients expecting? How the initial deployment to production works. How fixing defects and applying patches to the production environment will be handled. How upgrades to the production environment will be handled, including data migration. How will upgrades be carried out to the application without destroying its state.","title":"Release Strategy Principles"},{"location":"CI-CD/continuous-delivery/#application-release-and-environment-promotion","text":"Your release manifestation process should take the deployable build artifact created from your commit stage and deploy them across all cloud environments, starting with your test environment. The test environment ( often called Integration ) acts as a gate to validate if your test suite completes successfully for all release candidates. This validation should always begin in a test environment while inspecting the deployed release integrated from the feature / release branch containing your code changes. Code changes released into the test environment typically targets the main branch (when doing trunk ) or release branch (when doing gitflow ).","title":"Application Release and Environment Promotion"},{"location":"CI-CD/continuous-delivery/#the-first-deployment","text":"The very first deployment of any application should be showcased to the customer in a production-like environment ( UAT ) to solicit feedback early. The UAT environment is used to obtain product owner sign-off acceptance to ultimately promote the release to production.","title":"The First Deployment"},{"location":"CI-CD/continuous-delivery/#criteria-for-a-production-like-environment","text":"Runs the same operating system as production. Has the same software installed as production. Is sized and configured the same way as production. Mirrors production's networking topology. Simulated production-like load tests are executed following a release to surface any latency or throughput degradation.","title":"Criteria for a Production-Like Environment"},{"location":"CI-CD/continuous-delivery/#modeling-your-release-pipeline","text":"It's critical to model your test and release process to establish a common understanding between the application engineers and customer stakeholders. Specifically aligning expectations for how many cloud environments need to be pre-provisioned as well as defining sign-off gate roles and responsibilities.","title":"Modeling Your Release Pipeline"},{"location":"CI-CD/continuous-delivery/#release-pipeline-modeling-considerations","text":"Depict all stages an application change would have to go through before it is released to production. Define all release gate controls. Determine customer-specific Cloud RBAC groups which have the authority to approve release candidates per environment.","title":"Release Pipeline Modeling Considerations"},{"location":"CI-CD/continuous-delivery/#release-pipeline-stages","text":"The stages within your release workflow are ultimately testing a version of your application to validate it can be released in accordance to your acceptance criteria. The release pipeline should account for the following conditions: Release Selection: The developer carrying out application testing should have the capability to select which release version to deploy to the testing environment. Deployment - Release the application deployable build artifact ( created from the CI stage ) to the target cloud environment. Configuration - Applications should be configured consistently across all your environments. This configuration is applied at the time of deployment. Sensitive data like app secrets and certificates should be mastered in a fully managed PaaS key and secret store (eg Key Vault , KMS ). Any secrets used by the application should be sourced internally within the application itself. Application Secrets should not be exposed within the runtime environment. We encourage 12 Factor principles, especially when it comes to configuration management . Data Migration - Pre populate application state and/or data records which is needed for your runtime environment. This may also include test data required for your end-to-end integration test suite. Deployment smoke test. Your smoke test should also verify that your application is pointing to the correct configuration (e.g. production pointing to a UAT Database). Perform any manual or automated acceptance test scenarios. Approve the release gate to promote the application version to the target cloud environment. This promotion should also include the environment's configuration state (e.g. new env settings, feature flags, etc).","title":"Release Pipeline Stages"},{"location":"CI-CD/continuous-delivery/#live-release-warm-up","text":"A release should be running for a period of time before it's considered live and allowed to accept user traffic. These warm up activities may include application server(s) and database(s) pre-fill any dependent cache(s) as well as establish all service connections (eg connection pool allocations, etc ).","title":"Live Release Warm Up"},{"location":"CI-CD/continuous-delivery/#pre-production-releases","text":"Application release candidates should be deployed to a staging environment similar to production for carrying out final manual/automated tests ( including capacity testing ). Your production and staging / pre-prod cloud environments should be setup at the beginning of your project. Application warm up should be a quantified measurement that's validated as part of your pre-prod smoke tests.","title":"Pre-production Releases"},{"location":"CI-CD/continuous-delivery/#rolling-back-releases","text":"Your release strategy should account for rollback scenarios in the event of unexpected failures following a deployment. Rolling back releases can get tricky, especially when database record/object changes occur in result of your deployment ( either inadvertently or intentionally ). If there are no data changes which need to be backed out, then you can simply trigger a new release candidate for the last known production version and promote that release along your CD pipeline. For rollback scenarios involving data changes, there are several approaches to mitigating this which fall outside the scope of this guide. Some involve database record versioning, time machining database records / objects, etc. All data files and databases should be backed up prior to each release so they could be restored. The mitigation strategy for this scenario will vary across our projects. The expectation is that this mitigation strategy should be covered as part of your release strategy. Another approach to consider when designing your release strategy is deployment rings . This approach simplifies rollback scenarios while limiting the impact of your release to end-users by gradually deploying and validating your changes in production.","title":"Rolling-Back Releases"},{"location":"CI-CD/continuous-delivery/#zero-downtime-releases","text":"A hot deployment follows a process of switching users from one release to another with no impact to the user experience. As an example, Azure managed app services allows developers to validate app changes in a staging deployment slot before swapping it with the production slot. App Service slot swapping can also be fully automated once the source slot is fully warmed up (and auto swap is enabled). Slot swapping also simplifies release rollbacks once a technical operator restores the slots to their pre-swap states. Kubernetes natively supports rolling updates .","title":"Zero Downtime Releases"},{"location":"CI-CD/continuous-delivery/#blue-green-deployments","text":"Blue / Green is a deployment technique which reduces downtime by running two identical instances of a production environment called Blue and Green . Only one of these environments accepts live production traffic at a given time. In the above example, live production traffic is routed to the Green environment. During application releases, the new version is deployed to the blue environment which occurs independently from the Green environment. Live traffic is unaffected from Blue environment releases. You can point your end-to-end test suite against the Blue environment as one of your test checkouts. Migrating users to the new application version is as simple as changing the router configuration to direct all traffic to the Blue environment. This technique simplifies rollback scenarios as we can simply switch the router back to Green. Database providers like Cosmos and Azure SQL natively support data replication to help enable fully synchronized Blue Green database environments.","title":"Blue-Green Deployments"},{"location":"CI-CD/continuous-delivery/#canary-releasing","text":"Canary releasing enables development teams to gather faster feedback when deploying new features to production. These releases are rolled out to a subset of production nodes ( where no users are routed to ) to collect early insights around capacity testing and functional completeness and impact. Once smoke and capacity tests are completed, you can route a small subset of users to the production nodes hosting the release candidate. Canary releases simplify rollbacks as you can avoid routing users to bad application versions. Try to limit the number of versions of your application running parallel in production, as it can complicate maintenance and monitoring controls.","title":"Canary Releasing"},{"location":"CI-CD/continuous-delivery/#low-code-solutions","text":"Low code solutions have increased their participation in the applications and processes and because of that it is required that a proper conjunction of disciplines improve their development. Here is a guide for continuous deployment for Low Code Solutions .","title":"Low Code Solutions"},{"location":"CI-CD/continuous-delivery/#resources","text":"Continuous Delivery by Jez Humble, David Farley. Continuous integration vs. continuous delivery vs. continuous deployment Deployment Rings","title":"Resources"},{"location":"CI-CD/continuous-delivery/#tools","text":"Check out the below tools to help with some CD best practices listed above: Flux for gitops CI/CD workflow using GitOps Tekton for Kubernetes native pipelines Note Jenkins-X uses Tekton under the hood. Argo Workflows Flagger for powerful, Kubernetes native releases including blue/green, canary, and A/B testing. Not quite CD related, but checkout jsonnet , a templating language to reduce boilerplate and increase sharing between your yaml/json manifests.","title":"Tools"},{"location":"CI-CD/continuous-integration/","text":"Continuous Integration We encourage engineering teams to make an upfront investment during Sprint 0 of a project to establish an automated and repeatable pipeline which continuously integrates code and releases system executable(s) to target cloud environments. Each integration should be verified by an automated build process that asserts a suite of validation tests pass and surface any errors across the developer team. We encourage teams to implement the CI/CD pipelines before any service code is written for customers, which usually happens in Sprint 0(N). This way, the engineering team can develop and test their work in isolation without impacting other developers and promote a consistent devops workflow throughout the engagement. These principles map directly agile software development lifecycle practices . Goals Continuous integration automation is an integral part of the software development lifecycle intended to reduce build integration errors and maximize velocity across a dev crew. A robust build automation pipeline will: Accelerate team velocity Prevent integration problems Avoid last minute chaos during release dates Provide a quick feedback cycle for system-wide impact of local changes Separate build and deployment stages Measure and report metrics around build failures / success(s) Increase visibility across the team enabling tighter communication Reduce human errors, which is probably the most important part of automating the builds Build Definition Managed in Git Code / Manifest Artifacts Required to Build Your Project Should be Maintained Within Your Projects Git Repository CI provider-specific build pipeline definition(s) should reside within your project(s) git repository(s). Build Automation An automated build should encompass the following principles: Build Task A single step within your build pipeline that compiles your code project into a single build artifact. Unit Testing Your build definition includes validation steps to execute a suite of automated unit tests to ensure that application components meets its design and behaves as intended. Code Style Checks Code across an engineering team must be formatted to agreed coding standards. Such standards keep code consistent, and most importantly easy for the team and customer(s) to read and refactor. Code styling consistency encourages collective ownership for project scrum teams and our partners. There are several open source code style validation tools available to choose from ( code style checks , StyleCop ). The Code Review recipes section of the playbook has suggestions for linters and preferred styles for a number of languages. Your code and documentation should avoid the use of non-inclusive language wherever possible. Follow the Inclusive Linting section to ensure your project promotes an inclusive work environment for both the team and for customers. We recommend incorporating security analysis tools within the build stage of your pipeline such as: code credential scanner, security risk detection, static analysis, etc. For Azure DevOps, you can add a security scan task to your pipeline by installing the Microsoft Security Code Analysis Extension . GitHub Actions supports a similar extension with the RIPS security scan solution . Code standards are maintained within a single configuration file. There should be a step in your build pipeline that asserts code in the latest commit conforms to the known style definition. Build Script Target A single command should have the capability of building the system. This is also true for builds running on a CI server or on a developers local machine. No IDE Dependencies It's essential to have a build that's runnable through standalone scripts and not dependent on a particular IDE. Build pipeline targets can be triggered locally on their desktops through their IDE of choice. The build process should maintain enough flexibility to run within a CI server as well. As an example, dockerizing your build process offers this level of flexibility as VSCode and IntelliJ supports docker plugin extensions. DevOps Security Checks Introduce security to your project at early stages. Follow the DevSecOps section to introduce security practices, automation, tools and frameworks as part of the CI. Build Environment Dependencies Automated Local Environment Setup We encourage maintaining a consistent developer experience for all team members. There should be a central automated manifest / process that streamlines the installation and setup of any software dependencies. This way developers can replicate the same build environment locally as the one running on a CI server. Build automation scripts often require specific software packages and version pre-installed within the runtime environment of the OS. This presents some challenges as build processes typically version lock these dependencies. All developers on the team should be able to emulate the build environment from their local desktop regardless of their OS. For projects using VS Code, leveraging Dev Containers can really help standardize the local developer experience across the team. Well established software packaging tools like Docker, Maven, npm, etc should be considered when designing your build automation tool chain. Document Local Setup The setup process for setting up a local build environment should be well documented and easy for developers to follow. Infrastructure as Code Manage as much of the following as possible, as code: Configuration Files Configuration Management(ie environment variable automation via terraform ) Secret Management(ie creating Azure secrets via terraform ) Cloud Resource Provisioning Role Assignments Load Test Scenarios Availability Alerting / Monitoring Rules and Conditions Decoupling infrastructure from the application codebase simplifies engineering teams move to cloud native applications. Terraform resource providers like Azure DevOps is making it easier for developers to manage build pipeline variables, service connections and CI/CD pipeline definitions. Sample DevOps Workflow using Terraform and Cobalt Why Repeatable and auditable changes to infrastructure make it easier to roll back to known good configurations and to rapidly expand to new stages and regions without having to hand-wire cloud resources Battle tested and templated IAC reference projects like Cobalt and Bedrock enable more engineering teams deploy secure and scalable solutions at a much more rapid pace Simplify \u201clift and shift\u201d scenarios by abstracting the complexities of cloud-native computing away from application developer teams. IAC DevOPS: Operations by Pull Request The Infrastructure deployment process built around a repo that holds the current expected state of the system / Azure environment. Operational changes are made to the running system by making commits on this repo. Git also provides a simple model for auditing deployments and rolling back to a previous state. Infrastructure Advocated Patterns You define infrastructure as code in Terraform / ARM / Ansible templates Templates are repeatable cloud resource stacks with a focus on configuration sets aligned with app scaling and throughput needs. IAC Principles Automate the Azure Environment All cloud resources are provisioned through a set of infrastructure as code templates. This also includes secrets, service configuration settings, role assignments and monitoring conditions. Azure Portal should provide a read-only view on environment resources. Any change applied to the environment should be made through the IAC CI tool-chain only. Provisioning cloud environments should be a repeatable process that's driven off the infrastructure code artifacts checked into our git repository. IAC CI Workflow When the IAC template files change through a git-based workflow, A CI build pipeline builds, validates and reconciles the target infrastructure environment's current state with the expected state. The infrastructure execution plan candidate for these fixed environments are reviewed by a cloud administrator as a gate check prior to the deployment stage of the pipeline applying the execution plan. Developer Read-Only Access to Cloud Resources Developer accounts in the Azure portal should have read-only access to IAC environment resources in Azure. Secret Automation IAC templates are deployed via a CI/CD system that has secrets automation integrated. Avoid applying changes to secrets and/or certificates directly in the Azure Portal. Infrastructure Integration Test Automation End-to-end integration tests are run as part of your IAC CI process to inspect and validate that an azure environment is ready for use. Infrastructure Documentation The deployment and cloud resource template topology should be documented and well understood within the README of the IAC git repo. Local environment and CI workflow setup steps should be documented. Configuration Validation Applications use configuration to allow different runtime behaviors and it\u2019s quite common to use files to store these settings. As developers, we might introduce errors while editing these files which would cause issues for the application to start and/or run correctly. By applying validation techniques on both syntax and semantics of our configuration, we can detect errors before the application is deployed and execute, improving the developer (user) experience. Application Configuration Files Examples JSON, with support for complex data types and data structures YAML, a super set of JSON with support for complex data types and structures TOML, a super set of JSON and a formally specified configuration file format Why Validate Application Configuration as a Separate Step? Easier Debugging & Time saving - With a configuration validation step in our pipeline, we can avoid running the application just to find it fails. It saves time on having to deploy & run, wait and then realize something is wrong in configuration. In addition, it also saves time on going through the logs to figure out what failed and why. Better user/developer experience - A simple reminder to the user that something in the configuration isn't in the right format can make all the difference between the joy of a successful deployment process and the intense frustration of having to guess what went wrong. For example, when there is a Boolean value expected, it can either be a string value like \"True\" or \"False\" or an integer value such as \"0\" or \"1\" . With configuration validation we make sure the meaning is correct for our application. Avoid data corruption and security breaches - Since the data arrives from an untrusted source, such as a user or an external webservice, it\u2019s particularly important to validate the input . Otherwise, it will run at the risk of performing errors, corrupting data, or, worse, be vulnerable to a whole array of injection attacks. What is Json Schema? JSON-Schema is the standard of JSON documents that describes the structure and the requirements of your JSON data. Although it is called JSON-Schema, it also common to use this method for YAMLs, as it is a super set of JSON. The schema is very simple; point out which fields might exist, which are required or optional, what data format they use. Other validation rules can be added on top of that basic premise, along with human-readable information. The metadata lives in schemas which are .json files as well. In addition, schema has the widest adoption among all standards for JSON validation as it covers a big part of validation scenarios. It uses easy-to-parse JSON documents for schemas and is easily extensible. How to Implement Schema Validation? Implementing schema validation is divided in two - the generation of the schemas and the validation of yaml/json files with those schemas. Generation There are two options to generate a schema: From code - we can leverage the existing models and objects in the code and generate a customized schema. From data - we can take yaml/json samples which reflect the configuration in general and use the various online tools to generate a schema. Validation The schema has 30+ validators for different languages, including 10+ for JavaScript, so no need to code it yourself. Integration Validation An effective way to identify bugs in your build at a rapid pace is to invest early into a reliable suite of automated tests that validate the baseline functionality of the system: End-to-End Integration Tests Include tests in your pipeline to validate the build candidate conforms to automated business functionality assertions. Any bugs or broken code should be reported in the test results including the failed test and relevant stack trace. All tests should be invoked through a single command. Keep the build fast. Consider automated test runtime when deciding to pull in dependencies like databases, external services and mock data loading into your test harness. Slow builds often become a bottleneck for dev teams when parallel builds on a CI server are not an option. Consider adding max timeout limits for lengthy validations to fail fast and maintain high velocity across the team. Avoid Checking in Broken Builds Automated build checks, tests, lint runs, etc should be validated locally before committing your changes to the scm repo. Test Driven Development is a practice dev crews should consider to help identify bugs and failures as early as possible within the development lifecycle. Reporting Build Failures If the build step happens to fail then the build pipeline run status should be reported as failed including relevant logs and stack traces. Test Automation Data Dependencies Any mocked dataset(s) used for unit and end-to-end integration tests should be checked into the mainline repository. Minimize any external data dependencies with your build process. Code Coverage Checks We recommend integrating code coverage tools within your build stage. Most coverage tools fail builds when the test coverage falls below a minimum threshold(80% coverage). The coverage report should be published to your CI system to track a time series of variations. Git Driven Workflow Build on Commit Every commit to the baseline repository should trigger the CI pipeline to create a new build candidate. Build artifact(s) are built, packaged, validated and deployed continuously into a non-production environment per commit. Each commit against the repository results into a CI run which checks out the sources onto the integration machine, initiates a build, and notifies the committer of the result of the build. Avoid Commenting Out Failing Tests Avoid commenting out tests in the mainline branch. By commenting out tests, we get an incorrect indication of the status of the build. Branch Policy Enforcement Protected branch policies should be setup on the main branch to ensure that CI stage(s) have passed prior to starting a code review. Code review approvers will only start reviewing a pull request once the CI pipeline run passes for the latest pushed git commit. Broken builds should block pull request reviews. Prevent commits directly into main branch. Branch Strategy Release branches should auto trigger the deployment of a build artifact to its target cloud environment. You can find additional guidance on the Azure DevOps documentation site under the Manage deployments section Deliver Quickly and Daily \"By committing regularly, every committer can reduce the number of conflicting changes. Checking in a week's worth of work runs the risk of conflicting with other features and can be very difficult to resolve. Early, small conflicts in an area of the system cause team members to communicate about the change they are making.\" In the spirit of transparency and embracing frequent communication across a dev crew, we encourage developers to commit code on a daily cadence. This approach provides visibility to feature progress and accelerates pair programming across the team. Here are some principles to consider: Everyone Commits to the Git Repository Each Day End of day checked-in code should contain unit tests at the minimum. Run the build locally before checking in to avoid CI pipeline failure saturation. You should verify what caused the error, and try to solve it as soon as possible instead of committing your code. We encourage developers to follow a lean SDLC principles . Isolate work into small chunks which ties directly to business value and refactor incrementally. Isolated Environments One of the key goals of build validation is to isolate and identify failures in staging environment(s) and minimize any disruption to live production traffic. Our E2E automated tests should run in an environment which mimics our production environment(as much as possible). This includes consistent software versions, OS, test data volume simulations, network traffic parity with production, etc. Test in a Clone of Production The production environment should be duplicated into a staging environment(QA and/or Pre-Prod) at a minimum. Pull Request Updates Trigger Staged Releases New commits related to a pull request should trigger a build / release into an integration environment. The production environment should be fully isolated from this process. Promote Infrastructure Changes Across Fixed Environments Infrastructure as code changes should be tested in an integration environment and promoted to all staging environment(s) then migrated to production with zero downtime for system users. Testing in Production There are various approaches with safely carrying out automated tests for production deployments. Some of these may include: Feature flagging A/B testing Traffic shifting Developer Access to the Latest Release Artifacts Our devops workflow should enable developers to get, install and run the latest system executable. Release executable(s) should be auto generated as part of our CI/CD pipeline(s). Developers can Access the Latest Executable The latest system executable is available for all developers on the team. There should be a well-known place where developers can reference the release artifact. Release Artifacts are Published for Each Pull Request or Merges into the Main Branch Integration Observability Applied state changes to the mainline build should be made available and communicated across the team. Centralizing logs and status(s) from build and release pipeline failures are essential for developers investigating broken builds. We recommend integrating Teams or Slack with CI/CD pipeline runs which helps keep the team continuously plugged into failures and build candidate status(s). Continuous Integration Top Level Dashboard Modern CI providers have the capability to consolidate and report build status(s) within a given dashboard. Your CI dashboard should be able to correlate a build failure with a git commit. Build Status Badge in the Project Readme There should be a build status badge included in the root README of the project. Build Notifications Your CI process should be configured to send notifications to messaging platforms like Teams / Slack once the build completes. We recommend creating a separate channel to help consolidate and isolate these notifications. Resources Martin Fowler's Continuous Integration Best Practices Bedrock Getting Started Quick Guide Cobalt Quick Start Guide Terraform Azure DevOps Provider Azure DevOps multi stage pipelines Azure Pipeline Key Concepts Azure Pipeline Environments Artifacts in Azure Pipelines Azure Pipeline permission and security roles Azure Environment approvals and checks Terraform Getting Started Guide with Azure Terraform Remote State Azure Setup Terratest - Unit and Integration Infrastructure Framework","title":"Continuous Integration"},{"location":"CI-CD/continuous-integration/#continuous-integration","text":"We encourage engineering teams to make an upfront investment during Sprint 0 of a project to establish an automated and repeatable pipeline which continuously integrates code and releases system executable(s) to target cloud environments. Each integration should be verified by an automated build process that asserts a suite of validation tests pass and surface any errors across the developer team. We encourage teams to implement the CI/CD pipelines before any service code is written for customers, which usually happens in Sprint 0(N). This way, the engineering team can develop and test their work in isolation without impacting other developers and promote a consistent devops workflow throughout the engagement. These principles map directly agile software development lifecycle practices .","title":"Continuous Integration"},{"location":"CI-CD/continuous-integration/#goals","text":"Continuous integration automation is an integral part of the software development lifecycle intended to reduce build integration errors and maximize velocity across a dev crew. A robust build automation pipeline will: Accelerate team velocity Prevent integration problems Avoid last minute chaos during release dates Provide a quick feedback cycle for system-wide impact of local changes Separate build and deployment stages Measure and report metrics around build failures / success(s) Increase visibility across the team enabling tighter communication Reduce human errors, which is probably the most important part of automating the builds","title":"Goals"},{"location":"CI-CD/continuous-integration/#build-definition-managed-in-git","text":"","title":"Build Definition Managed in Git"},{"location":"CI-CD/continuous-integration/#code-manifest-artifacts-required-to-build-your-project-should-be-maintained-within-your-projects-git-repository","text":"CI provider-specific build pipeline definition(s) should reside within your project(s) git repository(s).","title":"Code / Manifest Artifacts Required to Build Your Project Should be Maintained Within Your Projects Git Repository"},{"location":"CI-CD/continuous-integration/#build-automation","text":"An automated build should encompass the following principles:","title":"Build Automation"},{"location":"CI-CD/continuous-integration/#build-task","text":"A single step within your build pipeline that compiles your code project into a single build artifact.","title":"Build Task"},{"location":"CI-CD/continuous-integration/#unit-testing","text":"Your build definition includes validation steps to execute a suite of automated unit tests to ensure that application components meets its design and behaves as intended.","title":"Unit Testing"},{"location":"CI-CD/continuous-integration/#code-style-checks","text":"Code across an engineering team must be formatted to agreed coding standards. Such standards keep code consistent, and most importantly easy for the team and customer(s) to read and refactor. Code styling consistency encourages collective ownership for project scrum teams and our partners. There are several open source code style validation tools available to choose from ( code style checks , StyleCop ). The Code Review recipes section of the playbook has suggestions for linters and preferred styles for a number of languages. Your code and documentation should avoid the use of non-inclusive language wherever possible. Follow the Inclusive Linting section to ensure your project promotes an inclusive work environment for both the team and for customers. We recommend incorporating security analysis tools within the build stage of your pipeline such as: code credential scanner, security risk detection, static analysis, etc. For Azure DevOps, you can add a security scan task to your pipeline by installing the Microsoft Security Code Analysis Extension . GitHub Actions supports a similar extension with the RIPS security scan solution . Code standards are maintained within a single configuration file. There should be a step in your build pipeline that asserts code in the latest commit conforms to the known style definition.","title":"Code Style Checks"},{"location":"CI-CD/continuous-integration/#build-script-target","text":"A single command should have the capability of building the system. This is also true for builds running on a CI server or on a developers local machine.","title":"Build Script Target"},{"location":"CI-CD/continuous-integration/#no-ide-dependencies","text":"It's essential to have a build that's runnable through standalone scripts and not dependent on a particular IDE. Build pipeline targets can be triggered locally on their desktops through their IDE of choice. The build process should maintain enough flexibility to run within a CI server as well. As an example, dockerizing your build process offers this level of flexibility as VSCode and IntelliJ supports docker plugin extensions.","title":"No IDE Dependencies"},{"location":"CI-CD/continuous-integration/#devops-security-checks","text":"Introduce security to your project at early stages. Follow the DevSecOps section to introduce security practices, automation, tools and frameworks as part of the CI.","title":"DevOps Security Checks"},{"location":"CI-CD/continuous-integration/#build-environment-dependencies","text":"","title":"Build Environment Dependencies"},{"location":"CI-CD/continuous-integration/#automated-local-environment-setup","text":"We encourage maintaining a consistent developer experience for all team members. There should be a central automated manifest / process that streamlines the installation and setup of any software dependencies. This way developers can replicate the same build environment locally as the one running on a CI server. Build automation scripts often require specific software packages and version pre-installed within the runtime environment of the OS. This presents some challenges as build processes typically version lock these dependencies. All developers on the team should be able to emulate the build environment from their local desktop regardless of their OS. For projects using VS Code, leveraging Dev Containers can really help standardize the local developer experience across the team. Well established software packaging tools like Docker, Maven, npm, etc should be considered when designing your build automation tool chain.","title":"Automated Local Environment Setup"},{"location":"CI-CD/continuous-integration/#document-local-setup","text":"The setup process for setting up a local build environment should be well documented and easy for developers to follow.","title":"Document Local Setup"},{"location":"CI-CD/continuous-integration/#infrastructure-as-code","text":"Manage as much of the following as possible, as code: Configuration Files Configuration Management(ie environment variable automation via terraform ) Secret Management(ie creating Azure secrets via terraform ) Cloud Resource Provisioning Role Assignments Load Test Scenarios Availability Alerting / Monitoring Rules and Conditions Decoupling infrastructure from the application codebase simplifies engineering teams move to cloud native applications. Terraform resource providers like Azure DevOps is making it easier for developers to manage build pipeline variables, service connections and CI/CD pipeline definitions.","title":"Infrastructure as Code"},{"location":"CI-CD/continuous-integration/#sample-devops-workflow-using-terraform-and-cobalt","text":"","title":"Sample DevOps Workflow using Terraform and Cobalt"},{"location":"CI-CD/continuous-integration/#why","text":"Repeatable and auditable changes to infrastructure make it easier to roll back to known good configurations and to rapidly expand to new stages and regions without having to hand-wire cloud resources Battle tested and templated IAC reference projects like Cobalt and Bedrock enable more engineering teams deploy secure and scalable solutions at a much more rapid pace Simplify \u201clift and shift\u201d scenarios by abstracting the complexities of cloud-native computing away from application developer teams.","title":"Why"},{"location":"CI-CD/continuous-integration/#iac-devops-operations-by-pull-request","text":"The Infrastructure deployment process built around a repo that holds the current expected state of the system / Azure environment. Operational changes are made to the running system by making commits on this repo. Git also provides a simple model for auditing deployments and rolling back to a previous state.","title":"IAC DevOPS: Operations by Pull Request"},{"location":"CI-CD/continuous-integration/#infrastructure-advocated-patterns","text":"You define infrastructure as code in Terraform / ARM / Ansible templates Templates are repeatable cloud resource stacks with a focus on configuration sets aligned with app scaling and throughput needs.","title":"Infrastructure Advocated Patterns"},{"location":"CI-CD/continuous-integration/#iac-principles","text":"","title":"IAC Principles"},{"location":"CI-CD/continuous-integration/#automate-the-azure-environment","text":"All cloud resources are provisioned through a set of infrastructure as code templates. This also includes secrets, service configuration settings, role assignments and monitoring conditions. Azure Portal should provide a read-only view on environment resources. Any change applied to the environment should be made through the IAC CI tool-chain only. Provisioning cloud environments should be a repeatable process that's driven off the infrastructure code artifacts checked into our git repository.","title":"Automate the Azure Environment"},{"location":"CI-CD/continuous-integration/#iac-ci-workflow","text":"When the IAC template files change through a git-based workflow, A CI build pipeline builds, validates and reconciles the target infrastructure environment's current state with the expected state. The infrastructure execution plan candidate for these fixed environments are reviewed by a cloud administrator as a gate check prior to the deployment stage of the pipeline applying the execution plan.","title":"IAC CI Workflow"},{"location":"CI-CD/continuous-integration/#developer-read-only-access-to-cloud-resources","text":"Developer accounts in the Azure portal should have read-only access to IAC environment resources in Azure.","title":"Developer Read-Only Access to Cloud Resources"},{"location":"CI-CD/continuous-integration/#secret-automation","text":"IAC templates are deployed via a CI/CD system that has secrets automation integrated. Avoid applying changes to secrets and/or certificates directly in the Azure Portal.","title":"Secret Automation"},{"location":"CI-CD/continuous-integration/#infrastructure-integration-test-automation","text":"End-to-end integration tests are run as part of your IAC CI process to inspect and validate that an azure environment is ready for use.","title":"Infrastructure Integration Test Automation"},{"location":"CI-CD/continuous-integration/#infrastructure-documentation","text":"The deployment and cloud resource template topology should be documented and well understood within the README of the IAC git repo. Local environment and CI workflow setup steps should be documented.","title":"Infrastructure Documentation"},{"location":"CI-CD/continuous-integration/#configuration-validation","text":"Applications use configuration to allow different runtime behaviors and it\u2019s quite common to use files to store these settings. As developers, we might introduce errors while editing these files which would cause issues for the application to start and/or run correctly. By applying validation techniques on both syntax and semantics of our configuration, we can detect errors before the application is deployed and execute, improving the developer (user) experience.","title":"Configuration Validation"},{"location":"CI-CD/continuous-integration/#application-configuration-files-examples","text":"JSON, with support for complex data types and data structures YAML, a super set of JSON with support for complex data types and structures TOML, a super set of JSON and a formally specified configuration file format","title":"Application Configuration Files Examples"},{"location":"CI-CD/continuous-integration/#why-validate-application-configuration-as-a-separate-step","text":"Easier Debugging & Time saving - With a configuration validation step in our pipeline, we can avoid running the application just to find it fails. It saves time on having to deploy & run, wait and then realize something is wrong in configuration. In addition, it also saves time on going through the logs to figure out what failed and why. Better user/developer experience - A simple reminder to the user that something in the configuration isn't in the right format can make all the difference between the joy of a successful deployment process and the intense frustration of having to guess what went wrong. For example, when there is a Boolean value expected, it can either be a string value like \"True\" or \"False\" or an integer value such as \"0\" or \"1\" . With configuration validation we make sure the meaning is correct for our application. Avoid data corruption and security breaches - Since the data arrives from an untrusted source, such as a user or an external webservice, it\u2019s particularly important to validate the input . Otherwise, it will run at the risk of performing errors, corrupting data, or, worse, be vulnerable to a whole array of injection attacks.","title":"Why Validate Application Configuration as a Separate Step?"},{"location":"CI-CD/continuous-integration/#what-is-json-schema","text":"JSON-Schema is the standard of JSON documents that describes the structure and the requirements of your JSON data. Although it is called JSON-Schema, it also common to use this method for YAMLs, as it is a super set of JSON. The schema is very simple; point out which fields might exist, which are required or optional, what data format they use. Other validation rules can be added on top of that basic premise, along with human-readable information. The metadata lives in schemas which are .json files as well. In addition, schema has the widest adoption among all standards for JSON validation as it covers a big part of validation scenarios. It uses easy-to-parse JSON documents for schemas and is easily extensible.","title":"What is Json Schema?"},{"location":"CI-CD/continuous-integration/#how-to-implement-schema-validation","text":"Implementing schema validation is divided in two - the generation of the schemas and the validation of yaml/json files with those schemas.","title":"How to Implement Schema Validation?"},{"location":"CI-CD/continuous-integration/#generation","text":"There are two options to generate a schema: From code - we can leverage the existing models and objects in the code and generate a customized schema. From data - we can take yaml/json samples which reflect the configuration in general and use the various online tools to generate a schema.","title":"Generation"},{"location":"CI-CD/continuous-integration/#validation","text":"The schema has 30+ validators for different languages, including 10+ for JavaScript, so no need to code it yourself.","title":"Validation"},{"location":"CI-CD/continuous-integration/#integration-validation","text":"An effective way to identify bugs in your build at a rapid pace is to invest early into a reliable suite of automated tests that validate the baseline functionality of the system:","title":"Integration Validation"},{"location":"CI-CD/continuous-integration/#end-to-end-integration-tests","text":"Include tests in your pipeline to validate the build candidate conforms to automated business functionality assertions. Any bugs or broken code should be reported in the test results including the failed test and relevant stack trace. All tests should be invoked through a single command. Keep the build fast. Consider automated test runtime when deciding to pull in dependencies like databases, external services and mock data loading into your test harness. Slow builds often become a bottleneck for dev teams when parallel builds on a CI server are not an option. Consider adding max timeout limits for lengthy validations to fail fast and maintain high velocity across the team.","title":"End-to-End Integration Tests"},{"location":"CI-CD/continuous-integration/#avoid-checking-in-broken-builds","text":"Automated build checks, tests, lint runs, etc should be validated locally before committing your changes to the scm repo. Test Driven Development is a practice dev crews should consider to help identify bugs and failures as early as possible within the development lifecycle.","title":"Avoid Checking in Broken Builds"},{"location":"CI-CD/continuous-integration/#reporting-build-failures","text":"If the build step happens to fail then the build pipeline run status should be reported as failed including relevant logs and stack traces.","title":"Reporting Build Failures"},{"location":"CI-CD/continuous-integration/#test-automation-data-dependencies","text":"Any mocked dataset(s) used for unit and end-to-end integration tests should be checked into the mainline repository. Minimize any external data dependencies with your build process.","title":"Test Automation Data Dependencies"},{"location":"CI-CD/continuous-integration/#code-coverage-checks","text":"We recommend integrating code coverage tools within your build stage. Most coverage tools fail builds when the test coverage falls below a minimum threshold(80% coverage). The coverage report should be published to your CI system to track a time series of variations.","title":"Code Coverage Checks"},{"location":"CI-CD/continuous-integration/#git-driven-workflow","text":"","title":"Git Driven Workflow"},{"location":"CI-CD/continuous-integration/#build-on-commit","text":"Every commit to the baseline repository should trigger the CI pipeline to create a new build candidate. Build artifact(s) are built, packaged, validated and deployed continuously into a non-production environment per commit. Each commit against the repository results into a CI run which checks out the sources onto the integration machine, initiates a build, and notifies the committer of the result of the build.","title":"Build on Commit"},{"location":"CI-CD/continuous-integration/#avoid-commenting-out-failing-tests","text":"Avoid commenting out tests in the mainline branch. By commenting out tests, we get an incorrect indication of the status of the build.","title":"Avoid Commenting Out Failing Tests"},{"location":"CI-CD/continuous-integration/#branch-policy-enforcement","text":"Protected branch policies should be setup on the main branch to ensure that CI stage(s) have passed prior to starting a code review. Code review approvers will only start reviewing a pull request once the CI pipeline run passes for the latest pushed git commit. Broken builds should block pull request reviews. Prevent commits directly into main branch.","title":"Branch Policy Enforcement"},{"location":"CI-CD/continuous-integration/#branch-strategy","text":"Release branches should auto trigger the deployment of a build artifact to its target cloud environment. You can find additional guidance on the Azure DevOps documentation site under the Manage deployments section","title":"Branch Strategy"},{"location":"CI-CD/continuous-integration/#deliver-quickly-and-daily","text":"\"By committing regularly, every committer can reduce the number of conflicting changes. Checking in a week's worth of work runs the risk of conflicting with other features and can be very difficult to resolve. Early, small conflicts in an area of the system cause team members to communicate about the change they are making.\" In the spirit of transparency and embracing frequent communication across a dev crew, we encourage developers to commit code on a daily cadence. This approach provides visibility to feature progress and accelerates pair programming across the team. Here are some principles to consider:","title":"Deliver Quickly and Daily"},{"location":"CI-CD/continuous-integration/#everyone-commits-to-the-git-repository-each-day","text":"End of day checked-in code should contain unit tests at the minimum. Run the build locally before checking in to avoid CI pipeline failure saturation. You should verify what caused the error, and try to solve it as soon as possible instead of committing your code. We encourage developers to follow a lean SDLC principles . Isolate work into small chunks which ties directly to business value and refactor incrementally.","title":"Everyone Commits to the Git Repository Each Day"},{"location":"CI-CD/continuous-integration/#isolated-environments","text":"One of the key goals of build validation is to isolate and identify failures in staging environment(s) and minimize any disruption to live production traffic. Our E2E automated tests should run in an environment which mimics our production environment(as much as possible). This includes consistent software versions, OS, test data volume simulations, network traffic parity with production, etc.","title":"Isolated Environments"},{"location":"CI-CD/continuous-integration/#test-in-a-clone-of-production","text":"The production environment should be duplicated into a staging environment(QA and/or Pre-Prod) at a minimum.","title":"Test in a Clone of Production"},{"location":"CI-CD/continuous-integration/#pull-request-updates-trigger-staged-releases","text":"New commits related to a pull request should trigger a build / release into an integration environment. The production environment should be fully isolated from this process.","title":"Pull Request Updates Trigger Staged Releases"},{"location":"CI-CD/continuous-integration/#promote-infrastructure-changes-across-fixed-environments","text":"Infrastructure as code changes should be tested in an integration environment and promoted to all staging environment(s) then migrated to production with zero downtime for system users.","title":"Promote Infrastructure Changes Across Fixed Environments"},{"location":"CI-CD/continuous-integration/#testing-in-production","text":"There are various approaches with safely carrying out automated tests for production deployments. Some of these may include: Feature flagging A/B testing Traffic shifting","title":"Testing in Production"},{"location":"CI-CD/continuous-integration/#developer-access-to-the-latest-release-artifacts","text":"Our devops workflow should enable developers to get, install and run the latest system executable. Release executable(s) should be auto generated as part of our CI/CD pipeline(s).","title":"Developer Access to the Latest Release Artifacts"},{"location":"CI-CD/continuous-integration/#developers-can-access-the-latest-executable","text":"The latest system executable is available for all developers on the team. There should be a well-known place where developers can reference the release artifact.","title":"Developers can Access the Latest Executable"},{"location":"CI-CD/continuous-integration/#release-artifacts-are-published-for-each-pull-request-or-merges-into-the-main-branch","text":"","title":"Release Artifacts are Published for Each Pull Request or Merges into the Main Branch"},{"location":"CI-CD/continuous-integration/#integration-observability","text":"Applied state changes to the mainline build should be made available and communicated across the team. Centralizing logs and status(s) from build and release pipeline failures are essential for developers investigating broken builds. We recommend integrating Teams or Slack with CI/CD pipeline runs which helps keep the team continuously plugged into failures and build candidate status(s).","title":"Integration Observability"},{"location":"CI-CD/continuous-integration/#continuous-integration-top-level-dashboard","text":"Modern CI providers have the capability to consolidate and report build status(s) within a given dashboard. Your CI dashboard should be able to correlate a build failure with a git commit.","title":"Continuous Integration Top Level Dashboard"},{"location":"CI-CD/continuous-integration/#build-status-badge-in-the-project-readme","text":"There should be a build status badge included in the root README of the project.","title":"Build Status Badge in the Project Readme"},{"location":"CI-CD/continuous-integration/#build-notifications","text":"Your CI process should be configured to send notifications to messaging platforms like Teams / Slack once the build completes. We recommend creating a separate channel to help consolidate and isolate these notifications.","title":"Build Notifications"},{"location":"CI-CD/continuous-integration/#resources","text":"Martin Fowler's Continuous Integration Best Practices Bedrock Getting Started Quick Guide Cobalt Quick Start Guide Terraform Azure DevOps Provider Azure DevOps multi stage pipelines Azure Pipeline Key Concepts Azure Pipeline Environments Artifacts in Azure Pipelines Azure Pipeline permission and security roles Azure Environment approvals and checks Terraform Getting Started Guide with Azure Terraform Remote State Azure Setup Terratest - Unit and Integration Infrastructure Framework","title":"Resources"},{"location":"CI-CD/dev-sec-ops/","text":"DevSecOps The Concept of DevSecOps DevSecOps or DevOps security is about introducing security earlier in the life cycle of application development (a.k.a shift-left), thus minimizing the impact of vulnerabilities and bringing security closer to development team. Why By embracing shift-left mentality, DevSecOps encourages organizations to bridge the gap that often exists between development and security teams to the point where many of the security processes are automated and are effectively handled by the development team. DevSecOps Practices This section covers different tools, frameworks and resources allowing introduction of DevSecOps best practices to your project at early stages of development. Topics covered: Credential Scanning - automatically inspecting a project to ensure that no secrets are included in the project's source code. Secrets Rotation - automated process by which the secret, used by the application, is refreshed and replaced by a new secret. Static Code Analysis - analyze source code or compiled versions of code to help find security flaws. Penetration Testing - a simulated attack against your application to check for exploitable vulnerabilities. Container Dependencies Scanning - search for vulnerabilities in container operating systems, language packages and application dependencies. Evaluation of Open Source Libraries - make it harder to apply open source supply chain attacks by evaluating the libraries you use.","title":"DevSecOps"},{"location":"CI-CD/dev-sec-ops/#devsecops","text":"","title":"DevSecOps"},{"location":"CI-CD/dev-sec-ops/#the-concept-of-devsecops","text":"DevSecOps or DevOps security is about introducing security earlier in the life cycle of application development (a.k.a shift-left), thus minimizing the impact of vulnerabilities and bringing security closer to development team.","title":"The Concept of DevSecOps"},{"location":"CI-CD/dev-sec-ops/#why","text":"By embracing shift-left mentality, DevSecOps encourages organizations to bridge the gap that often exists between development and security teams to the point where many of the security processes are automated and are effectively handled by the development team.","title":"Why"},{"location":"CI-CD/dev-sec-ops/#devsecops-practices","text":"This section covers different tools, frameworks and resources allowing introduction of DevSecOps best practices to your project at early stages of development. Topics covered: Credential Scanning - automatically inspecting a project to ensure that no secrets are included in the project's source code. Secrets Rotation - automated process by which the secret, used by the application, is refreshed and replaced by a new secret. Static Code Analysis - analyze source code or compiled versions of code to help find security flaws. Penetration Testing - a simulated attack against your application to check for exploitable vulnerabilities. Container Dependencies Scanning - search for vulnerabilities in container operating systems, language packages and application dependencies. Evaluation of Open Source Libraries - make it harder to apply open source supply chain attacks by evaluating the libraries you use.","title":"DevSecOps Practices"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/","text":"Azure DevOps Service Connection Security Service Connections are used in Azure DevOps Pipelines to connect to external services, like Azure, GitHub, Docker, Kubernetes, and many other services. Service Connections can be used to authenticate to these external services and to invoke diverse types of commands, like create and update resources in Azure, upload container images to Docker, or deploy applications to Kubernetes. To be able to invoke these commands, Service Connections need to have the right permissions to do so, for most types of Service Connections the permissions can be scoped to a subset of resources to limit the access they have. To improve the principle of least privilege, it's often very common to have separate Service Connections for different environments like Dev/Test/QA/Prod. Secure Service Connection Securing Service Connections can be achieved by using several methods. User permissions can be configured to ensure only the correct users can create, view, use, and manage the Service Connection. Pipeline-level permissions can be configured to ensure only approved YAML pipelines are able to use the Service Connection. Project permissions can be configured to ensure only certain Azure DevOps projects are able to use the Service Connection. After using the above methods, what is secured is who can use the Service Connections. What still isn't secured however, is what can be done with the Service Connections. Because Service Connections have all the necessary permissions in the external services, it is crucial to secure Service Connections so they cannot be misused by accident or by malicious users. An example of this is a Azure DevOps Pipeline that uses a Service Connection to an Azure Resource Group (or entire subscription) to list all resources and then delete those resources. Without the correct security in place, it could be possible to execute this Pipeline, without any validation or reviews being done. pool : vmImage : ubuntu-latest steps : - task : AzureCLI@2 inputs : azureSubscription : 'Production Service Connection' scriptType : 'pscore' scriptLocation : 'inlineScript' inlineScript : | $resources = az resource list foreach ($resource in $resources) { az resource delete --ids $resource.id } Pipeline Security Caveat YAML pipelines can be triggered without the need for a pull request, this introduces a security risk. In good practice, Pull Requests and Code Reviews should be used to ensure the code that is being deployed, is being reviewed by a second person and potentially automatically being checked for vulnerabilities and other security issues. However, YAML Pipelines can be executed without the need for a Pull Request and Code Reviews. This allows the (malicious) user to make changes using the Service Connection which would normally require a reviewer. The configuration of when a pipeline should be triggered is specified in the YAML Pipeline itself and therefore a pipeline can be configured to execute on changes in a temporary branch. In this temporary branch, any changes made to the pipeline itself will be executed without being reviewed. If the given pipeline has been granted Pipeline-level permissions to use a specific Service Connection, any command can be executed using that Service Connection, without anyone reviewing the command. Since Service Connections can have a lot of permissions in the external service, executing any pipeline without review could potentially have big consequences. Service Connection Checks To prevent accidental mis-use of Service Connections there are several checks that can be configured. These checks are configured on the Service Connection itself and therefore can only be configured by the owner or administrator of that Service Connection. A user of a certain YAML Pipeline cannot modify these checks since the checks are not defined in the YAML file itself. Configuration can be done in the Approvals and Checks menu on the Service Connection. Branch Control By configuring Branch Control on a Service Connection, you can control that the Service Connection can only be used in a YAML Pipeline if the pipeline is running from a specific branch. By configuring Branch Control to only allow the main branch (and potentially release branches) you can ensure a YAML Pipeline can only use the Service Connection after any changes to that pipeline have been merged into the main branch, and therefore has passed any Pull Requests checks and Code Reviews. As an additional check, Branch Control can verify if Branch Protections (like required Pull Requests and Code Reviews) are actually configured on the allowed branches. With Branch Control in place, in combination with Branch Protections, it is not possible anymore to run any commands against a Service Connection without having multiple persons review the commands. Therefore accidental, or malicious, mis-use of the permissions a Service Connection has is not possible anymore. Note: When setting a wildcard for the Allowed Branches, anyone could still create a branch matching that wildcard and would be able to use the Service Connection. Using git permissions it can be configured so only administrators are allowed to create certain branches, like release branches.*","title":"Azure DevOps Service Connection Security"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#azure-devops-service-connection-security","text":"Service Connections are used in Azure DevOps Pipelines to connect to external services, like Azure, GitHub, Docker, Kubernetes, and many other services. Service Connections can be used to authenticate to these external services and to invoke diverse types of commands, like create and update resources in Azure, upload container images to Docker, or deploy applications to Kubernetes. To be able to invoke these commands, Service Connections need to have the right permissions to do so, for most types of Service Connections the permissions can be scoped to a subset of resources to limit the access they have. To improve the principle of least privilege, it's often very common to have separate Service Connections for different environments like Dev/Test/QA/Prod.","title":"Azure DevOps Service Connection Security"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#secure-service-connection","text":"Securing Service Connections can be achieved by using several methods. User permissions can be configured to ensure only the correct users can create, view, use, and manage the Service Connection. Pipeline-level permissions can be configured to ensure only approved YAML pipelines are able to use the Service Connection. Project permissions can be configured to ensure only certain Azure DevOps projects are able to use the Service Connection. After using the above methods, what is secured is who can use the Service Connections. What still isn't secured however, is what can be done with the Service Connections. Because Service Connections have all the necessary permissions in the external services, it is crucial to secure Service Connections so they cannot be misused by accident or by malicious users. An example of this is a Azure DevOps Pipeline that uses a Service Connection to an Azure Resource Group (or entire subscription) to list all resources and then delete those resources. Without the correct security in place, it could be possible to execute this Pipeline, without any validation or reviews being done. pool : vmImage : ubuntu-latest steps : - task : AzureCLI@2 inputs : azureSubscription : 'Production Service Connection' scriptType : 'pscore' scriptLocation : 'inlineScript' inlineScript : | $resources = az resource list foreach ($resource in $resources) { az resource delete --ids $resource.id }","title":"Secure Service Connection"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#pipeline-security-caveat","text":"YAML pipelines can be triggered without the need for a pull request, this introduces a security risk. In good practice, Pull Requests and Code Reviews should be used to ensure the code that is being deployed, is being reviewed by a second person and potentially automatically being checked for vulnerabilities and other security issues. However, YAML Pipelines can be executed without the need for a Pull Request and Code Reviews. This allows the (malicious) user to make changes using the Service Connection which would normally require a reviewer. The configuration of when a pipeline should be triggered is specified in the YAML Pipeline itself and therefore a pipeline can be configured to execute on changes in a temporary branch. In this temporary branch, any changes made to the pipeline itself will be executed without being reviewed. If the given pipeline has been granted Pipeline-level permissions to use a specific Service Connection, any command can be executed using that Service Connection, without anyone reviewing the command. Since Service Connections can have a lot of permissions in the external service, executing any pipeline without review could potentially have big consequences.","title":"Pipeline Security Caveat"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#service-connection-checks","text":"To prevent accidental mis-use of Service Connections there are several checks that can be configured. These checks are configured on the Service Connection itself and therefore can only be configured by the owner or administrator of that Service Connection. A user of a certain YAML Pipeline cannot modify these checks since the checks are not defined in the YAML file itself. Configuration can be done in the Approvals and Checks menu on the Service Connection.","title":"Service Connection Checks"},{"location":"CI-CD/dev-sec-ops/azure-devops-service-connection-security/#branch-control","text":"By configuring Branch Control on a Service Connection, you can control that the Service Connection can only be used in a YAML Pipeline if the pipeline is running from a specific branch. By configuring Branch Control to only allow the main branch (and potentially release branches) you can ensure a YAML Pipeline can only use the Service Connection after any changes to that pipeline have been merged into the main branch, and therefore has passed any Pull Requests checks and Code Reviews. As an additional check, Branch Control can verify if Branch Protections (like required Pull Requests and Code Reviews) are actually configured on the allowed branches. With Branch Control in place, in combination with Branch Protections, it is not possible anymore to run any commands against a Service Connection without having multiple persons review the commands. Therefore accidental, or malicious, mis-use of the permissions a Service Connection has is not possible anymore. Note: When setting a wildcard for the Allowed Branches, anyone could still create a branch matching that wildcard and would be able to use the Service Connection. Using git permissions it can be configured so only administrators are allowed to create certain branches, like release branches.*","title":"Branch Control"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/","text":"Dependency and Container Scanning Dependency and Container scanning is performed in order to search for vulnerabilities in operating systems, language and application packages. Why Dependency and Container Scanning Container images are standard application delivery format in cloud-native environments. Having a broad selection of images from the community, we often choose a community base image, and then add packages that we need to it, which might also come from community sources. Those arbitrary dependencies might introduce vulnerabilities to our image and application. Applying Dependency and Container Scanning Images that contain software with security vulnerabilities become exploitable at runtime. When building an image in your CI pipeline, image scanning must be a requirement for a build to pass. Images that did not pass scanning should never be pushed to your production-accessible container registry. Dependency and Container scanning best practices: Base Image - if your image is built on top of a third-party base image, validate the following: The image comes from a well-known company or open-source group. It is hosted on a reputable registry. The Dockerfile is available, and check for dependencies installed in it. The image is frequently updated - old images might not contain the latest security updates. Remove Non-Essential Software - Start with a minimal base image and install only the tools, libraries and configuration files that are required by your application. Avoid installing the following tools or remove them if present: - Network tools and clients: e.g., wget, curl, netcat, ssh. - Shells: e.g. sh, bash. Note that removing shells also prevents the use of shell scripts at runtime. Instead, use an executable when possible. - Compilers and debuggers. These should be used only in build and development containers, but never in production containers. Container images should be immutable - download and include all the required dependencies during the image build. Scan for vulnerabilities in software dependencies - today there is likely no software project without some form of external libraries, dependencies or open source. While it allows the development team to focus on their application code, the dependency brings forth an expected downside where the security posture of the real application is now resting on it. To detect vulnerabilities contained within a project\u2019s dependencies use container scanning tools which as part of their analysis scan the software dependencies (see \"Dependency and Container Scanning Frameworks and Tools\"). Dependency and Container Scanning Frameworks and Tools Trivy - a simple and comprehensive vulnerability scanner for containers (doesn't support Windows containers) Aqua - dependency and container scanning for applications running on AKS, ACI and Windows Containers. Has an integration with AzDO pipelines. Dependency-Check Plugin for SonarQube - OnPrem dependency scanning Mend (previously WhiteSource) - Open Source Scanning Software Conclusion A powerful technology such as containers should be used carefully. Install the minimal requirements needed for your application, be aware of the software dependencies your application is using and make sure to maintain it over time by using container and dependencies scanning tools.","title":"Dependency and Container Scanning"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#dependency-and-container-scanning","text":"Dependency and Container scanning is performed in order to search for vulnerabilities in operating systems, language and application packages.","title":"Dependency and Container Scanning"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#why-dependency-and-container-scanning","text":"Container images are standard application delivery format in cloud-native environments. Having a broad selection of images from the community, we often choose a community base image, and then add packages that we need to it, which might also come from community sources. Those arbitrary dependencies might introduce vulnerabilities to our image and application.","title":"Why Dependency and Container Scanning"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#applying-dependency-and-container-scanning","text":"Images that contain software with security vulnerabilities become exploitable at runtime. When building an image in your CI pipeline, image scanning must be a requirement for a build to pass. Images that did not pass scanning should never be pushed to your production-accessible container registry. Dependency and Container scanning best practices: Base Image - if your image is built on top of a third-party base image, validate the following: The image comes from a well-known company or open-source group. It is hosted on a reputable registry. The Dockerfile is available, and check for dependencies installed in it. The image is frequently updated - old images might not contain the latest security updates. Remove Non-Essential Software - Start with a minimal base image and install only the tools, libraries and configuration files that are required by your application. Avoid installing the following tools or remove them if present: - Network tools and clients: e.g., wget, curl, netcat, ssh. - Shells: e.g. sh, bash. Note that removing shells also prevents the use of shell scripts at runtime. Instead, use an executable when possible. - Compilers and debuggers. These should be used only in build and development containers, but never in production containers. Container images should be immutable - download and include all the required dependencies during the image build. Scan for vulnerabilities in software dependencies - today there is likely no software project without some form of external libraries, dependencies or open source. While it allows the development team to focus on their application code, the dependency brings forth an expected downside where the security posture of the real application is now resting on it. To detect vulnerabilities contained within a project\u2019s dependencies use container scanning tools which as part of their analysis scan the software dependencies (see \"Dependency and Container Scanning Frameworks and Tools\").","title":"Applying Dependency and Container Scanning"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#dependency-and-container-scanning-frameworks-and-tools","text":"Trivy - a simple and comprehensive vulnerability scanner for containers (doesn't support Windows containers) Aqua - dependency and container scanning for applications running on AKS, ACI and Windows Containers. Has an integration with AzDO pipelines. Dependency-Check Plugin for SonarQube - OnPrem dependency scanning Mend (previously WhiteSource) - Open Source Scanning Software","title":"Dependency and Container Scanning Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/dependency-and-container-scanning/#conclusion","text":"A powerful technology such as containers should be used carefully. Install the minimal requirements needed for your application, be aware of the software dependencies your application is using and make sure to maintain it over time by using container and dependencies scanning tools.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/","text":"Evaluate Open Source Software Given the rise in threat of open source software supply chain attacks , developers should identify potential candidates for open-source dependencies and evaluate them against your needs and the required security posture. Why Evaluate Open Source Software Open source software is a critical part of modern software development. It is important to evaluate the open source software uses to ensure it meets the needs and is secure. Security is not a given with open source software, and furthermore, what is secure today may not be secure tomorrow so scanning dependencies for known vulnerabilities doesn't always cover all bases. This is why we need to look for evidence of a strong security posture and a commitment to security from the maintainers of the open source software we use. When to Evaluate Open Source Software You should evaluate open source software before you use it in your project. This is especially important if the software is a dependency of your project, as it can introduce security vulnerabilities and other issues into your project. Code reviewers should also be aware of the open source software used in the project and be able to use the tools and resources mentioned below to evaluate the security of the open source software that is being added to the project. Applying Open Source Software Evaluation When evaluating open source software, consider the following: Can you avoid adding it as a dependency? The best dependency is the one you don't have. Is it maintained? How often and at what engineering rigor (i.e. code reviews, branch protection, tests) Is there evidence that effort is taken to make it secure? Can you find a reference that it is used significantly downstream by other projects or is referenced by known and trusted documentation? How many stars and forks does it have on GitHub? Is it easy to use securely? Does the license allow you to use it in your project? Are there instructions on how to report vulnerabilities? Does it have any known vulnerabilities or security issues? Are its dependencies secure, or at least up to date and actively maintained? Has it been audited by a third party such as the OpenSSF Security Reviews ? Tools for Evaluating Open Source Software OpenSSF Scorecards - This tool actually automates some of the checks in the list above and can be used to evaluate the security posture of open source projects. This can run as a GitHub action or in the Command Line Interface (CLI) to provide a security scorecard for open source projects. Note which metrics are important to you, your organization and the customer's. This tool is used by known open source program offices (OSPO) for measuring open source contributions by their employees. OWASP Dependency-Check - a software composition analysis utility that identifies project dependencies and checks if there are any known, publicly disclosed, vulnerabilities. Concise Guide for Evaluating Open Source Software - a guide to help you expand upon the knowledge in this page to evaluate open source software.","title":"Evaluate Open Source Software"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#evaluate-open-source-software","text":"Given the rise in threat of open source software supply chain attacks , developers should identify potential candidates for open-source dependencies and evaluate them against your needs and the required security posture.","title":"Evaluate Open Source Software"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#why-evaluate-open-source-software","text":"Open source software is a critical part of modern software development. It is important to evaluate the open source software uses to ensure it meets the needs and is secure. Security is not a given with open source software, and furthermore, what is secure today may not be secure tomorrow so scanning dependencies for known vulnerabilities doesn't always cover all bases. This is why we need to look for evidence of a strong security posture and a commitment to security from the maintainers of the open source software we use.","title":"Why Evaluate Open Source Software"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#when-to-evaluate-open-source-software","text":"You should evaluate open source software before you use it in your project. This is especially important if the software is a dependency of your project, as it can introduce security vulnerabilities and other issues into your project. Code reviewers should also be aware of the open source software used in the project and be able to use the tools and resources mentioned below to evaluate the security of the open source software that is being added to the project.","title":"When to Evaluate Open Source Software"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#applying-open-source-software-evaluation","text":"When evaluating open source software, consider the following: Can you avoid adding it as a dependency? The best dependency is the one you don't have. Is it maintained? How often and at what engineering rigor (i.e. code reviews, branch protection, tests) Is there evidence that effort is taken to make it secure? Can you find a reference that it is used significantly downstream by other projects or is referenced by known and trusted documentation? How many stars and forks does it have on GitHub? Is it easy to use securely? Does the license allow you to use it in your project? Are there instructions on how to report vulnerabilities? Does it have any known vulnerabilities or security issues? Are its dependencies secure, or at least up to date and actively maintained? Has it been audited by a third party such as the OpenSSF Security Reviews ?","title":"Applying Open Source Software Evaluation"},{"location":"CI-CD/dev-sec-ops/evaluate-open-source-software/#tools-for-evaluating-open-source-software","text":"OpenSSF Scorecards - This tool actually automates some of the checks in the list above and can be used to evaluate the security posture of open source projects. This can run as a GitHub action or in the Command Line Interface (CLI) to provide a security scorecard for open source projects. Note which metrics are important to you, your organization and the customer's. This tool is used by known open source program offices (OSPO) for measuring open source contributions by their employees. OWASP Dependency-Check - a software composition analysis utility that identifies project dependencies and checks if there are any known, publicly disclosed, vulnerabilities. Concise Guide for Evaluating Open Source Software - a guide to help you expand upon the knowledge in this page to evaluate open source software.","title":"Tools for Evaluating Open Source Software"},{"location":"CI-CD/dev-sec-ops/penetration-testing/","text":"Penetration Testing A penetration test is a simulated attack against your application to check for exploitable security issues. Why Penetration Testing Penetration testing performed on a running application. As such, it tests the application E2E with all of its layers. It's output is a real simulated attack on the application that succeeded, therefore it is a critical issue in your application and should be addressed as soon as possible. Applying Penetration Testing Many organizations perform manual penetration testing. But new vulnerabilities found every day. Therefore, it is a good practice to have an automated penetration testing performed. To achieve this automation use penetration testing tools to uncover vulnerabilities, such as unsanitized inputs that are susceptible to code injection attacks. Insights provided by the penetration test can then be used to fine-tune your WAF security policies and patch detected vulnerabilities. Penetration Testing Frameworks and Tools OWASP Zed Attack Proxy (ZAP) - OWASP penetration testing tool for web applications. Conclusion Penetration testing is essential to check for vulnerabilities in your application and protect it from simulated attacks. Insights provided by Penetration testing can identify weak spots in an organization's security posture, as well as measure the compliance of its security policy, test the staff's awareness of security issues and determine whether -- and how -- the organization would be subject to security disasters.","title":"Penetration Testing"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#penetration-testing","text":"A penetration test is a simulated attack against your application to check for exploitable security issues.","title":"Penetration Testing"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#why-penetration-testing","text":"Penetration testing performed on a running application. As such, it tests the application E2E with all of its layers. It's output is a real simulated attack on the application that succeeded, therefore it is a critical issue in your application and should be addressed as soon as possible.","title":"Why Penetration Testing"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#applying-penetration-testing","text":"Many organizations perform manual penetration testing. But new vulnerabilities found every day. Therefore, it is a good practice to have an automated penetration testing performed. To achieve this automation use penetration testing tools to uncover vulnerabilities, such as unsanitized inputs that are susceptible to code injection attacks. Insights provided by the penetration test can then be used to fine-tune your WAF security policies and patch detected vulnerabilities.","title":"Applying Penetration Testing"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#penetration-testing-frameworks-and-tools","text":"OWASP Zed Attack Proxy (ZAP) - OWASP penetration testing tool for web applications.","title":"Penetration Testing Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/penetration-testing/#conclusion","text":"Penetration testing is essential to check for vulnerabilities in your application and protect it from simulated attacks. Insights provided by Penetration testing can identify weak spots in an organization's security posture, as well as measure the compliance of its security policy, test the staff's awareness of security issues and determine whether -- and how -- the organization would be subject to security disasters.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/secrets-management/","text":"Secrets Management Secret management refers to the tools and practices used to manage digital authentication credentials (like API keys, tokens, passwords, and certificates). These secrets are used to protect access to sensitive data and services, making their management critical for security. We should assume any repo we work on may go public at any time and protect our secrets, even if the repo is initially private. Importance of Secrets Management In modern software development, applications often need to interact with other software components, APIs, and services. These interactions often require authentication, which is typically handled using secrets. If these secrets are not managed properly, they can be exposed, leading to potential security breaches. Best Practices for Secrets Management Centralized Secret Storage: Store all secrets in a centralized, encrypted location. This reduces the risk of secrets being lost or exposed. Access Control: Implement strict access control policies. Only authorized entities should have access to secrets. Rotation of Secrets: Regularly change secrets to reduce the risk if a secret is compromised. Audit Trails: Keep a record of when and who accessed which secret. This can help in identifying suspicious activities. Automated Secret Management: Automate the processes of secret creation, rotation, and deletion. This reduces the risk of human error. Remember, the goal of secret management is to protect sensitive information from unauthorized access and potential security threats. General Approach The general approach is to keep secrets in separate configuration files that are not checked in to the repo. Add the files to the .gitignore to prevent that they're checked in. Each developer maintains their own local version of the file or, if required, circulate them via private channels e.g. a Teams chat. In a production system, assuming Azure, create the secrets in the environment of the running process. We can do this by manually editing the 'Applications Settings' section of the resource, but a script using the Azure CLI to do the same is a useful time-saving utility. See az webapp config appsettings for more details. It's best practice to maintain separate secrets configurations for each environment that you run. e.g. dev, test, prod, local etc The secrets-per-branch recipe describes a simple way to manage separate secrets configurations for each environment. Note: even if the secret was only pushed to a feature branch and never merged, it's still a part of the git history. Follow these instructions to remove any sensitive data and/or regenerate any keys and other sensitive information added to the repo. If a key or secret made it into the code base, rotate the key/secret so that it's no longer active Keeping Secrets Secret The care taken to protect our secrets applies both to how we get and store them, but also to how we use them. Don't log secrets Don't put them in reporting Don't send them to other applications, as part of URLs, forms, or in any other way other than to make a request to the service that requires that secret Enhanced-Security Applications The techniques outlined below provide good security and a common pattern for a wide range of languages. They rely on the fact that Azure keeps application settings (the environment) encrypted until your app runs. They do not prevent secrets from existing in plaintext in memory at runtime. In particular, for garbage collected languages those values may exist for longer than the lifetime of the variable, and may be visible when debugging a memory dump of the process. If you are working on an application with enhanced security requirements you should consider using additional techniques to maintain encryption on secrets throughout the application lifetime. Always rotate encryption keys on a regular basis. Techniques for Secrets Management These techniques make the loading of secrets transparent to the developer. C#/.NET Modern .NET Solution For .NET SDK (version 2.0 or higher) we have dotnet secrets , a tool provided by the .NET SDK that allows you to manage and protect sensitive information, such as API keys, connection strings, and other secrets, during development. The secrets are stored securely on your machine and can be accessed by your .NET applications. # Initialize dotnet secret dotnet user-secrets init # Adding secret # dotnet user-secrets set <KEY> <VALUE> dotnet user-secrets set ExternalServiceApiKey my-api-key-12345 # Update Secret dotnet user-secrets set ExternalServiceApiKey updated-api-key-67890 To access the secrets; using Microsoft.Extensions.Configuration ; var builder = new ConfigurationBuilder () . AddUserSecrets < Startup > (); var configuration = builder . Build (); var externalServiceApiKey = configuration [ \"ExternalServiceApiKey\" ]; Deployment Considerations When deploying your application to production, it's essential to ensure that your secrets are securely managed. Here are some deployment-related implications: Remove Development Secrets: Before deploying to production, remove any development secrets from your application configuration. You can use environment variables or a more secure secret management solution like Azure Key Vault or AWS Secrets Manager in production. Secure Deployment: Ensure that your production server is secure, and access to secrets is controlled. Never store secrets directly in source code or configuration files. Key Rotation: Consider implementing a secret rotation policy to regularly update your secrets in production. .NET Framework Solution Use the file attribute of the appSettings element to load secrets from a local file. <?xml version=\"1.0\" encoding=\"utf-8\"?> <configuration> <appSettings file= \"..\\..\\secrets.config\" > \u2026 </appSettings> <startup> <supportedRuntime version= \"v4.0\" sku= \".NETFramework,Version=v4.6.1\" /> </startup> \u2026 </configuration> Access secrets: static void Main ( string [] args ) { String mySecret = System . Configuration . ConfigurationManager . AppSettings [ \"mySecret\" ]; } When running in Azure, ConfigurationManager will load these settings from the process environment. We don't need to upload secrets files to the server or change any code. Node Store secrets in environment variables or in a .env file $ cat .env MY_SECRET = mySecret Use the dotenv package to load and access environment variables require('dotenv').config() let mySecret = process.env(\"MY_SECRET\") Python Store secrets in environment variables or in a .env file $ cat .env MY_SECRET = mySecret Use the dotenv package to load and access environment variables import os from dotenv import load_dotenv load_dotenv () my_secret = os . getenv ( 'MY_SECRET' ) Another good library for reading environment variables is environs from environs import Env env = Env () env . read_env () my_secret = os . environ [ \"MY_SECRET\" ] Databricks Databricks has the option of using dbutils as a secure way to retrieve credentials and not reveal them within the notebooks running on Databricks The following steps lay out a clear pathway to creating new secrets and then utilizing them within a notebook on Databricks: Install and configure the Databricks CLI on your local machine Get the Databricks personal access token Create a scope for the secrets Create secrets Validation Automated credential scanning can be performed on the code regardless of the programming language.","title":"Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#secrets-management","text":"Secret management refers to the tools and practices used to manage digital authentication credentials (like API keys, tokens, passwords, and certificates). These secrets are used to protect access to sensitive data and services, making their management critical for security. We should assume any repo we work on may go public at any time and protect our secrets, even if the repo is initially private.","title":"Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#importance-of-secrets-management","text":"In modern software development, applications often need to interact with other software components, APIs, and services. These interactions often require authentication, which is typically handled using secrets. If these secrets are not managed properly, they can be exposed, leading to potential security breaches.","title":"Importance of Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#best-practices-for-secrets-management","text":"Centralized Secret Storage: Store all secrets in a centralized, encrypted location. This reduces the risk of secrets being lost or exposed. Access Control: Implement strict access control policies. Only authorized entities should have access to secrets. Rotation of Secrets: Regularly change secrets to reduce the risk if a secret is compromised. Audit Trails: Keep a record of when and who accessed which secret. This can help in identifying suspicious activities. Automated Secret Management: Automate the processes of secret creation, rotation, and deletion. This reduces the risk of human error. Remember, the goal of secret management is to protect sensitive information from unauthorized access and potential security threats.","title":"Best Practices for Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#general-approach","text":"The general approach is to keep secrets in separate configuration files that are not checked in to the repo. Add the files to the .gitignore to prevent that they're checked in. Each developer maintains their own local version of the file or, if required, circulate them via private channels e.g. a Teams chat. In a production system, assuming Azure, create the secrets in the environment of the running process. We can do this by manually editing the 'Applications Settings' section of the resource, but a script using the Azure CLI to do the same is a useful time-saving utility. See az webapp config appsettings for more details. It's best practice to maintain separate secrets configurations for each environment that you run. e.g. dev, test, prod, local etc The secrets-per-branch recipe describes a simple way to manage separate secrets configurations for each environment. Note: even if the secret was only pushed to a feature branch and never merged, it's still a part of the git history. Follow these instructions to remove any sensitive data and/or regenerate any keys and other sensitive information added to the repo. If a key or secret made it into the code base, rotate the key/secret so that it's no longer active","title":"General Approach"},{"location":"CI-CD/dev-sec-ops/secrets-management/#keeping-secrets-secret","text":"The care taken to protect our secrets applies both to how we get and store them, but also to how we use them. Don't log secrets Don't put them in reporting Don't send them to other applications, as part of URLs, forms, or in any other way other than to make a request to the service that requires that secret","title":"Keeping Secrets Secret"},{"location":"CI-CD/dev-sec-ops/secrets-management/#enhanced-security-applications","text":"The techniques outlined below provide good security and a common pattern for a wide range of languages. They rely on the fact that Azure keeps application settings (the environment) encrypted until your app runs. They do not prevent secrets from existing in plaintext in memory at runtime. In particular, for garbage collected languages those values may exist for longer than the lifetime of the variable, and may be visible when debugging a memory dump of the process. If you are working on an application with enhanced security requirements you should consider using additional techniques to maintain encryption on secrets throughout the application lifetime. Always rotate encryption keys on a regular basis.","title":"Enhanced-Security Applications"},{"location":"CI-CD/dev-sec-ops/secrets-management/#techniques-for-secrets-management","text":"These techniques make the loading of secrets transparent to the developer.","title":"Techniques for Secrets Management"},{"location":"CI-CD/dev-sec-ops/secrets-management/#cnet","text":"","title":"C#/.NET"},{"location":"CI-CD/dev-sec-ops/secrets-management/#modern-net-solution","text":"For .NET SDK (version 2.0 or higher) we have dotnet secrets , a tool provided by the .NET SDK that allows you to manage and protect sensitive information, such as API keys, connection strings, and other secrets, during development. The secrets are stored securely on your machine and can be accessed by your .NET applications. # Initialize dotnet secret dotnet user-secrets init # Adding secret # dotnet user-secrets set <KEY> <VALUE> dotnet user-secrets set ExternalServiceApiKey my-api-key-12345 # Update Secret dotnet user-secrets set ExternalServiceApiKey updated-api-key-67890 To access the secrets; using Microsoft.Extensions.Configuration ; var builder = new ConfigurationBuilder () . AddUserSecrets < Startup > (); var configuration = builder . Build (); var externalServiceApiKey = configuration [ \"ExternalServiceApiKey\" ];","title":"Modern .NET Solution"},{"location":"CI-CD/dev-sec-ops/secrets-management/#deployment-considerations","text":"When deploying your application to production, it's essential to ensure that your secrets are securely managed. Here are some deployment-related implications: Remove Development Secrets: Before deploying to production, remove any development secrets from your application configuration. You can use environment variables or a more secure secret management solution like Azure Key Vault or AWS Secrets Manager in production. Secure Deployment: Ensure that your production server is secure, and access to secrets is controlled. Never store secrets directly in source code or configuration files. Key Rotation: Consider implementing a secret rotation policy to regularly update your secrets in production.","title":"Deployment Considerations"},{"location":"CI-CD/dev-sec-ops/secrets-management/#net-framework-solution","text":"Use the file attribute of the appSettings element to load secrets from a local file. <?xml version=\"1.0\" encoding=\"utf-8\"?> <configuration> <appSettings file= \"..\\..\\secrets.config\" > \u2026 </appSettings> <startup> <supportedRuntime version= \"v4.0\" sku= \".NETFramework,Version=v4.6.1\" /> </startup> \u2026 </configuration> Access secrets: static void Main ( string [] args ) { String mySecret = System . Configuration . ConfigurationManager . AppSettings [ \"mySecret\" ]; } When running in Azure, ConfigurationManager will load these settings from the process environment. We don't need to upload secrets files to the server or change any code.","title":".NET Framework Solution"},{"location":"CI-CD/dev-sec-ops/secrets-management/#node","text":"Store secrets in environment variables or in a .env file $ cat .env MY_SECRET = mySecret Use the dotenv package to load and access environment variables require('dotenv').config() let mySecret = process.env(\"MY_SECRET\")","title":"Node"},{"location":"CI-CD/dev-sec-ops/secrets-management/#python","text":"Store secrets in environment variables or in a .env file $ cat .env MY_SECRET = mySecret Use the dotenv package to load and access environment variables import os from dotenv import load_dotenv load_dotenv () my_secret = os . getenv ( 'MY_SECRET' ) Another good library for reading environment variables is environs from environs import Env env = Env () env . read_env () my_secret = os . environ [ \"MY_SECRET\" ]","title":"Python"},{"location":"CI-CD/dev-sec-ops/secrets-management/#databricks","text":"Databricks has the option of using dbutils as a secure way to retrieve credentials and not reveal them within the notebooks running on Databricks The following steps lay out a clear pathway to creating new secrets and then utilizing them within a notebook on Databricks: Install and configure the Databricks CLI on your local machine Get the Databricks personal access token Create a scope for the secrets Create secrets","title":"Databricks"},{"location":"CI-CD/dev-sec-ops/secrets-management/#validation","text":"Automated credential scanning can be performed on the code regardless of the programming language.","title":"Validation"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/","text":"Credential Scanning Credential scanning is the practice of automatically inspecting a project to ensure that no secrets are included in the project's source code. Secrets include database passwords, storage connection strings, admin logins, service principals, etc. Why Credential Scanning Including secrets in a project's source code is a significant risk, as it might make those secrets available to unwanted parties. Even if it seems that the source code is accessible to the same people who are privy to the secrets, this situation is likely to change as the project grows. Spreading secrets in different places makes them harder to manage, access control, and revoke efficiently. Secrets that are committed to source control are also harder to discard of, since they will persist in the source's history. Another consideration is that coupling the project's code to its infrastructure and deployment specifics is limiting and considered a bad practice. From a software design perspective, the code should be independent of the runtime configuration that will be used to run it, and that runtime configuration includes secrets. As such, there should be a clear boundary between code and secrets: secrets should be managed outside of the source code and credential scanning should be employed to ensure that this boundary is never violated. Applying Credential Scanning Ideally, credential scanning should be run as part of a developer's workflow (e.g. via a git pre-commit hook ), however, to protect against developer error, credential scanning must also be enforced as part of the continuous integration process to ensure that no credentials ever get merged to a project's main branch. To implement credential scanning for a project, consider the following: Store secrets in an external secure store that is meant to store sensitive information Use secrets scanning tools to asses your repositories current state by scanning it's full history for secrets Incorporate an automated secrets scanning tool into your CI pipeline to detect unintentional committing of secrets Avoid git add . commands on git Add sensitive files to .gitignore Credential Scanning Frameworks and Tools Recipes and Scenarios - detect-secrets is an aptly named module for detecting secrets within a code base. Use detect-secrets inside Azure DevOps Pipeline Microsoft Security Code Analysis extension Additional Tools - CodeQL \u2013 GitHub security. CodeQL lets you query code as if it was data. Write a query to find all variants of a vulnerability Git-secrets - Prevents you from committing passwords and other sensitive information to a git repository. Conclusion Secret management is essential to every project. Storing secrets in external secrets store and incorporating this mindset into your workflow will improve your security posture and will result in cleaner code.","title":"Credential Scanning"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#credential-scanning","text":"Credential scanning is the practice of automatically inspecting a project to ensure that no secrets are included in the project's source code. Secrets include database passwords, storage connection strings, admin logins, service principals, etc.","title":"Credential Scanning"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#why-credential-scanning","text":"Including secrets in a project's source code is a significant risk, as it might make those secrets available to unwanted parties. Even if it seems that the source code is accessible to the same people who are privy to the secrets, this situation is likely to change as the project grows. Spreading secrets in different places makes them harder to manage, access control, and revoke efficiently. Secrets that are committed to source control are also harder to discard of, since they will persist in the source's history. Another consideration is that coupling the project's code to its infrastructure and deployment specifics is limiting and considered a bad practice. From a software design perspective, the code should be independent of the runtime configuration that will be used to run it, and that runtime configuration includes secrets. As such, there should be a clear boundary between code and secrets: secrets should be managed outside of the source code and credential scanning should be employed to ensure that this boundary is never violated.","title":"Why Credential Scanning"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#applying-credential-scanning","text":"Ideally, credential scanning should be run as part of a developer's workflow (e.g. via a git pre-commit hook ), however, to protect against developer error, credential scanning must also be enforced as part of the continuous integration process to ensure that no credentials ever get merged to a project's main branch. To implement credential scanning for a project, consider the following: Store secrets in an external secure store that is meant to store sensitive information Use secrets scanning tools to asses your repositories current state by scanning it's full history for secrets Incorporate an automated secrets scanning tool into your CI pipeline to detect unintentional committing of secrets Avoid git add . commands on git Add sensitive files to .gitignore","title":"Applying Credential Scanning"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#credential-scanning-frameworks-and-tools","text":"Recipes and Scenarios - detect-secrets is an aptly named module for detecting secrets within a code base. Use detect-secrets inside Azure DevOps Pipeline Microsoft Security Code Analysis extension Additional Tools - CodeQL \u2013 GitHub security. CodeQL lets you query code as if it was data. Write a query to find all variants of a vulnerability Git-secrets - Prevents you from committing passwords and other sensitive information to a git repository.","title":"Credential Scanning Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/secrets-management/credential_scanning/#conclusion","text":"Secret management is essential to every project. Storing secrets in external secrets store and incorporating this mindset into your workflow will improve your security posture and will result in cleaner code.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/","text":"Secrets Rotation Secret rotation is the process of refreshing the secrets that are used by the application. The best way to authenticate to Azure services is by using a managed identity, but there are some scenarios where that isn't an option. In those cases, access keys or secrets are used. You should periodically rotate access keys or secrets. Why Secrets Rotation Secrets are an asset and as such have a potential to be leaked or stolen. By rotating the secrets, we are revoking any secrets that may have been compromised. Therefore, secrets should be rotated frequently. Managed Identity Azure Managed identities are automatically issues by Azure in order to identify individual resources, and can be used for authentication in place of secrets and passwords. The appeal in using Managed Identities is the elimination of management of secrets and credentials. They are not required on developers machines or checked into source control, and they don't need to be rotated. Managed identities are considered safer than the alternatives and is the recommended choice. Applying Secrets Rotation If Azure Managed Identity can't be used. This and the following sections will explain how rotation of secrets can be achieved: To promote frequent rotation of a secret - define an automated periodic secret rotation process. The secret rotation process might result in a downtime when the application is restarted to introduce the new secret. A common solution for that is to have two versions of secret available, also referred to as Blue/Green Secret rotation. By having a second secret at hand, we can start a second instance of the application with that secret before the previous secret is revoked, thus avoiding any downtime. Secrets Rotation Frameworks and Tools For rotation of a secret for resources that use one set of authentication credentials click here For rotation of a secret for resources that have two sets of authentication credentials click here Conclusion Refreshing secrets is important to ensure that your secret stays a secret without causing downtime to your application.","title":"Secrets Rotation"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#secrets-rotation","text":"Secret rotation is the process of refreshing the secrets that are used by the application. The best way to authenticate to Azure services is by using a managed identity, but there are some scenarios where that isn't an option. In those cases, access keys or secrets are used. You should periodically rotate access keys or secrets.","title":"Secrets Rotation"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#why-secrets-rotation","text":"Secrets are an asset and as such have a potential to be leaked or stolen. By rotating the secrets, we are revoking any secrets that may have been compromised. Therefore, secrets should be rotated frequently.","title":"Why Secrets Rotation"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#managed-identity","text":"Azure Managed identities are automatically issues by Azure in order to identify individual resources, and can be used for authentication in place of secrets and passwords. The appeal in using Managed Identities is the elimination of management of secrets and credentials. They are not required on developers machines or checked into source control, and they don't need to be rotated. Managed identities are considered safer than the alternatives and is the recommended choice.","title":"Managed Identity"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#applying-secrets-rotation","text":"If Azure Managed Identity can't be used. This and the following sections will explain how rotation of secrets can be achieved: To promote frequent rotation of a secret - define an automated periodic secret rotation process. The secret rotation process might result in a downtime when the application is restarted to introduce the new secret. A common solution for that is to have two versions of secret available, also referred to as Blue/Green Secret rotation. By having a second secret at hand, we can start a second instance of the application with that secret before the previous secret is revoked, thus avoiding any downtime.","title":"Applying Secrets Rotation"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#secrets-rotation-frameworks-and-tools","text":"For rotation of a secret for resources that use one set of authentication credentials click here For rotation of a secret for resources that have two sets of authentication credentials click here","title":"Secrets Rotation Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/secrets-management/secrets_rotation/#conclusion","text":"Refreshing secrets is important to ensure that your secret stays a secret without causing downtime to your application.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/","text":"Static Code Analysis Static code analysis is a method of detecting security issues by examining the source code of the application. Why Static Code Analysis Compared to code reviews, Static code analysis tools are more fast, accurate and through. As it operates on the source code itself, it is a very early indicator for issues, and coding errors found earlier are less costly to fix. Applying Static Code Analysis Static Code Analysis should be integrated in your build process. There are many tools available for Static Code Analysis, choose the ones that meet your programming language and development techniques. Static Code Analysis Frameworks and Tools SonarCloud - static code analysis with cloud-based software as a service product. OWASP Source code Analysis - OWASP recommendations for source code analysis tools Conclusion Static code analysis is essential to identify potential problems and security issues in the code. It allows you to detect bugs and security issues at an early stage.","title":"Static Code Analysis"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#static-code-analysis","text":"Static code analysis is a method of detecting security issues by examining the source code of the application.","title":"Static Code Analysis"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#why-static-code-analysis","text":"Compared to code reviews, Static code analysis tools are more fast, accurate and through. As it operates on the source code itself, it is a very early indicator for issues, and coding errors found earlier are less costly to fix.","title":"Why Static Code Analysis"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#applying-static-code-analysis","text":"Static Code Analysis should be integrated in your build process. There are many tools available for Static Code Analysis, choose the ones that meet your programming language and development techniques.","title":"Applying Static Code Analysis"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#static-code-analysis-frameworks-and-tools","text":"SonarCloud - static code analysis with cloud-based software as a service product. OWASP Source code Analysis - OWASP recommendations for source code analysis tools","title":"Static Code Analysis Frameworks and Tools"},{"location":"CI-CD/dev-sec-ops/secrets-management/static-code-analysis/#conclusion","text":"Static code analysis is essential to identify potential problems and security issues in the code. It allows you to detect bugs and security issues at an early stage.","title":"Conclusion"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/","text":"Running detect-secrets in Azure DevOps Pipelines Overview In this article, you can find information on how to integrate YELP detect-secrets into your Azure DevOps Pipeline. The proposed code can be part of the classic CI process or (preferred way) build validation for PRs before merging to the main branch. Azure DevOps Pipeline Proposed Azure DevOps Pipeline contains multiple steps described below: Set Python 3 as default Install detect-secrets using pip Run detect-secrets tool Publish results in the Pipeline Artifact Note: It's an optional step, but for future investigation .json file with results may be helpful. Analyzing detect-secrets results Note: This step does a simple analysis of the .json file. If any secret has been detected, then break the build with exit code 1. Note: The below example has 2 jobs: for Linux and Windows agents. You do not have to use both jobs - just adjust the pipeline to your needs. Note: Windows example does not use the latest version of detect-secrets. It is related to the bug in the detect-secret tool (see more in Issue#452 ). It is highly recommended to monitor the fix for the issue and use the latest version if possible by removing version tag ==1.0.3 in the pip install command. trigger : - none jobs : - job : ubuntu displayName : \"detect-secrets on Ubuntu Linux agent\" pool : vmImage : ubuntu-latest steps : - task : UsePythonVersion@0 displayName : \"Set Python 3 as default\" inputs : versionSpec : \"3\" addToPath : true architecture : \"x64\" - bash : pip install detect-secrets displayName : \"Install detect-secrets using pip\" - bash : | detect-secrets --version detect-secrets scan --all-files --force-use-all-plugins --exclude-files FETCH_HEAD > $(Pipeline.Workspace)/detect-secrets.json displayName : \"Run detect-secrets tool\" - task : PublishPipelineArtifact@1 displayName : \"Publish results in the Pipeline Artifact\" inputs : targetPath : \"$(Pipeline.Workspace)/detect-secrets.json\" artifact : \"detect-secrets-ubuntu\" publishLocation : \"pipeline\" - bash : | dsjson=$(cat $(Pipeline.Workspace)/detect-secrets.json) echo \"${dsjson}\" count=$(echo \"${dsjson}\" | jq -c -r '.results | length') if [ $count -gt 0 ]; then msg=\"Secrets were detected in code. ${count} file(s) affected.\" echo \"##vso[task.logissue type=error]${msg}\" echo \"##vso[task.complete result=Failed;]${msg}.\" else echo \"##vso[task.complete result=Succeeded;]No secrets detected.\" fi displayName : \"Analyzing detect-secrets results\" - job : windows displayName : \"detect-secrets on Windows agent\" pool : vmImage : windows-latest steps : - task : UsePythonVersion@0 displayName : \"Set Python 3 as default\" inputs : versionSpec : \"3\" addToPath : true architecture : \"x64\" - script : pip install detect-secrets==1.0.3 displayName : \"Install detect-secrets using pip\" - script : | detect-secrets --version detect-secrets scan --all-files --force-use-all-plugins > $(Pipeline.Workspace)/detect-secrets.json displayName : \"Run detect-secrets tool\" - task : PublishPipelineArtifact@1 displayName : \"Publish results in the Pipeline Artifact\" inputs : targetPath : \"$(Pipeline.Workspace)/detect-secrets.json\" artifact : \"detect-secrets-windows\" publishLocation : \"pipeline\" - pwsh : | $dsjson = Get-Content $(Pipeline.Workspace)/detect-secrets.json Write-Output $dsjson $dsObj = $dsjson | ConvertFrom-Json $count = ($dsObj.results | Get-Member -MemberType NoteProperty).Count if ($count -gt 0) { $msg = \"Secrets were detected in code. $count file(s) affected. \" Write-Host \"##vso[task.logissue type=error]$msg\" Write-Host \"##vso[task.complete result=Failed;]$msg\" } else { Write-Host \"##vso[task.complete result=Succeeded;]No secrets detected.\" } displayName : \"Analyzing detect-secrets results\"","title":"Running `detect-secrets` in Azure DevOps Pipelines"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/#running-detect-secrets-in-azure-devops-pipelines","text":"","title":"Running detect-secrets in Azure DevOps Pipelines"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/#overview","text":"In this article, you can find information on how to integrate YELP detect-secrets into your Azure DevOps Pipeline. The proposed code can be part of the classic CI process or (preferred way) build validation for PRs before merging to the main branch.","title":"Overview"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/#azure-devops-pipeline","text":"Proposed Azure DevOps Pipeline contains multiple steps described below: Set Python 3 as default Install detect-secrets using pip Run detect-secrets tool Publish results in the Pipeline Artifact Note: It's an optional step, but for future investigation .json file with results may be helpful. Analyzing detect-secrets results Note: This step does a simple analysis of the .json file. If any secret has been detected, then break the build with exit code 1. Note: The below example has 2 jobs: for Linux and Windows agents. You do not have to use both jobs - just adjust the pipeline to your needs. Note: Windows example does not use the latest version of detect-secrets. It is related to the bug in the detect-secret tool (see more in Issue#452 ). It is highly recommended to monitor the fix for the issue and use the latest version if possible by removing version tag ==1.0.3 in the pip install command. trigger : - none jobs : - job : ubuntu displayName : \"detect-secrets on Ubuntu Linux agent\" pool : vmImage : ubuntu-latest steps : - task : UsePythonVersion@0 displayName : \"Set Python 3 as default\" inputs : versionSpec : \"3\" addToPath : true architecture : \"x64\" - bash : pip install detect-secrets displayName : \"Install detect-secrets using pip\" - bash : | detect-secrets --version detect-secrets scan --all-files --force-use-all-plugins --exclude-files FETCH_HEAD > $(Pipeline.Workspace)/detect-secrets.json displayName : \"Run detect-secrets tool\" - task : PublishPipelineArtifact@1 displayName : \"Publish results in the Pipeline Artifact\" inputs : targetPath : \"$(Pipeline.Workspace)/detect-secrets.json\" artifact : \"detect-secrets-ubuntu\" publishLocation : \"pipeline\" - bash : | dsjson=$(cat $(Pipeline.Workspace)/detect-secrets.json) echo \"${dsjson}\" count=$(echo \"${dsjson}\" | jq -c -r '.results | length') if [ $count -gt 0 ]; then msg=\"Secrets were detected in code. ${count} file(s) affected.\" echo \"##vso[task.logissue type=error]${msg}\" echo \"##vso[task.complete result=Failed;]${msg}.\" else echo \"##vso[task.complete result=Succeeded;]No secrets detected.\" fi displayName : \"Analyzing detect-secrets results\" - job : windows displayName : \"detect-secrets on Windows agent\" pool : vmImage : windows-latest steps : - task : UsePythonVersion@0 displayName : \"Set Python 3 as default\" inputs : versionSpec : \"3\" addToPath : true architecture : \"x64\" - script : pip install detect-secrets==1.0.3 displayName : \"Install detect-secrets using pip\" - script : | detect-secrets --version detect-secrets scan --all-files --force-use-all-plugins > $(Pipeline.Workspace)/detect-secrets.json displayName : \"Run detect-secrets tool\" - task : PublishPipelineArtifact@1 displayName : \"Publish results in the Pipeline Artifact\" inputs : targetPath : \"$(Pipeline.Workspace)/detect-secrets.json\" artifact : \"detect-secrets-windows\" publishLocation : \"pipeline\" - pwsh : | $dsjson = Get-Content $(Pipeline.Workspace)/detect-secrets.json Write-Output $dsjson $dsObj = $dsjson | ConvertFrom-Json $count = ($dsObj.results | Get-Member -MemberType NoteProperty).Count if ($count -gt 0) { $msg = \"Secrets were detected in code. $count file(s) affected. \" Write-Host \"##vso[task.logissue type=error]$msg\" Write-Host \"##vso[task.complete result=Failed;]$msg\" } else { Write-Host \"##vso[task.complete result=Succeeded;]No secrets detected.\" } displayName : \"Analyzing detect-secrets results\"","title":"Azure DevOps Pipeline"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/","text":"Credential Scanning Tool: detect-secrets Background The detect-secrets tool is an open source project that uses heuristics and rules to scan for a wide range of secrets. We can extend the tool with custom rules and heuristics via a simple Python plugin API . Unlike other credential scanning tools, detect-secrets does not attempt to check a project's entire git history when invoked, but instead scans the project's current state. This means that the tool runs quickly which makes it ideal for use in continuous integration pipelines. detect-secrets employs the concept of a \"baseline file\", i.e. a list of known secrets already present in the repository, and we can configure it to ignore any of these pre-existing secrets when running. This makes it easy to gradually introduce the tool into a pre-existing project. The baseline file also provides a simple and convenient way of handling false positives. We can white-list the false positive in the baseline file to ignore it on future invocations of the tool. Setup # install system dependencies: diff, jq, python3 (if on Linux-based OS) apt-get install -y diffutils jq python3 python3-pip # install system dependencies: diff, jq, python3 (if on Windows) winget install Python.Python.3 choco install diffutils jq -y # install the detect-secrets tool python3 -m pip install detect-secrets # run the tool to establish a list of known secrets # review this file thoroughly and check it into the repository detect-secrets scan > .secrets.baseline Pre-Commit Hook It is recommended to use detect-secrets in your development environment as a Git pre-commit hook. First, follow the pre-commit installation instructions to install the tool in your development environment. Then, add the following to your .pre-commit-config.yaml : repos : - repo : https://github.com/Yelp/detect-secrets rev : v1.4.0 hooks : - id : detect-secrets args : [ '--baseline' , '.secrets.baseline' ] Usage in CI Pipelines # backup the list of known secrets cp .secrets.baseline .secrets.new # find all the secrets in the repository detect-secrets scan --baseline .secrets.new $( find . -type f ! -name '.secrets.*' ! -path '*/.git*' ) # if there is any difference between the known and newly detected secrets, break the build list_secrets () { jq -r '.results | keys[] as $key | \"\\($key),\\(.[$key] | .[] | .hashed_secret)\"' \" $1 \" | sort ; } if ! diff < ( list_secrets .secrets.baseline ) < ( list_secrets .secrets.new ) > & 2 ; then echo \"Detected new secrets in the repo\" > & 2 exit 1 fi","title":"Credential Scanning Tool: `detect-secrets`"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#credential-scanning-tool-detect-secrets","text":"","title":"Credential Scanning Tool: detect-secrets"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#background","text":"The detect-secrets tool is an open source project that uses heuristics and rules to scan for a wide range of secrets. We can extend the tool with custom rules and heuristics via a simple Python plugin API . Unlike other credential scanning tools, detect-secrets does not attempt to check a project's entire git history when invoked, but instead scans the project's current state. This means that the tool runs quickly which makes it ideal for use in continuous integration pipelines. detect-secrets employs the concept of a \"baseline file\", i.e. a list of known secrets already present in the repository, and we can configure it to ignore any of these pre-existing secrets when running. This makes it easy to gradually introduce the tool into a pre-existing project. The baseline file also provides a simple and convenient way of handling false positives. We can white-list the false positive in the baseline file to ignore it on future invocations of the tool.","title":"Background"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#setup","text":"# install system dependencies: diff, jq, python3 (if on Linux-based OS) apt-get install -y diffutils jq python3 python3-pip # install system dependencies: diff, jq, python3 (if on Windows) winget install Python.Python.3 choco install diffutils jq -y # install the detect-secrets tool python3 -m pip install detect-secrets # run the tool to establish a list of known secrets # review this file thoroughly and check it into the repository detect-secrets scan > .secrets.baseline","title":"Setup"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#pre-commit-hook","text":"It is recommended to use detect-secrets in your development environment as a Git pre-commit hook. First, follow the pre-commit installation instructions to install the tool in your development environment. Then, add the following to your .pre-commit-config.yaml : repos : - repo : https://github.com/Yelp/detect-secrets rev : v1.4.0 hooks : - id : detect-secrets args : [ '--baseline' , '.secrets.baseline' ]","title":"Pre-Commit Hook"},{"location":"CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/#usage-in-ci-pipelines","text":"# backup the list of known secrets cp .secrets.baseline .secrets.new # find all the secrets in the repository detect-secrets scan --baseline .secrets.new $( find . -type f ! -name '.secrets.*' ! -path '*/.git*' ) # if there is any difference between the known and newly detected secrets, break the build list_secrets () { jq -r '.results | keys[] as $key | \"\\($key),\\(.[$key] | .[] | .hashed_secret)\"' \" $1 \" | sort ; } if ! diff < ( list_secrets .secrets.baseline ) < ( list_secrets .secrets.new ) > & 2 ; then echo \"Detected new secrets in the repo\" > & 2 exit 1 fi","title":"Usage in CI Pipelines"},{"location":"CI-CD/gitops/deploying-with-gitops/","text":"Deploying with GitOps What is GitOps? \"GitOps is an operational framework that takes DevOps best practices used for application development such as version control, collaboration, compliance, and CI/CD, and applies them to infrastructure automation.\" See GitLab: What is GitOps? . Why should I use GitOps? GitOps simply allows faster deployments by having git repositories in the center offering a clear audit trail via git commits and no direct environment access. Read more on Why should I use GitOps? The below diagram compares traditional CI/CD vs GitOps workflow: Tools for GitOps Some popular GitOps frameworks for Kubernetes backed by CNCF community: Flux V2 Argo CD Rancher Fleet Deploying Using GitOps GitOps with Flux v2 can be enabled in Azure Kubernetes Service (AKS) managed clusters or Azure Arc-enabled Kubernetes connected clusters as a cluster extension. After the microsoft.flux cluster extension is installed, you can create one or more fluxConfigurations resources that sync your Git repository sources to the cluster and reconcile the cluster to the desired state. With GitOps, you can use your Git repository as the source of truth for cluster configuration and application deployment. Tutorial: Deploy configurations using GitOps on an Azure Arc-enabled Kubernetes cluster Tutorial: Implement CI/CD with GitOps Multi-cluster and multi-tenant environment with Flux v2","title":"Deploying with GitOps"},{"location":"CI-CD/gitops/deploying-with-gitops/#deploying-with-gitops","text":"","title":"Deploying with GitOps"},{"location":"CI-CD/gitops/deploying-with-gitops/#what-is-gitops","text":"\"GitOps is an operational framework that takes DevOps best practices used for application development such as version control, collaboration, compliance, and CI/CD, and applies them to infrastructure automation.\" See GitLab: What is GitOps? .","title":"What is GitOps?"},{"location":"CI-CD/gitops/deploying-with-gitops/#why-should-i-use-gitops","text":"GitOps simply allows faster deployments by having git repositories in the center offering a clear audit trail via git commits and no direct environment access. Read more on Why should I use GitOps? The below diagram compares traditional CI/CD vs GitOps workflow:","title":"Why should I use GitOps?"},{"location":"CI-CD/gitops/deploying-with-gitops/#tools-for-gitops","text":"Some popular GitOps frameworks for Kubernetes backed by CNCF community: Flux V2 Argo CD Rancher Fleet","title":"Tools for GitOps"},{"location":"CI-CD/gitops/deploying-with-gitops/#deploying-using-gitops","text":"GitOps with Flux v2 can be enabled in Azure Kubernetes Service (AKS) managed clusters or Azure Arc-enabled Kubernetes connected clusters as a cluster extension. After the microsoft.flux cluster extension is installed, you can create one or more fluxConfigurations resources that sync your Git repository sources to the cluster and reconcile the cluster to the desired state. With GitOps, you can use your Git repository as the source of truth for cluster configuration and application deployment. Tutorial: Deploy configurations using GitOps on an Azure Arc-enabled Kubernetes cluster Tutorial: Implement CI/CD with GitOps Multi-cluster and multi-tenant environment with Flux v2","title":"Deploying Using GitOps"},{"location":"CI-CD/gitops/github-workflows/","text":"GitHub Workflows A workflow is a configurable automated process made up of one or more jobs where each of these jobs can be an action in GitHub. Currently, a YAML file format is supported for defining a workflow in GitHub. Additional information on GitHub actions and GitHub Workflows in the links posted in the resources section below. Workflow per Environment The general approach is to have one pipeline, where the code is built, tested and deployed, and the artifact is then promoted to the next environment, eventually to be deployed into production. There are multiple ways in GitHub that an environment setup can be achieved. One way it can be done is to have one workflow for multiple environments, but the complexity increases as additional processes and jobs are added to a workflow, which does not mean it cannot be done for small pipelines. The plus point of having one workflow is that, when an artifact flows from one environment to another the state and environment values between the deployment environments can be passed easily. One way to get around the complexity of a single workflow is to have separate workflows for different environments, making sure that only the artifacts created and validated are promoted from one environment to another, as well as, the workflow is small enough, to debug any issues seen in any of the workflows. In this case, the state and environment values need to be passed from one deployment environment to another. Multiple workflows also helps to keep the deployments to the environments independent thus reducing the time to deploy and find issues earlier than later in the process. Also, since the environments are independent of each other, any failures in deploying to one environment does not block deployments to other environments. One tradeoff in this method, is that with different workflows for each environment, the maintenance increases as the complexity of workflows increase over time. Resources GitHub Actions GitHub Workflows","title":"GitHub Workflows"},{"location":"CI-CD/gitops/github-workflows/#github-workflows","text":"A workflow is a configurable automated process made up of one or more jobs where each of these jobs can be an action in GitHub. Currently, a YAML file format is supported for defining a workflow in GitHub. Additional information on GitHub actions and GitHub Workflows in the links posted in the resources section below.","title":"GitHub Workflows"},{"location":"CI-CD/gitops/github-workflows/#workflow-per-environment","text":"The general approach is to have one pipeline, where the code is built, tested and deployed, and the artifact is then promoted to the next environment, eventually to be deployed into production. There are multiple ways in GitHub that an environment setup can be achieved. One way it can be done is to have one workflow for multiple environments, but the complexity increases as additional processes and jobs are added to a workflow, which does not mean it cannot be done for small pipelines. The plus point of having one workflow is that, when an artifact flows from one environment to another the state and environment values between the deployment environments can be passed easily. One way to get around the complexity of a single workflow is to have separate workflows for different environments, making sure that only the artifacts created and validated are promoted from one environment to another, as well as, the workflow is small enough, to debug any issues seen in any of the workflows. In this case, the state and environment values need to be passed from one deployment environment to another. Multiple workflows also helps to keep the deployments to the environments independent thus reducing the time to deploy and find issues earlier than later in the process. Also, since the environments are independent of each other, any failures in deploying to one environment does not block deployments to other environments. One tradeoff in this method, is that with different workflows for each environment, the maintenance increases as the complexity of workflows increase over time.","title":"Workflow per Environment"},{"location":"CI-CD/gitops/github-workflows/#resources","text":"GitHub Actions GitHub Workflows","title":"Resources"},{"location":"CI-CD/gitops/secret-management/","text":"Secrets Management with GitOps GitOps projects have git repositories in the center that are considered a source of truth for managing both infrastructure and application. This infrastructure and application will require secured access to other resources of the system through secrets. Committing clear-text secrets into git repositories is unacceptable even if the repositories are private to your team and organization. Teams need a secure way to handle secrets when using GitOps. There are many ways to manage secrets with GitOps and at high level can be categorized into: Encrypted secrets in git repositories Reference to secrets stored in the external key vault TLDR : Referencing secrets in an external key vault is the recommended approach. It is easier to orchestrate secret rotation and more scalable with multiple clusters and/or teams. Encrypted Secrets in Git Repositories In this approach, Developers manually encrypt secrets using a public key, and the key can only be decrypted by the custom Kubernetes controller running in the target cluster. Some popular tools for his approach are Bitnami Sealed Secrets , Mozilla SOPS All the secret encryption tools share the following: Secret changes are managed by making changes within the GitOps repository which provides great traceability All secrets can be rotated by making changes in GitOps, without accessing the cluster They support fully disconnected gitops scenarios Secrets are stored encrypted in the gitops repository, if the private encryption key is leaked and the attacker has access to the repo, all secrets can be decrypted Bitnami Sealed Secrets Sealed Secrets use asymmetric encryption to encrypt secrets. A Kubernetes controller generates a key-pair (private-public) and stores the private key in the cluster's etcd database as a Kubernetes secret. Developers use Kubeseal CLI to seal secrets before committing to the git repo. Some of the key points of using Sealed Secrets are: Support automatic key rotation for the private key and can be used to enforce re-encryption of secrets Due to automatic renewal of the sealing key , the key needs to be prefetched from the cluster or cluster set up to store the sealing key on renewal in a secondary location Multi-tenancy support at the namespace level can be enforced by the controller When sealing secrets developers need a connection to the cluster control plane to fetch the public key or the public key has to be explicitly shared with the developer If the private key in the cluster is lost for some reason all secrets need to be re-encrypted followed by a new key-pair generation Does not scale with multi-cluster, because every cluster will require a controller having its own key pair Can only encrypt secret resource type The Flux documentation has inconsistences in the Azure Key Vault examples Mozilla SOPS SOPS: Secrets OPerationS is an encryption tool that supports YAML, JSON, ENV, INI, and BINARY formats and encrypts with AWS KMS, GCP KMS, Azure Key Vault, age, and PGP and is not just limited to Kubernetes. It supports integration with some common key management systems including Azure Key Vault, where one or more key management system is used to store the encryption key for encrypting secrets and not the actual secrets. Some of the key points of using SOPS are: Flux has native support for SOPS with cluster-side decryption Provides an added layer of security as the private key used for decryption is protected in an external key vault To use the Helm CLI for encryption the ( Helm Secrets ) plugin is needed Needs ( KSOPS )( kustomize-sopssecretgenerator ) plugin to work with Kustomization Does not scale with larger teams as each developer has to encrypt the secrets The public key is sufficient for creating brand new files. The secret key is required for decrypting and editing existing files because SOPS computes a MAC on all values. When using the public key solely to add or remove a field, the whole file should be deleted and recreated Supports several types of keys that can be used in both connected and disconnected state. A secret can have a list of keys and will try do decrypt with all of them. Reference to Secrets Stored in an External Key Vault (Recommended) This approach relies on a key management system like Azure Key Vault to hold the secrets and the git manifest in the repositories has reference to the key vault secrets. Developers do not perform any cryptographic operations with files in repositories. Kubernetes operators running in the target cluster are responsible for pulling the secrets from the key vault and making them available either as Kubernetes secrets or secrets volume mounted to the pod. All the below tools share the following: Secrets are not stored in the repository Supports Prometheus metrics for observability Supports sync with Kubernetes Secrets Supports Linux and Windows containers Provides enterprise-grade external secret management Easily scalable with multi-cluster and larger teams Both solutions support either Azure Active Directory (Azure AD) service principal or managed identity for authentication with the Key Vault . For secret rotation ideas, see Secrets Rotation on Environment Variables and Mounted Secrets For how to authenticate private container registries with a service principal see: Authenticated Private Container Registry Azure Key Vault Provider for Secrets Store CSI Driver Azure Key Vault Provider (AKVP) for Kubernetes secret store CSI Driver allows you to get secret contents stored in an Azure Key Vault instance and use the Secrets Store CSI driver interface to mount them into Kubernetes pods. Mounts secrets/keys/certs to pod using a CSI Inline volume. Azure Key Vault Provider for Secrets Store CSI Driver install guide . CSI driver will need access to Azure Key Vault either through a service principal or managed identity (recommended). To make this access secure you can leverage Azure AD Workload Identity (recommended) or AAD Pod Identity . Please note AAD pod identity will soon be replaced by workload identity. Product Group Links provided for AKVP with SSCSID: 1. Differences between ESO / SSCSID ( GitHub Issue ) 2. Secrets Management on K8S talk here (Native Secrets, Vault.io, and ESO vs. SSCSID) Advantages: Supports pod portability with the SecretProviderClass CRD Supports auto rotation of secrets with customizable sync intervals per cluster . Seems to be the MSFT choice (Secrets Store CSI driver is heavily contributed by MSFT and Kubernetes-SIG) Disadvantages: Missing disconnected scenario support : When the node is offline the SSCSID fails to fetch the secret and thus mounting the volume fails, making scaling and restarting pods not possible while being offline AKVP can only access Key Vault from a non-Azure environment using a service principal The Kubernetes Secret containing the service principal credentials need to be created as a secret in the same namespace as the application pod. If pods in multiple namespaces need to use the same SP to access Key Vault, this Kubernetes Secret needs to be created in each namespace. The GitOps repo must contain the name of the Key Vault within the SecretProviderClass Must mount secrets as volumes to allow syncing into Kubernetes Secrets Uses more resources (4 pods; CSI Storage driver and provider) and is a daemonset - not test on RPS / resource usage External Secrets Operator with Azure Key Vault The External Secrets Operator (ESO) is an open-sourced Kubernetes operator that can read secrets from external secret stores (e.g., Azure Key Vault) and sync those into Kubernetes Secrets. In contrast to the CSI Driver, the ESO controller creates the secrets on the cluster as K8s secrets, instead of mounting them as volumes to pods. Docs on using ESO Azure Key vault provider here . ESO will need access to Azure Key Vault either through the use of a service principal or managed identity (via Azure AD Workload Identity (recommended) or AAD Pod Identity ). Advantages: Supports auto rotation of secrets with customizable sync intervals per secret . Components are split into different CRDs for namespace (ExternalSecret, SecretStore) and cluster-wide (ClusterSecretStore, ClusterExternalSecret) making syncing more manageable i.r.t. different deployments/pods etc. Service Principal secret for the (Cluster)SecretStores could placed in a namespaced that only the ESO can access (see Shared ClusterSecretStore ). Resource efficient (single pod) - not test on RPS / resource usage. Open source and high contributions, ( GitHub ) Mounting Secrets as volumes is supported via K8S's APIs (see here ) Partial disconnected scenario support: As ESO is using native K8s secrets the cluster can be offline, and it does not have any implications towards restarting and scaling pods while being offline Disadvantages: The GitOps repo must contain the name of the Key Vault within the SecretStore / ClusterSecretStore or a ConfigMap linking to it Must create secrets as K8s secrets Resources Sealed Secrets with Flux v2 Mozilla SOPS with Flux v2 Secret Management with Argo CD Secret management Workflow Appendix Authenticated Private Container Registry An option on how to authenticate private container registries (e.g., ACR): Use a dockerconfigjson Kubernetes Secret on Pod-Level with ImagePullSecret (This can be also defined on namespace-level )","title":"Secrets Management with GitOps"},{"location":"CI-CD/gitops/secret-management/#secrets-management-with-gitops","text":"GitOps projects have git repositories in the center that are considered a source of truth for managing both infrastructure and application. This infrastructure and application will require secured access to other resources of the system through secrets. Committing clear-text secrets into git repositories is unacceptable even if the repositories are private to your team and organization. Teams need a secure way to handle secrets when using GitOps. There are many ways to manage secrets with GitOps and at high level can be categorized into: Encrypted secrets in git repositories Reference to secrets stored in the external key vault TLDR : Referencing secrets in an external key vault is the recommended approach. It is easier to orchestrate secret rotation and more scalable with multiple clusters and/or teams.","title":"Secrets Management with GitOps"},{"location":"CI-CD/gitops/secret-management/#encrypted-secrets-in-git-repositories","text":"In this approach, Developers manually encrypt secrets using a public key, and the key can only be decrypted by the custom Kubernetes controller running in the target cluster. Some popular tools for his approach are Bitnami Sealed Secrets , Mozilla SOPS All the secret encryption tools share the following: Secret changes are managed by making changes within the GitOps repository which provides great traceability All secrets can be rotated by making changes in GitOps, without accessing the cluster They support fully disconnected gitops scenarios Secrets are stored encrypted in the gitops repository, if the private encryption key is leaked and the attacker has access to the repo, all secrets can be decrypted","title":"Encrypted Secrets in Git Repositories"},{"location":"CI-CD/gitops/secret-management/#bitnami-sealed-secrets","text":"Sealed Secrets use asymmetric encryption to encrypt secrets. A Kubernetes controller generates a key-pair (private-public) and stores the private key in the cluster's etcd database as a Kubernetes secret. Developers use Kubeseal CLI to seal secrets before committing to the git repo. Some of the key points of using Sealed Secrets are: Support automatic key rotation for the private key and can be used to enforce re-encryption of secrets Due to automatic renewal of the sealing key , the key needs to be prefetched from the cluster or cluster set up to store the sealing key on renewal in a secondary location Multi-tenancy support at the namespace level can be enforced by the controller When sealing secrets developers need a connection to the cluster control plane to fetch the public key or the public key has to be explicitly shared with the developer If the private key in the cluster is lost for some reason all secrets need to be re-encrypted followed by a new key-pair generation Does not scale with multi-cluster, because every cluster will require a controller having its own key pair Can only encrypt secret resource type The Flux documentation has inconsistences in the Azure Key Vault examples","title":"Bitnami Sealed Secrets"},{"location":"CI-CD/gitops/secret-management/#mozilla-sops","text":"SOPS: Secrets OPerationS is an encryption tool that supports YAML, JSON, ENV, INI, and BINARY formats and encrypts with AWS KMS, GCP KMS, Azure Key Vault, age, and PGP and is not just limited to Kubernetes. It supports integration with some common key management systems including Azure Key Vault, where one or more key management system is used to store the encryption key for encrypting secrets and not the actual secrets. Some of the key points of using SOPS are: Flux has native support for SOPS with cluster-side decryption Provides an added layer of security as the private key used for decryption is protected in an external key vault To use the Helm CLI for encryption the ( Helm Secrets ) plugin is needed Needs ( KSOPS )( kustomize-sopssecretgenerator ) plugin to work with Kustomization Does not scale with larger teams as each developer has to encrypt the secrets The public key is sufficient for creating brand new files. The secret key is required for decrypting and editing existing files because SOPS computes a MAC on all values. When using the public key solely to add or remove a field, the whole file should be deleted and recreated Supports several types of keys that can be used in both connected and disconnected state. A secret can have a list of keys and will try do decrypt with all of them.","title":"Mozilla SOPS"},{"location":"CI-CD/gitops/secret-management/#reference-to-secrets-stored-in-an-external-key-vault-recommended","text":"This approach relies on a key management system like Azure Key Vault to hold the secrets and the git manifest in the repositories has reference to the key vault secrets. Developers do not perform any cryptographic operations with files in repositories. Kubernetes operators running in the target cluster are responsible for pulling the secrets from the key vault and making them available either as Kubernetes secrets or secrets volume mounted to the pod. All the below tools share the following: Secrets are not stored in the repository Supports Prometheus metrics for observability Supports sync with Kubernetes Secrets Supports Linux and Windows containers Provides enterprise-grade external secret management Easily scalable with multi-cluster and larger teams Both solutions support either Azure Active Directory (Azure AD) service principal or managed identity for authentication with the Key Vault . For secret rotation ideas, see Secrets Rotation on Environment Variables and Mounted Secrets For how to authenticate private container registries with a service principal see: Authenticated Private Container Registry","title":"Reference to Secrets Stored in an External Key Vault (Recommended)"},{"location":"CI-CD/gitops/secret-management/#azure-key-vault-provider-for-secrets-store-csi-driver","text":"Azure Key Vault Provider (AKVP) for Kubernetes secret store CSI Driver allows you to get secret contents stored in an Azure Key Vault instance and use the Secrets Store CSI driver interface to mount them into Kubernetes pods. Mounts secrets/keys/certs to pod using a CSI Inline volume. Azure Key Vault Provider for Secrets Store CSI Driver install guide . CSI driver will need access to Azure Key Vault either through a service principal or managed identity (recommended). To make this access secure you can leverage Azure AD Workload Identity (recommended) or AAD Pod Identity . Please note AAD pod identity will soon be replaced by workload identity. Product Group Links provided for AKVP with SSCSID: 1. Differences between ESO / SSCSID ( GitHub Issue ) 2. Secrets Management on K8S talk here (Native Secrets, Vault.io, and ESO vs. SSCSID) Advantages: Supports pod portability with the SecretProviderClass CRD Supports auto rotation of secrets with customizable sync intervals per cluster . Seems to be the MSFT choice (Secrets Store CSI driver is heavily contributed by MSFT and Kubernetes-SIG) Disadvantages: Missing disconnected scenario support : When the node is offline the SSCSID fails to fetch the secret and thus mounting the volume fails, making scaling and restarting pods not possible while being offline AKVP can only access Key Vault from a non-Azure environment using a service principal The Kubernetes Secret containing the service principal credentials need to be created as a secret in the same namespace as the application pod. If pods in multiple namespaces need to use the same SP to access Key Vault, this Kubernetes Secret needs to be created in each namespace. The GitOps repo must contain the name of the Key Vault within the SecretProviderClass Must mount secrets as volumes to allow syncing into Kubernetes Secrets Uses more resources (4 pods; CSI Storage driver and provider) and is a daemonset - not test on RPS / resource usage","title":"Azure Key Vault Provider for Secrets Store CSI Driver"},{"location":"CI-CD/gitops/secret-management/#external-secrets-operator-with-azure-key-vault","text":"The External Secrets Operator (ESO) is an open-sourced Kubernetes operator that can read secrets from external secret stores (e.g., Azure Key Vault) and sync those into Kubernetes Secrets. In contrast to the CSI Driver, the ESO controller creates the secrets on the cluster as K8s secrets, instead of mounting them as volumes to pods. Docs on using ESO Azure Key vault provider here . ESO will need access to Azure Key Vault either through the use of a service principal or managed identity (via Azure AD Workload Identity (recommended) or AAD Pod Identity ). Advantages: Supports auto rotation of secrets with customizable sync intervals per secret . Components are split into different CRDs for namespace (ExternalSecret, SecretStore) and cluster-wide (ClusterSecretStore, ClusterExternalSecret) making syncing more manageable i.r.t. different deployments/pods etc. Service Principal secret for the (Cluster)SecretStores could placed in a namespaced that only the ESO can access (see Shared ClusterSecretStore ). Resource efficient (single pod) - not test on RPS / resource usage. Open source and high contributions, ( GitHub ) Mounting Secrets as volumes is supported via K8S's APIs (see here ) Partial disconnected scenario support: As ESO is using native K8s secrets the cluster can be offline, and it does not have any implications towards restarting and scaling pods while being offline Disadvantages: The GitOps repo must contain the name of the Key Vault within the SecretStore / ClusterSecretStore or a ConfigMap linking to it Must create secrets as K8s secrets","title":"External Secrets Operator with Azure Key Vault"},{"location":"CI-CD/gitops/secret-management/#resources","text":"Sealed Secrets with Flux v2 Mozilla SOPS with Flux v2 Secret Management with Argo CD Secret management Workflow","title":"Resources"},{"location":"CI-CD/gitops/secret-management/#appendix","text":"","title":"Appendix"},{"location":"CI-CD/gitops/secret-management/#authenticated-private-container-registry","text":"An option on how to authenticate private container registries (e.g., ACR): Use a dockerconfigjson Kubernetes Secret on Pod-Level with ImagePullSecret (This can be also defined on namespace-level )","title":"Authenticated Private Container Registry"},{"location":"CI-CD/gitops/secret-management/azure-devops-secret-management-per-branch/","text":"Azure DevOps: Managing Settings on a Per-Branch Basis When using Azure DevOps Pipelines for CI/CD, it's convenient to leverage the built-in pipeline variables for secrets management , but using pipeline variables for secrets management has its disadvantages: Pipeline variables are managed outside the code that references them. This makes it easy to introduce drift between the source code and the secrets, e.g. adding a reference to a new secret in code but forgetting to add it to the pipeline variables (leads to confusing build breaks), or deleting a reference to a secret in code and forgetting to remote it from the pipeline variables (leads to confusing pipeline variables). Pipeline variables are global shared state. This can lead to confusing situations and hard to debug problems when developers make concurrent changes to the pipeline variables which may override each other. Having a single global set of pipeline variables also makes it impossible for secrets to vary per environment (e.g. when using a branch-based deployment model where 'master' deploys using the production secrets, 'development' deploys using the staging secrets, and so forth). A solution to these limitations is to manage secrets in the Git repository jointly with the project's source code. As described in secrets management , don't check secrets into the repository in plain text. Instead we can add an encrypted version of our secrets to the repository and enable our CI/CD agents and developers to decrypt the secrets for local usage with some pre-shared key. This gives us the best of both worlds: a secure storage for secrets as well as side-by-side management of secrets and code. # first, make sure that we never commit our plain text secrets and generate a strong encryption key echo \".env\" >> .gitignore ENCRYPTION_KEY = \" $( LC_ALL = C < /dev/urandom tr -dc '_A-Z-a-z-0-9' | head -c128 ) \" # now let's add some secret to our .env file echo \"MY_SECRET=...\" >> .env # also update our secrets documentation file cat >> .env.template <<< \" # enter description of your secret here MY_SECRET= \" # next, encrypt the plain text secrets; the resulting .env.enc file can safely be committed to the repository echo \" ${ ENCRYPTION_KEY } \" | openssl enc -aes-256-cbc -md sha512 -pass stdin -in .env -out .env.enc git add .env.enc .env.template git commit -m \"Update secrets\" When running the CI/CD, the build server can now access the secrets by decrypting them. E.g. for Azure DevOps, configure ENCRYPTION_KEY as a secret pipeline variable and then add the following step to azure-pipelines.yml : steps : - script : echo \"$(ENCRYPTION_KEY)\" | openssl enc -aes-256-cbc -md sha512 -pass stdin -in .env.enc -out .env -d displayName : Decrypt secrets You can also use variable groups linked directly to Azure key vault for your pipelines to manage all secrets in one location.","title":"Azure DevOps: Managing Settings on a Per-Branch Basis"},{"location":"CI-CD/gitops/secret-management/azure-devops-secret-management-per-branch/#azure-devops-managing-settings-on-a-per-branch-basis","text":"When using Azure DevOps Pipelines for CI/CD, it's convenient to leverage the built-in pipeline variables for secrets management , but using pipeline variables for secrets management has its disadvantages: Pipeline variables are managed outside the code that references them. This makes it easy to introduce drift between the source code and the secrets, e.g. adding a reference to a new secret in code but forgetting to add it to the pipeline variables (leads to confusing build breaks), or deleting a reference to a secret in code and forgetting to remote it from the pipeline variables (leads to confusing pipeline variables). Pipeline variables are global shared state. This can lead to confusing situations and hard to debug problems when developers make concurrent changes to the pipeline variables which may override each other. Having a single global set of pipeline variables also makes it impossible for secrets to vary per environment (e.g. when using a branch-based deployment model where 'master' deploys using the production secrets, 'development' deploys using the staging secrets, and so forth). A solution to these limitations is to manage secrets in the Git repository jointly with the project's source code. As described in secrets management , don't check secrets into the repository in plain text. Instead we can add an encrypted version of our secrets to the repository and enable our CI/CD agents and developers to decrypt the secrets for local usage with some pre-shared key. This gives us the best of both worlds: a secure storage for secrets as well as side-by-side management of secrets and code. # first, make sure that we never commit our plain text secrets and generate a strong encryption key echo \".env\" >> .gitignore ENCRYPTION_KEY = \" $( LC_ALL = C < /dev/urandom tr -dc '_A-Z-a-z-0-9' | head -c128 ) \" # now let's add some secret to our .env file echo \"MY_SECRET=...\" >> .env # also update our secrets documentation file cat >> .env.template <<< \" # enter description of your secret here MY_SECRET= \" # next, encrypt the plain text secrets; the resulting .env.enc file can safely be committed to the repository echo \" ${ ENCRYPTION_KEY } \" | openssl enc -aes-256-cbc -md sha512 -pass stdin -in .env -out .env.enc git add .env.enc .env.template git commit -m \"Update secrets\" When running the CI/CD, the build server can now access the secrets by decrypting them. E.g. for Azure DevOps, configure ENCRYPTION_KEY as a secret pipeline variable and then add the following step to azure-pipelines.yml : steps : - script : echo \"$(ENCRYPTION_KEY)\" | openssl enc -aes-256-cbc -md sha512 -pass stdin -in .env.enc -out .env -d displayName : Decrypt secrets You can also use variable groups linked directly to Azure key vault for your pipelines to manage all secrets in one location.","title":"Azure DevOps: Managing Settings on a Per-Branch Basis"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/","text":"Secrets Rotation of Environment Variables and Mounted Secrets in Pods This document covers some ways you can do secret rotation with environment variables and mounted secrets in Kubernetes pods Mapping Secrets via secretKeyRef with Environment Variables If we map a K8s native secret via a secretKeyRef into an environment variable and we rotate keys the environment variable is not updated even though the K8s native secret has been updated. We need to restart the Pod so changes get populated. Reloader solves this issue with a K8S controller. ... env : - name : EVENTHUB_CONNECTION_STRING valueFrom : secretKeyRef : name : poc-creds key : EventhubConnectionString ... Mapping Secrets via volumeMounts (ESO Way) If we map a K8s native secret via a volume mount and we rotate keys the file gets updated. The application needs to then be able pick up the changes without a restart (requiring most likely custom logic in the application to support this). Then no restart of the application is required. ... volumeMounts : - name : mounted-secret mountPath : /mnt/secrets-store readOnly : true volumes : - name : mounted-secret secret : secretName : poc-creds ... Mapping Secrets via volumeMounts (AKVP SSCSID Way) SSCSID focuses on mounting external secrets into the CSI. Thus if we rotate keys the file gets updated. The application needs to then be able pick up the changes without a restart (requiring most likely custom logic in the application to support this). Then no restart of the application is required. ... volumeMounts : - name : app-secrets-store-inline mountPath : \"/mnt/app-secrets-store\" readOnly : true volumes : - name : app-secrets-store-inline csi : driver : secrets-store.csi.k8s.io readOnly : true volumeAttributes : secretProviderClass : akvp-app nodePublishSecretRef : name : secrets-store-sp-creds ...","title":"Secrets Rotation of Environment Variables and Mounted Secrets in Pods"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/#secrets-rotation-of-environment-variables-and-mounted-secrets-in-pods","text":"This document covers some ways you can do secret rotation with environment variables and mounted secrets in Kubernetes pods","title":"Secrets Rotation of Environment Variables and Mounted Secrets in Pods"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/#mapping-secrets-via-secretkeyref-with-environment-variables","text":"If we map a K8s native secret via a secretKeyRef into an environment variable and we rotate keys the environment variable is not updated even though the K8s native secret has been updated. We need to restart the Pod so changes get populated. Reloader solves this issue with a K8S controller. ... env : - name : EVENTHUB_CONNECTION_STRING valueFrom : secretKeyRef : name : poc-creds key : EventhubConnectionString ...","title":"Mapping Secrets via secretKeyRef with Environment Variables"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/#mapping-secrets-via-volumemounts-eso-way","text":"If we map a K8s native secret via a volume mount and we rotate keys the file gets updated. The application needs to then be able pick up the changes without a restart (requiring most likely custom logic in the application to support this). Then no restart of the application is required. ... volumeMounts : - name : mounted-secret mountPath : /mnt/secrets-store readOnly : true volumes : - name : mounted-secret secret : secretName : poc-creds ...","title":"Mapping Secrets via volumeMounts (ESO Way)"},{"location":"CI-CD/gitops/secret-management/secret-rotation-in-pods/#mapping-secrets-via-volumemounts-akvp-sscsid-way","text":"SSCSID focuses on mounting external secrets into the CSI. Thus if we rotate keys the file gets updated. The application needs to then be able pick up the changes without a restart (requiring most likely custom logic in the application to support this). Then no restart of the application is required. ... volumeMounts : - name : app-secrets-store-inline mountPath : \"/mnt/app-secrets-store\" readOnly : true volumes : - name : app-secrets-store-inline csi : driver : secrets-store.csi.k8s.io readOnly : true volumeAttributes : secretProviderClass : akvp-app nodePublishSecretRef : name : secrets-store-sp-creds ...","title":"Mapping Secrets via volumeMounts (AKVP SSCSID Way)"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/","text":"Continuous Delivery on Low-Code and No-Code Solutions Low-code and no-code platforms have taken a spot in a wide variety of Business Solutions involving process automation, AI models, Bots, Business Applications and Business Intelligence. The scenarios enabled by these platforms are constantly evolving and opening a spot for productive roles. This has been exactly the reason why bringing more professional tools to their development have become necessary such as controlled and automated delivery. In the case of Power Platform products, the adoption of a CI/CD process may seem to increase the development complexity to a solution oriented to Citizen Developers it is more important to make the development process more scalable and capable of dealing with new features and bug corrections in a faster way. Environments in Power Platform Solutions Environments are spaces where Power Platform Solutions exists. They store, manage and share everything related to the solution like data, apps, chat bots, flows and models. They also serve as containers to separate apps that might have different roles, security requirements or just target audiences. They can be used to create different stages of the solution development process, the expected model of working with environments in a CI/CD process will be as the following image suggests. Environments Considerations Whenever an environment has been created, its resources can be only accessed by users within the same tenant which is an Azure Active Directory tenant in fact. When you create an app in an environment that app can only interact with data sources that are also deployed in that same environment, this includes connections, flows and Dataverse databases. This is an important consideration when dealing with a CD process. Deployment Strategy With three environments already created to represent the stages of the deployment, the goal now is to automate the deployment from one environment to another. Each environment will require the creation of its own solution: business logic and data. Step 1 Development team will be working in a Dev environment. These environments according to the team could be one for the team or one for each developer. Once changes have been made, the first step will be packaging the solution and export it into source control. Step 2 Second step is about the solution, you need to have a managed solution to deploy to other environments such as Stage or Production so now you should use a JIT environment where you would import your unmanaged solution and export them as managed. These solution files won't be checked into source control but will be stored as a build artifact in the pipeline making them available to be deployed in the release pipeline. This is where the second environment will be used. This second environment will be responsible of receiving the output managed solution coming from the artifact. Step 3 Third and final step will import the solution into the production environment, this means that this stage will take the artifact from last step and will export it. When working in this environment you can also version your product in order to make a better trace of the product. Tools Most used tools to get this process completed are: Power Platform Build Tools There is also a non graphical tool that could be used to work with this CD process. The Power CLI tool. Resources Application lifecycle management with Microsoft Power Platform","title":"CD on low code solutions"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#continuous-delivery-on-low-code-and-no-code-solutions","text":"Low-code and no-code platforms have taken a spot in a wide variety of Business Solutions involving process automation, AI models, Bots, Business Applications and Business Intelligence. The scenarios enabled by these platforms are constantly evolving and opening a spot for productive roles. This has been exactly the reason why bringing more professional tools to their development have become necessary such as controlled and automated delivery. In the case of Power Platform products, the adoption of a CI/CD process may seem to increase the development complexity to a solution oriented to Citizen Developers it is more important to make the development process more scalable and capable of dealing with new features and bug corrections in a faster way.","title":"Continuous Delivery on Low-Code and No-Code Solutions"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#environments-in-power-platform-solutions","text":"Environments are spaces where Power Platform Solutions exists. They store, manage and share everything related to the solution like data, apps, chat bots, flows and models. They also serve as containers to separate apps that might have different roles, security requirements or just target audiences. They can be used to create different stages of the solution development process, the expected model of working with environments in a CI/CD process will be as the following image suggests.","title":"Environments in Power Platform Solutions"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#environments-considerations","text":"Whenever an environment has been created, its resources can be only accessed by users within the same tenant which is an Azure Active Directory tenant in fact. When you create an app in an environment that app can only interact with data sources that are also deployed in that same environment, this includes connections, flows and Dataverse databases. This is an important consideration when dealing with a CD process.","title":"Environments Considerations"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#deployment-strategy","text":"With three environments already created to represent the stages of the deployment, the goal now is to automate the deployment from one environment to another. Each environment will require the creation of its own solution: business logic and data.","title":"Deployment Strategy"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#step-1","text":"Development team will be working in a Dev environment. These environments according to the team could be one for the team or one for each developer. Once changes have been made, the first step will be packaging the solution and export it into source control.","title":"Step 1"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#step-2","text":"Second step is about the solution, you need to have a managed solution to deploy to other environments such as Stage or Production so now you should use a JIT environment where you would import your unmanaged solution and export them as managed. These solution files won't be checked into source control but will be stored as a build artifact in the pipeline making them available to be deployed in the release pipeline. This is where the second environment will be used. This second environment will be responsible of receiving the output managed solution coming from the artifact.","title":"Step 2"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#step-3","text":"Third and final step will import the solution into the production environment, this means that this stage will take the artifact from last step and will export it. When working in this environment you can also version your product in order to make a better trace of the product.","title":"Step 3"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#tools","text":"Most used tools to get this process completed are: Power Platform Build Tools There is also a non graphical tool that could be used to work with this CD process. The Power CLI tool.","title":"Tools"},{"location":"CI-CD/recipes/cd-on-low-code-solutions/#resources","text":"Application lifecycle management with Microsoft Power Platform","title":"Resources"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/","text":"CI Pipeline for Better Documentation Introduction Most projects start with spikes, where developers and analysts produce lots of documentation. Sometimes, these documents don't have a standard and each team member writes them accordingly with their preference. Add to that the time a reviewer will spend confirming grammar, searching for typos or non-inclusive language. This pipeline helps address that! The Pipeline The pipeline uses the following npm modules: markdownlint : add standardization using rules markdown-link-check : check the links in the documentation and report broken ones write-good : linter for English prose We have been using this pipeline for more than one year in different engagements and always received great feedback from the customers! How Does it Work To start using this pipeline: Download the files from this repository Unzip the folders and files to your repository root if the repository is empty - if it's not brand new, copy the files and make the required adjustments: - check .azdo so it matches your repository standard - check package.json so you don't overwrite one you already have in the process. Also update the file if you changed the name of the .azdo folder. Create the pipeline in Azure DevOps or GitHub Resources Markdown Code Reviews in the Engineering Fundamentals Playbook","title":"CI pipeline for better documentation"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#ci-pipeline-for-better-documentation","text":"","title":"CI Pipeline for Better Documentation"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#introduction","text":"Most projects start with spikes, where developers and analysts produce lots of documentation. Sometimes, these documents don't have a standard and each team member writes them accordingly with their preference. Add to that the time a reviewer will spend confirming grammar, searching for typos or non-inclusive language. This pipeline helps address that!","title":"Introduction"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#the-pipeline","text":"The pipeline uses the following npm modules: markdownlint : add standardization using rules markdown-link-check : check the links in the documentation and report broken ones write-good : linter for English prose We have been using this pipeline for more than one year in different engagements and always received great feedback from the customers!","title":"The Pipeline"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#how-does-it-work","text":"To start using this pipeline: Download the files from this repository Unzip the folders and files to your repository root if the repository is empty - if it's not brand new, copy the files and make the required adjustments: - check .azdo so it matches your repository standard - check package.json so you don't overwrite one you already have in the process. Also update the file if you changed the name of the .azdo folder. Create the pipeline in Azure DevOps or GitHub","title":"How Does it Work"},{"location":"CI-CD/recipes/ci-pipeline-for-better-documentation/#resources","text":"Markdown Code Reviews in the Engineering Fundamentals Playbook","title":"Resources"},{"location":"CI-CD/recipes/ci-with-jupyter-notebooks/","text":"CI with Jupyter Notebooks As Azure DevOps doesn't allow code reviewers to comment directly in Jupyter Notebooks, Data Scientists(DSs) have to convert the notebooks to scripts before they commit and push these files to the repository. This document aims to automate this process in Azure DevOps, so the DSs don't need to execute anything locally. Problem Statement A Data Science repository has this folder structure: . \u251c\u2500\u2500 notebooks \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 00 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 01 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 02 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 03 .ipynb \u2514\u2500\u2500 scripts \u251c\u2500\u2500 Machine Learning Experiments - 00 .py \u251c\u2500\u2500 Machine Learning Experiments - 01 .py \u251c\u2500\u2500 Machine Learning Experiments - 02 .py \u2514\u2500\u2500 Machine Learning Experiments - 03 .py The python files are needed to allow Pull Request reviewers to add comments to the notebooks, they can add comments to the Python scripts and we apply these comments to the notebooks. Since we have to run this process manually before we add files to a commit, this manual process is error prone, e.g. If we create a notebook, generate the script from it, but later make some changes and forget to generate a new script for the changes. Solution One way to avoid this is to create the scripts in the repository from the commit. This document will describe this process. We can add a pipeline with the following steps to the repository to run in ipynb files: Go to the Project Settings -> Repositories -> Security -> User Permissions Add the Build Service in Users the permission to Contribute Create a new pipeline. In the newly created pipeline we add: Trigger to run on ipynb files: trigger: paths: include: - '*.ipynb' - '**/*.ipynb' Select the pool as Linux: pool: vmImage: ubuntu-latest Set the directory where we want to store the scripts: variables: REPO_URL: # Azure DevOps URL in the format: dev.azure.com/<Organization>/<Project>/_git/<RepoName> Now we will start the core of the pipeline: 1. Upgrade pip - script: | python -m pip install --upgrade pip displayName: 'Upgrade pip' 1. Install nbconvert and ipython : - script: | pip install nbconvert ipython displayName: 'install nbconvert & ipython' 1. Install pandoc : - script: | sudo apt install -y pandoc displayName: \"Install pandoc\" 1. Find the notebook files ( ipynb ) in the last commit to the repo and convert it to scripts ( py ): - task: Bash@3 inputs: targetType: 'inline' script: | IPYNB_PATH=($(git diff-tree --no-commit-id --name-only -r $(Build.SourceVersion) | grep '[.]ipynb$')) echo $IPYNB_PATH [ -z \"$IPYNB_PATH\" ] && echo \"Nothing to convert\" || jupyter nbconvert --to script $IPYNB_PATH displayName: \"Convert Notebook to script\" 1. Commit these changes to the repository: - bash: | git config --global user.email \"build@dev.azure.com\" git config --global user.name \"build\" git add . git commit -m 'Convert Jupyter notebooks' || echo \"No changes to commit\" && NO_CHANGES=1 [ -z \"$NO_CHANGES\" ] || git push https://$(System.AccessToken)@$(REPO_URL) HEAD:$(Build.SourceBranchName) displayName: \"Commit notebook to repository\" Now we have a pipeline that will generate the scripts as we commit our notebooks.","title":"CI with jupyter notebooks"},{"location":"CI-CD/recipes/ci-with-jupyter-notebooks/#ci-with-jupyter-notebooks","text":"As Azure DevOps doesn't allow code reviewers to comment directly in Jupyter Notebooks, Data Scientists(DSs) have to convert the notebooks to scripts before they commit and push these files to the repository. This document aims to automate this process in Azure DevOps, so the DSs don't need to execute anything locally.","title":"CI with Jupyter Notebooks"},{"location":"CI-CD/recipes/ci-with-jupyter-notebooks/#problem-statement","text":"A Data Science repository has this folder structure: . \u251c\u2500\u2500 notebooks \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 00 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 01 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 02 .ipynb \u2502 \u251c\u2500\u2500 Machine Learning Experiments - 03 .ipynb \u2514\u2500\u2500 scripts \u251c\u2500\u2500 Machine Learning Experiments - 00 .py \u251c\u2500\u2500 Machine Learning Experiments - 01 .py \u251c\u2500\u2500 Machine Learning Experiments - 02 .py \u2514\u2500\u2500 Machine Learning Experiments - 03 .py The python files are needed to allow Pull Request reviewers to add comments to the notebooks, they can add comments to the Python scripts and we apply these comments to the notebooks. Since we have to run this process manually before we add files to a commit, this manual process is error prone, e.g. If we create a notebook, generate the script from it, but later make some changes and forget to generate a new script for the changes.","title":"Problem Statement"},{"location":"CI-CD/recipes/ci-with-jupyter-notebooks/#solution","text":"One way to avoid this is to create the scripts in the repository from the commit. This document will describe this process. We can add a pipeline with the following steps to the repository to run in ipynb files: Go to the Project Settings -> Repositories -> Security -> User Permissions Add the Build Service in Users the permission to Contribute Create a new pipeline. In the newly created pipeline we add: Trigger to run on ipynb files: trigger: paths: include: - '*.ipynb' - '**/*.ipynb' Select the pool as Linux: pool: vmImage: ubuntu-latest Set the directory where we want to store the scripts: variables: REPO_URL: # Azure DevOps URL in the format: dev.azure.com/<Organization>/<Project>/_git/<RepoName> Now we will start the core of the pipeline: 1. Upgrade pip - script: | python -m pip install --upgrade pip displayName: 'Upgrade pip' 1. Install nbconvert and ipython : - script: | pip install nbconvert ipython displayName: 'install nbconvert & ipython' 1. Install pandoc : - script: | sudo apt install -y pandoc displayName: \"Install pandoc\" 1. Find the notebook files ( ipynb ) in the last commit to the repo and convert it to scripts ( py ): - task: Bash@3 inputs: targetType: 'inline' script: | IPYNB_PATH=($(git diff-tree --no-commit-id --name-only -r $(Build.SourceVersion) | grep '[.]ipynb$')) echo $IPYNB_PATH [ -z \"$IPYNB_PATH\" ] && echo \"Nothing to convert\" || jupyter nbconvert --to script $IPYNB_PATH displayName: \"Convert Notebook to script\" 1. Commit these changes to the repository: - bash: | git config --global user.email \"build@dev.azure.com\" git config --global user.name \"build\" git add . git commit -m 'Convert Jupyter notebooks' || echo \"No changes to commit\" && NO_CHANGES=1 [ -z \"$NO_CHANGES\" ] || git push https://$(System.AccessToken)@$(REPO_URL) HEAD:$(Build.SourceBranchName) displayName: \"Commit notebook to repository\" Now we have a pipeline that will generate the scripts as we commit our notebooks.","title":"Solution"},{"location":"CI-CD/recipes/inclusive-linting/","text":"Inclusive Linting As software professionals we should strive to promote an inclusive work environment, which naturally extends to the code and documentation we write. It's important to keep the use of inclusive language consistent across an entire project or repository. To achieve this, we recommend using a text file analysis tool such as an inclusive linter and including this as a step in your CI pipelines. What to Lint for The primary goal of an inclusive linter is to flag any occurrences of non-inclusive language within source code (and optionally suggest some alternatives). Non-inclusive words or phrases in a project can be found anywhere from comments and documentation to variable names. An inclusive linter may include its own dictionary of \"default\" non-inclusive words and phrases to run against as a good starting point. These tools can also be customizable, oftentimes offering the ability to omit some terms and/or add your own. The ability to add additional terms to your linter has the added benefit of enabling linting of sensitive language on top of inclusive linting. This can prevent things such as customer names or other non-public information from making it into your git history, for instance. Getting Started with an Inclusive Linter woke One inclusive linter we recommend is woke . It is a language-agnostic CLI tool that detects non-inclusive language in your source code and recommends alternatives. While woke automatically applies a default ruleset with non-inclusive terms to lint for, you can also apply a custom rule config (via a yaml file) with additional terms to lint for. Running the tool locally on a file or directory is relatively straightforward: $ woke test.txt test.txt:2:2-6: ` guys ` may be insensitive, use ` folks ` , ` people ` instead ( warning ) * guys ^ woke can be run locally on your machine or CI/CD system via CLI and is also available as a two GitHub Actions: Run woke Run woke with Reviewdog To use the standard \"Run woke\" GitHub Action with the default ruleset in a CI pipeline: Add the woke action as a step in your project's CI pipeline yaml: name : ci on : - pull_request jobs : woke : name : woke runs-on : ubuntu-latest steps : - name : Checkout uses : actions/checkout@v2 - name : woke uses : get-woke/woke-action@v0 with : # Cause the check to fail on any broke rules fail-on-error : true Run your pipeline View the output in the \"Actions\" tab in the main repository view Resources woke default ruleset example.yaml Run woke Run woke with reviewdog docs","title":"Inclusive Linting"},{"location":"CI-CD/recipes/inclusive-linting/#inclusive-linting","text":"As software professionals we should strive to promote an inclusive work environment, which naturally extends to the code and documentation we write. It's important to keep the use of inclusive language consistent across an entire project or repository. To achieve this, we recommend using a text file analysis tool such as an inclusive linter and including this as a step in your CI pipelines.","title":"Inclusive Linting"},{"location":"CI-CD/recipes/inclusive-linting/#what-to-lint-for","text":"The primary goal of an inclusive linter is to flag any occurrences of non-inclusive language within source code (and optionally suggest some alternatives). Non-inclusive words or phrases in a project can be found anywhere from comments and documentation to variable names. An inclusive linter may include its own dictionary of \"default\" non-inclusive words and phrases to run against as a good starting point. These tools can also be customizable, oftentimes offering the ability to omit some terms and/or add your own. The ability to add additional terms to your linter has the added benefit of enabling linting of sensitive language on top of inclusive linting. This can prevent things such as customer names or other non-public information from making it into your git history, for instance.","title":"What to Lint for"},{"location":"CI-CD/recipes/inclusive-linting/#getting-started-with-an-inclusive-linter","text":"","title":"Getting Started with an Inclusive Linter"},{"location":"CI-CD/recipes/inclusive-linting/#woke","text":"One inclusive linter we recommend is woke . It is a language-agnostic CLI tool that detects non-inclusive language in your source code and recommends alternatives. While woke automatically applies a default ruleset with non-inclusive terms to lint for, you can also apply a custom rule config (via a yaml file) with additional terms to lint for. Running the tool locally on a file or directory is relatively straightforward: $ woke test.txt test.txt:2:2-6: ` guys ` may be insensitive, use ` folks ` , ` people ` instead ( warning ) * guys ^ woke can be run locally on your machine or CI/CD system via CLI and is also available as a two GitHub Actions: Run woke Run woke with Reviewdog To use the standard \"Run woke\" GitHub Action with the default ruleset in a CI pipeline: Add the woke action as a step in your project's CI pipeline yaml: name : ci on : - pull_request jobs : woke : name : woke runs-on : ubuntu-latest steps : - name : Checkout uses : actions/checkout@v2 - name : woke uses : get-woke/woke-action@v0 with : # Cause the check to fail on any broke rules fail-on-error : true Run your pipeline View the output in the \"Actions\" tab in the main repository view","title":"woke"},{"location":"CI-CD/recipes/inclusive-linting/#resources","text":"woke default ruleset example.yaml Run woke Run woke with reviewdog docs","title":"Resources"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/","text":"Reusing Dev Containers Within a Pipeline Given a repository with a local development container a.k.a. dev container that contains all the tooling required for development, would it make sense to reuse that container for running the tooling in the Continuous Integration pipelines? Options for Building Dev Containers Within a Pipeline There are three ways to build devcontainers within pipeline: With GitHub - devcontainers/ci builds the container with the devcontainer.json . Example here: devcontainers/ci \u00b7 Getting Started . With GitHub - devcontainers/cli , which is the same as the above, but using the underlying CLI directly without tasks. Building the DockerFile with docker build . This option excludes all configuration/features specified within the devcontainer.json . Considered Options Run CI pipelines in the native environment Run CI pipelines in the dev container via building image locally Run CI pipelines in the dev container with a container registry Here are below pros and cons for both approaches: Run CI Pipelines in the Native Environment Pros Cons Can use any pipeline tasks available Need to keep two sets of tooling and their versions in sync No container registry Can take some time to start, based on tools/dependencies required Agent will always be up to date with security patches The dev container should always be built within each run of the CI pipeline, to verify the changes within the branch haven't broken anything Run CI Pipelines in the Dev Container Without Image Caching Pros Cons Utilities scripts will work out of the box Need to rebuild the container for each run, given that there may be changes within the branch being built Rules used (for linting or unit tests) will be the same on the CI Not everything in the container is needed for the CI pipeline\u00b9 No surprise for the developers, local outputs (of linting for instance) will be the same in the CI Some pipeline tasks will not be available All tooling and their versions defined in a single place Building the image for each pipeline run is slow\u00b2 Tools/dependencies are already present The dev container is being tested to include all new tooling in addition to not being broken \u00b9: container size can be reduced by exporting the layer that contains only the tooling needed for the CI pipeline \u00b2: could be mitigated via adding image caching without using a container registry Run CI Pipelines in the Dev Container with Image Registry Pros Cons Utilities scripts will work out of the box Need to rebuild the container for each run, given that there may be changes within the branch being built No surprise for the developers, local outputs (of linting for instance) will be the same in the CI Not everything in the container is needed for the CI pipeline\u00b9 Rules used (for linting or unit tests) will be the same on the CI Some pipeline tasks will not be available\u00b2 All tooling and their versions defined in a single place Require access to a container registry to host the container within the pipeline\u00b3 Tools/dependencies are already present The dev container is being tested to include all new tooling in addition to not being broken Publishing the container built from devcontainer.json allows you to reference it in the cacheFrom in devcontainer.json (see docs ). By doing this, VS Code will use the published image as a layer cache when building \u00b9: container size can be reduces by exporting the layer that contains only the tooling needed for the CI pipeline. This would require building the image without tasks \u00b2: using container jobs in AzDO you can use all tasks (as far as I can tell). Reference: Dockerizing DevOps V2 - AzDO container jobs - DEV Community \u00b3: within GH actions, the default Github Actions token can be used for accessing GHCR without setting up separate registry, see the example below. Note: This does not build the Dockerfile together with the devcontainer.json - uses : whoan/docker-build-with-cache-action@v5 id : cache with : username : $GITHUB_ACTOR password : \"${{ secrets.GITHUB_TOKEN }}\" registry : docker.pkg.github.com image_name : devcontainer dockerfile : .devcontainer/Dockerfile","title":"Reusing Dev Containers Within a Pipeline"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#reusing-dev-containers-within-a-pipeline","text":"Given a repository with a local development container a.k.a. dev container that contains all the tooling required for development, would it make sense to reuse that container for running the tooling in the Continuous Integration pipelines?","title":"Reusing Dev Containers Within a Pipeline"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#options-for-building-dev-containers-within-a-pipeline","text":"There are three ways to build devcontainers within pipeline: With GitHub - devcontainers/ci builds the container with the devcontainer.json . Example here: devcontainers/ci \u00b7 Getting Started . With GitHub - devcontainers/cli , which is the same as the above, but using the underlying CLI directly without tasks. Building the DockerFile with docker build . This option excludes all configuration/features specified within the devcontainer.json .","title":"Options for Building Dev Containers Within a Pipeline"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#considered-options","text":"Run CI pipelines in the native environment Run CI pipelines in the dev container via building image locally Run CI pipelines in the dev container with a container registry Here are below pros and cons for both approaches:","title":"Considered Options"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#run-ci-pipelines-in-the-native-environment","text":"Pros Cons Can use any pipeline tasks available Need to keep two sets of tooling and their versions in sync No container registry Can take some time to start, based on tools/dependencies required Agent will always be up to date with security patches The dev container should always be built within each run of the CI pipeline, to verify the changes within the branch haven't broken anything","title":"Run CI Pipelines in the Native Environment"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#run-ci-pipelines-in-the-dev-container-without-image-caching","text":"Pros Cons Utilities scripts will work out of the box Need to rebuild the container for each run, given that there may be changes within the branch being built Rules used (for linting or unit tests) will be the same on the CI Not everything in the container is needed for the CI pipeline\u00b9 No surprise for the developers, local outputs (of linting for instance) will be the same in the CI Some pipeline tasks will not be available All tooling and their versions defined in a single place Building the image for each pipeline run is slow\u00b2 Tools/dependencies are already present The dev container is being tested to include all new tooling in addition to not being broken \u00b9: container size can be reduced by exporting the layer that contains only the tooling needed for the CI pipeline \u00b2: could be mitigated via adding image caching without using a container registry","title":"Run CI Pipelines in the Dev Container Without Image Caching"},{"location":"CI-CD/recipes/reusing-devcontainers-within-a-pipeline/#run-ci-pipelines-in-the-dev-container-with-image-registry","text":"Pros Cons Utilities scripts will work out of the box Need to rebuild the container for each run, given that there may be changes within the branch being built No surprise for the developers, local outputs (of linting for instance) will be the same in the CI Not everything in the container is needed for the CI pipeline\u00b9 Rules used (for linting or unit tests) will be the same on the CI Some pipeline tasks will not be available\u00b2 All tooling and their versions defined in a single place Require access to a container registry to host the container within the pipeline\u00b3 Tools/dependencies are already present The dev container is being tested to include all new tooling in addition to not being broken Publishing the container built from devcontainer.json allows you to reference it in the cacheFrom in devcontainer.json (see docs ). By doing this, VS Code will use the published image as a layer cache when building \u00b9: container size can be reduces by exporting the layer that contains only the tooling needed for the CI pipeline. This would require building the image without tasks \u00b2: using container jobs in AzDO you can use all tasks (as far as I can tell). Reference: Dockerizing DevOps V2 - AzDO container jobs - DEV Community \u00b3: within GH actions, the default Github Actions token can be used for accessing GHCR without setting up separate registry, see the example below. Note: This does not build the Dockerfile together with the devcontainer.json - uses : whoan/docker-build-with-cache-action@v5 id : cache with : username : $GITHUB_ACTOR password : \"${{ secrets.GITHUB_TOKEN }}\" registry : docker.pkg.github.com image_name : devcontainer dockerfile : .devcontainer/Dockerfile","title":"Run CI Pipelines in the Dev Container with Image Registry"},{"location":"CI-CD/recipes/github-actions/runtime-variables/","text":"Runtime Variables in GitHub Actions Objective While GitHub Actions is a popular choice for writing and running CI/CD pipelines, especially for open source projects hosted on GitHub, it lacks specific quality of life features found in other CI/CD environments. One key feature that GitHub Actions has not yet implemented is the ability to mock and inject runtime variables into a workflow, in order to test the pipeline itself. This provides a bridge between a pre-existing feature in Azure DevOps, and one that has not yet released inside GitHub Actions. Target Audience This guide assumes that you are familiar with CI/CD, and understand the security implications of CI/CD pipelines. We also assume basic knowledge with GitHub Actions, including how to write and run a basic CI/CD pipeline, checkout repositories inside the action, use Marketplace Actions with version control, etc. We assume that you, as a CI/CD engineer, want to inject environment variables or environment flags into your pipelines and workflows in order to test them, and are using GitHub Actions to accomplish this. Usage Scenario Many integration or end-to-end workflows require specific environment variables that are only available at runtime. For example, a workflow might be doing the following: In this situation, testing the pipeline is extremely difficult without having to make external calls to the resource. In many cases, making external calls to the resource can be expensive or time-consuming, significantly slowing down inner loop development. Azure DevOps, as an example, offers a way to define pipeline variables on a manual trigger: GitHub Actions does not do so yet. Solution To workaround this, the easiest solution is to add runtime variables to either commit messages or the PR Body, and grep for the variable. GitHub Actions provides grep functionality natively using a contains function, which is what we shall be specifically using. In scope: We will scope this to injecting a single environment variable into a pipeline, with a previously known key and value. Out of Scope: While the solution is obviously extensible using shell scripting or any other means of creating variables, this solution serves well as the proof of the basic concept. No such scripting is provided in this guide. Additionally, teams may wish to formalize this process using a PR Template that has an additional section for the variables being provided. This is not however included in this guide. Security Warning: This is NOT for injecting secrets as the commit messages and PR body can be retrieved by a third party, are stored in git log , and can otherwise be read by a malicious individual using a variety of tools. Rather, this is for testing a workflow that needs simple variables to be injected into it, as above. If you need to retrieve secrets or sensitive information , use the GitHub Action for Azure Key Vault or some other similar secret storage and retrieval service. Commit Message Variables How to inject a single variable into the environment for use, with a specified key and value. In this example, the key is COMMIT_VAR and the value is [commit var] . Pre-requisites: Pipeline triggers are correctly set up to trigger on pushed commits (Here we will use actions-test-branch as the branch of choice) Code Snippet: on : push : branches : - actions-test-branch jobs : Echo-On-Commit : runs-on : ubuntu-latest steps : - name : \"Checkout Repository\" uses : actions/checkout@v2 - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} run : | if ${COMMIT_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" Code Explanation: The first part of the code is setting up Push triggers on the working branch and checking out the repository, so we will not explore that in detail. - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} This is a named step inside the only Job in our GitHub Actions pipeline. Here, we set an environment variable for the step: Any code or action that the step calls will now have the environment variable available. contains is a GitHub Actions function that is available by default in all workflows. It returns a Boolean true or false value. In this situation, it checks to see if the commit message on the last push, accessed using github.event.head_commit.message . The ${{...}} is necessary to use the GitHub Context and make the functions and github.event variables available for the command. run : | if ${COMMIT_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi The run command here checks to see if the COMMIT_VAR variable has been set to true , and if it has, it sets a secondary flag to true, and echoes this behavior. It does the same if the variable is false . The specific reason to do this is to allow for the flag variable to be used in further steps instead of having to reuse the COMMIT_VAR in every step. Further, it allows for the flag to be used in the if step of an action, as in the next part of the snippet. - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" In this part of the snippet, the next step in the same job is now using the flag that was set in the previous step. This allows the user to: Reuse the flag instead of repeatedly accessing the GitHub Context Set the flag using multiple conditions, instead of just one. For example, a different step might ALSO set the flag to true or false for different reasons. Change the variable in exactly one place instead of having to change it in multiple places Shorter Alternative: The \"Set flag from commit\" step can be simplified to the following in order to make the code much shorter, although not necessarily more readable: - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} run : | echo \"flag=${COMMIT_VAR}\" >> $GITHUB_ENV echo \"set flag to ${COMMIT_VAR}\" Usage: Including the Variable Push to branch master : > git add. > git commit -m \"Running GitHub Actions Test [commit var]\" > git push This triggers the workflow (as will any push). As the [commit var] is in the commit message, the ${COMMIT_VAR} variable in the workflow will be set to true and result in the following: Not Including the Variable Push to branch master : > git add. > git commit -m \"Running GitHub Actions Test\" > git push This triggers the workflow (as will any push). As the [commit var] is not in the commit message, the ${COMMIT_VAR} variable in the workflow will be set to false and result in the following: PR Body Variables When a PR is made, the PR Body can also be used to set up variables. These variables can be made available to all the workflow runs that stem from that PR, which can help ensure that commit messages are more informative and less cluttered, and reduces the work on the developer. Once again, this for an expected key and value. In this case, the key is PR_VAR and the value is [pr var] . Pre-requisites: Pipeline triggers are correctly set up to trigger on a pull request into a specific branch. (Here we will use master as the destination branch.) Code Snippet: on : pull_request : branches : - master jobs : Echo-On-PR : runs-on : ubuntu-latest steps : - name : \"Checkout Repository\" uses : actions/checkout@v2 - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} run : | if ${PR_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" Code Explanation: The first part of the YAML file simply sets up the Pull Request Trigger. The majority of the following code is identical, so we will only explain the differences. - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} In this section, the PR_VAR environment variable is set to true or false depending on whether the [pr var] string is in the PR Body. Shorter Alternative: Similarly to the above, the YAML step can be simplified to the following in order to make the code much shorter, although not necessarily more readable: - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} run : | echo \"flag=${PR_VAR}\" >> $GITHUB_ENV echo \"set flag to ${PR_VAR}\" Usage: Create a Pull Request into master , and include the expected variable in the body somewhere: The GitHub Action will trigger automatically, and since [pr var] is present in the PR Body, it will set the flag to true, as shown below: Real World Scenarios There are many real world scenarios where controlling environment variables can be extremely useful. Some are outlined below: Avoiding Expensive External Calls Developer A is in the process of writing and testing an integration pipeline. The integration pipeline needs to make a call to an external service such as Azure Data Factory or Databricks, wait for a result, and then echo that result. The workflow could look like this: The workflow inherently takes time and is expensive to run, as it involves maintaining a Databricks cluster while also waiting for the response. This external dependency can be removed by essentially mocking the response for the duration of writing and testing other parts of the workflow, and mocking the response in situations where the actual response either does not matter, or is not being directly tested. Skipping Long CI processes Developer B is in the process of writing and testing a CI/CD pipeline. The pipeline has multiple CI stages, each of which runs sequentially. The workflow might look like this: In this case, each CI stage needs to run before the next one starts, and errors in the middle of the process can cause the entire pipeline to fail. While this might be intended behavior for the pipeline in some situations (Perhaps you don't want to run a more involved, longer build or run a time-consuming test coverage suite if the CI process is failing), it means that steps need to be commented out or deleted when testing the pipeline itself. Instead, an additional step could check for a [skip ci $N] tag in either the commit messages or PR Body, and skip a specific stage of the CI build. This ensures that the final pipeline does not have changes committed to it that render it broken, as sometimes happens when commenting out/deleting steps. It additionally allows for a mechanism to repeatedly test individual steps by skipping the others, making developing the pipeline significantly easier.","title":"Runtime Variables in GitHub Actions"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#runtime-variables-in-github-actions","text":"","title":"Runtime Variables in GitHub Actions"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#objective","text":"While GitHub Actions is a popular choice for writing and running CI/CD pipelines, especially for open source projects hosted on GitHub, it lacks specific quality of life features found in other CI/CD environments. One key feature that GitHub Actions has not yet implemented is the ability to mock and inject runtime variables into a workflow, in order to test the pipeline itself. This provides a bridge between a pre-existing feature in Azure DevOps, and one that has not yet released inside GitHub Actions.","title":"Objective"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#target-audience","text":"This guide assumes that you are familiar with CI/CD, and understand the security implications of CI/CD pipelines. We also assume basic knowledge with GitHub Actions, including how to write and run a basic CI/CD pipeline, checkout repositories inside the action, use Marketplace Actions with version control, etc. We assume that you, as a CI/CD engineer, want to inject environment variables or environment flags into your pipelines and workflows in order to test them, and are using GitHub Actions to accomplish this.","title":"Target Audience"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#usage-scenario","text":"Many integration or end-to-end workflows require specific environment variables that are only available at runtime. For example, a workflow might be doing the following: In this situation, testing the pipeline is extremely difficult without having to make external calls to the resource. In many cases, making external calls to the resource can be expensive or time-consuming, significantly slowing down inner loop development. Azure DevOps, as an example, offers a way to define pipeline variables on a manual trigger: GitHub Actions does not do so yet.","title":"Usage Scenario"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#solution","text":"To workaround this, the easiest solution is to add runtime variables to either commit messages or the PR Body, and grep for the variable. GitHub Actions provides grep functionality natively using a contains function, which is what we shall be specifically using. In scope: We will scope this to injecting a single environment variable into a pipeline, with a previously known key and value. Out of Scope: While the solution is obviously extensible using shell scripting or any other means of creating variables, this solution serves well as the proof of the basic concept. No such scripting is provided in this guide. Additionally, teams may wish to formalize this process using a PR Template that has an additional section for the variables being provided. This is not however included in this guide. Security Warning: This is NOT for injecting secrets as the commit messages and PR body can be retrieved by a third party, are stored in git log , and can otherwise be read by a malicious individual using a variety of tools. Rather, this is for testing a workflow that needs simple variables to be injected into it, as above. If you need to retrieve secrets or sensitive information , use the GitHub Action for Azure Key Vault or some other similar secret storage and retrieval service.","title":"Solution"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#commit-message-variables","text":"How to inject a single variable into the environment for use, with a specified key and value. In this example, the key is COMMIT_VAR and the value is [commit var] . Pre-requisites: Pipeline triggers are correctly set up to trigger on pushed commits (Here we will use actions-test-branch as the branch of choice) Code Snippet: on : push : branches : - actions-test-branch jobs : Echo-On-Commit : runs-on : ubuntu-latest steps : - name : \"Checkout Repository\" uses : actions/checkout@v2 - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} run : | if ${COMMIT_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" Code Explanation: The first part of the code is setting up Push triggers on the working branch and checking out the repository, so we will not explore that in detail. - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} This is a named step inside the only Job in our GitHub Actions pipeline. Here, we set an environment variable for the step: Any code or action that the step calls will now have the environment variable available. contains is a GitHub Actions function that is available by default in all workflows. It returns a Boolean true or false value. In this situation, it checks to see if the commit message on the last push, accessed using github.event.head_commit.message . The ${{...}} is necessary to use the GitHub Context and make the functions and github.event variables available for the command. run : | if ${COMMIT_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi The run command here checks to see if the COMMIT_VAR variable has been set to true , and if it has, it sets a secondary flag to true, and echoes this behavior. It does the same if the variable is false . The specific reason to do this is to allow for the flag variable to be used in further steps instead of having to reuse the COMMIT_VAR in every step. Further, it allows for the flag to be used in the if step of an action, as in the next part of the snippet. - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" In this part of the snippet, the next step in the same job is now using the flag that was set in the previous step. This allows the user to: Reuse the flag instead of repeatedly accessing the GitHub Context Set the flag using multiple conditions, instead of just one. For example, a different step might ALSO set the flag to true or false for different reasons. Change the variable in exactly one place instead of having to change it in multiple places Shorter Alternative: The \"Set flag from commit\" step can be simplified to the following in order to make the code much shorter, although not necessarily more readable: - name : \"Set flag from Commit\" env : COMMIT_VAR : ${{ contains(github.event.head_commit.message, '[commit var]') }} run : | echo \"flag=${COMMIT_VAR}\" >> $GITHUB_ENV echo \"set flag to ${COMMIT_VAR}\" Usage: Including the Variable Push to branch master : > git add. > git commit -m \"Running GitHub Actions Test [commit var]\" > git push This triggers the workflow (as will any push). As the [commit var] is in the commit message, the ${COMMIT_VAR} variable in the workflow will be set to true and result in the following: Not Including the Variable Push to branch master : > git add. > git commit -m \"Running GitHub Actions Test\" > git push This triggers the workflow (as will any push). As the [commit var] is not in the commit message, the ${COMMIT_VAR} variable in the workflow will be set to false and result in the following:","title":"Commit Message Variables"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#pr-body-variables","text":"When a PR is made, the PR Body can also be used to set up variables. These variables can be made available to all the workflow runs that stem from that PR, which can help ensure that commit messages are more informative and less cluttered, and reduces the work on the developer. Once again, this for an expected key and value. In this case, the key is PR_VAR and the value is [pr var] . Pre-requisites: Pipeline triggers are correctly set up to trigger on a pull request into a specific branch. (Here we will use master as the destination branch.) Code Snippet: on : pull_request : branches : - master jobs : Echo-On-PR : runs-on : ubuntu-latest steps : - name : \"Checkout Repository\" uses : actions/checkout@v2 - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} run : | if ${PR_VAR} == true; then echo \"flag=true\" >> $GITHUB_ENV echo \"flag set to true\" else echo \"flag=false\" >> $GITHUB_ENV echo \"flag set to false\" fi - name : \"Use flag if true\" if : env.flag run : echo \"Flag is available and true\" Code Explanation: The first part of the YAML file simply sets up the Pull Request Trigger. The majority of the following code is identical, so we will only explain the differences. - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} In this section, the PR_VAR environment variable is set to true or false depending on whether the [pr var] string is in the PR Body. Shorter Alternative: Similarly to the above, the YAML step can be simplified to the following in order to make the code much shorter, although not necessarily more readable: - name : \"Set flag from PR\" env : PR_VAR : ${{ contains(github.event.pull_request.body, '[pr var]') }} run : | echo \"flag=${PR_VAR}\" >> $GITHUB_ENV echo \"set flag to ${PR_VAR}\" Usage: Create a Pull Request into master , and include the expected variable in the body somewhere: The GitHub Action will trigger automatically, and since [pr var] is present in the PR Body, it will set the flag to true, as shown below:","title":"PR Body Variables"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#real-world-scenarios","text":"There are many real world scenarios where controlling environment variables can be extremely useful. Some are outlined below:","title":"Real World Scenarios"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#avoiding-expensive-external-calls","text":"Developer A is in the process of writing and testing an integration pipeline. The integration pipeline needs to make a call to an external service such as Azure Data Factory or Databricks, wait for a result, and then echo that result. The workflow could look like this: The workflow inherently takes time and is expensive to run, as it involves maintaining a Databricks cluster while also waiting for the response. This external dependency can be removed by essentially mocking the response for the duration of writing and testing other parts of the workflow, and mocking the response in situations where the actual response either does not matter, or is not being directly tested.","title":"Avoiding Expensive External Calls"},{"location":"CI-CD/recipes/github-actions/runtime-variables/#skipping-long-ci-processes","text":"Developer B is in the process of writing and testing a CI/CD pipeline. The pipeline has multiple CI stages, each of which runs sequentially. The workflow might look like this: In this case, each CI stage needs to run before the next one starts, and errors in the middle of the process can cause the entire pipeline to fail. While this might be intended behavior for the pipeline in some situations (Perhaps you don't want to run a more involved, longer build or run a time-consuming test coverage suite if the CI process is failing), it means that steps need to be commented out or deleted when testing the pipeline itself. Instead, an additional step could check for a [skip ci $N] tag in either the commit messages or PR Body, and skip a specific stage of the CI build. This ensures that the final pipeline does not have changes committed to it that render it broken, as sometimes happens when commenting out/deleting steps. It additionally allows for a mechanism to repeatedly test individual steps by skipping the others, making developing the pipeline significantly easier.","title":"Skipping Long CI processes"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/","text":"Save Terraform Output to a Variable Group (Azure DevOps) This recipe applies only to terraform usage with Azure DevOps. It assumes your familiar with terraform commands and Azure Pipelines. Context When terraform is used to automate the provisioning of the infrastructure, an Azure Pipeline is generally dedicated to apply terraform configuration files. It will create, update, delete Azure resources to provision your infrastructure changes. Once files are applied, some Output Values (for instance resource group name, app service name) can be referenced and outputted by terraform. These values must be generally retrieved afterwards, used as input variables for the deployment of services happening in separate pipelines. output \"core_resource_group_name\" { description = \"The resource group name\" value = module.core.resource_group_name } output \"core_key_vault_name\" { description = \"The key vault name.\" value = module.core.key_vault_name } output \"core_key_vault_url\" { description = \"The key vault url.\" value = module.core.key_vault_url } The purpose of this recipe is to answer the following statement: How to make terraform output values available across multiple pipelines ? Solution One suggested solution is to store outputted values in the Library with a Variable Group . Variable groups is a convenient way store values you might want to be passed into a YAML pipeline. In addition, all assets defined in the Library share a common security model. You can control who can define new items in a library, and who can use an existing item. For this purpose, we are using the following commands: terraform output to extract the value of an output variable from the state file (provided by Terraform CLI ) az pipelines variable-group to manage variable groups (provided by Azure DevOps CLI ) You can use the following script once terraform apply is completed to create/update the variable group. Script (update-variablegroup.sh) Parameters Name Description DEVOPS_ORGANIZATION The URI of the Azure DevOps organization. DEVOPS_PROJECT The name or id of the Azure DevOps project. GROUP_NAME The name of the variable group targeted. Implementation choices: If a variable group already exists, a valid option could be to delete and rebuild the group from scratch. However, as authorization could have been updated at the group level, we prefer to avoid this option. The script remove instead all variables in the targeted group and add them back with latest values. Permissions are not impacted. A variable group cannot be empty. It must contains at least one variable. A temporary uuid value is created to mitigate this issue, and removed once variables are updated. #!/bin/bash set -e export DEVOPS_ORGANIZATION = $1 export DEVOPS_PROJECT = $2 export GROUP_NAME = $3 # configure the azure devops cli az devops configure --defaults organization = ${ DEVOPS_ORGANIZATION } project = ${ DEVOPS_PROJECT } --use-git-aliases true # get the variable group id (if already exists) group_id = $( az pipelines variable-group list --group-name ${ GROUP_NAME } --query '[0].id' -o json ) if [ -z \" ${ group_id } \" ] ; then # create a new variable group tf_output = $( terraform output -json | jq -r 'to_entries[] | \"\\(.key)=\\(.value.value)\"' ) az pipelines variable-group create --name ${ GROUP_NAME } --variables ${ tf_output } --authorize true else # get existing variables var_list = $( az pipelines variable-group variable list --group-id ${ group_id } ) # add temporary uuid variable (a variable group cannot be empty) uuid = $( cat /proc/sys/kernel/random/uuid ) az pipelines variable-group variable create --group-id ${ group_id } --name ${ uuid } # delete existing variables for row in $( echo ${ var_list } | jq -r 'to_entries[] | \"\\(.key)\"' ) ; do az pipelines variable-group variable delete --group-id ${ group_id } --name ${ row } --yes done # create variables with latest values (from terraform) for row in $( terraform output -json | jq -c 'to_entries[]' ) ; do _jq () { echo ${ row } | jq -r ${ 1 } } az pipelines variable-group variable create --group-id ${ group_id } --name $( _jq '.key' ) --value $( _jq '.value.value' ) --secret $( _jq '.value.sensitive' ) done # delete temporary uuid variable az pipelines variable-group variable delete --group-id ${ group_id } --name ${ uuid } --yes fi Authenticate with Azure DevOps Most commands used in previous script interact with Azure DevOps and do require authentication. You can authenticate using the System.AccessToken security token used by the running pipeline, by assigning it to an environment variable named AZURE_DEVOPS_EXT_PAT , as shown in the following example (see Azure DevOps CLI in Azure Pipeline YAML for additional information). In addition, you can notice we are also using predefined variables to target the Azure DevOps organization and project (respectively System.TeamFoundationCollectionUri and System.TeamProjectId ). - task : Bash@3 displayName : 'Update variable group using terraform outputs' inputs : targetType : filePath arguments : $(System.TeamFoundationCollectionUri) $(System.TeamProjectId) \"Platform-VG\" workingDirectory : $(terraformDirectory) filePath : $(scriptsDirectory)/update-variablegroup.sh env : AZURE_DEVOPS_EXT_PAT : $(System.AccessToken) System variables Description System.AccessToken Special variable that carries the security token used by the running build. System.TeamFoundationCollectionUri The URI of the Azure DevOps organization. System.TeamProjectId The ID of the project that this build belongs to. Library security Roles are defined for Library items, and membership of these roles governs the operations you can perform on those items. Role for library item Description Reader Can view the item. User Can use the item when authoring build or release pipelines. For example, you must be a 'User' for a variable group to use it in a release pipeline. Administrator Can also manage membership of all other roles for the item. The user who created an item gets automatically added to the Administrator role for that item. By default, the following groups get added to the Administrator role of the library: Build Administrators, Release Administrators, and Project Administrators. Creator Can create new items in the library, but this role doesn't include Reader or User permissions. The Creator role can't manage permissions for other users. When using System.AccessToken , service account <ProjectName> Build Service identity will be used to access the Library. Please ensure in Pipelines > Library > Security section that this service account has Administrator role at the Library or Variable Group level to create/update/delete variables (see. Library of assets for additional information).","title":"Save Terraform Output to a Variable Group (Azure DevOps)"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#save-terraform-output-to-a-variable-group-azure-devops","text":"This recipe applies only to terraform usage with Azure DevOps. It assumes your familiar with terraform commands and Azure Pipelines.","title":"Save Terraform Output to a Variable Group (Azure DevOps)"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#context","text":"When terraform is used to automate the provisioning of the infrastructure, an Azure Pipeline is generally dedicated to apply terraform configuration files. It will create, update, delete Azure resources to provision your infrastructure changes. Once files are applied, some Output Values (for instance resource group name, app service name) can be referenced and outputted by terraform. These values must be generally retrieved afterwards, used as input variables for the deployment of services happening in separate pipelines. output \"core_resource_group_name\" { description = \"The resource group name\" value = module.core.resource_group_name } output \"core_key_vault_name\" { description = \"The key vault name.\" value = module.core.key_vault_name } output \"core_key_vault_url\" { description = \"The key vault url.\" value = module.core.key_vault_url } The purpose of this recipe is to answer the following statement: How to make terraform output values available across multiple pipelines ?","title":"Context"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#solution","text":"One suggested solution is to store outputted values in the Library with a Variable Group . Variable groups is a convenient way store values you might want to be passed into a YAML pipeline. In addition, all assets defined in the Library share a common security model. You can control who can define new items in a library, and who can use an existing item. For this purpose, we are using the following commands: terraform output to extract the value of an output variable from the state file (provided by Terraform CLI ) az pipelines variable-group to manage variable groups (provided by Azure DevOps CLI ) You can use the following script once terraform apply is completed to create/update the variable group.","title":"Solution"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#script-update-variablegroupsh","text":"","title":"Script (update-variablegroup.sh)"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#parameters","text":"Name Description DEVOPS_ORGANIZATION The URI of the Azure DevOps organization. DEVOPS_PROJECT The name or id of the Azure DevOps project. GROUP_NAME The name of the variable group targeted. Implementation choices: If a variable group already exists, a valid option could be to delete and rebuild the group from scratch. However, as authorization could have been updated at the group level, we prefer to avoid this option. The script remove instead all variables in the targeted group and add them back with latest values. Permissions are not impacted. A variable group cannot be empty. It must contains at least one variable. A temporary uuid value is created to mitigate this issue, and removed once variables are updated. #!/bin/bash set -e export DEVOPS_ORGANIZATION = $1 export DEVOPS_PROJECT = $2 export GROUP_NAME = $3 # configure the azure devops cli az devops configure --defaults organization = ${ DEVOPS_ORGANIZATION } project = ${ DEVOPS_PROJECT } --use-git-aliases true # get the variable group id (if already exists) group_id = $( az pipelines variable-group list --group-name ${ GROUP_NAME } --query '[0].id' -o json ) if [ -z \" ${ group_id } \" ] ; then # create a new variable group tf_output = $( terraform output -json | jq -r 'to_entries[] | \"\\(.key)=\\(.value.value)\"' ) az pipelines variable-group create --name ${ GROUP_NAME } --variables ${ tf_output } --authorize true else # get existing variables var_list = $( az pipelines variable-group variable list --group-id ${ group_id } ) # add temporary uuid variable (a variable group cannot be empty) uuid = $( cat /proc/sys/kernel/random/uuid ) az pipelines variable-group variable create --group-id ${ group_id } --name ${ uuid } # delete existing variables for row in $( echo ${ var_list } | jq -r 'to_entries[] | \"\\(.key)\"' ) ; do az pipelines variable-group variable delete --group-id ${ group_id } --name ${ row } --yes done # create variables with latest values (from terraform) for row in $( terraform output -json | jq -c 'to_entries[]' ) ; do _jq () { echo ${ row } | jq -r ${ 1 } } az pipelines variable-group variable create --group-id ${ group_id } --name $( _jq '.key' ) --value $( _jq '.value.value' ) --secret $( _jq '.value.sensitive' ) done # delete temporary uuid variable az pipelines variable-group variable delete --group-id ${ group_id } --name ${ uuid } --yes fi","title":"Parameters"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#authenticate-with-azure-devops","text":"Most commands used in previous script interact with Azure DevOps and do require authentication. You can authenticate using the System.AccessToken security token used by the running pipeline, by assigning it to an environment variable named AZURE_DEVOPS_EXT_PAT , as shown in the following example (see Azure DevOps CLI in Azure Pipeline YAML for additional information). In addition, you can notice we are also using predefined variables to target the Azure DevOps organization and project (respectively System.TeamFoundationCollectionUri and System.TeamProjectId ). - task : Bash@3 displayName : 'Update variable group using terraform outputs' inputs : targetType : filePath arguments : $(System.TeamFoundationCollectionUri) $(System.TeamProjectId) \"Platform-VG\" workingDirectory : $(terraformDirectory) filePath : $(scriptsDirectory)/update-variablegroup.sh env : AZURE_DEVOPS_EXT_PAT : $(System.AccessToken) System variables Description System.AccessToken Special variable that carries the security token used by the running build. System.TeamFoundationCollectionUri The URI of the Azure DevOps organization. System.TeamProjectId The ID of the project that this build belongs to.","title":"Authenticate with Azure DevOps"},{"location":"CI-CD/recipes/terraform/save-output-to-variable-group/#library-security","text":"Roles are defined for Library items, and membership of these roles governs the operations you can perform on those items. Role for library item Description Reader Can view the item. User Can use the item when authoring build or release pipelines. For example, you must be a 'User' for a variable group to use it in a release pipeline. Administrator Can also manage membership of all other roles for the item. The user who created an item gets automatically added to the Administrator role for that item. By default, the following groups get added to the Administrator role of the library: Build Administrators, Release Administrators, and Project Administrators. Creator Can create new items in the library, but this role doesn't include Reader or User permissions. The Creator role can't manage permissions for other users. When using System.AccessToken , service account <ProjectName> Build Service identity will be used to access the Library. Please ensure in Pipelines > Library > Security section that this service account has Administrator role at the Library or Variable Group level to create/update/delete variables (see. Library of assets for additional information).","title":"Library security"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/","text":"Sharing Common Variables / Naming Conventions Between Terraform Modules What are we Trying to Solve? When deploying infrastructure using code, it's common practice to split the code into different modules that are responsible for the deployment of a part or a component of the infrastructure. In Terraform, this can be done by using modules . In this case, it is useful to be able to share some common variables as well as centralize naming conventions of the different resources, to ensure it will be easy to refactor when it has to change, despite the dependencies that exist between modules. For example, let's consider 2 modules: Network module, responsible for deploying Virtual Network, Subnets, NSGs and Private DNS Zones Azure Kubernetes Service module responsible for deploying AKS cluster There are dependencies between these modules, like the Kubernetes cluster that will be deployed into the virtual network from the Network module. To do that, it must reference the name of the virtual network, as well as the resource group it is deployed in. And ideally, we would like these dependencies to be loosely coupled, as much as possible, to keep agility in how the modules are deployed and keep independent lifecycle. This page explains a way to solve this with Terraform. How to Do It? Context Let's consider the following structure for our modules: modules \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf Now, assume that you deploy a virtual network for the development environment, with the following properties: name: vnet-dev resource group: rg-dev-network Then at some point, you need to inject these values into the Kubernetes module, to get a reference to it through a data source, for example: data \"azurem_virtual_network\" \"vnet\" { name = var.vnet_name resource_group_name = var.vnet_rg_name } In the snippet above, the virtual network name and resource group are defined through variable. This is great, but if this changes in the future, then the values of these variables must change too. In every module they are used. Being able to manage naming in a central place will make sure the code can easily be refactored in the future, without updating all modules. About Terraform Variables In Terraform, every input variable must be defined at the configuration (or module) level, using the variable block. By convention, this is often done in a variables.tf file, in the module. This file contains variable declaration and default values. Values can be set using variables configuration files (.tfvars), environment variables or CLI arg when using the terraform plan or apply commands. One of the limitation of the variables declaration is that it's not possible to compose variables, locals or Terraform built-in functions are used for that. Common Terraform Module One way to bypass this limitations is to introduce a \"common\" module, that will not deploy any resources, but just compute / calculate and output the resource names and shared variables, and be used by all other modules, as a dependency. modules \u251c\u2500\u2500 common \u2502 \u251c\u2500\u2500 output.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf variables.tf: variable \"environment_name\" { type = string description = \"The name of the environment.\" } variable \"location\" { type = string description = \"The Azure region where the resources will be created. Default is westeurope.\" default = \"westeurope\" } output.tf: # Shared variables output \"location\" { value = var.location } output \"subscription\" { value = var.subscription } # Virtual Network Naming output \"vnet_rg_name\" { value = \"rg-network-${var.environment_name}\" } output \"vnet_name\" { value = \"vnet-${var.environment_name}\" } # AKS Naming output \"aks_rg_name\" { value = \"rg-aks-${var.environment_name}\" } output \"aks_name\" { value = \"aks-${var.environment_name}\" } Now, if you execute the Terraform apply for the common module, you get all the shared/common variables in outputs: $ terraform plan -var environment_name = \"dev\" -var subscription = \" $( az account show --query id -o tsv ) \" Changes to Outputs: + aks_name = \"aks-dev\" + aks_rg_name = \"rg-aks-dev\" + location = \"westeurope\" + subscription = \"01010101-1010-0101-1010-010101010101\" + vnet_name = \"vnet-dev\" + vnet_rg_name = \"rg-network-dev\" You can apply this plan to save these new output values to the Terraform state, without changing any real infrastructure. Use the Common Terraform Module Using the common Terraform module in any other module is super easy. For example, this is what you can do in the Azure Kubernetes module main.tf file: module \"common\" { source = \"../common\" environment_name = var.environment_name subscription = var.subscription } data \"azurerm_subnet\" \"aks_subnet\" { name = \"AksSubnet\" virtual_network_name = module.common.vnet_name resource_group_name = module.common.vnet_rg_name } resource \"azurerm_kubernetes_cluster\" \"aks\" { name = module.common.aks_name resource_group_name = module.common.aks_rg_name location = module.common.location dns_prefix = module.common.aks_name identity { type = \"SystemAssigned\" } default_node_pool { name = \"default\" vm_size = \"Standard_DS2_v2\" vnet_subnet_id = data.azurerm_subnet.aks_subnet.id } } Then, you can execute the terraform plan and terraform apply commands to deploy! terraform plan -var environment_name = \"dev\" -var subscription = \" $( az account show --query id -o tsv ) \" data.azurerm_subnet.aks_subnet: Reading... data.azurerm_subnet.aks_subnet: Read complete after 1s [ id = /subscriptions/01010101-1010-0101-1010-010101010101/resourceGroups/rg-network-dev/providers/Microsoft.Network/virtualNetworks/vnet-dev/subnets/AksSubnet ] Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: + create Terraform will perform the following actions: # azurerm_kubernetes_cluster.aks will be created + resource \"azurerm_kubernetes_cluster\" \"aks\" { + dns_prefix = \"aks-dev\" + fqdn = ( known after apply ) + id = ( known after apply ) + kube_admin_config = ( known after apply ) + kube_admin_config_raw = ( sensitive value ) + kube_config = ( known after apply ) + kube_config_raw = ( sensitive value ) + kubernetes_version = ( known after apply ) + location = \"westeurope\" + name = \"aks-dev\" + node_resource_group = ( known after apply ) + portal_fqdn = ( known after apply ) + private_cluster_enabled = ( known after apply ) + private_cluster_public_fqdn_enabled = false + private_dns_zone_id = ( known after apply ) + private_fqdn = ( known after apply ) + private_link_enabled = ( known after apply ) + public_network_access_enabled = true + resource_group_name = \"rg-aks-dev\" + sku_tier = \"Free\" [ ... ] truncated + default_node_pool { + kubelet_disk_type = ( known after apply ) + max_pods = ( known after apply ) + name = \"default\" + node_count = ( known after apply ) + node_labels = ( known after apply ) + orchestrator_version = ( known after apply ) + os_disk_size_gb = ( known after apply ) + os_disk_type = \"Managed\" + os_sku = ( known after apply ) + type = \"VirtualMachineScaleSets\" + ultra_ssd_enabled = false + vm_size = \"Standard_DS2_v2\" + vnet_subnet_id = \"/subscriptions/01010101-1010-0101-1010-010101010101/resourceGroups/rg-network-dev/providers/Microsoft.Network/virtualNetworks/vnet-dev/subnets/AksSubnet\" } + identity { + principal_id = ( known after apply ) + tenant_id = ( known after apply ) + type = \"SystemAssigned\" } [ ... ] truncated } Plan: 1 to add, 0 to change, 0 to destroy. Note: the usage of a common module is also valid if you decide to deploy all your modules in the same operation from a main Terraform configuration file, like: module \"common\" { source = \"./common\" environment_name = var.environment_name subscription = var.subscription } module \"network\" { source = \"./network\" vnet_name = module.common.vnet_name vnet_rg_name = module.common.vnet_rg_name } module \"kubernetes\" { source = \"./kubernetes\" aks_name = module.common.aks_name aks_rg = module.common.aks_rg_name } Centralize Input Variables Definitions In case you chose to define variables values directly in the source control (e.g. gitops scenario) using variables definitions files ( .tfvars ), having a common module will also help to not have to duplicate the common variables definitions in all modules. Indeed, it is possible to have a global file that is defined once, at the common module level, and merge it with a module-specific variables definitions files at Terraform plan or apply time. Let's consider the following structure: modules \u251c\u2500\u2500 common \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 output.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf The common module as well as all other modules contain variables files for dev and prod environment. tfvars files from the common module will define all the global variables that will be shared with other modules (like subscription, environment name, etc.) and .tfvars files of each module will define only the module-specific values. Then, it's possible to merge these files when running the terraform apply or terraform plan command, using the following syntax: terraform plan -var-file = < ( cat ../common/dev.tfvars ./dev.tfvars ) Note: using this, it is really important to ensure that you have not the same variable names in both files, otherwise that will generate an error. Conclusion By having a common module that owns shared variables as well as naming convention, it is now easier to refactor your Terraform configuration code base. Imagine that for some reason you need change the pattern that is used for the virtual network name: you change it in the common module output files, and just have to re-apply all modules!","title":"Sharing Common Variables / Naming Conventions Between Terraform Modules"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#sharing-common-variables-naming-conventions-between-terraform-modules","text":"","title":"Sharing Common Variables / Naming Conventions Between Terraform Modules"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#what-are-we-trying-to-solve","text":"When deploying infrastructure using code, it's common practice to split the code into different modules that are responsible for the deployment of a part or a component of the infrastructure. In Terraform, this can be done by using modules . In this case, it is useful to be able to share some common variables as well as centralize naming conventions of the different resources, to ensure it will be easy to refactor when it has to change, despite the dependencies that exist between modules. For example, let's consider 2 modules: Network module, responsible for deploying Virtual Network, Subnets, NSGs and Private DNS Zones Azure Kubernetes Service module responsible for deploying AKS cluster There are dependencies between these modules, like the Kubernetes cluster that will be deployed into the virtual network from the Network module. To do that, it must reference the name of the virtual network, as well as the resource group it is deployed in. And ideally, we would like these dependencies to be loosely coupled, as much as possible, to keep agility in how the modules are deployed and keep independent lifecycle. This page explains a way to solve this with Terraform.","title":"What are we Trying to Solve?"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#how-to-do-it","text":"","title":"How to Do It?"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#context","text":"Let's consider the following structure for our modules: modules \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf Now, assume that you deploy a virtual network for the development environment, with the following properties: name: vnet-dev resource group: rg-dev-network Then at some point, you need to inject these values into the Kubernetes module, to get a reference to it through a data source, for example: data \"azurem_virtual_network\" \"vnet\" { name = var.vnet_name resource_group_name = var.vnet_rg_name } In the snippet above, the virtual network name and resource group are defined through variable. This is great, but if this changes in the future, then the values of these variables must change too. In every module they are used. Being able to manage naming in a central place will make sure the code can easily be refactored in the future, without updating all modules.","title":"Context"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#about-terraform-variables","text":"In Terraform, every input variable must be defined at the configuration (or module) level, using the variable block. By convention, this is often done in a variables.tf file, in the module. This file contains variable declaration and default values. Values can be set using variables configuration files (.tfvars), environment variables or CLI arg when using the terraform plan or apply commands. One of the limitation of the variables declaration is that it's not possible to compose variables, locals or Terraform built-in functions are used for that.","title":"About Terraform Variables"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#common-terraform-module","text":"One way to bypass this limitations is to introduce a \"common\" module, that will not deploy any resources, but just compute / calculate and output the resource names and shared variables, and be used by all other modules, as a dependency. modules \u251c\u2500\u2500 common \u2502 \u251c\u2500\u2500 output.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf variables.tf: variable \"environment_name\" { type = string description = \"The name of the environment.\" } variable \"location\" { type = string description = \"The Azure region where the resources will be created. Default is westeurope.\" default = \"westeurope\" } output.tf: # Shared variables output \"location\" { value = var.location } output \"subscription\" { value = var.subscription } # Virtual Network Naming output \"vnet_rg_name\" { value = \"rg-network-${var.environment_name}\" } output \"vnet_name\" { value = \"vnet-${var.environment_name}\" } # AKS Naming output \"aks_rg_name\" { value = \"rg-aks-${var.environment_name}\" } output \"aks_name\" { value = \"aks-${var.environment_name}\" } Now, if you execute the Terraform apply for the common module, you get all the shared/common variables in outputs: $ terraform plan -var environment_name = \"dev\" -var subscription = \" $( az account show --query id -o tsv ) \" Changes to Outputs: + aks_name = \"aks-dev\" + aks_rg_name = \"rg-aks-dev\" + location = \"westeurope\" + subscription = \"01010101-1010-0101-1010-010101010101\" + vnet_name = \"vnet-dev\" + vnet_rg_name = \"rg-network-dev\" You can apply this plan to save these new output values to the Terraform state, without changing any real infrastructure.","title":"Common Terraform Module"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#use-the-common-terraform-module","text":"Using the common Terraform module in any other module is super easy. For example, this is what you can do in the Azure Kubernetes module main.tf file: module \"common\" { source = \"../common\" environment_name = var.environment_name subscription = var.subscription } data \"azurerm_subnet\" \"aks_subnet\" { name = \"AksSubnet\" virtual_network_name = module.common.vnet_name resource_group_name = module.common.vnet_rg_name } resource \"azurerm_kubernetes_cluster\" \"aks\" { name = module.common.aks_name resource_group_name = module.common.aks_rg_name location = module.common.location dns_prefix = module.common.aks_name identity { type = \"SystemAssigned\" } default_node_pool { name = \"default\" vm_size = \"Standard_DS2_v2\" vnet_subnet_id = data.azurerm_subnet.aks_subnet.id } } Then, you can execute the terraform plan and terraform apply commands to deploy! terraform plan -var environment_name = \"dev\" -var subscription = \" $( az account show --query id -o tsv ) \" data.azurerm_subnet.aks_subnet: Reading... data.azurerm_subnet.aks_subnet: Read complete after 1s [ id = /subscriptions/01010101-1010-0101-1010-010101010101/resourceGroups/rg-network-dev/providers/Microsoft.Network/virtualNetworks/vnet-dev/subnets/AksSubnet ] Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: + create Terraform will perform the following actions: # azurerm_kubernetes_cluster.aks will be created + resource \"azurerm_kubernetes_cluster\" \"aks\" { + dns_prefix = \"aks-dev\" + fqdn = ( known after apply ) + id = ( known after apply ) + kube_admin_config = ( known after apply ) + kube_admin_config_raw = ( sensitive value ) + kube_config = ( known after apply ) + kube_config_raw = ( sensitive value ) + kubernetes_version = ( known after apply ) + location = \"westeurope\" + name = \"aks-dev\" + node_resource_group = ( known after apply ) + portal_fqdn = ( known after apply ) + private_cluster_enabled = ( known after apply ) + private_cluster_public_fqdn_enabled = false + private_dns_zone_id = ( known after apply ) + private_fqdn = ( known after apply ) + private_link_enabled = ( known after apply ) + public_network_access_enabled = true + resource_group_name = \"rg-aks-dev\" + sku_tier = \"Free\" [ ... ] truncated + default_node_pool { + kubelet_disk_type = ( known after apply ) + max_pods = ( known after apply ) + name = \"default\" + node_count = ( known after apply ) + node_labels = ( known after apply ) + orchestrator_version = ( known after apply ) + os_disk_size_gb = ( known after apply ) + os_disk_type = \"Managed\" + os_sku = ( known after apply ) + type = \"VirtualMachineScaleSets\" + ultra_ssd_enabled = false + vm_size = \"Standard_DS2_v2\" + vnet_subnet_id = \"/subscriptions/01010101-1010-0101-1010-010101010101/resourceGroups/rg-network-dev/providers/Microsoft.Network/virtualNetworks/vnet-dev/subnets/AksSubnet\" } + identity { + principal_id = ( known after apply ) + tenant_id = ( known after apply ) + type = \"SystemAssigned\" } [ ... ] truncated } Plan: 1 to add, 0 to change, 0 to destroy. Note: the usage of a common module is also valid if you decide to deploy all your modules in the same operation from a main Terraform configuration file, like: module \"common\" { source = \"./common\" environment_name = var.environment_name subscription = var.subscription } module \"network\" { source = \"./network\" vnet_name = module.common.vnet_name vnet_rg_name = module.common.vnet_rg_name } module \"kubernetes\" { source = \"./kubernetes\" aks_name = module.common.aks_name aks_rg = module.common.aks_rg_name }","title":"Use the Common Terraform Module"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#centralize-input-variables-definitions","text":"In case you chose to define variables values directly in the source control (e.g. gitops scenario) using variables definitions files ( .tfvars ), having a common module will also help to not have to duplicate the common variables definitions in all modules. Indeed, it is possible to have a global file that is defined once, at the common module level, and merge it with a module-specific variables definitions files at Terraform plan or apply time. Let's consider the following structure: modules \u251c\u2500\u2500 common \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 output.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 kubernetes \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf \u251c\u2500\u2500 network \u2502 \u251c\u2500\u2500 dev.tfvars \u2502 \u251c\u2500\u2500 prod.tfvars \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u2514\u2500\u2500 variables.tf The common module as well as all other modules contain variables files for dev and prod environment. tfvars files from the common module will define all the global variables that will be shared with other modules (like subscription, environment name, etc.) and .tfvars files of each module will define only the module-specific values. Then, it's possible to merge these files when running the terraform apply or terraform plan command, using the following syntax: terraform plan -var-file = < ( cat ../common/dev.tfvars ./dev.tfvars ) Note: using this, it is really important to ensure that you have not the same variable names in both files, otherwise that will generate an error.","title":"Centralize Input Variables Definitions"},{"location":"CI-CD/recipes/terraform/share-common-variables-naming-conventions/#conclusion","text":"By having a common module that owns shared variables as well as naming convention, it is now easier to refactor your Terraform configuration code base. Imagine that for some reason you need change the pattern that is used for the virtual network name: you change it in the common module output files, and just have to re-apply all modules!","title":"Conclusion"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/","text":"Guidelines on Structuring and Testing the Terraform Configuration Context When creating an infrastructure configuration, it is important to follow a consistent and organized structure to ensure maintainability, scalability and reusability of the code. The goal of this section is to briefly describe how to structure your Terraform configuration in order to achieve this. Structuring the Terraform Configuration The recommended structure is as follows: Place each component you want to configure in its own module folder. Analyze your infrastructure code and identify the logical components that can be separated into reusable modules. This will give you a clear separation of concerns and will make it straight forward to include new resources, update existing ones or reuse them in the future. For more details on modules and when to use them, see the Terraform guidance . Place the .tf module files at the root of each folder and make sure to include a README file in a markdown format which can be automatically generated based on the module code. It's recommended to follow this approach as this file structure will be automatically picked up by the Terraform Registry . Use a consistent set of files to structure your modules. While this can vary depending on the specific needs of the project, one good example can be the following: provider.tf : defines the list of providers according to the plugins used data.tf : defines information read from different data sources main.tf : defines the infrastructure objects needed for your configuration (e.g. resource group, role assignment, container registry) backend.tf : backend configuration file outputs.tf : defines structured data that is exported variables.tf : defines static, reusable values Include in each module sub folders for documentation, examples and tests. The documentation includes basic information about the module: what is it installing, what are the options, an example use case and so on. You can also add here any other relevant details you might have. The example folder can include one or more examples of how to use the module, each example having the same set of configuration files decided on the previous step. It's recommended to also include a README providing a clear understanding of how it can be used in practice. The tests folder includes one or more files to test the example module together with a documentation file with instructions on how these tests can be executed . Place the root module in a separate folder called main : this is the primary entry point for the configuration. Like for the other modules, it will contain its corresponding configuration files. An example configuration structure obtained using the guidelines above is: modules \u251c\u2500\u2500 mlops \u2502 \u251c\u2500\u2500 doc \u2502 \u251c\u2500\u2500 example \u2502 \u251c\u2500\u2500 test \u2502 \u251c\u2500\u2500 backend.tf \u2502 \u251c\u2500\u2500 data.tf \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 outputs.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u251c\u2500\u2500 variables.tf \u2502 \u251c\u2500\u2500 README.md \u251c\u2500\u2500 common \u251c\u2500\u2500 main Testing the Configuration To test Terraform configurations, the Terratest library is utilized. A comprehensive guide to best practices with Terratest, including unit tests, integration tests, and end-to-end tests, is available for reference here . Types of tests Unit Test for Module / Resource : Write unit tests for individual modules / resources to ensure that each module behaves as expected in isolation. They are particularly valuable in larger, more complex Terraform configurations where individual modules can be reused and are generally quicker in terms of execution time. Integration Test : These tests verify that the different modules and resources work together as intended. For simple Terraform configurations, extensive unit testing might be overkill. Integration tests might be sufficient in such cases. However, as the complexity grows, unit tests become more valuable. Key aspects to consider Syntax and validation : Use terraform fmt and terraform validate to check the syntax and validate the Terraform configuration during development or in the deployment script / pipeline. This ensures that the configuration is correctly formatted and free of syntax errors. Deployment and existence : Terraform providers, like the Azure provider, perform certain checks during the execution of terraform apply. If Terraform successfully applies a configuration, it typically means that the specified resources were created or modified as expected. In your code you can skip this validation and focus on particular resource configurations that are more critical, described in the next points. Resource properties that can break the functionality : The expectation here is that we're not interested in testing each property of a resource, but to identify the ones that could cause an issue in the system if they are changed, such as access or network policies, service principal permissions and others. Validation of Key Vault contents : Ensuring the presence of necessary keys, certificates, or secrets in the Azure Key Vault that are stored as part of resource configuration. Properties that can influence the cost or location : This can be achieved by asserting the locations, service tiers, storage settings, depending on the properties available for the resources. Naming Convention When naming Terraform variables, it's essential to use clear and consistent naming conventions that are easy to understand and follow. The general convention is to use lowercase letters and numbers, with underscores instead of dashes, for example: \"azurerm_resource_group\". When naming resources, start with the provider's name, followed by the target resource, separated by underscores. For instance, \"azurerm_postgresql_server\" is an appropriate name for an Azure provider resource. When it comes to data sources, use a similar naming convention, but make sure to use plural names for lists of items. For example, \"azurerm_resource_groups\" is a good name for a data source that represents a list of resource groups. Variable and output names should be descriptive and reflect the purpose or use of the variable. It's also helpful to group related items together using a common prefix. For example, all variables related to storage accounts could start with \"storage_\". Keep in mind that outputs should be understandable outside of their scope. A useful naming pattern to follow is \"{name}_{attribute}\", where \"name\" represents a resource or data source name, and \"attribute\" is the attribute returned by the output. For example, \"storage_primary_connection_string\" could be a valid output name. Make sure you include a description for outputs and variables, as well as marking the values as 'default' or 'sensitive' when the case. This information will be captured in the generated documentation. Generating the Documentation The documentation can be automatically generated based on the configuration code in your modules with the help of terraform-docs . To generate the Terraform module documentation, go to the module folder and enter this command: terraform-docs markdown table --output-file README.md --output-mode inject . Then, the documentation will be generated inside the component root directory. Conclusion The approach presented in this section is designed to be flexible and easy to use, making it straight forward to add new resources or update existing ones. The separation of concerns also makes it easy to reuse existing components in other projects, with all the information (modules, examples, documentation and tests) located in one place. Resources Terraform-docs Terraform Registry Terraform Module Guidance Terratest Testing HashiCorp Terraform Build Infrastructure - Terraform Azure Example","title":"Guidelines on Structuring and Testing the Terraform Configuration"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#guidelines-on-structuring-and-testing-the-terraform-configuration","text":"","title":"Guidelines on Structuring and Testing the Terraform Configuration"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#context","text":"When creating an infrastructure configuration, it is important to follow a consistent and organized structure to ensure maintainability, scalability and reusability of the code. The goal of this section is to briefly describe how to structure your Terraform configuration in order to achieve this.","title":"Context"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#structuring-the-terraform-configuration","text":"The recommended structure is as follows: Place each component you want to configure in its own module folder. Analyze your infrastructure code and identify the logical components that can be separated into reusable modules. This will give you a clear separation of concerns and will make it straight forward to include new resources, update existing ones or reuse them in the future. For more details on modules and when to use them, see the Terraform guidance . Place the .tf module files at the root of each folder and make sure to include a README file in a markdown format which can be automatically generated based on the module code. It's recommended to follow this approach as this file structure will be automatically picked up by the Terraform Registry . Use a consistent set of files to structure your modules. While this can vary depending on the specific needs of the project, one good example can be the following: provider.tf : defines the list of providers according to the plugins used data.tf : defines information read from different data sources main.tf : defines the infrastructure objects needed for your configuration (e.g. resource group, role assignment, container registry) backend.tf : backend configuration file outputs.tf : defines structured data that is exported variables.tf : defines static, reusable values Include in each module sub folders for documentation, examples and tests. The documentation includes basic information about the module: what is it installing, what are the options, an example use case and so on. You can also add here any other relevant details you might have. The example folder can include one or more examples of how to use the module, each example having the same set of configuration files decided on the previous step. It's recommended to also include a README providing a clear understanding of how it can be used in practice. The tests folder includes one or more files to test the example module together with a documentation file with instructions on how these tests can be executed . Place the root module in a separate folder called main : this is the primary entry point for the configuration. Like for the other modules, it will contain its corresponding configuration files. An example configuration structure obtained using the guidelines above is: modules \u251c\u2500\u2500 mlops \u2502 \u251c\u2500\u2500 doc \u2502 \u251c\u2500\u2500 example \u2502 \u251c\u2500\u2500 test \u2502 \u251c\u2500\u2500 backend.tf \u2502 \u251c\u2500\u2500 data.tf \u2502 \u251c\u2500\u2500 main.tf \u2502 \u251c\u2500\u2500 outputs.tf \u2502 \u251c\u2500\u2500 provider.tf \u2502 \u251c\u2500\u2500 variables.tf \u2502 \u251c\u2500\u2500 README.md \u251c\u2500\u2500 common \u251c\u2500\u2500 main","title":"Structuring the Terraform Configuration"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#testing-the-configuration","text":"To test Terraform configurations, the Terratest library is utilized. A comprehensive guide to best practices with Terratest, including unit tests, integration tests, and end-to-end tests, is available for reference here .","title":"Testing the Configuration"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#types-of-tests","text":"Unit Test for Module / Resource : Write unit tests for individual modules / resources to ensure that each module behaves as expected in isolation. They are particularly valuable in larger, more complex Terraform configurations where individual modules can be reused and are generally quicker in terms of execution time. Integration Test : These tests verify that the different modules and resources work together as intended. For simple Terraform configurations, extensive unit testing might be overkill. Integration tests might be sufficient in such cases. However, as the complexity grows, unit tests become more valuable.","title":"Types of tests"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#key-aspects-to-consider","text":"Syntax and validation : Use terraform fmt and terraform validate to check the syntax and validate the Terraform configuration during development or in the deployment script / pipeline. This ensures that the configuration is correctly formatted and free of syntax errors. Deployment and existence : Terraform providers, like the Azure provider, perform certain checks during the execution of terraform apply. If Terraform successfully applies a configuration, it typically means that the specified resources were created or modified as expected. In your code you can skip this validation and focus on particular resource configurations that are more critical, described in the next points. Resource properties that can break the functionality : The expectation here is that we're not interested in testing each property of a resource, but to identify the ones that could cause an issue in the system if they are changed, such as access or network policies, service principal permissions and others. Validation of Key Vault contents : Ensuring the presence of necessary keys, certificates, or secrets in the Azure Key Vault that are stored as part of resource configuration. Properties that can influence the cost or location : This can be achieved by asserting the locations, service tiers, storage settings, depending on the properties available for the resources.","title":"Key aspects to consider"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#naming-convention","text":"When naming Terraform variables, it's essential to use clear and consistent naming conventions that are easy to understand and follow. The general convention is to use lowercase letters and numbers, with underscores instead of dashes, for example: \"azurerm_resource_group\". When naming resources, start with the provider's name, followed by the target resource, separated by underscores. For instance, \"azurerm_postgresql_server\" is an appropriate name for an Azure provider resource. When it comes to data sources, use a similar naming convention, but make sure to use plural names for lists of items. For example, \"azurerm_resource_groups\" is a good name for a data source that represents a list of resource groups. Variable and output names should be descriptive and reflect the purpose or use of the variable. It's also helpful to group related items together using a common prefix. For example, all variables related to storage accounts could start with \"storage_\". Keep in mind that outputs should be understandable outside of their scope. A useful naming pattern to follow is \"{name}_{attribute}\", where \"name\" represents a resource or data source name, and \"attribute\" is the attribute returned by the output. For example, \"storage_primary_connection_string\" could be a valid output name. Make sure you include a description for outputs and variables, as well as marking the values as 'default' or 'sensitive' when the case. This information will be captured in the generated documentation.","title":"Naming Convention"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#generating-the-documentation","text":"The documentation can be automatically generated based on the configuration code in your modules with the help of terraform-docs . To generate the Terraform module documentation, go to the module folder and enter this command: terraform-docs markdown table --output-file README.md --output-mode inject . Then, the documentation will be generated inside the component root directory.","title":"Generating the Documentation"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#conclusion","text":"The approach presented in this section is designed to be flexible and easy to use, making it straight forward to add new resources or update existing ones. The separation of concerns also makes it easy to reuse existing components in other projects, with all the information (modules, examples, documentation and tests) located in one place.","title":"Conclusion"},{"location":"CI-CD/recipes/terraform/terraform-structure-guidelines/#resources","text":"Terraform-docs Terraform Registry Terraform Module Guidance Terratest Testing HashiCorp Terraform Build Infrastructure - Terraform Azure Example","title":"Resources"},{"location":"UI-UX/","text":"User Interface and User Experience Engineering Also known as UI/UX , Front End Development , or Web Development , user interface and user experience engineering is a broad topic and encompasses many different aspects of modern application development. When a user interface is required, ISE primarily develops a web application . Web apps can be built in a variety of ways with many different tools. Goal The goal of the User Interface section is to provide guidance on developing web applications. Everyone should begin by reading the General Guidance for a quick introduction to the four main aspects of every web application project. From there, readers are encouraged to dive deeper into each topic, or begin reviewing technical guidance that pertains to their engagement. All UI/UX projects should begin with a detailed design document. Review the Design Process section for more details, and a template to get started. Keep in mind that like all software, there is no \"right way\" to build a user interface application. Leverage and trust your team's or your customer's experience and expertise for the best development experience. General Guidance The state of web platform engineering is fast moving. There is no one-size-fits-all solution. For any team to be successful in building a UI, they need to have an understanding of the higher-level aspects of all UI project. Accessibility - ensuring your application is usable and enjoyed by as many people as possible is at the heart of accessibility and inclusive design. Usability - how effortless should it be for any given user to use the application? Do they need special training or a document to understand how to use it, or will it be intuitive? Maintainability - is the application just a proof of concept to showcase an idea for future work, or will it be an MVP and act as the starting point for a larger, production-ready application? Sometimes you don't need React or any other framework. Sometimes you need React, but not all the bells and whistles from create-react-app. Understanding project maintainability requirements can simplify an engagement\u2019s tooling needs significantly and let folks iterate without headaches. Stability - what is the cost of adding a dependency? Is it actively stable/updated/maintained? If not, can you afford the tech debt (sometimes the answer can be yes!)? Could you get 90% of the way there without adding another dependency? More information is available for each general guidance section in the corresponding pages. Design Process All user interface applications begin with the design process. The true definition for \"the design process\" is ever changing and highly opinion based as well. This sections aims to deliver a general overview of a design process any engineering team could conduct when starting an UI application engagement. When committing to a UI/UX project, be certain to not over-promise on the web application requirements. Delivering a production-ready application involves a large number of engineering complexities resulting in a very long timeline. Always start with a proof-of-concept or minimum-viable-product first. These projects can easily be achieved within a couple month timeline (and sometimes even less). The first step in the design process is to understand the problem at hand and outline what the solution should achieve. Commonly referred to as Desired Outcomes , the output of this first step should be a generalized list of outcomes that the solution will accomplish. Consider the following example: A public library has a set of data containing information about its collection. The data stores text, images, and the status of a book (borrowed, available, reserved). The library librarian wants to share this data with its users. As the librarian, I want to notify users before they receive late penalties for overdue books As the librarian, I want to notify users when a book they have reserved becomes available With the desired outcomes in mind, the next step in the design process is to define user personas. Regardless of the solution for a given problem, understanding the user needs leads to a better understanding of feature development and technological choices. Personas are written as prose-like paragraphs that describe different types of users. Considering the previous example, the various user personas could be: An individual with no disabilities, but is unfamiliar with using software interfaces An individual with no disabilities, and is familiar with using software interfaces An individual with disabilities, and is unfamiliar with using software interfaces (with or without the use of accessibility tooling) An individual with disabilities, but familiar with using software interfaces through the use of accessibility tooling After defining these personas it is clear that whatever the solution is, it requires a lot of accessibility and user experience design work. Sometimes personas can be simpler than this, but always include disabled users . Even when a user set is predefined as a group of individuals without disabilities, there is no guarantee that the user set will remain that way. After defining the desired outcomes as well as the personas , the next step in the design process is to begin conducting Trade Studies for potential solutions. The first trade study should be high-level and solution oriented. It will utilize the results of previous steps and propose multiple solutions for achieving the desired outcomes with the listed personas in mind. Continuing with the library example, this first trade study may compare various application solutions such as automated emails or text messages, an RSS feed, or an user interface application. There are pros and cons for each solution both from an user experience and a developer experience perspective, but at this stage it is important to focus on the users. After arriving on the best solution, the next trade study can dive into different implementation methods. It is in this subsequent trade studies that developer experience becomes more important. The benefit of building software applications is that there are truly infinite ways to build something. A team can use the latest shiny tools, or they can utilize the tried-and-tested ones. It is for this reason that focussing completely on the user until a solution is defined is better than obsessing over technology choices. Within ISE, we often reach for tools such as the React framework. React is a great tool when wielded by an experienced team. Otherwise, it can create more hurdles than it is worth. Keep in mind that even if you feel capable with React, the rest of your team and your customer's dev team needs to as well. Some other great options to consider when building a proof-of-concept or minimum-viable-product are: HTML/CSS/JavaScript Back to the basics! Start with a single index.html , include a popular CSS framework such as Bootstrap using their CDN link, and start prototyping! Rarely will you have to support legacy browsers; thus, you can rely on modern JavaScript language features! No need for build tools or even TypeScript (did you know you can type check JavaScript ). Web Component frameworks Web Components are now standardized in all modern browsers Microsoft has their own, stable & actively-maintained framework, Fast For more information of choosing the right implementation tool, read the Recommended Technologies document. Continue reading the Trade Study section of this site for more information on completing this step in the design process. After iterating through multiple trade study documents, this design process can be considered complete! With an agreed upon solution and implementation in mind, it is now time to begin development. A natural continuation of the design process is to get users (or stakeholders) involved as early as possible. Constantly look for design and usability feedback, and utilize this to improve the application as it is being developed.","title":"User Interface and User Experience Engineering"},{"location":"UI-UX/#user-interface-and-user-experience-engineering","text":"Also known as UI/UX , Front End Development , or Web Development , user interface and user experience engineering is a broad topic and encompasses many different aspects of modern application development. When a user interface is required, ISE primarily develops a web application . Web apps can be built in a variety of ways with many different tools.","title":"User Interface and User Experience Engineering"},{"location":"UI-UX/#goal","text":"The goal of the User Interface section is to provide guidance on developing web applications. Everyone should begin by reading the General Guidance for a quick introduction to the four main aspects of every web application project. From there, readers are encouraged to dive deeper into each topic, or begin reviewing technical guidance that pertains to their engagement. All UI/UX projects should begin with a detailed design document. Review the Design Process section for more details, and a template to get started. Keep in mind that like all software, there is no \"right way\" to build a user interface application. Leverage and trust your team's or your customer's experience and expertise for the best development experience.","title":"Goal"},{"location":"UI-UX/#general-guidance","text":"The state of web platform engineering is fast moving. There is no one-size-fits-all solution. For any team to be successful in building a UI, they need to have an understanding of the higher-level aspects of all UI project. Accessibility - ensuring your application is usable and enjoyed by as many people as possible is at the heart of accessibility and inclusive design. Usability - how effortless should it be for any given user to use the application? Do they need special training or a document to understand how to use it, or will it be intuitive? Maintainability - is the application just a proof of concept to showcase an idea for future work, or will it be an MVP and act as the starting point for a larger, production-ready application? Sometimes you don't need React or any other framework. Sometimes you need React, but not all the bells and whistles from create-react-app. Understanding project maintainability requirements can simplify an engagement\u2019s tooling needs significantly and let folks iterate without headaches. Stability - what is the cost of adding a dependency? Is it actively stable/updated/maintained? If not, can you afford the tech debt (sometimes the answer can be yes!)? Could you get 90% of the way there without adding another dependency? More information is available for each general guidance section in the corresponding pages.","title":"General Guidance"},{"location":"UI-UX/#design-process","text":"All user interface applications begin with the design process. The true definition for \"the design process\" is ever changing and highly opinion based as well. This sections aims to deliver a general overview of a design process any engineering team could conduct when starting an UI application engagement. When committing to a UI/UX project, be certain to not over-promise on the web application requirements. Delivering a production-ready application involves a large number of engineering complexities resulting in a very long timeline. Always start with a proof-of-concept or minimum-viable-product first. These projects can easily be achieved within a couple month timeline (and sometimes even less). The first step in the design process is to understand the problem at hand and outline what the solution should achieve. Commonly referred to as Desired Outcomes , the output of this first step should be a generalized list of outcomes that the solution will accomplish. Consider the following example: A public library has a set of data containing information about its collection. The data stores text, images, and the status of a book (borrowed, available, reserved). The library librarian wants to share this data with its users. As the librarian, I want to notify users before they receive late penalties for overdue books As the librarian, I want to notify users when a book they have reserved becomes available With the desired outcomes in mind, the next step in the design process is to define user personas. Regardless of the solution for a given problem, understanding the user needs leads to a better understanding of feature development and technological choices. Personas are written as prose-like paragraphs that describe different types of users. Considering the previous example, the various user personas could be: An individual with no disabilities, but is unfamiliar with using software interfaces An individual with no disabilities, and is familiar with using software interfaces An individual with disabilities, and is unfamiliar with using software interfaces (with or without the use of accessibility tooling) An individual with disabilities, but familiar with using software interfaces through the use of accessibility tooling After defining these personas it is clear that whatever the solution is, it requires a lot of accessibility and user experience design work. Sometimes personas can be simpler than this, but always include disabled users . Even when a user set is predefined as a group of individuals without disabilities, there is no guarantee that the user set will remain that way. After defining the desired outcomes as well as the personas , the next step in the design process is to begin conducting Trade Studies for potential solutions. The first trade study should be high-level and solution oriented. It will utilize the results of previous steps and propose multiple solutions for achieving the desired outcomes with the listed personas in mind. Continuing with the library example, this first trade study may compare various application solutions such as automated emails or text messages, an RSS feed, or an user interface application. There are pros and cons for each solution both from an user experience and a developer experience perspective, but at this stage it is important to focus on the users. After arriving on the best solution, the next trade study can dive into different implementation methods. It is in this subsequent trade studies that developer experience becomes more important. The benefit of building software applications is that there are truly infinite ways to build something. A team can use the latest shiny tools, or they can utilize the tried-and-tested ones. It is for this reason that focussing completely on the user until a solution is defined is better than obsessing over technology choices. Within ISE, we often reach for tools such as the React framework. React is a great tool when wielded by an experienced team. Otherwise, it can create more hurdles than it is worth. Keep in mind that even if you feel capable with React, the rest of your team and your customer's dev team needs to as well. Some other great options to consider when building a proof-of-concept or minimum-viable-product are: HTML/CSS/JavaScript Back to the basics! Start with a single index.html , include a popular CSS framework such as Bootstrap using their CDN link, and start prototyping! Rarely will you have to support legacy browsers; thus, you can rely on modern JavaScript language features! No need for build tools or even TypeScript (did you know you can type check JavaScript ). Web Component frameworks Web Components are now standardized in all modern browsers Microsoft has their own, stable & actively-maintained framework, Fast For more information of choosing the right implementation tool, read the Recommended Technologies document. Continue reading the Trade Study section of this site for more information on completing this step in the design process. After iterating through multiple trade study documents, this design process can be considered complete! With an agreed upon solution and implementation in mind, it is now time to begin development. A natural continuation of the design process is to get users (or stakeholders) involved as early as possible. Constantly look for design and usability feedback, and utilize this to improve the application as it is being developed.","title":"Design Process"},{"location":"UI-UX/recommended-technologies/","text":"Recommended Technologies The purpose of this page is to review the commonly selected technology options when developing user interface applications. To reiterate from the general guidance section: Keep in mind that like all software, there is no \"right way\" to build a user interface application. Leverage and trust your team's or your customer's experience and expertise for the best development experience. Additionally, while some of these technologies are presented as alternate options, many can be combined together. For example, you can use React in a basic HTML/CSS/JS workflow by inline-importing React along with Babel. See the Add React to a Website for more details. Similarly, any Fast web component can be integrated into any existing React application . And of course, every JavaScript technology can also be used with TypeScript! TypeScript TypeScript is JavaScript with syntax for types. TypeScript is a strongly typed programming language that builds on JavaScript, giving you better tooling at any scale. typescriptlang.org TypeScript is highly recommended for all new web application projects. The stability it provides for teams is unmatched, and can make it easier for folks with C# backgrounds to work with web technologies. There are many ways to integrate TypeScript into a web application. The easiest way to get started is by reviewing the TypeScript Tooling in 5 Minutes guide from the official TypeScript docs. The other sections on this page contain information regarding integration with TypeScript. React React is a framework developed and maintained by Facebook. React is used throughout Microsoft and has a vast open source community. Documentation & Recommended Resources One can expect to find a multitude of guides, answers, and posts on how to work with React; don't take everything at face value. The best place to review React concepts is the React documentation. From there, you can review articles from various sources such as React Community Articles , Kent C Dodd's Blog , CSS Tricks Articles , and Awesome React . The React API has changed dramatically over time. Older resources may contain solutions or patterns that have since been changed and improved upon. Modern React development uses the React Hooks pattern. Rarely will you have to implement something using React Class pattern. If you're reading an article/answer/docs that instruct you to use the class pattern you may be looking at an out-of-date resource. Bootstrapping There are many different ways to bootstrap a React application. Two great tool sets to use are create-react-app and vite . create-react-app From Adding TypeScript npx create-react-app my-app --template typescript Vite From Scaffolding your First Vite Project # npm 6.x npm init vite@latest my-app --template react-ts # npm 7.x npm init vite@latest my-app -- --template react-ts","title":"Recommended Technologies"},{"location":"UI-UX/recommended-technologies/#recommended-technologies","text":"The purpose of this page is to review the commonly selected technology options when developing user interface applications. To reiterate from the general guidance section: Keep in mind that like all software, there is no \"right way\" to build a user interface application. Leverage and trust your team's or your customer's experience and expertise for the best development experience. Additionally, while some of these technologies are presented as alternate options, many can be combined together. For example, you can use React in a basic HTML/CSS/JS workflow by inline-importing React along with Babel. See the Add React to a Website for more details. Similarly, any Fast web component can be integrated into any existing React application . And of course, every JavaScript technology can also be used with TypeScript!","title":"Recommended Technologies"},{"location":"UI-UX/recommended-technologies/#typescript","text":"TypeScript is JavaScript with syntax for types. TypeScript is a strongly typed programming language that builds on JavaScript, giving you better tooling at any scale. typescriptlang.org TypeScript is highly recommended for all new web application projects. The stability it provides for teams is unmatched, and can make it easier for folks with C# backgrounds to work with web technologies. There are many ways to integrate TypeScript into a web application. The easiest way to get started is by reviewing the TypeScript Tooling in 5 Minutes guide from the official TypeScript docs. The other sections on this page contain information regarding integration with TypeScript.","title":"TypeScript"},{"location":"UI-UX/recommended-technologies/#react","text":"React is a framework developed and maintained by Facebook. React is used throughout Microsoft and has a vast open source community.","title":"React"},{"location":"UI-UX/recommended-technologies/#documentation-recommended-resources","text":"One can expect to find a multitude of guides, answers, and posts on how to work with React; don't take everything at face value. The best place to review React concepts is the React documentation. From there, you can review articles from various sources such as React Community Articles , Kent C Dodd's Blog , CSS Tricks Articles , and Awesome React . The React API has changed dramatically over time. Older resources may contain solutions or patterns that have since been changed and improved upon. Modern React development uses the React Hooks pattern. Rarely will you have to implement something using React Class pattern. If you're reading an article/answer/docs that instruct you to use the class pattern you may be looking at an out-of-date resource.","title":"Documentation &amp; Recommended Resources"},{"location":"UI-UX/recommended-technologies/#bootstrapping","text":"There are many different ways to bootstrap a React application. Two great tool sets to use are create-react-app and vite .","title":"Bootstrapping"},{"location":"UI-UX/recommended-technologies/#create-react-app","text":"From Adding TypeScript npx create-react-app my-app --template typescript","title":"create-react-app"},{"location":"UI-UX/recommended-technologies/#vite","text":"From Scaffolding your First Vite Project # npm 6.x npm init vite@latest my-app --template react-ts # npm 7.x npm init vite@latest my-app -- --template react-ts","title":"Vite"},{"location":"agile-development/","text":"Agile Development In this documentation we refer to the team working on an engagement a \"Crew\" . This includes the dev team, dev lead, PM, data scientists, etc. Why Agile We want to be quick to respond to change We want to get to a state of working software fast, and iterate on it to improve it We want to keep the customer/end users involved all the way through We care about individuals and interactions over documents and processes The Fundamentals We care about the goal for each activity, but not necessarily about how they are accomplished. The suggestions in parenthesis are common ways to accomplish the goals. We keep a shared backlog of work, that everyone in the team can always access (ex. Azure DevOps or GitHub) We plan our work in iterations with clear goals (ex. sprints) We have a clear idea of when work items are ready to implement (ex. definition of ready) We have a clear idea of when work items are completed (ex. definition of done) We communicate the progress in one place that everyone can access, and keep the progress up to date (ex. sprint board and daily standups) We reflect on our work regularly to make improvements (ex. retrospectives) The team has a clear idea of the roles and responsibilities in the project (ex. Dev lead, TPM, Process Lead etc.) The team has a joint idea of how we work together (ex. team agreement) We value and respect the opinions and work of all team members. References What Is Scrum? Essential Scrum: A Practical Guide to The Most Popular Agile Process","title":"Agile Development"},{"location":"agile-development/#agile-development","text":"In this documentation we refer to the team working on an engagement a \"Crew\" . This includes the dev team, dev lead, PM, data scientists, etc.","title":"Agile Development"},{"location":"agile-development/#why-agile","text":"We want to be quick to respond to change We want to get to a state of working software fast, and iterate on it to improve it We want to keep the customer/end users involved all the way through We care about individuals and interactions over documents and processes","title":"Why Agile"},{"location":"agile-development/#the-fundamentals","text":"We care about the goal for each activity, but not necessarily about how they are accomplished. The suggestions in parenthesis are common ways to accomplish the goals. We keep a shared backlog of work, that everyone in the team can always access (ex. Azure DevOps or GitHub) We plan our work in iterations with clear goals (ex. sprints) We have a clear idea of when work items are ready to implement (ex. definition of ready) We have a clear idea of when work items are completed (ex. definition of done) We communicate the progress in one place that everyone can access, and keep the progress up to date (ex. sprint board and daily standups) We reflect on our work regularly to make improvements (ex. retrospectives) The team has a clear idea of the roles and responsibilities in the project (ex. Dev lead, TPM, Process Lead etc.) The team has a joint idea of how we work together (ex. team agreement) We value and respect the opinions and work of all team members.","title":"The Fundamentals"},{"location":"agile-development/#references","text":"What Is Scrum? Essential Scrum: A Practical Guide to The Most Popular Agile Process","title":"References"},{"location":"agile-development/backlog-management/","text":"Backlog Management Backlog Goals User stories have a clear acceptance criteria and definition of done. Design activities are planned as part of the backlog (a design for a story that needs it should be done before it is added in a Sprint). Suggestions Consider the backlog refinement as an ongoing activity, that expands outside of the typical \"Refinement meeting\". The team should decide on and have a clear understanding of a definition of ready and a definition of done . The team should have a clear understanding of what constitutes good acceptance criteria for a story/task, and decide on how stories/tasks are handled. Eg. in some projects, stories are refined as a crew, but tasks are created by individual developers on an as needed bases. Technical debt is mostly due to shortcuts made in the implementation as well as the future maintenance cost as the natural result of continuous improvement. Shortcuts should generally be avoided. In some rare instances where they happen, prioritizing and planning improvement activities to reduce this debt at a later time is the recommended approach. Resources Product Backlog Sprint Backlog Acceptance Criteria Definition of Done Definition of Ready Estimation Basics in Agile","title":"Backlog Management"},{"location":"agile-development/backlog-management/#backlog-management","text":"","title":"Backlog Management"},{"location":"agile-development/backlog-management/#backlog","text":"Goals User stories have a clear acceptance criteria and definition of done. Design activities are planned as part of the backlog (a design for a story that needs it should be done before it is added in a Sprint). Suggestions Consider the backlog refinement as an ongoing activity, that expands outside of the typical \"Refinement meeting\". The team should decide on and have a clear understanding of a definition of ready and a definition of done . The team should have a clear understanding of what constitutes good acceptance criteria for a story/task, and decide on how stories/tasks are handled. Eg. in some projects, stories are refined as a crew, but tasks are created by individual developers on an as needed bases. Technical debt is mostly due to shortcuts made in the implementation as well as the future maintenance cost as the natural result of continuous improvement. Shortcuts should generally be avoided. In some rare instances where they happen, prioritizing and planning improvement activities to reduce this debt at a later time is the recommended approach.","title":"Backlog"},{"location":"agile-development/backlog-management/#resources","text":"Product Backlog Sprint Backlog Acceptance Criteria Definition of Done Definition of Ready Estimation Basics in Agile","title":"Resources"},{"location":"agile-development/ceremonies/","text":"Agile Ceremonies Sprint Planning Goals The planning supports Diversity and Inclusion principles and provides equal opportunities. The Planning defines how the work is going to be completed in the sprint. Stories fit in a sprint and are designed and ready before the planning. Note: Self assignment by team members can give a feeling of fairness in how work is split in the team. Sometime, this ends up not being the case as it can give an advantage to the loudest or more experienced voices in the team. Individuals also tend to stay in their comfort zone, which might not be the right approach for their own growth.* Sprint Goal Consider defining a sprint goal, or list of goals for each sprint. Effective sprint goals are a concise bullet point list of items. A Sprint goal can be created first and used as an input to choose the Stories for the sprint. A sprint goal could also be created from the list of stories that were picked for the Sprint. The sprint goal can be used: At the end of each stand up meeting, to remember the north star for the Sprint and help everyone taking a step back During the sprint review (\"was the goal achieved?\", \"If not, why?\") Note: A simple way to define a sprint goal, is to create a User Story in each sprint backlog and name it \"Sprint XX goal\". You can add the bullet points in the description.* Stories Example 1: Preparing in advance The dev lead and product owner plan time to prepare the sprint backlog ahead of sprint planning. The dev lead uses their experience (past and on the current project) and the estimation made for these stories to gauge how many should be in the sprint. The dev lead asks the entire team to look at the tentative sprint backlog in advance of the sprint planning. The dev lead assigns stories to specific developers after confirming with them that it makes sense During the sprint planning meeting, the team reviews the sprint goal and the stories. Everyone confirm they understand the plan and feel it's reasonable. Example 2: Building during the planning meeting The product owner ensures that the highest priority items of the product backlog is refined and estimated following the team estimation process. During the Sprint planning meeting, the product owner describe each stories, one by one, starting by highest priority. For each story, the dev lead and the team confirm they understand what needs to be done and add the story to the sprint backlog. The team keeps considering more stories up to a point where they agree the sprint backlog is full. This should be informed by the estimation, past developer experience and past experience in this specific project. Stories are assigned during the planning meeting: Option 1: The dev lead makes suggestion on who could work on each stories. Each engineer agrees or discuss if required. Option 2: The team review each story and engineer volunteer select the one they want to be assigned to. Note : this option might cause issues with the first core expectations. Who gets to work on what? Ultimately, it is the dev lead responsibility to ensure each engineer gets the opportunity to work on what makes sense for their growth.) Tasks Examples of approaches for task creation and assignment: Stories are split into tasks ahead of time by dev lead and assigned before/during sprint planning to engineers. Stories are assigned to more senior engineers who are responsible for splitting into tasks. Stories are split into tasks during the Sprint planning meeting by the entire team. Note : Depending on the seniority of the team, consider splitting into tasks before sprint planning. This can help getting out of sprint planning with all work assigned. It also increase clarity for junior engineers. Sprint Planning Resources Definition of Ready Sprint Goal Template Planning Refinement User Stories Applied: For Software Development Estimation Goals Estimation supports the predictability of the team work and delivery. Estimation re-enforces the value of accountability to the team. The estimation process is improved over time and discussed on a regular basis. Estimation is inclusive of the different individuals in the team. Rough estimation is usually done for a generic SE 2 dev. Example 1: T-shirt Sizes The team use t-shirt sizes (S, M, L, XL) and agrees in advance which size fits a sprint. In this example: S, M fits a sprint, L, XL too big for a sprint and need to be split / refined The dev lead with support of the team roughly estimates how much S and M stories can be done in the first sprints This rough estimation is refined over time and used to as an input for future sprint planning and to adjust project end date forecasting Example 2: Single Indicator The team uses a single indicator: \"does this story fits in one sprint?\", if not, the story needs to be split The dev lead with support of the team roughly estimates how many stories can be done in the first sprints How many stories are done in each sprint on average is used as an input for future sprint planning and as an indicator to adjust project end date forecasting Example 3: Planning Poker The team does planning poker and estimates in story points Story points are roughly used to estimate how much can be done in next sprint The dev lead and the TPM uses the past sprints and observed velocity to adjust project end date forecasting Other Considerations Estimating stories using story points in smaller project does not always provide the value it would in bigger ones. Avoid converting story points or t-shirt sizes to days. Measure Estimation Accuracy Collect data to monitor estimation accuracy and sprint completion over time to drive improvements. Use the sprint goal to understand if the estimation was correct. If the sprint goal is met: does anything else matter? Scrum Practices While Scrum does not prescribe how to size work, Professional Scrum is biased away from absolute estimation (hours, function points, ideal-days, etc.) and towards relative sizing. Planning Poker Planning Poker is a collaborative technique to assign relative size. Developers may choose whatever units they want - story points and t-shirt sizes are examples of units. 'Same-Size' Product Backlog Items (PBIs) 'Same-Size' PBIs is a relative estimation approach that involves breaking items down small enough that they are roughly the same size. Velocity can be understood as a count of PBIs; this is sometimes used by teams doing continuously delivery. 'Right-Size' Product Backlog Items (PBIs) 'Right-Size' PBIs is a relative estimation approach that involves breaking things down small enough to deliver value in a certain time period (i.e. get to Done by the end of a Sprint). This is sometimes associated with teams utilizing flow for forecasting. Teams use historical data to determine if they think they can get the PBI done within the confidence level that their historical data says they typically get a PBI done. Estimation Resources The Most Important Thing You Are Missing about Estimation Retrospectives Goals Retrospectives lead to actionable items that help grow the team's engineering practices. These items are in the backlog, assigned, and prioritized to be fixed by a date agreed upon (default being next retrospective). Retrospectives are used to ask the hard questions (\"we usually don't finish what we plan, let's talk about this\") when necessary. Suggestions Consider other retro formats available outside of Mad Sad Glad. Gather Data: Triple Nickels, Timeline, Mad Sad Glad, Team Radar Generate Insights: 5 Whys, Fishbone, Patterns and Shifts Consider setting a retro focus area. Schedule enough time to ensure that you can have the conversation you need to get the correct plan an action and improve how you work. Bring in a neutral facilitator for project retros or retros that introspect after a difficult period. Use the following retrospectives techniques to address specific trends that might be emerging on an engagement 5 Whys If a team is confronting a problem and is unsure of the exact root cause, the 5 whys exercise taken from the business analysis sector can help get to the bottom of it. For example, if a team cannot get to Done each Sprint, that would go at the top of the whiteboard. The team then asks why that problem exists, writing that answer in the box below. Next, the team asks why again, but this time in response to the why they just identified. Continue this process until the team identifies an actual root cause, which usually becomes apparent within five steps. Processes, Tools, Individuals, Interactions and the Definition of Done This approach encourages team members to think more broadly. Ask team members to identify what is going well and ideas for improvement within the categories of processes, tools, individuals/interactions, and the Definition of Done. Then, ask team members to vote on which improvement ideas to focus on during the upcoming Sprint. Focus This retrospective technique incorporates the concept of visioning. Using this technique, you ask team members where they would like to go? Decide what the team should look like in 4 weeks, and then ask what is holding them back from that and how they can resolve the impediment. If you are focusing on specific improvements, you can use this technique for one or two Retrospectives in a row so that the team can see progress over time. Retrospective Resources Agile Retrospective: Making Good Teams Great Retrospective Sprint Demo Goals Each sprint ends with demos that illustrate the sprint goal and how it fits in the engagement goal. Suggestions Consider not pre-recording sprint demos in advance. You can record the demo meeting and archive them. A demo does not have to be about running code. It can be showing documentation that was written. Sprint Demo Resources Sprint Review/Demo Stand-Up Goals The stand-up is run efficiently. The stand-up helps the team understand what was done, what will be done and what are the blockers. The stand-up helps the team understand if they will meet the sprint goal or not. Suggestions Keep stand up short and efficient. Table the longer conversations for a parking lot section, or for a conversation that will be planned later. Run daily stand ups: 15 minutes of stand up and 15 minutes of parking lot. If someone cannot make the stand-up exceptionally: Ask them to do a written stand up in advance. Stand ups should include everyone involved in the project, including the customer. Projects with widely divergent time zones should be avoided if possible, but if you are on one, you should adapt the standups to meet the needs and time constraints of all team members. Stand-Up Resources Stand-Up/Daily Scrum","title":"Agile Ceremonies"},{"location":"agile-development/ceremonies/#agile-ceremonies","text":"","title":"Agile Ceremonies"},{"location":"agile-development/ceremonies/#sprint-planning","text":"Goals The planning supports Diversity and Inclusion principles and provides equal opportunities. The Planning defines how the work is going to be completed in the sprint. Stories fit in a sprint and are designed and ready before the planning. Note: Self assignment by team members can give a feeling of fairness in how work is split in the team. Sometime, this ends up not being the case as it can give an advantage to the loudest or more experienced voices in the team. Individuals also tend to stay in their comfort zone, which might not be the right approach for their own growth.*","title":"Sprint Planning"},{"location":"agile-development/ceremonies/#sprint-goal","text":"Consider defining a sprint goal, or list of goals for each sprint. Effective sprint goals are a concise bullet point list of items. A Sprint goal can be created first and used as an input to choose the Stories for the sprint. A sprint goal could also be created from the list of stories that were picked for the Sprint. The sprint goal can be used: At the end of each stand up meeting, to remember the north star for the Sprint and help everyone taking a step back During the sprint review (\"was the goal achieved?\", \"If not, why?\") Note: A simple way to define a sprint goal, is to create a User Story in each sprint backlog and name it \"Sprint XX goal\". You can add the bullet points in the description.*","title":"Sprint Goal"},{"location":"agile-development/ceremonies/#stories","text":"Example 1: Preparing in advance The dev lead and product owner plan time to prepare the sprint backlog ahead of sprint planning. The dev lead uses their experience (past and on the current project) and the estimation made for these stories to gauge how many should be in the sprint. The dev lead asks the entire team to look at the tentative sprint backlog in advance of the sprint planning. The dev lead assigns stories to specific developers after confirming with them that it makes sense During the sprint planning meeting, the team reviews the sprint goal and the stories. Everyone confirm they understand the plan and feel it's reasonable. Example 2: Building during the planning meeting The product owner ensures that the highest priority items of the product backlog is refined and estimated following the team estimation process. During the Sprint planning meeting, the product owner describe each stories, one by one, starting by highest priority. For each story, the dev lead and the team confirm they understand what needs to be done and add the story to the sprint backlog. The team keeps considering more stories up to a point where they agree the sprint backlog is full. This should be informed by the estimation, past developer experience and past experience in this specific project. Stories are assigned during the planning meeting: Option 1: The dev lead makes suggestion on who could work on each stories. Each engineer agrees or discuss if required. Option 2: The team review each story and engineer volunteer select the one they want to be assigned to. Note : this option might cause issues with the first core expectations. Who gets to work on what? Ultimately, it is the dev lead responsibility to ensure each engineer gets the opportunity to work on what makes sense for their growth.)","title":"Stories"},{"location":"agile-development/ceremonies/#tasks","text":"Examples of approaches for task creation and assignment: Stories are split into tasks ahead of time by dev lead and assigned before/during sprint planning to engineers. Stories are assigned to more senior engineers who are responsible for splitting into tasks. Stories are split into tasks during the Sprint planning meeting by the entire team. Note : Depending on the seniority of the team, consider splitting into tasks before sprint planning. This can help getting out of sprint planning with all work assigned. It also increase clarity for junior engineers.","title":"Tasks"},{"location":"agile-development/ceremonies/#sprint-planning-resources","text":"Definition of Ready Sprint Goal Template Planning Refinement User Stories Applied: For Software Development","title":"Sprint Planning Resources"},{"location":"agile-development/ceremonies/#estimation","text":"Goals Estimation supports the predictability of the team work and delivery. Estimation re-enforces the value of accountability to the team. The estimation process is improved over time and discussed on a regular basis. Estimation is inclusive of the different individuals in the team. Rough estimation is usually done for a generic SE 2 dev.","title":"Estimation"},{"location":"agile-development/ceremonies/#example-1-t-shirt-sizes","text":"The team use t-shirt sizes (S, M, L, XL) and agrees in advance which size fits a sprint. In this example: S, M fits a sprint, L, XL too big for a sprint and need to be split / refined The dev lead with support of the team roughly estimates how much S and M stories can be done in the first sprints This rough estimation is refined over time and used to as an input for future sprint planning and to adjust project end date forecasting","title":"Example 1: T-shirt Sizes"},{"location":"agile-development/ceremonies/#example-2-single-indicator","text":"The team uses a single indicator: \"does this story fits in one sprint?\", if not, the story needs to be split The dev lead with support of the team roughly estimates how many stories can be done in the first sprints How many stories are done in each sprint on average is used as an input for future sprint planning and as an indicator to adjust project end date forecasting","title":"Example 2: Single Indicator"},{"location":"agile-development/ceremonies/#example-3-planning-poker","text":"The team does planning poker and estimates in story points Story points are roughly used to estimate how much can be done in next sprint The dev lead and the TPM uses the past sprints and observed velocity to adjust project end date forecasting","title":"Example 3: Planning Poker"},{"location":"agile-development/ceremonies/#other-considerations","text":"Estimating stories using story points in smaller project does not always provide the value it would in bigger ones. Avoid converting story points or t-shirt sizes to days.","title":"Other Considerations"},{"location":"agile-development/ceremonies/#measure-estimation-accuracy","text":"Collect data to monitor estimation accuracy and sprint completion over time to drive improvements. Use the sprint goal to understand if the estimation was correct. If the sprint goal is met: does anything else matter?","title":"Measure Estimation Accuracy"},{"location":"agile-development/ceremonies/#scrum-practices","text":"While Scrum does not prescribe how to size work, Professional Scrum is biased away from absolute estimation (hours, function points, ideal-days, etc.) and towards relative sizing. Planning Poker Planning Poker is a collaborative technique to assign relative size. Developers may choose whatever units they want - story points and t-shirt sizes are examples of units. 'Same-Size' Product Backlog Items (PBIs) 'Same-Size' PBIs is a relative estimation approach that involves breaking items down small enough that they are roughly the same size. Velocity can be understood as a count of PBIs; this is sometimes used by teams doing continuously delivery. 'Right-Size' Product Backlog Items (PBIs) 'Right-Size' PBIs is a relative estimation approach that involves breaking things down small enough to deliver value in a certain time period (i.e. get to Done by the end of a Sprint). This is sometimes associated with teams utilizing flow for forecasting. Teams use historical data to determine if they think they can get the PBI done within the confidence level that their historical data says they typically get a PBI done.","title":"Scrum Practices"},{"location":"agile-development/ceremonies/#estimation-resources","text":"The Most Important Thing You Are Missing about Estimation","title":"Estimation Resources"},{"location":"agile-development/ceremonies/#retrospectives","text":"Goals Retrospectives lead to actionable items that help grow the team's engineering practices. These items are in the backlog, assigned, and prioritized to be fixed by a date agreed upon (default being next retrospective). Retrospectives are used to ask the hard questions (\"we usually don't finish what we plan, let's talk about this\") when necessary. Suggestions Consider other retro formats available outside of Mad Sad Glad. Gather Data: Triple Nickels, Timeline, Mad Sad Glad, Team Radar Generate Insights: 5 Whys, Fishbone, Patterns and Shifts Consider setting a retro focus area. Schedule enough time to ensure that you can have the conversation you need to get the correct plan an action and improve how you work. Bring in a neutral facilitator for project retros or retros that introspect after a difficult period. Use the following retrospectives techniques to address specific trends that might be emerging on an engagement","title":"Retrospectives"},{"location":"agile-development/ceremonies/#5-whys","text":"If a team is confronting a problem and is unsure of the exact root cause, the 5 whys exercise taken from the business analysis sector can help get to the bottom of it. For example, if a team cannot get to Done each Sprint, that would go at the top of the whiteboard. The team then asks why that problem exists, writing that answer in the box below. Next, the team asks why again, but this time in response to the why they just identified. Continue this process until the team identifies an actual root cause, which usually becomes apparent within five steps.","title":"5 Whys"},{"location":"agile-development/ceremonies/#processes-tools-individuals-interactions-and-the-definition-of-done","text":"This approach encourages team members to think more broadly. Ask team members to identify what is going well and ideas for improvement within the categories of processes, tools, individuals/interactions, and the Definition of Done. Then, ask team members to vote on which improvement ideas to focus on during the upcoming Sprint.","title":"Processes, Tools, Individuals, Interactions and the Definition of Done"},{"location":"agile-development/ceremonies/#focus","text":"This retrospective technique incorporates the concept of visioning. Using this technique, you ask team members where they would like to go? Decide what the team should look like in 4 weeks, and then ask what is holding them back from that and how they can resolve the impediment. If you are focusing on specific improvements, you can use this technique for one or two Retrospectives in a row so that the team can see progress over time.","title":"Focus"},{"location":"agile-development/ceremonies/#retrospective-resources","text":"Agile Retrospective: Making Good Teams Great Retrospective","title":"Retrospective Resources"},{"location":"agile-development/ceremonies/#sprint-demo","text":"Goals Each sprint ends with demos that illustrate the sprint goal and how it fits in the engagement goal. Suggestions Consider not pre-recording sprint demos in advance. You can record the demo meeting and archive them. A demo does not have to be about running code. It can be showing documentation that was written.","title":"Sprint Demo"},{"location":"agile-development/ceremonies/#sprint-demo-resources","text":"Sprint Review/Demo","title":"Sprint Demo Resources"},{"location":"agile-development/ceremonies/#stand-up","text":"Goals The stand-up is run efficiently. The stand-up helps the team understand what was done, what will be done and what are the blockers. The stand-up helps the team understand if they will meet the sprint goal or not. Suggestions Keep stand up short and efficient. Table the longer conversations for a parking lot section, or for a conversation that will be planned later. Run daily stand ups: 15 minutes of stand up and 15 minutes of parking lot. If someone cannot make the stand-up exceptionally: Ask them to do a written stand up in advance. Stand ups should include everyone involved in the project, including the customer. Projects with widely divergent time zones should be avoided if possible, but if you are on one, you should adapt the standups to meet the needs and time constraints of all team members.","title":"Stand-Up"},{"location":"agile-development/ceremonies/#stand-up-resources","text":"Stand-Up/Daily Scrum","title":"Stand-Up Resources"},{"location":"agile-development/roles/","text":"Agile/Scrum Roles We prefer using \"process lead\" over \"scrum master\". It describes the same role. This section has links directing you to definitions for the traditional roles within Agile/Scrum. After reading through the best practices you should have a basic understanding of the key Agile roles in terms of what they are and the expectations for the role. Product Owner Scrum Master Development Team","title":"Agile/Scrum Roles"},{"location":"agile-development/roles/#agilescrum-roles","text":"We prefer using \"process lead\" over \"scrum master\". It describes the same role. This section has links directing you to definitions for the traditional roles within Agile/Scrum. After reading through the best practices you should have a basic understanding of the key Agile roles in terms of what they are and the expectations for the role. Product Owner Scrum Master Development Team","title":"Agile/Scrum Roles"},{"location":"agile-development/advanced-topics/backlog-management/external-feedback/","text":"External Feedback Various stakeholders can provide feedback to the working product during a project, beyond any formal review and feedback sessions required by the organization. The frequency and method of collecting feedback through reviews varies depending on the case, but a couple of good practices are: Capture each review in the backlog as a separate user story. Standardize the tasks that implement this user story. Plan for a review user story per Epic / Feature in your backlog proactively.","title":"External Feedback"},{"location":"agile-development/advanced-topics/backlog-management/external-feedback/#external-feedback","text":"Various stakeholders can provide feedback to the working product during a project, beyond any formal review and feedback sessions required by the organization. The frequency and method of collecting feedback through reviews varies depending on the case, but a couple of good practices are: Capture each review in the backlog as a separate user story. Standardize the tasks that implement this user story. Plan for a review user story per Epic / Feature in your backlog proactively.","title":"External Feedback"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/","text":"Minimal Slices Always Deliver Your Work Using Minimal Valuable Slices Split your work item into small chunks that are contributed in incremental commits. Contribute your chunks frequently. Follow an iterative approach by regularly providing updates and changes to the team. This allows for instant feedback and early issue discovery and ensures you are developing in the right direction, both technically and functionally. Do NOT work independently on your task without providing any updates to your team. Example Imagine you are working on adding UWP (Universal Windows Platform) application building functionality for existing continuous integration service which already has Android/iOS support. Bad Approach After six weeks of work you created PR with all required functionality, including portal UI (build settings), backend REST API (UWP build functionality), telemetry, unit and integration tests, etc. Good Approach You divided your feature into smaller user stories (which in turn were divided into multiple tasks) and started working on them one by one: As a user I can successfully build UWP apps using current service As a user I can see telemetry when building the apps As a user I have the ability to select build configuration (debug, release) As a user I have the ability to select target platform (arm, x86, x64) ... You also divided your stories into smaller tasks and sent PRs based on those tasks. E.g. you have the following tasks for the first user story above: Enable UWP platform on backend Add build button to the UI (build first solution file found) Add select solution file dropdown to the UI Implement unit tests Implement integration tests to verify build succeeded Update documentation ... Resources Minimalism Rules","title":"Minimal Slices"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#minimal-slices","text":"","title":"Minimal Slices"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#always-deliver-your-work-using-minimal-valuable-slices","text":"Split your work item into small chunks that are contributed in incremental commits. Contribute your chunks frequently. Follow an iterative approach by regularly providing updates and changes to the team. This allows for instant feedback and early issue discovery and ensures you are developing in the right direction, both technically and functionally. Do NOT work independently on your task without providing any updates to your team.","title":"Always Deliver Your Work Using Minimal Valuable Slices"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#example","text":"Imagine you are working on adding UWP (Universal Windows Platform) application building functionality for existing continuous integration service which already has Android/iOS support.","title":"Example"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#bad-approach","text":"After six weeks of work you created PR with all required functionality, including portal UI (build settings), backend REST API (UWP build functionality), telemetry, unit and integration tests, etc.","title":"Bad Approach"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#good-approach","text":"You divided your feature into smaller user stories (which in turn were divided into multiple tasks) and started working on them one by one: As a user I can successfully build UWP apps using current service As a user I can see telemetry when building the apps As a user I have the ability to select build configuration (debug, release) As a user I have the ability to select target platform (arm, x86, x64) ... You also divided your stories into smaller tasks and sent PRs based on those tasks. E.g. you have the following tasks for the first user story above: Enable UWP platform on backend Add build button to the UI (build first solution file found) Add select solution file dropdown to the UI Implement unit tests Implement integration tests to verify build succeeded Update documentation ...","title":"Good Approach"},{"location":"agile-development/advanced-topics/backlog-management/minimal-slices/#resources","text":"Minimalism Rules","title":"Resources"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/","text":"Risk Management Agile methodologies are conceived to be driven by risk management principles, but no methodology can eliminate all risks. Goal Anticipation is a key aspect of software project management, involving the proactive identification and assessment of potential risks and challenges to enable effective planning and mitigation strategies. The following guidance aims to provide decision-makers with the information needed to make informed choices, understanding trade-offs, costs, and project timelines throughout the project. General Guidance Identify risks in every activity such as a planning meetings, design and code reviews, or daily standups. All team members are responsible for identifying relevant risks. Assess risks in terms of their likelihood and potential impact on the project. Use the issues to report and track risks. Issues represent unplanned activities. Prioritize them based on their severity and likelihood, focusing on addressing the most critical ones first. Mitigate or reduce the impact and likelihood of the risks. Monitor continuously to ensure the effectiveness of the mitigation strategies. Prepare contingency plans for high-impact risks that may still materialize. Communicate and report risks to keep all stakeholders informed. Opportunity Management The same process can be applied to opportunities, but while risk management involves applying mitigation actions to decrease the likelihood of a risk, in opportunity management, you enhance actions to increase the likelihood of a positive outcome.","title":"Risk Management"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/#risk-management","text":"Agile methodologies are conceived to be driven by risk management principles, but no methodology can eliminate all risks.","title":"Risk Management"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/#goal","text":"Anticipation is a key aspect of software project management, involving the proactive identification and assessment of potential risks and challenges to enable effective planning and mitigation strategies. The following guidance aims to provide decision-makers with the information needed to make informed choices, understanding trade-offs, costs, and project timelines throughout the project.","title":"Goal"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/#general-guidance","text":"Identify risks in every activity such as a planning meetings, design and code reviews, or daily standups. All team members are responsible for identifying relevant risks. Assess risks in terms of their likelihood and potential impact on the project. Use the issues to report and track risks. Issues represent unplanned activities. Prioritize them based on their severity and likelihood, focusing on addressing the most critical ones first. Mitigate or reduce the impact and likelihood of the risks. Monitor continuously to ensure the effectiveness of the mitigation strategies. Prepare contingency plans for high-impact risks that may still materialize. Communicate and report risks to keep all stakeholders informed.","title":"General Guidance"},{"location":"agile-development/advanced-topics/backlog-management/risk-management/#opportunity-management","text":"The same process can be applied to opportunities, but while risk management involves applying mitigation actions to decrease the likelihood of a risk, in opportunity management, you enhance actions to increase the likelihood of a positive outcome.","title":"Opportunity Management"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/","text":"How to Add a Pairing Custom Field in Azure DevOps User Stories This document outlines the benefits of adding a custom field of type Identity in Azure DevOps user stories, prerequisites, and a step-by-step guide. Benefits of Adding a Custom Field Having the names of both individuals pairing on a story visible on the Azure DevOps cards can be helpful during sprint ceremonies and lead to greater accountability by the pairing assignee. For example, it is easier to keep track of the individuals assigned stories as part of a pair during sprint planning by using the \"pairing names\" field. During stand-up it can also help the Process Lead filter stories assigned to the individual (both as an owner or as a pairing assignee) and show these on the board. Furthermore, the pairing field can provide an additional data point for reports and burndown rates. Prerequisites Prior to customizing Azure DevOps, review Configure and customize Azure Boards . In order to add a custom field to user stories in Azure DevOps changes must be made as an Organization setting . This document therefore assumes use of an existing Organization in Azure DevOps and that the user account used to make these changes is a member of the Project Collection Administrators Group . Change the Organization Settings Duplicate the process currently in use. Navigate to the Organization Settings , within the Boards / Process tab. Select the Process type, click on the icon with three dots ... and click Create inherited process . Click on the newly created inherited process. As you can see in the example below, we called it 'Pairing'. Click on the work item type User Story . Click New Field . Give it a Name and select Identity in Type. Click on Add Field . This completes the change in Organization settings. The rest of the instructions must be completed under Project Settings. Change the Project Settings Go to the Project that is to be modified, select Project Settings . Select Project configuration . Click on process customization page . Click on Projects then click on Change process . Change the target process to Pairing then click Save. Go to Boards . Click on the Gear icon to open Settings. Add field to card. Click on the green + icon to add select the Pairing field. Check the box to display fields, even when they are empty. Save and close . View the modified the card. Notice the new Pairing field. The Story can now be assigned an Owner and a Pairing assignee!","title":"How to Add a Pairing Custom Field in Azure DevOps User Stories"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#how-to-add-a-pairing-custom-field-in-azure-devops-user-stories","text":"This document outlines the benefits of adding a custom field of type Identity in Azure DevOps user stories, prerequisites, and a step-by-step guide.","title":"How to Add a Pairing Custom Field in Azure DevOps User Stories"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#benefits-of-adding-a-custom-field","text":"Having the names of both individuals pairing on a story visible on the Azure DevOps cards can be helpful during sprint ceremonies and lead to greater accountability by the pairing assignee. For example, it is easier to keep track of the individuals assigned stories as part of a pair during sprint planning by using the \"pairing names\" field. During stand-up it can also help the Process Lead filter stories assigned to the individual (both as an owner or as a pairing assignee) and show these on the board. Furthermore, the pairing field can provide an additional data point for reports and burndown rates.","title":"Benefits of Adding a Custom Field"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#prerequisites","text":"Prior to customizing Azure DevOps, review Configure and customize Azure Boards . In order to add a custom field to user stories in Azure DevOps changes must be made as an Organization setting . This document therefore assumes use of an existing Organization in Azure DevOps and that the user account used to make these changes is a member of the Project Collection Administrators Group .","title":"Prerequisites"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#change-the-organization-settings","text":"Duplicate the process currently in use. Navigate to the Organization Settings , within the Boards / Process tab. Select the Process type, click on the icon with three dots ... and click Create inherited process . Click on the newly created inherited process. As you can see in the example below, we called it 'Pairing'. Click on the work item type User Story . Click New Field . Give it a Name and select Identity in Type. Click on Add Field . This completes the change in Organization settings. The rest of the instructions must be completed under Project Settings.","title":"Change the Organization Settings"},{"location":"agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/#change-the-project-settings","text":"Go to the Project that is to be modified, select Project Settings . Select Project configuration . Click on process customization page . Click on Projects then click on Change process . Change the target process to Pairing then click Save. Go to Boards . Click on the Gear icon to open Settings. Add field to card. Click on the green + icon to add select the Pairing field. Check the box to display fields, even when they are empty. Save and close . View the modified the card. Notice the new Pairing field. The Story can now be assigned an Owner and a Pairing assignee!","title":"Change the Project Settings"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/","text":"Effortless Pair Programming with GitHub Codespaces and VSCode Pair programming used to be a software development technique in which two programmers work together on a single computer, sharing one keyboard and mouse, to jointly design, code, test, and debug software. It is one of the patterns explored in the section why collaboration? of this playbook, however with teams that work mostly remotely, sharing a physical computer became a challenge, but opened the door to a more efficient approach of pair programming. Through the effective utilization of a range of tools and techniques, we have successfully implemented both pair and swarm programming methodologies. As such, we are eager to share some of the valuable insights and knowledge gained from this experience. How to Make Pair Programming a Painless Experience? Working Sessions In order to enhance pair programming capabilities, you can create regular working sessions that are open to all team members. This facilitates smooth and efficient collaboration as everyone can simply join in and work together before branching off into smaller groups. This approach has proven particularly beneficial for new team members who may otherwise feel overwhelmed by a large codebase. It emulates the concept of the \" humble water cooler ,\" which fosters a sense of connectedness among team members through their shared work. Additionally, scheduling these working sessions in advance ensures intentional collaboration and provides clarity on user story responsibilities. To this end, assign a single person to each user story to ensure clear ownership and eliminate ambiguity. By doing so, this could eliminate the common problem of engineers being hesitant to modify code outside of their assigned tasks due to the sentiment of lack of ownership. These working sessions are instrumental in promoting a cohesive team dynamic, allowing for effective knowledge sharing and collective problem-solving. GitHub Codespaces GitHub Codespaces is a vital component in an efficient development environment, particularly in the context of pair programming. Prioritize setting up a Codespace as the initial step of the project, preceding tasks such as local machine project compilation or VSCode plugin installation. To this end, make sure to update the Codespace documentation before incorporating any quick start instructions for local environments. Additionally, consistently demonstrate demos in codespaces environment to ensure its prominent integration into our workflow. With its cloud-based infrastructure, GitHub Codespaces presents a highly efficient and simplified approach to real-time collaborative coding. As a result, new team members can easily access the GitHub project and begin coding within seconds, without requiring installation on their local machines. This seamless, integrated solution for pair programming offers a streamlined workflow, allowing you to direct your attention towards producing exemplary code, free from the distractions of cumbersome setup processes. VSCode Live Share VSCode Live Share is specifically designed for pair programming and enables you to work on the same codebase, in real-time, with your team members. The arduous process of configuring complex setups, grappling with confusing configurations, straining one's eyes to work on small screens, or physically switching keyboards is not a problem with LiveShare. This innovative solution enables seamless sharing of your development environment with your team members, facilitating smooth collaborative coding experiences. Fully integrated into Visual Studio Code and Visual Studio, LiveShare offers the added benefit of terminal sharing, debug session collaboration, and host machine control. When paired with GitHub Codespaces, it presents a potent tool set for effective pair programming. Tip: Share VSCode extensions (including Live Share) using a base devcontainer.json . This ensure all team members have available the same set of extensions, and allow them to focus in solving the business needs from day one. Resources GitHub Codespaces . VSCode Live Share . Create a Dev Container . How companies have optimized the humble office water cooler .","title":"Effortless Pair Programming with GitHub Codespaces and VSCode"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#effortless-pair-programming-with-github-codespaces-and-vscode","text":"Pair programming used to be a software development technique in which two programmers work together on a single computer, sharing one keyboard and mouse, to jointly design, code, test, and debug software. It is one of the patterns explored in the section why collaboration? of this playbook, however with teams that work mostly remotely, sharing a physical computer became a challenge, but opened the door to a more efficient approach of pair programming. Through the effective utilization of a range of tools and techniques, we have successfully implemented both pair and swarm programming methodologies. As such, we are eager to share some of the valuable insights and knowledge gained from this experience.","title":"Effortless Pair Programming with GitHub Codespaces and VSCode"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#how-to-make-pair-programming-a-painless-experience","text":"","title":"How to Make Pair Programming a Painless Experience?"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#working-sessions","text":"In order to enhance pair programming capabilities, you can create regular working sessions that are open to all team members. This facilitates smooth and efficient collaboration as everyone can simply join in and work together before branching off into smaller groups. This approach has proven particularly beneficial for new team members who may otherwise feel overwhelmed by a large codebase. It emulates the concept of the \" humble water cooler ,\" which fosters a sense of connectedness among team members through their shared work. Additionally, scheduling these working sessions in advance ensures intentional collaboration and provides clarity on user story responsibilities. To this end, assign a single person to each user story to ensure clear ownership and eliminate ambiguity. By doing so, this could eliminate the common problem of engineers being hesitant to modify code outside of their assigned tasks due to the sentiment of lack of ownership. These working sessions are instrumental in promoting a cohesive team dynamic, allowing for effective knowledge sharing and collective problem-solving.","title":"Working Sessions"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#github-codespaces","text":"GitHub Codespaces is a vital component in an efficient development environment, particularly in the context of pair programming. Prioritize setting up a Codespace as the initial step of the project, preceding tasks such as local machine project compilation or VSCode plugin installation. To this end, make sure to update the Codespace documentation before incorporating any quick start instructions for local environments. Additionally, consistently demonstrate demos in codespaces environment to ensure its prominent integration into our workflow. With its cloud-based infrastructure, GitHub Codespaces presents a highly efficient and simplified approach to real-time collaborative coding. As a result, new team members can easily access the GitHub project and begin coding within seconds, without requiring installation on their local machines. This seamless, integrated solution for pair programming offers a streamlined workflow, allowing you to direct your attention towards producing exemplary code, free from the distractions of cumbersome setup processes.","title":"GitHub Codespaces"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#vscode-live-share","text":"VSCode Live Share is specifically designed for pair programming and enables you to work on the same codebase, in real-time, with your team members. The arduous process of configuring complex setups, grappling with confusing configurations, straining one's eyes to work on small screens, or physically switching keyboards is not a problem with LiveShare. This innovative solution enables seamless sharing of your development environment with your team members, facilitating smooth collaborative coding experiences. Fully integrated into Visual Studio Code and Visual Studio, LiveShare offers the added benefit of terminal sharing, debug session collaboration, and host machine control. When paired with GitHub Codespaces, it presents a potent tool set for effective pair programming. Tip: Share VSCode extensions (including Live Share) using a base devcontainer.json . This ensure all team members have available the same set of extensions, and allow them to focus in solving the business needs from day one.","title":"VSCode Live Share"},{"location":"agile-development/advanced-topics/collaboration/pair-programming-tools/#resources","text":"GitHub Codespaces . VSCode Live Share . Create a Dev Container . How companies have optimized the humble office water cooler .","title":"Resources"},{"location":"agile-development/advanced-topics/collaboration/social-question/","text":"Social Question of the Day The social question of the day is an optional short question to follow the three project questions in the daily stand-up. It develops team cohesion and interpersonal trust over the course of an engagement by facilitating the sharing of personal preferences, lifestyle, or other context. The social question should be chosen before the stand-up. The facilitator should select the question either independently or from the team's asynchronous suggestions. This minimizes delays at the start of the stand-up. Tip: having the stand-up facilitator role rotate each sprint lets the facilitator choose the social question independently without burdening any one team member. Properties of a Good Question A good question has a brief answer with small optional elaboration. A yes or no answer doesn't tell you very much about someone, while knowing that their favorite fruit is a durian is informative. Good questions are low in consequence but allow controversy. Watching someone strongly exclaim that salmon and lox on cinnamon-raisin is the best bagel order is endearing. As a corollary, a good question is one someone is likely to be passionate about. You know a little more about a team member's personality if their eyes light up when describing their favorite karaoke song. Starter List of Questions Potentially good questions include: What's your Starbucks order? What's your favorite operating system? What's your favorite version of Windows? What's your favorite plant, houseplant or otherwise? What's your favorite fruit? What's your favorite fast food? What's your favorite noodle? What's your favorite text editor? Mountains or beach? DC or Marvel? Coffee with one person from history: who? What's your silliest online purchase? What's your alternate career? What's the best bagel topping? What's your guilty TV pleasure? What's your go-to karaoke song? Would you rather see the past or the future? Would you rather be able to teleport or to fly? Would you rather live underwater or in space for a year? What's your favorite phone app? What's your favorite fish, to eat or otherwise? What was your best costume? Who is someone you admire (from history, from your personal life, etc.)? Give one reason why. What's the best compliment you've ever received? What's your favorite or most used emoji right now? What was your biggest DIY project? What's a spice that you use on everything? What's your top Spotify (or just your favorite) genre/artist for this year? What was your first computer? What's your favorite kind of taco? What's your favorite decade? What's the best way to eat potatoes? What was your best vacation (stay-cations acceptable)? Favorite cartoon? Pick someone in your family and tell us something awesome about them. What was your longest road trip? What thing do you remember learning when you were young that is taught differently now? What was your favorite toy as a child?","title":"Social Question of the Day"},{"location":"agile-development/advanced-topics/collaboration/social-question/#social-question-of-the-day","text":"The social question of the day is an optional short question to follow the three project questions in the daily stand-up. It develops team cohesion and interpersonal trust over the course of an engagement by facilitating the sharing of personal preferences, lifestyle, or other context. The social question should be chosen before the stand-up. The facilitator should select the question either independently or from the team's asynchronous suggestions. This minimizes delays at the start of the stand-up. Tip: having the stand-up facilitator role rotate each sprint lets the facilitator choose the social question independently without burdening any one team member.","title":"Social Question of the Day"},{"location":"agile-development/advanced-topics/collaboration/social-question/#properties-of-a-good-question","text":"A good question has a brief answer with small optional elaboration. A yes or no answer doesn't tell you very much about someone, while knowing that their favorite fruit is a durian is informative. Good questions are low in consequence but allow controversy. Watching someone strongly exclaim that salmon and lox on cinnamon-raisin is the best bagel order is endearing. As a corollary, a good question is one someone is likely to be passionate about. You know a little more about a team member's personality if their eyes light up when describing their favorite karaoke song.","title":"Properties of a Good Question"},{"location":"agile-development/advanced-topics/collaboration/social-question/#starter-list-of-questions","text":"Potentially good questions include: What's your Starbucks order? What's your favorite operating system? What's your favorite version of Windows? What's your favorite plant, houseplant or otherwise? What's your favorite fruit? What's your favorite fast food? What's your favorite noodle? What's your favorite text editor? Mountains or beach? DC or Marvel? Coffee with one person from history: who? What's your silliest online purchase? What's your alternate career? What's the best bagel topping? What's your guilty TV pleasure? What's your go-to karaoke song? Would you rather see the past or the future? Would you rather be able to teleport or to fly? Would you rather live underwater or in space for a year? What's your favorite phone app? What's your favorite fish, to eat or otherwise? What was your best costume? Who is someone you admire (from history, from your personal life, etc.)? Give one reason why. What's the best compliment you've ever received? What's your favorite or most used emoji right now? What was your biggest DIY project? What's a spice that you use on everything? What's your top Spotify (or just your favorite) genre/artist for this year? What was your first computer? What's your favorite kind of taco? What's your favorite decade? What's the best way to eat potatoes? What was your best vacation (stay-cations acceptable)? Favorite cartoon? Pick someone in your family and tell us something awesome about them. What was your longest road trip? What thing do you remember learning when you were young that is taught differently now? What was your favorite toy as a child?","title":"Starter List of Questions"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/","text":"Engagement Team Development In every ISE engagement, dynamics are different so are the team requirements. Based on transfer learning among teams, we aim to build right \"code-with\" environments in every team. This documentation gives a high-level template with some suggestions by aiming to accelerate team swarming phase to achieve a high speed agility however it has no intention to provide a list of \"must-do\" items. Identification As it's stated in Tuckman's team phases , traditional team development has several stages. However those phases can be extremely fast or sometimes mismatched in teams due to external factors, what applies to ISE engagements. In order to minimize the risk and set the expectations on the right way for all parties, an identification phase is important to understand each other. Some potential steps in this phase may be as following (not limited): Working agreement Identification of styles/preferences in communication, sharing, learning, decision making of each team member Talking about necessity of pair programming Decisions on backlog management & refinement meetings, weekly design sessions, social time sessions...etc. Sync/Async communication methods, work hours/flexible times Decisions and identifications of charts that will be helpful to provide transparent and true information to everyone Identification of \"Software Craftspersonship\" areas which means the tools and methods will be widely used during the engagement and taking the required actions on team upskilling side if necessary. GitHub, VSCode LiveShare, AzDevOps, necessary development tools & libraries ... more. If upskilling on certain topic(s) is needed, identifying the areas and arranging code spikes for increasing the team knowledge on the regarding topic(s). Identification of communication channels, feedback loops and recurrent team call slots out of regular sprint meetings Introduction to Technical Agility Team Manifesto and planning the technical delivery by aiming to keep technical debt risk minimum. Following the Plan and Agile Debugging Identification phase accelerates the process of building a safe environment for every individual in the team, later on team has the required assets to follow the plan. And it is team's itself responsibility (engineers,PO,Process Lead) to debug their Agility level. In every team stabilization takes time and pro-active agile debugging is the best accelerator to decrease the distraction away from sprint/engagement goal. Team is also responsible to keep the plan up-to-date based on team changes/needs and debugging results. Just as an example, agility debugging activities may include: Dashboards related with \"Goal\" such as burndown/burnout, Item/PR Aging, Mood Chart ..etc. are accessible to the team and team is always up-to-date Backlog Refinement meetings Size of stories (Too big? Too small?) Are \"User Stories\" and \"Tasks\" clear ? Are Acceptance Criteria enough and right? Is everyone ready-to-go after taking the User Story/Task? Running efficient retrospectives Is the Sprint Goal clear in every iteration ? Is the estimation process in the team improving over time or does it meet the delivery/workload prediction? Kindly check Scrum Values to have a better understanding to improve team commitment. Following that, above suggestions aim to remove agile/team disfunctionalities and provide a broader team understanding, potential time savings and full transparency. Resources Tuckman's Stages of Group Development Scrum Values","title":"Engagement Team Development"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/#engagement-team-development","text":"In every ISE engagement, dynamics are different so are the team requirements. Based on transfer learning among teams, we aim to build right \"code-with\" environments in every team. This documentation gives a high-level template with some suggestions by aiming to accelerate team swarming phase to achieve a high speed agility however it has no intention to provide a list of \"must-do\" items.","title":"Engagement Team Development"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/#identification","text":"As it's stated in Tuckman's team phases , traditional team development has several stages. However those phases can be extremely fast or sometimes mismatched in teams due to external factors, what applies to ISE engagements. In order to minimize the risk and set the expectations on the right way for all parties, an identification phase is important to understand each other. Some potential steps in this phase may be as following (not limited): Working agreement Identification of styles/preferences in communication, sharing, learning, decision making of each team member Talking about necessity of pair programming Decisions on backlog management & refinement meetings, weekly design sessions, social time sessions...etc. Sync/Async communication methods, work hours/flexible times Decisions and identifications of charts that will be helpful to provide transparent and true information to everyone Identification of \"Software Craftspersonship\" areas which means the tools and methods will be widely used during the engagement and taking the required actions on team upskilling side if necessary. GitHub, VSCode LiveShare, AzDevOps, necessary development tools & libraries ... more. If upskilling on certain topic(s) is needed, identifying the areas and arranging code spikes for increasing the team knowledge on the regarding topic(s). Identification of communication channels, feedback loops and recurrent team call slots out of regular sprint meetings Introduction to Technical Agility Team Manifesto and planning the technical delivery by aiming to keep technical debt risk minimum.","title":"Identification"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/#following-the-plan-and-agile-debugging","text":"Identification phase accelerates the process of building a safe environment for every individual in the team, later on team has the required assets to follow the plan. And it is team's itself responsibility (engineers,PO,Process Lead) to debug their Agility level. In every team stabilization takes time and pro-active agile debugging is the best accelerator to decrease the distraction away from sprint/engagement goal. Team is also responsible to keep the plan up-to-date based on team changes/needs and debugging results. Just as an example, agility debugging activities may include: Dashboards related with \"Goal\" such as burndown/burnout, Item/PR Aging, Mood Chart ..etc. are accessible to the team and team is always up-to-date Backlog Refinement meetings Size of stories (Too big? Too small?) Are \"User Stories\" and \"Tasks\" clear ? Are Acceptance Criteria enough and right? Is everyone ready-to-go after taking the User Story/Task? Running efficient retrospectives Is the Sprint Goal clear in every iteration ? Is the estimation process in the team improving over time or does it meet the delivery/workload prediction? Kindly check Scrum Values to have a better understanding to improve team commitment. Following that, above suggestions aim to remove agile/team disfunctionalities and provide a broader team understanding, potential time savings and full transparency.","title":"Following the Plan and Agile Debugging"},{"location":"agile-development/advanced-topics/collaboration/teaming-up/#resources","text":"Tuckman's Stages of Group Development Scrum Values","title":"Resources"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/","text":"Virtual Collaboration and Pair Programming Pair programming is the de facto work method that most large engineering organizations use for \u201chands on keyboard\u201d coding. Two developers, working synchronously, looking at the same screen and attempting to code and design together, which often results in better and clearer code than either could produce individually. Pair programming works well under the correct circumstances, but it loses some of its charm when executed in a completely virtual setting. The virtual setup still involves two developers looking at the same screen and talking out their designs, but there are often logistical issues to deal with, including lag, microphone set up issues, workspace and personal considerations, and many other small, individually trivial problems that worsen the experience. Virtual work patterns are different from the in-person patterns we are accustomed to. Pair programming at its core is based on the following principles: Generating clarity through communication Producing higher quality through collaboration Creating ownership through equal contribution Pair programming is one way to achieve these results. Red Team Testing (RTT) is an alternate programming method that uses the same principles but with some of the advantages that virtual work methods provide. Red Team Testing (RTT) Red Team Testing borrows its name from the \u201cRed Team\u201d and \u201cBlue Team\u201d paradigm of penetration testing, and is a collaborative, parallel way of working virtually. In Red Team Testing, two developers jointly decide on the interface, architecture, and design of the program, and then separate for the implementation phase. One developer writes tests using the public interface, attempting to perform edge case testing, input validation, and otherwise stress testing the interface. The second developer is simultaneously writing the implementation which will eventually be tested. Red Team Testing has the same philosophy as any other Test-Driven Development lifecycle: All implementation is separated from the interface, and the interface can be tested with no knowledge of the implementation. Steps Design Phase: Both developers design the interface together. This includes: - Method signatures and names - Writing documentation or docstrings for what the methods are intended to do. - Architecture decisions that would influence testing (Factory patterns, etc.) Implementation Phase: The developers separate and parallelize work, while continuing to communicate. - Developer A will design the implementation of the methods, adhering to the previously decided design. - Developer B will concurrently write tests for the same method signatures, without knowing details of the implementation. Integration & Testing Phase: Both developers commit their code and run the tests. - Utopian Scenario: All tests run and pass correctly. - Realistic Scenario: The tests have either broken or failed due to flaws in testing. This leads to further clarification of the design and a discussion of why the tests failed. The developers will repeat the three phases until the code is functional and tested. When to Follow the RTT Strategy RTT works well under specific circumstances. If collaboration needs to happen virtually, and all communication is virtual, RTT reduces the need for constant communication while maintaining the benefits of a joint design session. This considers the human element: Virtual communication is more exhausting than in person communication. RTT also works well when there is complete consensus, or no consensus at all, on what purpose the code serves. Since creating the design jointly and agreeing to implement and test against it are part of the RTT method, RTT forcibly creates clarity through iteration and communication. Benefits RTT has many of the same benefits as Pair Programming and Test-Driven development but tries to update them for a virtual setting. Code implementation and testing can be done in parallel, over long distances or across time zones, which reduces the overall time taken to finish writing the code. RTT maintains the pair programming paradigm, while reducing the need for video communication or constant communication between developers. RTT allows detailed focus on design and engineering alignment before implementing any code, leading to cleaner and simpler interfaces. RTT encourages testing to be prioritized alongside implementation, instead of having testing follow or be influenced by the implementation of the code. Documentation is inherently a part of RTT, since both the implementer and the tester need correct, up to date documentation, in the implementation phase. What You Need for RTT to Work Well Demand for constant communication and good teamwork may pose a challenge; daily updates amongst team members are essential to maintain alignment on varying code requirements. Clarity of the code design and testing strategy must be established beforehand and documented as reference. Lack of an established design will cause misalignment between the two major pieces of work and a need for time-consuming refactoring. RTT does not work well if only one developer has knowledge of the overall design. Team communication is critical to ensuring that every developer involved in RTT is on the same page.","title":"Virtual Collaboration and Pair Programming"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#virtual-collaboration-and-pair-programming","text":"Pair programming is the de facto work method that most large engineering organizations use for \u201chands on keyboard\u201d coding. Two developers, working synchronously, looking at the same screen and attempting to code and design together, which often results in better and clearer code than either could produce individually. Pair programming works well under the correct circumstances, but it loses some of its charm when executed in a completely virtual setting. The virtual setup still involves two developers looking at the same screen and talking out their designs, but there are often logistical issues to deal with, including lag, microphone set up issues, workspace and personal considerations, and many other small, individually trivial problems that worsen the experience. Virtual work patterns are different from the in-person patterns we are accustomed to. Pair programming at its core is based on the following principles: Generating clarity through communication Producing higher quality through collaboration Creating ownership through equal contribution Pair programming is one way to achieve these results. Red Team Testing (RTT) is an alternate programming method that uses the same principles but with some of the advantages that virtual work methods provide.","title":"Virtual Collaboration and Pair Programming"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#red-team-testing-rtt","text":"Red Team Testing borrows its name from the \u201cRed Team\u201d and \u201cBlue Team\u201d paradigm of penetration testing, and is a collaborative, parallel way of working virtually. In Red Team Testing, two developers jointly decide on the interface, architecture, and design of the program, and then separate for the implementation phase. One developer writes tests using the public interface, attempting to perform edge case testing, input validation, and otherwise stress testing the interface. The second developer is simultaneously writing the implementation which will eventually be tested. Red Team Testing has the same philosophy as any other Test-Driven Development lifecycle: All implementation is separated from the interface, and the interface can be tested with no knowledge of the implementation.","title":"Red Team Testing (RTT)"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#steps","text":"Design Phase: Both developers design the interface together. This includes: - Method signatures and names - Writing documentation or docstrings for what the methods are intended to do. - Architecture decisions that would influence testing (Factory patterns, etc.) Implementation Phase: The developers separate and parallelize work, while continuing to communicate. - Developer A will design the implementation of the methods, adhering to the previously decided design. - Developer B will concurrently write tests for the same method signatures, without knowing details of the implementation. Integration & Testing Phase: Both developers commit their code and run the tests. - Utopian Scenario: All tests run and pass correctly. - Realistic Scenario: The tests have either broken or failed due to flaws in testing. This leads to further clarification of the design and a discussion of why the tests failed. The developers will repeat the three phases until the code is functional and tested.","title":"Steps"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#when-to-follow-the-rtt-strategy","text":"RTT works well under specific circumstances. If collaboration needs to happen virtually, and all communication is virtual, RTT reduces the need for constant communication while maintaining the benefits of a joint design session. This considers the human element: Virtual communication is more exhausting than in person communication. RTT also works well when there is complete consensus, or no consensus at all, on what purpose the code serves. Since creating the design jointly and agreeing to implement and test against it are part of the RTT method, RTT forcibly creates clarity through iteration and communication.","title":"When to Follow the RTT Strategy"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#benefits","text":"RTT has many of the same benefits as Pair Programming and Test-Driven development but tries to update them for a virtual setting. Code implementation and testing can be done in parallel, over long distances or across time zones, which reduces the overall time taken to finish writing the code. RTT maintains the pair programming paradigm, while reducing the need for video communication or constant communication between developers. RTT allows detailed focus on design and engineering alignment before implementing any code, leading to cleaner and simpler interfaces. RTT encourages testing to be prioritized alongside implementation, instead of having testing follow or be influenced by the implementation of the code. Documentation is inherently a part of RTT, since both the implementer and the tester need correct, up to date documentation, in the implementation phase.","title":"Benefits"},{"location":"agile-development/advanced-topics/collaboration/virtual-collaboration/#what-you-need-for-rtt-to-work-well","text":"Demand for constant communication and good teamwork may pose a challenge; daily updates amongst team members are essential to maintain alignment on varying code requirements. Clarity of the code design and testing strategy must be established beforehand and documented as reference. Lack of an established design will cause misalignment between the two major pieces of work and a need for time-consuming refactoring. RTT does not work well if only one developer has knowledge of the overall design. Team communication is critical to ensuring that every developer involved in RTT is on the same page.","title":"What You Need for RTT to Work Well"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/","text":"Why Collaboration Why is Collaboration Important In engagements, we aim to be highly collaborative because when we code together, we perform better, have a higher sprint velocity, and have a greater degree of knowledge sharing across the team. There are two common patterns we use for collaboration: Pairing and swarming. Pair programming (\u201cpairing\u201d) - two software engineers assigned to, and working on, one shared story at a time during the sprint. The Dev Lead assigns a user story to two engineers -- one primary engineer (story owner) and one secondary engineer (pairing assignee). Swarm programming (\u201cswarming\u201d) - three or more software engineers collaborating on a high-priority item to bring it to completion. How to Pair Program As mentioned, every story is intentionally assigned to a pair. The pairing assignee may be in the process of upskilling, nevertheless, they are equal partners in the development effort. Below are some general guidelines for pairing: Upon assignment of the story/product backlog item (PBI), the pair needs to be deliberate about defining how to work together and have a firm definition of the work to be completed. This information should be expressed clearly in the story\u2019s description and acceptance criteria. The expectations about this need to be communicated and agreed upon by both engineers and should be done prior to any actual working sessions. The story owner and pairing assignee do not merely split the work up and sync regularly \u2013 they actively work together on the same tasks, and might share their screens via a Teams online session. Collaborative tools like VS Live Share can be preferable to sharing screens. Not all collaboration needs to be screen-share based. During the collaborative sessions, one engineer provides the development environment while the other actively views and comments verbally. Engineers trade places often from one session to the next so that everyone has time in control of the keyboard. Engineers leverage feature branches for the collaboration during the development of each story to have small Pull Requests (PRs) (as opposed to a single giant PR) at the end of the sprint. Code is committed to the repository by both members of the assigned pair where and when it makes sense as tasks were completed. The pairing assignee is the voice representing the pair during the daily standup while being supported by the story owner. Having the names of both individuals (owner and pair assignee) visible on the PBI can be helpful during sprint ceremonies and lead to greater accountability by the pairing assignee. An example of this using Azure DevOps cards can be found here . Why Pair Programming Helps Collaboration Pair programming helps collaboration because both engineers share equal responsibility for bringing the story to completion. This is a mutually beneficial exercise because, while the story owner often has more experience to lean on, the pairing assignee brings a fresh view that is unclouded by repetition. Some other benefits include: Fewer defects and increased accountability. Having two sets of eyes allows the engineers more opportunity to catch errors and to remember often-overlooked tasks such as writing unit and integration tests. Pairing allows engineers with different experience and expertise to learn from one another by collaborating and receiving feedback in real-time. Instead of having an engineer work alone on a task for long hours and hit an isolation breaking point, pairing allows the pair to check in with one another. Even something as simple as describing the problem out loud can help uncover issues or bugs in the code. Pairing can help brainstorming as well as validating details such as making the variable names consistent. When to Swarm Program It is important to know that not every PBI needs to use swarming. Some sprints may not even warrant swarming at all. Swarm when: The work is complex enough to have collective minds collaborating (not because the quantity of work is more than what would be completed in one sprint). The task that the swarm works on has become (or is in imminent danger of becoming) a blocker to other stories. An unknown is discovered that needs a collaborative effort to form a decision on how to move forward. The collective knowledge and expertise help move the story forward more quickly and ultimately produced better quality code. A conflict or unresolved difference of opinion arises during a pairing session. Promote the work to become a swarming session to help resolve the conflict. How to Swarm Program As soon the pair finds out that the PBI will warrant swarming, the pair brings it up to the rest of the team (via parking lot during stand-up or asynchronously). Members of the team agree or volunteer to assist. The story owner (or pairing assignee) sends Teams call invite to the interested parties. This allows the swarm to have dedicated focus time by blocking time in calendars. During a swarming session, an engineer can branch out if there is something that needs to be handled while the swarm tackles the main problem at hand, then reconnects and reports back. This allows the swarm to focus on a core aspect and to be all on the same page. The Teams call is repeated until resolution is found or alternative path forward is formulated. Why Swarm Programming Helps Collaboration Swarming allows the collective knowledge and expertise of the team to come together in a focused and unified way. Not only does swarming help close out the item faster, but it also helps the team understand each other\u2019s strengths and weaknesses. Allows the team to build a higher level of trust and work as a cohesive unit. When to Decide to Swarm, Pair, and/or Split While a lot of time can be spent on pair programming, it does make sense to split the work when folks understand how the work will be carried out, and the work to be done is largely prescriptive. Once the story has been jointly tasked out by both engineers, the engineers may choose to tackle some tasks separately and then combine the work together at the end. Pair programming is more helpful when the engineers do not have perfect clarity about what is needed to be done or how it can be done. Swarming is done when the two engineers assigned to the story need an additional sounding board or need expertise that other team members could provide. Benefits of Increased Collaboration Knowledge sharing and bringing ISE and customer engineers together in a \u2018code-with\u2019 manner is an important aspect of ISE engagements. This grows both our customers\u2019 and our ISE team\u2019s capability to build on Azure. We are responsible for demonstrating engineering fundamentals and leaving the customer in a better place after we disengage. This can only happen if we collaborate and engage together as a team. In addition to improved software quality, this also adds a beneficial social aspect to the engagements. Resources How to add a pairing custom field in Azure DevOps User Stories - adding a custom field of type Identity in Azure DevOps for pairing On Pair Programming - Martin Fowler Pair Programming hands-on lessons - these can be used (and adapted) to support bringing pair programming into your team (MS internal or including customers) Effortless Pair Programming with GitHub Codespaces and VSCode","title":"Why Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#why-collaboration","text":"","title":"Why Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#why-is-collaboration-important","text":"In engagements, we aim to be highly collaborative because when we code together, we perform better, have a higher sprint velocity, and have a greater degree of knowledge sharing across the team. There are two common patterns we use for collaboration: Pairing and swarming. Pair programming (\u201cpairing\u201d) - two software engineers assigned to, and working on, one shared story at a time during the sprint. The Dev Lead assigns a user story to two engineers -- one primary engineer (story owner) and one secondary engineer (pairing assignee). Swarm programming (\u201cswarming\u201d) - three or more software engineers collaborating on a high-priority item to bring it to completion.","title":"Why is Collaboration Important"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#how-to-pair-program","text":"As mentioned, every story is intentionally assigned to a pair. The pairing assignee may be in the process of upskilling, nevertheless, they are equal partners in the development effort. Below are some general guidelines for pairing: Upon assignment of the story/product backlog item (PBI), the pair needs to be deliberate about defining how to work together and have a firm definition of the work to be completed. This information should be expressed clearly in the story\u2019s description and acceptance criteria. The expectations about this need to be communicated and agreed upon by both engineers and should be done prior to any actual working sessions. The story owner and pairing assignee do not merely split the work up and sync regularly \u2013 they actively work together on the same tasks, and might share their screens via a Teams online session. Collaborative tools like VS Live Share can be preferable to sharing screens. Not all collaboration needs to be screen-share based. During the collaborative sessions, one engineer provides the development environment while the other actively views and comments verbally. Engineers trade places often from one session to the next so that everyone has time in control of the keyboard. Engineers leverage feature branches for the collaboration during the development of each story to have small Pull Requests (PRs) (as opposed to a single giant PR) at the end of the sprint. Code is committed to the repository by both members of the assigned pair where and when it makes sense as tasks were completed. The pairing assignee is the voice representing the pair during the daily standup while being supported by the story owner. Having the names of both individuals (owner and pair assignee) visible on the PBI can be helpful during sprint ceremonies and lead to greater accountability by the pairing assignee. An example of this using Azure DevOps cards can be found here .","title":"How to Pair Program"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#why-pair-programming-helps-collaboration","text":"Pair programming helps collaboration because both engineers share equal responsibility for bringing the story to completion. This is a mutually beneficial exercise because, while the story owner often has more experience to lean on, the pairing assignee brings a fresh view that is unclouded by repetition. Some other benefits include: Fewer defects and increased accountability. Having two sets of eyes allows the engineers more opportunity to catch errors and to remember often-overlooked tasks such as writing unit and integration tests. Pairing allows engineers with different experience and expertise to learn from one another by collaborating and receiving feedback in real-time. Instead of having an engineer work alone on a task for long hours and hit an isolation breaking point, pairing allows the pair to check in with one another. Even something as simple as describing the problem out loud can help uncover issues or bugs in the code. Pairing can help brainstorming as well as validating details such as making the variable names consistent.","title":"Why Pair Programming Helps Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#when-to-swarm-program","text":"It is important to know that not every PBI needs to use swarming. Some sprints may not even warrant swarming at all. Swarm when: The work is complex enough to have collective minds collaborating (not because the quantity of work is more than what would be completed in one sprint). The task that the swarm works on has become (or is in imminent danger of becoming) a blocker to other stories. An unknown is discovered that needs a collaborative effort to form a decision on how to move forward. The collective knowledge and expertise help move the story forward more quickly and ultimately produced better quality code. A conflict or unresolved difference of opinion arises during a pairing session. Promote the work to become a swarming session to help resolve the conflict.","title":"When to Swarm Program"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#how-to-swarm-program","text":"As soon the pair finds out that the PBI will warrant swarming, the pair brings it up to the rest of the team (via parking lot during stand-up or asynchronously). Members of the team agree or volunteer to assist. The story owner (or pairing assignee) sends Teams call invite to the interested parties. This allows the swarm to have dedicated focus time by blocking time in calendars. During a swarming session, an engineer can branch out if there is something that needs to be handled while the swarm tackles the main problem at hand, then reconnects and reports back. This allows the swarm to focus on a core aspect and to be all on the same page. The Teams call is repeated until resolution is found or alternative path forward is formulated.","title":"How to Swarm Program"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#why-swarm-programming-helps-collaboration","text":"Swarming allows the collective knowledge and expertise of the team to come together in a focused and unified way. Not only does swarming help close out the item faster, but it also helps the team understand each other\u2019s strengths and weaknesses. Allows the team to build a higher level of trust and work as a cohesive unit.","title":"Why Swarm Programming Helps Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#when-to-decide-to-swarm-pair-andor-split","text":"While a lot of time can be spent on pair programming, it does make sense to split the work when folks understand how the work will be carried out, and the work to be done is largely prescriptive. Once the story has been jointly tasked out by both engineers, the engineers may choose to tackle some tasks separately and then combine the work together at the end. Pair programming is more helpful when the engineers do not have perfect clarity about what is needed to be done or how it can be done. Swarming is done when the two engineers assigned to the story need an additional sounding board or need expertise that other team members could provide.","title":"When to Decide to Swarm, Pair, and/or Split"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#benefits-of-increased-collaboration","text":"Knowledge sharing and bringing ISE and customer engineers together in a \u2018code-with\u2019 manner is an important aspect of ISE engagements. This grows both our customers\u2019 and our ISE team\u2019s capability to build on Azure. We are responsible for demonstrating engineering fundamentals and leaving the customer in a better place after we disengage. This can only happen if we collaborate and engage together as a team. In addition to improved software quality, this also adds a beneficial social aspect to the engagements.","title":"Benefits of Increased Collaboration"},{"location":"agile-development/advanced-topics/collaboration/why-collaboration/#resources","text":"How to add a pairing custom field in Azure DevOps User Stories - adding a custom field of type Identity in Azure DevOps for pairing On Pair Programming - Martin Fowler Pair Programming hands-on lessons - these can be used (and adapted) to support bringing pair programming into your team (MS internal or including customers) Effortless Pair Programming with GitHub Codespaces and VSCode","title":"Resources"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/","text":"Delivery Plan Goals While Scrum does not require and discourages planning more than one sprint at a time. Most of us work in enterprises where we are dependent outside teams (for example: marketing, sales, support). A rough assessment of the planned project scope is achievable within a reasonable time frame and resources. The goal is to have a rough plan and estimate as a starting point, not to implement \"Agilefall.\" Note that this is just a starting point to enable planning discussions. We expect the actual schedule to evolve and shift over time and that you will update the scope and timeline as you progress. Delivery Plans ensure your teams are aligning with your organizational goals. Benefits As you complete the assessment, you can push back on the scope, time frame or ask for more resources. As you progress in your project/product delivery, you can highlight risks to the scope, time frame, and resources. Approach One approach you can take to accomplish is with stickies and a spreadsheet. Stack rank the features for everything in your backlog - Functional Features - Non-functional Features - User Research and Design - Testing - Documentation - Knowledge Transfer/Support Processes T-Shirt Features in terms of working weeks per person. In some scenarios, you have no idea how complex the work. In this situation, you can ask for time to conduct a spike (timebox the effort so you can get back on time). Calculate the capacity for the team based on the number of weeks person with his/her start and end date and minus holidays, vacation, conferences, training, and onboarding days. Also, minus time if the person is also working on defects and support. Based on your capacity, you know have the options Ask for more resources. Caution: onboarding new resources take time. Reduce the scope to the most MVP. Caution: as you trim more of the scope, it might not be valuable anymore to the customer. Consider a cupcake which is everything you need. You don't want to skim off the frosting. Ask for more time. Usually, this is the most flexible, but if there is a marketing date that you need to hit, this might be as flexible. Tools You can also leverage one of these tools by creating your epics and features and add the weeks estimates. The Plans (Preview) feature on Azure DevOps will help you make a plan. Delivery Plans provide a schedule of stories or features your team plan to deliver. Delivery Plans show the scheduled work items by a sprint (iteration path) of selected teams against a calendar view. Confluence JIRA, Trello, Rally, Asana, Basecamp, and GitHub Issues are other similar tools in the market (some are free, others you pay a monthly fee, or you can install on-prem) that you can leverage.","title":"Delivery Plan"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#delivery-plan","text":"","title":"Delivery Plan"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#goals","text":"While Scrum does not require and discourages planning more than one sprint at a time. Most of us work in enterprises where we are dependent outside teams (for example: marketing, sales, support). A rough assessment of the planned project scope is achievable within a reasonable time frame and resources. The goal is to have a rough plan and estimate as a starting point, not to implement \"Agilefall.\" Note that this is just a starting point to enable planning discussions. We expect the actual schedule to evolve and shift over time and that you will update the scope and timeline as you progress. Delivery Plans ensure your teams are aligning with your organizational goals.","title":"Goals"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#benefits","text":"As you complete the assessment, you can push back on the scope, time frame or ask for more resources. As you progress in your project/product delivery, you can highlight risks to the scope, time frame, and resources.","title":"Benefits"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#approach","text":"One approach you can take to accomplish is with stickies and a spreadsheet. Stack rank the features for everything in your backlog - Functional Features - Non-functional Features - User Research and Design - Testing - Documentation - Knowledge Transfer/Support Processes T-Shirt Features in terms of working weeks per person. In some scenarios, you have no idea how complex the work. In this situation, you can ask for time to conduct a spike (timebox the effort so you can get back on time). Calculate the capacity for the team based on the number of weeks person with his/her start and end date and minus holidays, vacation, conferences, training, and onboarding days. Also, minus time if the person is also working on defects and support. Based on your capacity, you know have the options Ask for more resources. Caution: onboarding new resources take time. Reduce the scope to the most MVP. Caution: as you trim more of the scope, it might not be valuable anymore to the customer. Consider a cupcake which is everything you need. You don't want to skim off the frosting. Ask for more time. Usually, this is the most flexible, but if there is a marketing date that you need to hit, this might be as flexible.","title":"Approach"},{"location":"agile-development/advanced-topics/effective-organization/delivery-plan/#tools","text":"You can also leverage one of these tools by creating your epics and features and add the weeks estimates. The Plans (Preview) feature on Azure DevOps will help you make a plan. Delivery Plans provide a schedule of stories or features your team plan to deliver. Delivery Plans show the scheduled work items by a sprint (iteration path) of selected teams against a calendar view. Confluence JIRA, Trello, Rally, Asana, Basecamp, and GitHub Issues are other similar tools in the market (some are free, others you pay a monthly fee, or you can install on-prem) that you can leverage.","title":"Tools"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/","text":"Scrum of Scrums Scrum of scrums is a technique used to scale Scrum to a larger group working towards the same project goal. In Scrum, we consider a team being too big when going over 10-12 individuals. This should be decided on a case by case basis. If the project is set up in multiple work streams that contain a fixed group of people and a common stand-up meeting is slowing down productivity: scrum of scrums should be considered. The team would identify the different subgroups that would act as a separate scrum teams with their own backlog, board and stand-up. Goals The goal of the scrum of scrums ceremony is to give sub-teams the agility they need while not loosing visibility and coordination. It also helps to ensure that the sub-teams are achieving their sprint goals, and they are going in the right direction to achieve the overall project goal. The scrum of scrums ceremony happens every day and can be seen as a regular stand-up: What was done the day before by the sub-team. What will be done today by the sub-team. What are blockers or other issues for the sub-team. What are the blockers or issues that may impact other sub-teams. The outcome of the meeting will result in a list of impediments related to coordination of the whole project. Solutions could be: agreeing on interfaces between teams, discussing architecture changes, evolving responsibility boundaries, etc. This list of impediments is usually managed in a separate backlog but does not have to. Participation The common guideline is to have on average one person per sub-team to participate in the scrum of scrums. Ideally, the Process Lead of each sub-team would represent them in this ceremony. In some instances, the representative for the day is selected at the end of each sub-team daily stand-up and could change every day. In practice, having a fixed representative tends to be more efficient in the long term. Impact This practice is helpful in cases of longer projects and with a larger scope, requiring more people. When having more people, it is usually easier to divide the project in sub-teams. Having a daily scrum of scrums improves communication, lowers the risk of integration issues and increases the project chances of success. When choosing to implement Scrum of Scrums, you need to keep in mind that some team members will have additional meetings to coordinate and participate in. Also: all team members for each sub-team need to be updated on the decisions at a later point to ensure a good flow of information. Measures The easiest way to measure the impact is by tracking the time to resolve issues in the scrum of scrums backlog. You can also track issues reported during the retrospective related to global coordination (is it well done? can it be improved?). Facilitation Guidance This should be facilitated like a regular stand-up.","title":"Scrum of Scrums"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#scrum-of-scrums","text":"Scrum of scrums is a technique used to scale Scrum to a larger group working towards the same project goal. In Scrum, we consider a team being too big when going over 10-12 individuals. This should be decided on a case by case basis. If the project is set up in multiple work streams that contain a fixed group of people and a common stand-up meeting is slowing down productivity: scrum of scrums should be considered. The team would identify the different subgroups that would act as a separate scrum teams with their own backlog, board and stand-up.","title":"Scrum of Scrums"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#goals","text":"The goal of the scrum of scrums ceremony is to give sub-teams the agility they need while not loosing visibility and coordination. It also helps to ensure that the sub-teams are achieving their sprint goals, and they are going in the right direction to achieve the overall project goal. The scrum of scrums ceremony happens every day and can be seen as a regular stand-up: What was done the day before by the sub-team. What will be done today by the sub-team. What are blockers or other issues for the sub-team. What are the blockers or issues that may impact other sub-teams. The outcome of the meeting will result in a list of impediments related to coordination of the whole project. Solutions could be: agreeing on interfaces between teams, discussing architecture changes, evolving responsibility boundaries, etc. This list of impediments is usually managed in a separate backlog but does not have to.","title":"Goals"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#participation","text":"The common guideline is to have on average one person per sub-team to participate in the scrum of scrums. Ideally, the Process Lead of each sub-team would represent them in this ceremony. In some instances, the representative for the day is selected at the end of each sub-team daily stand-up and could change every day. In practice, having a fixed representative tends to be more efficient in the long term.","title":"Participation"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#impact","text":"This practice is helpful in cases of longer projects and with a larger scope, requiring more people. When having more people, it is usually easier to divide the project in sub-teams. Having a daily scrum of scrums improves communication, lowers the risk of integration issues and increases the project chances of success. When choosing to implement Scrum of Scrums, you need to keep in mind that some team members will have additional meetings to coordinate and participate in. Also: all team members for each sub-team need to be updated on the decisions at a later point to ensure a good flow of information.","title":"Impact"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#measures","text":"The easiest way to measure the impact is by tracking the time to resolve issues in the scrum of scrums backlog. You can also track issues reported during the retrospective related to global coordination (is it well done? can it be improved?).","title":"Measures"},{"location":"agile-development/advanced-topics/effective-organization/scrum-of-scrums/#facilitation-guidance","text":"This should be facilitated like a regular stand-up.","title":"Facilitation Guidance"},{"location":"agile-development/team-agreements/definition-of-done/","text":"Definition of Done To close a user story, a sprint, or a milestone it is important to verify that the tasks are complete. The development team should decide together what their Definition of Done is and document this in the project. Below are some examples of checks to verify that the user story, sprint, task is completed. Feature/User Story Acceptance criteria are met Refactoring is complete Code builds with no error Unit tests are written and pass Existing Unit Tests pass Sufficient diagnostics/telemetry are logged Code review is complete UX review is complete (if applicable) Documentation is updated The feature is merged into the develop branch The feature is signed off by the product owner Sprint Goal Definition of Done for all user stories included in the sprint are met Product backlog is updated Functional and Integration tests pass Performance tests pass End 2 End tests pass All bugs are fixed The sprint is signed off from developers, software architects, project manager, product owner etc. Release/Milestone Code Complete (goals of sprints are met) Release is marked as ready for production deployment by product owner","title":"Definition of Done"},{"location":"agile-development/team-agreements/definition-of-done/#definition-of-done","text":"To close a user story, a sprint, or a milestone it is important to verify that the tasks are complete. The development team should decide together what their Definition of Done is and document this in the project. Below are some examples of checks to verify that the user story, sprint, task is completed.","title":"Definition of Done"},{"location":"agile-development/team-agreements/definition-of-done/#featureuser-story","text":"Acceptance criteria are met Refactoring is complete Code builds with no error Unit tests are written and pass Existing Unit Tests pass Sufficient diagnostics/telemetry are logged Code review is complete UX review is complete (if applicable) Documentation is updated The feature is merged into the develop branch The feature is signed off by the product owner","title":"Feature/User Story"},{"location":"agile-development/team-agreements/definition-of-done/#sprint-goal","text":"Definition of Done for all user stories included in the sprint are met Product backlog is updated Functional and Integration tests pass Performance tests pass End 2 End tests pass All bugs are fixed The sprint is signed off from developers, software architects, project manager, product owner etc.","title":"Sprint Goal"},{"location":"agile-development/team-agreements/definition-of-done/#releasemilestone","text":"Code Complete (goals of sprints are met) Release is marked as ready for production deployment by product owner","title":"Release/Milestone"},{"location":"agile-development/team-agreements/definition-of-ready/","text":"Definition of Ready When the development team picks a user story from the top of the backlog, the user story needs to have enough detail to estimate the work needed to complete the story within the sprint. If it has enough detail to estimate, it is Ready to be developed. If a user story is not Ready in the beginning of the Sprint it increases the chance that the story will not be done at the end of this sprint. What it is Definition of Ready is the agreement made by the scrum team around how complete a user story should be in order to be selected as candidate for estimation in the sprint planning. These can be codified as a checklist in user stories using GitHub Issue Templates or Azure DevOps Work Item Templates . It can be understood as a checklist that helps the Product Owner to ensure that the user story they wrote contains all the necessary details for the scrum team to understand the work to be done. Examples of Ready Checklist Items Does the description have the details including any input values required to implement the user story? Does the user story have clear and complete acceptance criteria? Does the user story address the business need? Can we measure the acceptance criteria? Is the user story small enough to be implemented in a short amount of time, but large enough to provide value to the customer? Is the user story blocked? For example, does it depend on any of the following: The completion of unfinished work A deliverable provided by another team (code artifact, data, etc...) Who Writes it The ready checklist can be written by a Product Owner in agreement with the development team and the Process Lead. When Should a Definition of Ready be Updated Update or change the definition of ready anytime the scrum team observes that there are missing information in the user stories that recurrently impacts the planning. What Should be Avoided The ready checklist should contain items that apply broadly. Don't include items or details that only apply to one or two user stories. This may become an overhead when writing the user stories. How to get Stories Ready In the case that the highest priority work is not yet ready, it still may be possible to make forward progress. Here are some strategies that may help: Backlog Refinement sessions are a good time to validate that high priority user stories are verified to have a clear description, acceptance criteria and demonstrable business value. It is also a good time to breakdown large stories will likely not be completable in a single sprint. Prioritization sessions are a good time to prioritize user stories that unblock other blocked high priority work. Blocked user stories can often be broken down in a way that unblocks a portion of the original stories scope. This is a good way to make forward progress even when some work is blocked.","title":"Definition of Ready"},{"location":"agile-development/team-agreements/definition-of-ready/#definition-of-ready","text":"When the development team picks a user story from the top of the backlog, the user story needs to have enough detail to estimate the work needed to complete the story within the sprint. If it has enough detail to estimate, it is Ready to be developed. If a user story is not Ready in the beginning of the Sprint it increases the chance that the story will not be done at the end of this sprint.","title":"Definition of Ready"},{"location":"agile-development/team-agreements/definition-of-ready/#what-it-is","text":"Definition of Ready is the agreement made by the scrum team around how complete a user story should be in order to be selected as candidate for estimation in the sprint planning. These can be codified as a checklist in user stories using GitHub Issue Templates or Azure DevOps Work Item Templates . It can be understood as a checklist that helps the Product Owner to ensure that the user story they wrote contains all the necessary details for the scrum team to understand the work to be done.","title":"What it is"},{"location":"agile-development/team-agreements/definition-of-ready/#examples-of-ready-checklist-items","text":"Does the description have the details including any input values required to implement the user story? Does the user story have clear and complete acceptance criteria? Does the user story address the business need? Can we measure the acceptance criteria? Is the user story small enough to be implemented in a short amount of time, but large enough to provide value to the customer? Is the user story blocked? For example, does it depend on any of the following: The completion of unfinished work A deliverable provided by another team (code artifact, data, etc...)","title":"Examples of Ready Checklist Items"},{"location":"agile-development/team-agreements/definition-of-ready/#who-writes-it","text":"The ready checklist can be written by a Product Owner in agreement with the development team and the Process Lead.","title":"Who Writes it"},{"location":"agile-development/team-agreements/definition-of-ready/#when-should-a-definition-of-ready-be-updated","text":"Update or change the definition of ready anytime the scrum team observes that there are missing information in the user stories that recurrently impacts the planning.","title":"When Should a Definition of Ready be Updated"},{"location":"agile-development/team-agreements/definition-of-ready/#what-should-be-avoided","text":"The ready checklist should contain items that apply broadly. Don't include items or details that only apply to one or two user stories. This may become an overhead when writing the user stories.","title":"What Should be Avoided"},{"location":"agile-development/team-agreements/definition-of-ready/#how-to-get-stories-ready","text":"In the case that the highest priority work is not yet ready, it still may be possible to make forward progress. Here are some strategies that may help: Backlog Refinement sessions are a good time to validate that high priority user stories are verified to have a clear description, acceptance criteria and demonstrable business value. It is also a good time to breakdown large stories will likely not be completable in a single sprint. Prioritization sessions are a good time to prioritize user stories that unblock other blocked high priority work. Blocked user stories can often be broken down in a way that unblocks a portion of the original stories scope. This is a good way to make forward progress even when some work is blocked.","title":"How to get Stories Ready"},{"location":"agile-development/team-agreements/team-manifesto/","text":"Team Manifesto Introduction ISE teams work with a new development team in each customer engagement which requires a phase of introduction & knowledge transfer before starting an engagement. Completion of this phase of ice-breakers and discussions about the standards takes time, but is required to start increasing the learning curve of the new team. A team manifesto is a light-weight one page agile document among team members which summarizes the basic principles and values of the team and aiming to provide a consensus about technical expectations from each team member in order to deliver high quality output at the end of each engagement. It aims to reduce the time on setting the right expectations without arranging longer \"team document reading\" meetings and provide a consensus among team members to answer the question - \"How does the new team develop the software?\" - by covering all engineering fundamentals and excellence topics such as release process, clean coding, testing. Another main goal of writing the manifesto is to start a conversation during the \"manifesto building session\" to detect any differences of opinion around how the team should work. It also serves in the same way when a new team member joins to the team. New joiners can quickly get up to speed on the agreed standards. How to Build a Team Manifesto It can be said that the best time to start building it is at the very early phase of the engagement when teams meet with each other for swarming or during the preparation phase. It is recommended to keep team manifesto as simple as possible, so preferably, one-page simple document which doesn't include any references or links is a nice format for it. If there is a need for providing knowledge on certain topics, the way to do is delivering brown-bag sessions, technical katas, team practices, documentations and others later on. A few important points about the team manifesto The team manifesto is built by the development team itself It should cover all required technical engineering points for the excellence as well as behavioral agility mindset items that the team finds relevant It aims to give a common understanding about the desired expertise, practices and/or mindset within the team Based on the needs of the team and retrospective results, it can be modified during the engagement. In ISE, we aim for quality over quantity, and well-crafted software as well as to a comfortable/transparent environment where each team member can reach their highest potential. The difference between the team manifesto and other team documents is that it is used to give a short summary of expectations around the technical way of working and supported mindset in the team, before code-with sprints starts. Below, you can find some including, but not limited, topics many teams touch during engagements, Topic What is it about ? Collective Ownership Does team own the code rather than individuals? What is the expectation? Respect Any preferred statement about it's a \"must-have\" team value Collaboration Any preferred statement about how does team want to collaborate ? Transparency A simple statement about it's a \"must-have\" team value and if preferred, how does this being provided by the team ? meetings, retrospective, feedback mechanisms etc. Craftspersonship Which tools such as Git, VS Code LiveShare, etc. are being used ? What is the definition of expected best usage of them? PR sizing What does team prefer in PRs ? Branching Team's branching strategy and standards Commit standards Preferred format in commit messages, rules and more Clean Code Does team follow clean code principles ? Pair/Mob Programming Will team apply pair/mob programming ? If yes, what programming styles are suitable for the team ? Release Process Principles around release process such as quality gates, reviewing process ...etc. Code Review Any rule for code reviewing such as min number of reviewers, team rules ...etc. Action Readiness How the backlog will be refined? How do we ensure clear Definition of Done and Acceptance Criteria ? TDD Will the team follow TDD ? Test Coverage Is there any expected number, percentage or measurement ? Dimensions in Testing Required tests for high quality software, eg : unit, integration, functional, performance, regression, acceptance Build process build for all? or not; The clear statement of where code and under what conditions code should work ? eg : OS, DevOps, tool dependency Bug fix The rules of bug fixing in the team ? eg: contact people, attaching PR to the issue etc. Technical debt How does team manage/follow it? Refactoring How does team manage/follow it? Agile Documentation Does team want to use diagrams and tables more rather than detailed KB articles ? Efficient Documentation When is it necessary ? Is it a prerequisite to complete tasks/PRs etc.? Definition of Fun How will we have fun for relaxing/enjoying the team spirit during the engagement? Tools Generally team sessions are enough for building a manifesto and having a consensus around it, and if there is a need for improving it in a structured way, there are many blogs and tools online, any retrospective tool can be used. Resources Technical Agility*","title":"Team Manifesto"},{"location":"agile-development/team-agreements/team-manifesto/#team-manifesto","text":"","title":"Team Manifesto"},{"location":"agile-development/team-agreements/team-manifesto/#introduction","text":"ISE teams work with a new development team in each customer engagement which requires a phase of introduction & knowledge transfer before starting an engagement. Completion of this phase of ice-breakers and discussions about the standards takes time, but is required to start increasing the learning curve of the new team. A team manifesto is a light-weight one page agile document among team members which summarizes the basic principles and values of the team and aiming to provide a consensus about technical expectations from each team member in order to deliver high quality output at the end of each engagement. It aims to reduce the time on setting the right expectations without arranging longer \"team document reading\" meetings and provide a consensus among team members to answer the question - \"How does the new team develop the software?\" - by covering all engineering fundamentals and excellence topics such as release process, clean coding, testing. Another main goal of writing the manifesto is to start a conversation during the \"manifesto building session\" to detect any differences of opinion around how the team should work. It also serves in the same way when a new team member joins to the team. New joiners can quickly get up to speed on the agreed standards.","title":"Introduction"},{"location":"agile-development/team-agreements/team-manifesto/#how-to-build-a-team-manifesto","text":"It can be said that the best time to start building it is at the very early phase of the engagement when teams meet with each other for swarming or during the preparation phase. It is recommended to keep team manifesto as simple as possible, so preferably, one-page simple document which doesn't include any references or links is a nice format for it. If there is a need for providing knowledge on certain topics, the way to do is delivering brown-bag sessions, technical katas, team practices, documentations and others later on. A few important points about the team manifesto The team manifesto is built by the development team itself It should cover all required technical engineering points for the excellence as well as behavioral agility mindset items that the team finds relevant It aims to give a common understanding about the desired expertise, practices and/or mindset within the team Based on the needs of the team and retrospective results, it can be modified during the engagement. In ISE, we aim for quality over quantity, and well-crafted software as well as to a comfortable/transparent environment where each team member can reach their highest potential. The difference between the team manifesto and other team documents is that it is used to give a short summary of expectations around the technical way of working and supported mindset in the team, before code-with sprints starts. Below, you can find some including, but not limited, topics many teams touch during engagements, Topic What is it about ? Collective Ownership Does team own the code rather than individuals? What is the expectation? Respect Any preferred statement about it's a \"must-have\" team value Collaboration Any preferred statement about how does team want to collaborate ? Transparency A simple statement about it's a \"must-have\" team value and if preferred, how does this being provided by the team ? meetings, retrospective, feedback mechanisms etc. Craftspersonship Which tools such as Git, VS Code LiveShare, etc. are being used ? What is the definition of expected best usage of them? PR sizing What does team prefer in PRs ? Branching Team's branching strategy and standards Commit standards Preferred format in commit messages, rules and more Clean Code Does team follow clean code principles ? Pair/Mob Programming Will team apply pair/mob programming ? If yes, what programming styles are suitable for the team ? Release Process Principles around release process such as quality gates, reviewing process ...etc. Code Review Any rule for code reviewing such as min number of reviewers, team rules ...etc. Action Readiness How the backlog will be refined? How do we ensure clear Definition of Done and Acceptance Criteria ? TDD Will the team follow TDD ? Test Coverage Is there any expected number, percentage or measurement ? Dimensions in Testing Required tests for high quality software, eg : unit, integration, functional, performance, regression, acceptance Build process build for all? or not; The clear statement of where code and under what conditions code should work ? eg : OS, DevOps, tool dependency Bug fix The rules of bug fixing in the team ? eg: contact people, attaching PR to the issue etc. Technical debt How does team manage/follow it? Refactoring How does team manage/follow it? Agile Documentation Does team want to use diagrams and tables more rather than detailed KB articles ? Efficient Documentation When is it necessary ? Is it a prerequisite to complete tasks/PRs etc.? Definition of Fun How will we have fun for relaxing/enjoying the team spirit during the engagement?","title":"How to Build a Team Manifesto"},{"location":"agile-development/team-agreements/team-manifesto/#tools","text":"Generally team sessions are enough for building a manifesto and having a consensus around it, and if there is a need for improving it in a structured way, there are many blogs and tools online, any retrospective tool can be used.","title":"Tools"},{"location":"agile-development/team-agreements/team-manifesto/#resources","text":"Technical Agility*","title":"Resources"},{"location":"agile-development/team-agreements/working-agreement/","text":"Sections of a Working Agreement A working agreement is a document, or a set of documents that describe how we work together as a team and what our expectations and principles are. The working agreement created by the team at the beginning of the project, and is stored in the repository so that it is readily available for everyone working on the project. The following are examples of sections and points that can be part of a working agreement but each team should compose their own, and adjust times, communication channels, branch naming policies etc. to fit their team needs. General We work as one team towards a common goal and clear scope We make sure everyone's voice is heard, listened to We show all team members equal respect We work as a team to have common expectations for technical delivery that are documented in a Team Manifesto . We make sure to spread our expertise and skills in the team, so no single person is relied on for one skill All times below are listed in CET Communication We communicate all information relevant to the team through the Project Teams channel We add all technical spikes , trade studies , and other technical documentation to the project repository through async design reviews in PRs Work-life Balance Our office hours, when we can expect to collaborate via Microsoft Teams, phone or face-to-face are Monday to Friday 10AM - 5PM We are not expected to answer emails past 6PM, on weekends or when we are on holidays or vacation. We work in different time zones and respect this, especially when setting up recurring meetings. We record meetings when possible, so that team members who could not attend live can listen later. Quality and not Quantity We agree on a Definition of Done for our user story's and sprints and live by it. We follow engineering best practices like the Engineering Fundamentals Engineering Playbook Scrum Rhythm Activity When Duration Who Accountable Goal Project Standup Tue-Fri 9AM 15 min Everyone Process Lead What has been accomplished, next steps, blockers Sprint Demo Monday 9AM 1 hour Everyone Dev Lead Present work done and sign off on user story completion Sprint Retro Monday 10AM 1 hour Everyone Process Lead Dev Teams shares learnings and what can be improved Sprint Planning Monday 11AM 1 hour Everyone PO Size and plan user stories for the sprint Task Creation After Sprint Planning - Dev Team Dev Lead Create tasks to clarify and determine velocity Backlog refinement Wednesday 2PM 1 hour Dev Lead, PO PO Prepare for next sprint and ensure that stories are ready for next sprint. Process Lead The Process Lead is responsible for leading any scrum or agile practices to enable the project to move forward. Facilitate standup meetings and hold team accountable for attendance and participation. Keep the meeting moving as described in the Project Standup page. Make sure all action items are documented and ensure each has an owner and a due date and tracks the open issues. Notes as needed after planning / stand-ups. Make sure that items are moved to the parking lot and ensure follow-up afterwards. Maintain a location showing team\u2019s work and status and removing impediments that are blocking the team. Hold the team accountable for results in a supportive fashion. Make sure that project and program documentation are up-to-date. Guarantee the tracking/following up on action items from retrospectives (iteration and release planning) and from daily standup meetings. Facilitate the sprint retrospective. Coach Product Owner and the team in the process, as needed. Backlog Management We work together on a Definition of Ready and all user stories assigned to a sprint need to follow this We communicate what we are working on through the board We assign ourselves a task when we are ready to work on it (not before) and move it to active We capture any work we do related to the project in a user story/task We close our tasks/user stories only when they are done (as described in the Definition of Done ) We work with the PM if we want to add a new user story to the sprint If we add new tasks to the board, we make sure it matches the acceptance criteria of the user story (to avoid scope creep). If it doesn't match the acceptance criteria we should discuss with the PM to see if we need a new user story for the task or if we should adjust the acceptance criteria. Code Management We follow the git flow branch naming convention for branches and identify the task number e.g. feature/123-add-working-agreement We merge all code into main branches through PRs All PRs are reviewed by one person from and one from Microsoft (for knowledge transfer and to ensure code and security standards are met) We always review existing PRs before starting work on a new task We look through open PRs at the end of stand-up to make sure all PRs have reviewers. We treat documentation as code and apply the same standards to Markdown as code","title":"Sections of a Working Agreement"},{"location":"agile-development/team-agreements/working-agreement/#sections-of-a-working-agreement","text":"A working agreement is a document, or a set of documents that describe how we work together as a team and what our expectations and principles are. The working agreement created by the team at the beginning of the project, and is stored in the repository so that it is readily available for everyone working on the project. The following are examples of sections and points that can be part of a working agreement but each team should compose their own, and adjust times, communication channels, branch naming policies etc. to fit their team needs.","title":"Sections of a Working Agreement"},{"location":"agile-development/team-agreements/working-agreement/#general","text":"We work as one team towards a common goal and clear scope We make sure everyone's voice is heard, listened to We show all team members equal respect We work as a team to have common expectations for technical delivery that are documented in a Team Manifesto . We make sure to spread our expertise and skills in the team, so no single person is relied on for one skill All times below are listed in CET","title":"General"},{"location":"agile-development/team-agreements/working-agreement/#communication","text":"We communicate all information relevant to the team through the Project Teams channel We add all technical spikes , trade studies , and other technical documentation to the project repository through async design reviews in PRs","title":"Communication"},{"location":"agile-development/team-agreements/working-agreement/#work-life-balance","text":"Our office hours, when we can expect to collaborate via Microsoft Teams, phone or face-to-face are Monday to Friday 10AM - 5PM We are not expected to answer emails past 6PM, on weekends or when we are on holidays or vacation. We work in different time zones and respect this, especially when setting up recurring meetings. We record meetings when possible, so that team members who could not attend live can listen later.","title":"Work-life Balance"},{"location":"agile-development/team-agreements/working-agreement/#quality-and-not-quantity","text":"We agree on a Definition of Done for our user story's and sprints and live by it. We follow engineering best practices like the Engineering Fundamentals Engineering Playbook","title":"Quality and not Quantity"},{"location":"agile-development/team-agreements/working-agreement/#scrum-rhythm","text":"Activity When Duration Who Accountable Goal Project Standup Tue-Fri 9AM 15 min Everyone Process Lead What has been accomplished, next steps, blockers Sprint Demo Monday 9AM 1 hour Everyone Dev Lead Present work done and sign off on user story completion Sprint Retro Monday 10AM 1 hour Everyone Process Lead Dev Teams shares learnings and what can be improved Sprint Planning Monday 11AM 1 hour Everyone PO Size and plan user stories for the sprint Task Creation After Sprint Planning - Dev Team Dev Lead Create tasks to clarify and determine velocity Backlog refinement Wednesday 2PM 1 hour Dev Lead, PO PO Prepare for next sprint and ensure that stories are ready for next sprint.","title":"Scrum Rhythm"},{"location":"agile-development/team-agreements/working-agreement/#process-lead","text":"The Process Lead is responsible for leading any scrum or agile practices to enable the project to move forward. Facilitate standup meetings and hold team accountable for attendance and participation. Keep the meeting moving as described in the Project Standup page. Make sure all action items are documented and ensure each has an owner and a due date and tracks the open issues. Notes as needed after planning / stand-ups. Make sure that items are moved to the parking lot and ensure follow-up afterwards. Maintain a location showing team\u2019s work and status and removing impediments that are blocking the team. Hold the team accountable for results in a supportive fashion. Make sure that project and program documentation are up-to-date. Guarantee the tracking/following up on action items from retrospectives (iteration and release planning) and from daily standup meetings. Facilitate the sprint retrospective. Coach Product Owner and the team in the process, as needed.","title":"Process Lead"},{"location":"agile-development/team-agreements/working-agreement/#backlog-management","text":"We work together on a Definition of Ready and all user stories assigned to a sprint need to follow this We communicate what we are working on through the board We assign ourselves a task when we are ready to work on it (not before) and move it to active We capture any work we do related to the project in a user story/task We close our tasks/user stories only when they are done (as described in the Definition of Done ) We work with the PM if we want to add a new user story to the sprint If we add new tasks to the board, we make sure it matches the acceptance criteria of the user story (to avoid scope creep). If it doesn't match the acceptance criteria we should discuss with the PM to see if we need a new user story for the task or if we should adjust the acceptance criteria.","title":"Backlog Management"},{"location":"agile-development/team-agreements/working-agreement/#code-management","text":"We follow the git flow branch naming convention for branches and identify the task number e.g. feature/123-add-working-agreement We merge all code into main branches through PRs All PRs are reviewed by one person from and one from Microsoft (for knowledge transfer and to ensure code and security standards are met) We always review existing PRs before starting work on a new task We look through open PRs at the end of stand-up to make sure all PRs have reviewers. We treat documentation as code and apply the same standards to Markdown as code","title":"Code Management"},{"location":"automated-testing/","text":"Testing Why Testing Tests allow us to find flaws in our software Good tests document the code by describing the intent Automated tests saves time, compared to manual tests Automated tests allow us to safely change and refactor our code without introducing regressions The Fundamentals We consider code to be incomplete if it is not accompanied by tests We write unit tests (tests without external dependencies) that can run before every PR merge to validate that we don\u2019t have regressions We write Integration tests/E2E tests that test the whole system end to end, and run them regularly We write our tests early and block any further code merging if tests fail. We run load tests/performance tests where appropriate to validate that the system performs under stress Build for Testing Testing is a critical part of the development process. It is important to build your application with testing in mind. Here are some tips to help you build for testing: Parameterize everything. Rather than hard-code any variables, consider making everything a configurable parameter with a reasonable default. This will allow you to easily change the behavior of your application during testing. Particularly during performance testing, it is common to test different values to see what impact that has on performance. If a range of defaults need to change together, consider one or more parameters which set \"modes\", changing the defaults of a group of parameters together. Document at startup. When your application starts up, it should log all parameters. This ensures the person reviewing the logs and application behavior know exactly how the application is configured. Log to console. Logging to external systems like Azure Monitor is desirable for traceability across services. This requires logs to be dispatched from the local system to the external system and that is a dependency that can fail. It is important that someone be able to console logs directly on the local system. Log to external system. In addition to console logs, logging to an external system like Azure Monitor is desirable for traceability across services and durability of logs. Log all activity. If the system is performing some activity (reading data from a database, calling an external service, etc.), it should log that activity. Ideally, there should be a log message saying the activity is starting and another log message saying the activity is complete. This allows someone reviewing the logs to understand what the application is doing and how long it is taking. Depending on how noisy this is, different messages can be associated with different log levels, but it is important to have the information available when it comes to debugging a deployed system. Correlate distributed activities. If the system is performing some activity that is distributed across multiple systems, it is important to correlate the activity across those systems. This can be done using a Correlation ID that is passed from system to system. This allows someone reviewing the logs to understand the entire flow of activity. For more information, please see Observability in Microservices . Log metadata. When logging, it is important to include metadata that is relevant to the activity. For example, a Tenant ID, Customer ID, or Order ID. This allows someone reviewing the logs to understand the context of the activity and filter to a manageable set of logs. Log performance metrics. Even if you are using App Insights to capture how long dependency calls are taking, it is often useful to know long certain functions of your application took. It then becomes possible to evaluate the performance characteristics of your application as it is deployed on different compute platforms with different limitations on CPU, memory, and network bandwidth. For more information, please see Metrics . Map of Outcomes to Testing Techniques The table below maps outcomes (the results that you may want to achieve in your validation efforts) to one or more techniques that can be used to accomplish that outcome. When I am working on... I want to get this outcome... ...so I should consider Development Prove backward compatibility with existing callers and clients Shadow testing Development Ensure telemetry is sufficiently detailed and complete to trace and diagnose malfunction in End-to-End testing flows Distributed Debug challenges; Orphaned call chain analysis Development Ensure program logic is correct for a variety of expected, mainline, edge and unexpected inputs Unit testing ; Functional tests; Consumer-driven Contract Testing ; Integration testing Development Prevent regressions in logical correctness; earlier is better Unit testing ; Functional tests; Consumer-driven Contract Testing ; Integration testing ; Rings (each of these are expanding scopes of coverage) Development Quickly validate mainline correctness of a point of functionality (e.g. single API), manually Manual smoke testing Tools: postman, powershell, curl Development Validate interactions between components in isolation, ensuring that consumer and provider components are compatible and conform to a shared understanding documented in a contract Consumer-driven Contract Testing Development Validate that multiple components function together across multiple interfaces in a call chain, incl network hops Integration testing ; End-to-end ( End-to-End testing ) tests; Segmented end-to-end ( End-to-End testing ) Development Prove disaster recoverability \u2013 recover from corruption of data DR drills Development Find vulnerabilities in service Authentication or Authorization Scenario (security) Development Prove correct RBAC and claims interpretation of Authorization code Scenario (security) Development Document and/or enforce valid API usage Unit testing ; Functional tests; Consumer-driven Contract Testing Development Prove implementation correctness in advance of a dependency or absent a dependency Unit testing (with mocks); Unit testing (with emulators); Consumer-driven Contract Testing Development Ensure that the user interface is accessible Accessibility Development Ensure that users can operate the interface UI testing (automated) (human usability observation) Development Prevent regression in user experience UI automation; End-to-End testing Development Detect and prevent 'noisy neighbor' phenomena Load testing Development Detect availability drops Synthetic Transaction testing ; Outside-in probes Development Prevent regression in 'composite' scenario use cases / workflows (e.g. an e-commerce system might have many APIs that used together in a sequence perform a \"shop-and-buy\" scenario) End-to-End testing ; Scenario Development; Operations Prevent regressions in runtime performance metrics e.g. latency / cost / resource consumption; earlier is better Rings; Synthetic Transaction testing / Transaction; Rollback Watchdogs Development; Optimization Compare any given metric between 2 candidate implementations or variations in functionality Flighting; A/B testing Development; Staging Prove production system of provisioned capacity meets goals for reliability, availability, resource consumption, performance Load testing (stress) ; Spike; Soak; Performance testing Development; Staging Understand key user experience performance characteristics \u2013 latency, chattiness, resiliency to network errors Load; Performance testing ; Scenario (network partitioning) Development; Staging; Operation Discover melt points (the loads at which failure or maximum tolerable resource consumption occurs) for each individual component in the stack Squeeze; Load testing (stress) Development; Staging; Operation Discover overall system melt point (the loads at which the end-to-end system fails) and which component is the weakest link in the whole stack Squeeze; Load testing (stress) Development; Staging; Operation Measure capacity limits for given provisioning to predict or satisfy future provisioning needs Squeeze; Load testing (stress) Development; Staging; Operation Create / exercise failover runbook Failover drills Development; Staging; Operation Prove disaster recoverability \u2013 loss of data center (the meteor scenario); measure MTTR DR drills Development; Staging; Operation Understand whether observability dashboards are correct, and telemetry is complete; flowing Trace Validation; Load testing (stress) ; Scenario; End-to-End testing Development; Staging; Operation Measure impact of seasonality of traffic Load testing Development; Staging; Operation Prove Transaction and alerts correctly notify / take action Synthetic Transaction testing (negative cases); Load testing Development; Staging; Operation; Optimizing Understand scalability curve, i.e. how the system consumes resources with load Load testing (stress) ; Performance testing Operation; Optimizing Discover system behavior over long-haul time Soak Optimizing Find cost savings opportunities Squeeze Staging; Operation Measure impact of failover / scale-out (repartitioning, increasing provisioning) / scale-down Failover drills; Scale drills Staging; Operation Create/Exercise runbook for increasing/reducing provisioning Scale drills Staging; Operation Measure behavior under rapid changes in traffic Spike Staging; Optimizing Discover cost metrics per unit load volume (what factors influence cost at what load points, e.g. cost per million concurrent users) Load (stress) Development; Operation Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, \u2026) Chaos Development Perform unit testing on Power platform custom connectors Custom Connector Testing Technology Specific Testing Using DevTest Pattern for building containers with AzDO Using Azurite to run blob storage tests in pipeline","title":"Testing"},{"location":"automated-testing/#testing","text":"","title":"Testing"},{"location":"automated-testing/#why-testing","text":"Tests allow us to find flaws in our software Good tests document the code by describing the intent Automated tests saves time, compared to manual tests Automated tests allow us to safely change and refactor our code without introducing regressions","title":"Why Testing"},{"location":"automated-testing/#the-fundamentals","text":"We consider code to be incomplete if it is not accompanied by tests We write unit tests (tests without external dependencies) that can run before every PR merge to validate that we don\u2019t have regressions We write Integration tests/E2E tests that test the whole system end to end, and run them regularly We write our tests early and block any further code merging if tests fail. We run load tests/performance tests where appropriate to validate that the system performs under stress","title":"The Fundamentals"},{"location":"automated-testing/#build-for-testing","text":"Testing is a critical part of the development process. It is important to build your application with testing in mind. Here are some tips to help you build for testing: Parameterize everything. Rather than hard-code any variables, consider making everything a configurable parameter with a reasonable default. This will allow you to easily change the behavior of your application during testing. Particularly during performance testing, it is common to test different values to see what impact that has on performance. If a range of defaults need to change together, consider one or more parameters which set \"modes\", changing the defaults of a group of parameters together. Document at startup. When your application starts up, it should log all parameters. This ensures the person reviewing the logs and application behavior know exactly how the application is configured. Log to console. Logging to external systems like Azure Monitor is desirable for traceability across services. This requires logs to be dispatched from the local system to the external system and that is a dependency that can fail. It is important that someone be able to console logs directly on the local system. Log to external system. In addition to console logs, logging to an external system like Azure Monitor is desirable for traceability across services and durability of logs. Log all activity. If the system is performing some activity (reading data from a database, calling an external service, etc.), it should log that activity. Ideally, there should be a log message saying the activity is starting and another log message saying the activity is complete. This allows someone reviewing the logs to understand what the application is doing and how long it is taking. Depending on how noisy this is, different messages can be associated with different log levels, but it is important to have the information available when it comes to debugging a deployed system. Correlate distributed activities. If the system is performing some activity that is distributed across multiple systems, it is important to correlate the activity across those systems. This can be done using a Correlation ID that is passed from system to system. This allows someone reviewing the logs to understand the entire flow of activity. For more information, please see Observability in Microservices . Log metadata. When logging, it is important to include metadata that is relevant to the activity. For example, a Tenant ID, Customer ID, or Order ID. This allows someone reviewing the logs to understand the context of the activity and filter to a manageable set of logs. Log performance metrics. Even if you are using App Insights to capture how long dependency calls are taking, it is often useful to know long certain functions of your application took. It then becomes possible to evaluate the performance characteristics of your application as it is deployed on different compute platforms with different limitations on CPU, memory, and network bandwidth. For more information, please see Metrics .","title":"Build for Testing"},{"location":"automated-testing/#map-of-outcomes-to-testing-techniques","text":"The table below maps outcomes (the results that you may want to achieve in your validation efforts) to one or more techniques that can be used to accomplish that outcome. When I am working on... I want to get this outcome... ...so I should consider Development Prove backward compatibility with existing callers and clients Shadow testing Development Ensure telemetry is sufficiently detailed and complete to trace and diagnose malfunction in End-to-End testing flows Distributed Debug challenges; Orphaned call chain analysis Development Ensure program logic is correct for a variety of expected, mainline, edge and unexpected inputs Unit testing ; Functional tests; Consumer-driven Contract Testing ; Integration testing Development Prevent regressions in logical correctness; earlier is better Unit testing ; Functional tests; Consumer-driven Contract Testing ; Integration testing ; Rings (each of these are expanding scopes of coverage) Development Quickly validate mainline correctness of a point of functionality (e.g. single API), manually Manual smoke testing Tools: postman, powershell, curl Development Validate interactions between components in isolation, ensuring that consumer and provider components are compatible and conform to a shared understanding documented in a contract Consumer-driven Contract Testing Development Validate that multiple components function together across multiple interfaces in a call chain, incl network hops Integration testing ; End-to-end ( End-to-End testing ) tests; Segmented end-to-end ( End-to-End testing ) Development Prove disaster recoverability \u2013 recover from corruption of data DR drills Development Find vulnerabilities in service Authentication or Authorization Scenario (security) Development Prove correct RBAC and claims interpretation of Authorization code Scenario (security) Development Document and/or enforce valid API usage Unit testing ; Functional tests; Consumer-driven Contract Testing Development Prove implementation correctness in advance of a dependency or absent a dependency Unit testing (with mocks); Unit testing (with emulators); Consumer-driven Contract Testing Development Ensure that the user interface is accessible Accessibility Development Ensure that users can operate the interface UI testing (automated) (human usability observation) Development Prevent regression in user experience UI automation; End-to-End testing Development Detect and prevent 'noisy neighbor' phenomena Load testing Development Detect availability drops Synthetic Transaction testing ; Outside-in probes Development Prevent regression in 'composite' scenario use cases / workflows (e.g. an e-commerce system might have many APIs that used together in a sequence perform a \"shop-and-buy\" scenario) End-to-End testing ; Scenario Development; Operations Prevent regressions in runtime performance metrics e.g. latency / cost / resource consumption; earlier is better Rings; Synthetic Transaction testing / Transaction; Rollback Watchdogs Development; Optimization Compare any given metric between 2 candidate implementations or variations in functionality Flighting; A/B testing Development; Staging Prove production system of provisioned capacity meets goals for reliability, availability, resource consumption, performance Load testing (stress) ; Spike; Soak; Performance testing Development; Staging Understand key user experience performance characteristics \u2013 latency, chattiness, resiliency to network errors Load; Performance testing ; Scenario (network partitioning) Development; Staging; Operation Discover melt points (the loads at which failure or maximum tolerable resource consumption occurs) for each individual component in the stack Squeeze; Load testing (stress) Development; Staging; Operation Discover overall system melt point (the loads at which the end-to-end system fails) and which component is the weakest link in the whole stack Squeeze; Load testing (stress) Development; Staging; Operation Measure capacity limits for given provisioning to predict or satisfy future provisioning needs Squeeze; Load testing (stress) Development; Staging; Operation Create / exercise failover runbook Failover drills Development; Staging; Operation Prove disaster recoverability \u2013 loss of data center (the meteor scenario); measure MTTR DR drills Development; Staging; Operation Understand whether observability dashboards are correct, and telemetry is complete; flowing Trace Validation; Load testing (stress) ; Scenario; End-to-End testing Development; Staging; Operation Measure impact of seasonality of traffic Load testing Development; Staging; Operation Prove Transaction and alerts correctly notify / take action Synthetic Transaction testing (negative cases); Load testing Development; Staging; Operation; Optimizing Understand scalability curve, i.e. how the system consumes resources with load Load testing (stress) ; Performance testing Operation; Optimizing Discover system behavior over long-haul time Soak Optimizing Find cost savings opportunities Squeeze Staging; Operation Measure impact of failover / scale-out (repartitioning, increasing provisioning) / scale-down Failover drills; Scale drills Staging; Operation Create/Exercise runbook for increasing/reducing provisioning Scale drills Staging; Operation Measure behavior under rapid changes in traffic Spike Staging; Optimizing Discover cost metrics per unit load volume (what factors influence cost at what load points, e.g. cost per million concurrent users) Load (stress) Development; Operation Discover points where a system is not resilient to unpredictable yet inevitable failures (network outage, hardware failure, VM host servicing, rack/switch failures, random acts of the Malevolent Divine, solar flares, sharks that eat undersea cable relays, cosmic radiation, power outages, renegade backhoe operators, wolves chewing on junction boxes, \u2026) Chaos Development Perform unit testing on Power platform custom connectors Custom Connector Testing","title":"Map of Outcomes to Testing Techniques"},{"location":"automated-testing/#technology-specific-testing","text":"Using DevTest Pattern for building containers with AzDO Using Azurite to run blob storage tests in pipeline","title":"Technology Specific Testing"},{"location":"automated-testing/cdc-testing/","text":"Consumer-Driven Contract Testing (CDC) Consumer-driven Contract Testing (or CDC for short) is a software testing methodology used to test components of a system in isolation while ensuring that provider components are compatible with the expectations that consumer components have of them. Why Consumer-Driven Contract Testing CDC tries to overcome the several painful drawbacks of automated E2E tests with components interacting together: E2E tests are slow E2E tests break easily E2E tests are expensive and hard to maintain E2E tests of larger systems may be hard or impossible to run outside a dedicated testing environment Although testing best practices suggest to write just a few E2E tests compared to the cheaper, faster and more stable integration and unit tests as pictured in the testing pyramid below, experience shows many teams end up writing too many E2E tests . A reason for this is that E2E tests give developers the highest confidence to release as they are testing the \"real\" system. CDC addresses these issues by testing interactions between components in isolation using mocks that conform to a shared understanding documented in a \"contract\". Contracts are agreed between consumer and provider, and are regularly verified against a real instance of the provider component. This effectively partitions a larger system into smaller pieces that can be tested individually in isolation of each other, leading to simpler, fast and stable tests that also give confidence to release. Some E2E tests are still required to verify the system as a whole when deployed in the real environment, but most functional interactions between components can be covered with CDC tests. CDC testing was initially developed for testing RESTful API's, but the pattern scales to all consumer-provider systems and tooling for other messaging protocols besides HTTP does exist. Consumer-Driven Contract Testing Design Blocks In a consumer-driven approach the consumer drives changes to contracts between a consumer (the client) and a provider (the server). This may sound counterintuitive, but it helps providers create APIs that fit the real requirements of the consumers rather than trying to guess these in advance. Next we describe the CDC building blocks ordered by their occurrence in the development cycle. Consumer Tests with Provider Mock The consumers start by creating integration tests against a provider mock and running them as part of their CI pipeline. Expected responses are defined in the provider mock for requests fired from the tests. Through this, the consumer essentially defines the contract they expect the provider to fulfill. Contract Contracts are generated from the expectations defined in the provider mock as a result of a successful test run. CDC frameworks like Pact provide a specification for contracts in json format consisting of the list of request/responses generated from the consumer tests plus some additional metadata. Contracts are not a replacement for a discussion between the consumer and provider team. This is the moment where this discussion should take place (if not already done before). The consumer tests and generated contract are refined with the feedback and cooperation of the provider team. Lastly the finalized contract is versioned and stored in a central place accessible by both consumer and provider. Contracts are complementary to API specification documents like OpenAPI. API specifications describe the structure and the format of the API. A contract instead specifies that for a given request, a given response is expected. An API specifications document is helpful in writing an API contract and can be used to validate that the contract conforms to the API specification. Provider Contract Verification On the provider side tests are also executed as part of a separate pipeline which verifies contracts against real responses of the provider. Contract verification fails if real responses differ from the expected responses as specified in the contract. The cause of this can be: Invalid expectations on the consumer side leading to incompatibility with the current provider implementation Broken provider implementation due to some missing functionality or a regression Either way, thanks to CDC it is easy to pinpoint integration issues down to the consumer/provider of the affected interaction. This is a big advantage compared to the debugging pain this could have been with an E2E test approach. CDC Testing Frameworks and Tools Pact is an implementation of CDC testing that allows mocking of responses in the consumer codebase, and verification of the interactions in the provider codebase, while defining a specification for contracts . It was originally written in Ruby but has available wrappers for multiple languages. Pact is the de-facto standard to use when working with CDC. Spring Cloud Contract is an implementation of CDC testing from Spring, and offers easy integration in the Spring ecosystem. Support for non-Spring and non-JVM providers and consumers also exists. Conclusion CDC has several benefits that make it an approach worth considering when dealing with systems composed of multiple components interacting together. Maintenance efforts can be reduced by testing consumer-provider interactions in isolation without the need of a complex integrated environment, specially as the interactions between components grow in number and become more complex. Additionally, a close collaboration between consumer and provider teams is strongly encouraged through the CDC development process, which can bring many other benefits. Contracts offer a formal way to document the shared understanding how components interact with each other, and serve as a base for the communication between teams. In a way, the contract repository serves as a live documentation of all consumer-provider interactions of a system. CDC has some drawbacks as well. An extra layer of testing is added requiring a proper investment in education for team members to understand and use CDC correctly. Additionally, the CDC test scope should be considered carefully to prevent blurring CDC with other higher level functional testing layers. Contract tests are not the place to verify internal business logic and correctness of the consumer. Resources Testing pyramid from Kent C. Dodd's blog Pact , a code-first consumer-driven contract testing tool with support for several different programming languages Consumer-driven contracts from Ian Robinson Contract test from Martin Fowler A simple example of using Pact consumer-driven contract testing in a Java client-server application Pact dotnet workshop","title":"Consumer-Driven Contract Testing (CDC)"},{"location":"automated-testing/cdc-testing/#consumer-driven-contract-testing-cdc","text":"Consumer-driven Contract Testing (or CDC for short) is a software testing methodology used to test components of a system in isolation while ensuring that provider components are compatible with the expectations that consumer components have of them.","title":"Consumer-Driven Contract Testing (CDC)"},{"location":"automated-testing/cdc-testing/#why-consumer-driven-contract-testing","text":"CDC tries to overcome the several painful drawbacks of automated E2E tests with components interacting together: E2E tests are slow E2E tests break easily E2E tests are expensive and hard to maintain E2E tests of larger systems may be hard or impossible to run outside a dedicated testing environment Although testing best practices suggest to write just a few E2E tests compared to the cheaper, faster and more stable integration and unit tests as pictured in the testing pyramid below, experience shows many teams end up writing too many E2E tests . A reason for this is that E2E tests give developers the highest confidence to release as they are testing the \"real\" system. CDC addresses these issues by testing interactions between components in isolation using mocks that conform to a shared understanding documented in a \"contract\". Contracts are agreed between consumer and provider, and are regularly verified against a real instance of the provider component. This effectively partitions a larger system into smaller pieces that can be tested individually in isolation of each other, leading to simpler, fast and stable tests that also give confidence to release. Some E2E tests are still required to verify the system as a whole when deployed in the real environment, but most functional interactions between components can be covered with CDC tests. CDC testing was initially developed for testing RESTful API's, but the pattern scales to all consumer-provider systems and tooling for other messaging protocols besides HTTP does exist.","title":"Why Consumer-Driven Contract Testing"},{"location":"automated-testing/cdc-testing/#consumer-driven-contract-testing-design-blocks","text":"In a consumer-driven approach the consumer drives changes to contracts between a consumer (the client) and a provider (the server). This may sound counterintuitive, but it helps providers create APIs that fit the real requirements of the consumers rather than trying to guess these in advance. Next we describe the CDC building blocks ordered by their occurrence in the development cycle.","title":"Consumer-Driven Contract Testing Design Blocks"},{"location":"automated-testing/cdc-testing/#consumer-tests-with-provider-mock","text":"The consumers start by creating integration tests against a provider mock and running them as part of their CI pipeline. Expected responses are defined in the provider mock for requests fired from the tests. Through this, the consumer essentially defines the contract they expect the provider to fulfill.","title":"Consumer Tests with Provider Mock"},{"location":"automated-testing/cdc-testing/#contract","text":"Contracts are generated from the expectations defined in the provider mock as a result of a successful test run. CDC frameworks like Pact provide a specification for contracts in json format consisting of the list of request/responses generated from the consumer tests plus some additional metadata. Contracts are not a replacement for a discussion between the consumer and provider team. This is the moment where this discussion should take place (if not already done before). The consumer tests and generated contract are refined with the feedback and cooperation of the provider team. Lastly the finalized contract is versioned and stored in a central place accessible by both consumer and provider. Contracts are complementary to API specification documents like OpenAPI. API specifications describe the structure and the format of the API. A contract instead specifies that for a given request, a given response is expected. An API specifications document is helpful in writing an API contract and can be used to validate that the contract conforms to the API specification.","title":"Contract"},{"location":"automated-testing/cdc-testing/#provider-contract-verification","text":"On the provider side tests are also executed as part of a separate pipeline which verifies contracts against real responses of the provider. Contract verification fails if real responses differ from the expected responses as specified in the contract. The cause of this can be: Invalid expectations on the consumer side leading to incompatibility with the current provider implementation Broken provider implementation due to some missing functionality or a regression Either way, thanks to CDC it is easy to pinpoint integration issues down to the consumer/provider of the affected interaction. This is a big advantage compared to the debugging pain this could have been with an E2E test approach.","title":"Provider Contract Verification"},{"location":"automated-testing/cdc-testing/#cdc-testing-frameworks-and-tools","text":"Pact is an implementation of CDC testing that allows mocking of responses in the consumer codebase, and verification of the interactions in the provider codebase, while defining a specification for contracts . It was originally written in Ruby but has available wrappers for multiple languages. Pact is the de-facto standard to use when working with CDC. Spring Cloud Contract is an implementation of CDC testing from Spring, and offers easy integration in the Spring ecosystem. Support for non-Spring and non-JVM providers and consumers also exists.","title":"CDC Testing Frameworks and Tools"},{"location":"automated-testing/cdc-testing/#conclusion","text":"CDC has several benefits that make it an approach worth considering when dealing with systems composed of multiple components interacting together. Maintenance efforts can be reduced by testing consumer-provider interactions in isolation without the need of a complex integrated environment, specially as the interactions between components grow in number and become more complex. Additionally, a close collaboration between consumer and provider teams is strongly encouraged through the CDC development process, which can bring many other benefits. Contracts offer a formal way to document the shared understanding how components interact with each other, and serve as a base for the communication between teams. In a way, the contract repository serves as a live documentation of all consumer-provider interactions of a system. CDC has some drawbacks as well. An extra layer of testing is added requiring a proper investment in education for team members to understand and use CDC correctly. Additionally, the CDC test scope should be considered carefully to prevent blurring CDC with other higher level functional testing layers. Contract tests are not the place to verify internal business logic and correctness of the consumer.","title":"Conclusion"},{"location":"automated-testing/cdc-testing/#resources","text":"Testing pyramid from Kent C. Dodd's blog Pact , a code-first consumer-driven contract testing tool with support for several different programming languages Consumer-driven contracts from Ian Robinson Contract test from Martin Fowler A simple example of using Pact consumer-driven contract testing in a Java client-server application Pact dotnet workshop","title":"Resources"},{"location":"automated-testing/e2e-testing/","text":"E2E Testing End-to-end (E2E) testing is a Software testing methodology to test a functional and data application flow consisting of several sub-systems working together from start to end. At times, these systems are developed in different technologies by different teams or organizations. Finally, they come together to form a functional business application. Hence, testing a single system would not suffice. Therefore, end-to-end testing verifies the application from start to end putting all its components together. Why E2E Testing In many commercial software application scenarios, a modern software system consists of its interconnection with multiple sub-systems. These sub-systems can be within the same organization or can be components of different organizations. Also, these sub-systems can have somewhat similar or different lifetime release cycle from the current system. As a result, if there is any failure or fault in any sub-system, it can adversely affect the whole software system leading to its collapse. The above illustration is a testing pyramid from Kent C. Dodd's blog which is a combination of the pyramids from Martin Fowler\u2019s blog and the Google Testing Blog . The majority of your tests are at the bottom of the pyramid. As you move up the pyramid, the number of tests gets smaller. Also, going up the pyramid, tests get slower and more expensive to write, run, and maintain. Each type of testing vary for its purpose, application and the areas it's supposed to cover. For more information on comparison analysis of different testing types, please see this ## Unit vs Integration vs System vs E2E Testing document. E2E Testing Design Blocks We will look into all the 3 categories one by one: User Functions Following actions should be performed as a part of building user functions: List user initiated functions of the software systems, and their interconnected sub-systems. For any function, keep track of the actions performed as well as Input and Output data. Find the relations, if any between different Users functions. Find out the nature of different user functions i.e. if they are independent or are reusable. Conditions Following activities should be performed as a part of building conditions based on user functions: For each and every user functions, a set of conditions should be prepared. Timing, data conditions and other factors that affect user functions can be considered as parameters. Test Cases Following factors should be considered for building test cases: For every scenario, one or more test cases should be created to test each and every functionality of the user functions. If possible, these test cases should be automated through the standard CI/CD build pipeline processes with the track of each successful and failed build in AzDO. Every single condition should be enlisted as a separate test case. Applying the E2E Testing Like any other testing, E2E testing also goes through formal planning, test execution, and closure phases. E2E testing is done with the following steps: Planning Business and Functional Requirement analysis Test plan development Test case development Production like Environment setup for the testing Test data setup Decide exit criteria Choose the testing methods that most applicable to your system. For the definition of the various testing methods, please see Testing Methods document. Pre-requisite System Testing should be complete for all the participating systems. All subsystems should be combined to work as a complete application. Production like test environment should be ready. Test Execution Execute the test cases Register the test results and decide on pass and failure Report the Bugs in the bug reporting tool Re-verify the bug fixes Test Closure Test report preparation Evaluation of exit criteria Test phase closure Test Metrics The tracing the quality metrics gives insight about the current status of testing. Some common metrics of E2E testing are: Test case preparation status : Number of test cases ready versus the total number of test cases. Frequent Test progress : Number of test cases executed in the consistent frequent manner, e.g. weekly, versus a target number of the test cases in the same time period. Defects Status : This metric represents the status of the defects found during testing. Defects should be logged into defect tracking tool (e.g. AzDO backlog) and resolved as per their severity and priority. Therefore, the percentage of open and closed defects as per their severity and priority should be calculated to track this metric. The AzDO Dashboard Query can be used to track this metric. Test environment availability : This metric tracks the duration of the test environment used for end-to-end testing versus its scheduled allocation duration. E2E Testing Frameworks and Tools 1. Gauge Framework Gauge is a free and open source framework for writing and running E2E tests. Some key features of Gauge that makes it unique include: Simple, flexible and rich syntax based on Markdown. Consistent cross-platform/language support for writing test code. A modular architecture with plugins support. Supports data driven execution and external data sources. Helps you create maintainable test suites. Supports Visual Studio Code, Intellij IDEA, IDE Support. Supports html, json and XML reporting. Gauge Framework Website 2. Robot Framework Robot Framework is a generic open source automation framework. The framework has easy syntax, utilizing human-readable keywords. Its capabilities can be extended by libraries implemented with Python or Java. Robot shares a lot of the same \"pros\" as Gauge, except the developer tooling and the syntax. In our usage, we found the VS Code Intellisense offered with Gauge to be much more stable than the offerings for Robot. We also found the syntax to be less readable than what Gauge offered. While both frameworks allow for markup based test case definitions, the Gauge syntax reads much more like an English sentence than Robot. Finally, Intellisense is baked into the markup files for Gauge test cases, which will create a function stub for the actual test definition if the developer allows it. The same cannot be said of the Robot Framework. Robot Framework Website 3. TestCraft TestCraft is a codeless Selenium test automation platform. Its revolutionary AI technology and unique visual modeling allow for faster test creation and execution while eliminating test maintenance overhead. The testers create fully automated test scenarios without coding. Customers find bugs faster, release more frequently, integrate with the CI/CD approach and improve the overall quality of their digital products. This all creates a complete end-to-end testing experience. Perfecto (TestCraft) Website or get it from the Visual Studio Marketplace 4. Ranorex Studio Ranorex Studio is a complete end-to-end test automation tool for desktop, web, and mobile applications. Create reliable tests fast without any coding at all, or using the full IDE. Use external CSV or Excel files, or a SQL database as inputs to your tests. Run tests in parallel or on a Selenium Grid with built-in Selenium WebDriver. Ranorex Studio integrates with your CI/CD process to shorten your release cycles without sacrificing quality. Ranorex Studio tests also integrate with Azure DevOps (AzDO), which can be run as part of a build pipeline in AzDO. Ranorex Studio Website 5. Katalon Studio Katalon Studio is an excellent end-to-end automation solution for web, API, mobile, and desktop testing with DevOps support. With Katalon Studio, automated testing can be easily integrated into any CI/CD pipeline to release products faster while guaranteeing high quality. Katalon Studio customizes for users from beginners to experts. Robust functions such as Spying, Recording, Dual-editor interface and Custom Keywords make setting up, creating and maintaining tests possible for users. Built on top of Selenium and Appium, Katalon Studio helps standardize your end-to-end tests standardized. It also complies with the most popular frameworks to work seamlessly with other tools in the automated testing ecosystem. Katalon is endorsed by Gartner, IT professionals, and a large testing community. Note: At the time of this writing, Katalon Studio extension for AzDO was NOT available for Linux. Katalon Studio Website or read about its integration with AzDO 6. BugBug.io BugBug is an easy way to automate tests for web applications. The tool focuses on simplicity, yet allows you to cover all essential test cases without coding. It's an all-in-one solution - you can easily create tests and use the built-in cloud to run them on schedule or from your CI/CD, without changes to your own infrastructure. BugBug is an interesting alternative to Selenium because it's actually a completely different technology. It is based on a Chrome extension that allows BugBug to record and run tests faster than old-school frameworks. The biggest advantage of BugBug is its user-friendliness. Most tests created with BugBug simply work out of the box. This makes it easier for non-technical people to maintain tests - with BugBug you can save money on hiring a QA engineer. BugBug Website Conclusion Hope you learned various aspects of E2E testing like its processes, metrics, the difference between Unit, Integration and E2E testing, and the various recommended E2E test frameworks and tools. For any commercial release of the software, E2E test verification plays an important role as it tests the entire application in an environment that exactly imitates real-world users like network communication, middleware and backend services interaction, etc. Finally, the E2E test is often performed manually as the cost of automating such test cases is too high to be afforded by any organization. Having said that, the ultimate goal of each organization is to make the e2e testing as streamlined as possible adding full and semi-automation testing components into the process. Hence, the various E2E testing frameworks and tools listed in this article come to the rescue. Resources Wikipedia: Software testing Wikipedia: Unit testing Wikipedia: Integration testing Wikipedia: System testing","title":"E2E Testing"},{"location":"automated-testing/e2e-testing/#e2e-testing","text":"End-to-end (E2E) testing is a Software testing methodology to test a functional and data application flow consisting of several sub-systems working together from start to end. At times, these systems are developed in different technologies by different teams or organizations. Finally, they come together to form a functional business application. Hence, testing a single system would not suffice. Therefore, end-to-end testing verifies the application from start to end putting all its components together.","title":"E2E Testing"},{"location":"automated-testing/e2e-testing/#why-e2e-testing","text":"In many commercial software application scenarios, a modern software system consists of its interconnection with multiple sub-systems. These sub-systems can be within the same organization or can be components of different organizations. Also, these sub-systems can have somewhat similar or different lifetime release cycle from the current system. As a result, if there is any failure or fault in any sub-system, it can adversely affect the whole software system leading to its collapse. The above illustration is a testing pyramid from Kent C. Dodd's blog which is a combination of the pyramids from Martin Fowler\u2019s blog and the Google Testing Blog . The majority of your tests are at the bottom of the pyramid. As you move up the pyramid, the number of tests gets smaller. Also, going up the pyramid, tests get slower and more expensive to write, run, and maintain. Each type of testing vary for its purpose, application and the areas it's supposed to cover. For more information on comparison analysis of different testing types, please see this ## Unit vs Integration vs System vs E2E Testing document.","title":"Why E2E Testing"},{"location":"automated-testing/e2e-testing/#e2e-testing-design-blocks","text":"We will look into all the 3 categories one by one:","title":"E2E Testing Design Blocks"},{"location":"automated-testing/e2e-testing/#user-functions","text":"Following actions should be performed as a part of building user functions: List user initiated functions of the software systems, and their interconnected sub-systems. For any function, keep track of the actions performed as well as Input and Output data. Find the relations, if any between different Users functions. Find out the nature of different user functions i.e. if they are independent or are reusable.","title":"User Functions"},{"location":"automated-testing/e2e-testing/#conditions","text":"Following activities should be performed as a part of building conditions based on user functions: For each and every user functions, a set of conditions should be prepared. Timing, data conditions and other factors that affect user functions can be considered as parameters.","title":"Conditions"},{"location":"automated-testing/e2e-testing/#test-cases","text":"Following factors should be considered for building test cases: For every scenario, one or more test cases should be created to test each and every functionality of the user functions. If possible, these test cases should be automated through the standard CI/CD build pipeline processes with the track of each successful and failed build in AzDO. Every single condition should be enlisted as a separate test case.","title":"Test Cases"},{"location":"automated-testing/e2e-testing/#applying-the-e2e-testing","text":"Like any other testing, E2E testing also goes through formal planning, test execution, and closure phases. E2E testing is done with the following steps:","title":"Applying the E2E Testing"},{"location":"automated-testing/e2e-testing/#planning","text":"Business and Functional Requirement analysis Test plan development Test case development Production like Environment setup for the testing Test data setup Decide exit criteria Choose the testing methods that most applicable to your system. For the definition of the various testing methods, please see Testing Methods document.","title":"Planning"},{"location":"automated-testing/e2e-testing/#pre-requisite","text":"System Testing should be complete for all the participating systems. All subsystems should be combined to work as a complete application. Production like test environment should be ready.","title":"Pre-requisite"},{"location":"automated-testing/e2e-testing/#test-execution","text":"Execute the test cases Register the test results and decide on pass and failure Report the Bugs in the bug reporting tool Re-verify the bug fixes","title":"Test Execution"},{"location":"automated-testing/e2e-testing/#test-closure","text":"Test report preparation Evaluation of exit criteria Test phase closure","title":"Test Closure"},{"location":"automated-testing/e2e-testing/#test-metrics","text":"The tracing the quality metrics gives insight about the current status of testing. Some common metrics of E2E testing are: Test case preparation status : Number of test cases ready versus the total number of test cases. Frequent Test progress : Number of test cases executed in the consistent frequent manner, e.g. weekly, versus a target number of the test cases in the same time period. Defects Status : This metric represents the status of the defects found during testing. Defects should be logged into defect tracking tool (e.g. AzDO backlog) and resolved as per their severity and priority. Therefore, the percentage of open and closed defects as per their severity and priority should be calculated to track this metric. The AzDO Dashboard Query can be used to track this metric. Test environment availability : This metric tracks the duration of the test environment used for end-to-end testing versus its scheduled allocation duration.","title":"Test Metrics"},{"location":"automated-testing/e2e-testing/#e2e-testing-frameworks-and-tools","text":"","title":"E2E Testing Frameworks and Tools"},{"location":"automated-testing/e2e-testing/#1-gauge-framework","text":"Gauge is a free and open source framework for writing and running E2E tests. Some key features of Gauge that makes it unique include: Simple, flexible and rich syntax based on Markdown. Consistent cross-platform/language support for writing test code. A modular architecture with plugins support. Supports data driven execution and external data sources. Helps you create maintainable test suites. Supports Visual Studio Code, Intellij IDEA, IDE Support. Supports html, json and XML reporting. Gauge Framework Website","title":"1. Gauge Framework"},{"location":"automated-testing/e2e-testing/#2-robot-framework","text":"Robot Framework is a generic open source automation framework. The framework has easy syntax, utilizing human-readable keywords. Its capabilities can be extended by libraries implemented with Python or Java. Robot shares a lot of the same \"pros\" as Gauge, except the developer tooling and the syntax. In our usage, we found the VS Code Intellisense offered with Gauge to be much more stable than the offerings for Robot. We also found the syntax to be less readable than what Gauge offered. While both frameworks allow for markup based test case definitions, the Gauge syntax reads much more like an English sentence than Robot. Finally, Intellisense is baked into the markup files for Gauge test cases, which will create a function stub for the actual test definition if the developer allows it. The same cannot be said of the Robot Framework. Robot Framework Website","title":"2. Robot Framework"},{"location":"automated-testing/e2e-testing/#3-testcraft","text":"TestCraft is a codeless Selenium test automation platform. Its revolutionary AI technology and unique visual modeling allow for faster test creation and execution while eliminating test maintenance overhead. The testers create fully automated test scenarios without coding. Customers find bugs faster, release more frequently, integrate with the CI/CD approach and improve the overall quality of their digital products. This all creates a complete end-to-end testing experience. Perfecto (TestCraft) Website or get it from the Visual Studio Marketplace","title":"3. TestCraft"},{"location":"automated-testing/e2e-testing/#4-ranorex-studio","text":"Ranorex Studio is a complete end-to-end test automation tool for desktop, web, and mobile applications. Create reliable tests fast without any coding at all, or using the full IDE. Use external CSV or Excel files, or a SQL database as inputs to your tests. Run tests in parallel or on a Selenium Grid with built-in Selenium WebDriver. Ranorex Studio integrates with your CI/CD process to shorten your release cycles without sacrificing quality. Ranorex Studio tests also integrate with Azure DevOps (AzDO), which can be run as part of a build pipeline in AzDO. Ranorex Studio Website","title":"4. Ranorex Studio"},{"location":"automated-testing/e2e-testing/#5-katalon-studio","text":"Katalon Studio is an excellent end-to-end automation solution for web, API, mobile, and desktop testing with DevOps support. With Katalon Studio, automated testing can be easily integrated into any CI/CD pipeline to release products faster while guaranteeing high quality. Katalon Studio customizes for users from beginners to experts. Robust functions such as Spying, Recording, Dual-editor interface and Custom Keywords make setting up, creating and maintaining tests possible for users. Built on top of Selenium and Appium, Katalon Studio helps standardize your end-to-end tests standardized. It also complies with the most popular frameworks to work seamlessly with other tools in the automated testing ecosystem. Katalon is endorsed by Gartner, IT professionals, and a large testing community. Note: At the time of this writing, Katalon Studio extension for AzDO was NOT available for Linux. Katalon Studio Website or read about its integration with AzDO","title":"5. Katalon Studio"},{"location":"automated-testing/e2e-testing/#6-bugbugio","text":"BugBug is an easy way to automate tests for web applications. The tool focuses on simplicity, yet allows you to cover all essential test cases without coding. It's an all-in-one solution - you can easily create tests and use the built-in cloud to run them on schedule or from your CI/CD, without changes to your own infrastructure. BugBug is an interesting alternative to Selenium because it's actually a completely different technology. It is based on a Chrome extension that allows BugBug to record and run tests faster than old-school frameworks. The biggest advantage of BugBug is its user-friendliness. Most tests created with BugBug simply work out of the box. This makes it easier for non-technical people to maintain tests - with BugBug you can save money on hiring a QA engineer. BugBug Website","title":"6. BugBug.io"},{"location":"automated-testing/e2e-testing/#conclusion","text":"Hope you learned various aspects of E2E testing like its processes, metrics, the difference between Unit, Integration and E2E testing, and the various recommended E2E test frameworks and tools. For any commercial release of the software, E2E test verification plays an important role as it tests the entire application in an environment that exactly imitates real-world users like network communication, middleware and backend services interaction, etc. Finally, the E2E test is often performed manually as the cost of automating such test cases is too high to be afforded by any organization. Having said that, the ultimate goal of each organization is to make the e2e testing as streamlined as possible adding full and semi-automation testing components into the process. Hence, the various E2E testing frameworks and tools listed in this article come to the rescue.","title":"Conclusion"},{"location":"automated-testing/e2e-testing/#resources","text":"Wikipedia: Software testing Wikipedia: Unit testing Wikipedia: Integration testing Wikipedia: System testing","title":"Resources"},{"location":"automated-testing/e2e-testing/testing-comparison/","text":"Unit vs Integration vs System vs E2E Testing The table below illustrates the most critical characteristics and differences among Unit, Integration, System, and End-to-End Testing, and when to apply each methodology in a project. Unit Test Integration Test System Testing E2E Test Scope Modules, APIs Modules, interfaces Application, system All sub-systems, network dependencies, services and databases Size Tiny Small to medium Large X-Large Environment Development Integration test QA test Production like Data Mock data Test data Test data Copy of real production data System Under Test Isolated unit test Interfaces and flow data between the modules Particular system as a whole Application flow from start to end Scenarios Developer perspectives Developers and IT Pro tester perspectives Developer and QA tester perspectives End-user perspectives When After each build After Unit testing Before E2E testing and after Unit and Integration testing After System testing Automated or Manual Automated Manual or automated Manual or automated Manual","title":"Unit vs Integration vs System vs E2E Testing"},{"location":"automated-testing/e2e-testing/testing-comparison/#unit-vs-integration-vs-system-vs-e2e-testing","text":"The table below illustrates the most critical characteristics and differences among Unit, Integration, System, and End-to-End Testing, and when to apply each methodology in a project. Unit Test Integration Test System Testing E2E Test Scope Modules, APIs Modules, interfaces Application, system All sub-systems, network dependencies, services and databases Size Tiny Small to medium Large X-Large Environment Development Integration test QA test Production like Data Mock data Test data Test data Copy of real production data System Under Test Isolated unit test Interfaces and flow data between the modules Particular system as a whole Application flow from start to end Scenarios Developer perspectives Developers and IT Pro tester perspectives Developer and QA tester perspectives End-user perspectives When After each build After Unit testing Before E2E testing and after Unit and Integration testing After System testing Automated or Manual Automated Manual or automated Manual or automated Manual","title":"Unit vs Integration vs System vs E2E Testing"},{"location":"automated-testing/e2e-testing/testing-methods/","text":"E2E Testing Methods Horizontal Test This method is used very commonly. It occurs horizontally across the context of multiple applications. Take an example of a data ingest management system. The inbound data may be injected from various sources, but it then \"flatten\" into a horizontal processing pipeline that may include various components, such as a gateway API, data transformation, data validation, storage, etc... Throughout the entire Extract-Transform-Load (ETL) processing, the data flow can be tracked and monitored under the horizontal spectrum with little sprinkles of optional, and thus not important for the overall E2E test case, services, like logging, auditing, authentication. Vertical Test In this method, all most critical transactions of any application are verified and evaluated right from the start to finish. Each individual layer of the application is tested starting from top to bottom. Take an example of a web-based application that uses middleware services for reaching back-end resources. In such case, each layer (tier) is required to be fully tested in conjunction with the \"connected\" layers above and beneath, in which services \"talk\" to each other during the end to end data flow. All these complex testing scenarios will require proper validation and dedicated automated testing. Thus, this method is much more difficult. E2E Test Cases Design Guidelines Below enlisted are few guidelines that should be kept in mind while designing the test cases for performing E2E testing: Test cases should be designed from the end user\u2019s perspective. Should focus on testing some existing features of the system. Multiple scenarios should be considered for creating multiple test cases. Different sets of test cases should be created to focus on multiple scenarios of the system.","title":"E2E Testing Methods"},{"location":"automated-testing/e2e-testing/testing-methods/#e2e-testing-methods","text":"","title":"E2E Testing Methods"},{"location":"automated-testing/e2e-testing/testing-methods/#horizontal-test","text":"This method is used very commonly. It occurs horizontally across the context of multiple applications. Take an example of a data ingest management system. The inbound data may be injected from various sources, but it then \"flatten\" into a horizontal processing pipeline that may include various components, such as a gateway API, data transformation, data validation, storage, etc... Throughout the entire Extract-Transform-Load (ETL) processing, the data flow can be tracked and monitored under the horizontal spectrum with little sprinkles of optional, and thus not important for the overall E2E test case, services, like logging, auditing, authentication.","title":"Horizontal Test"},{"location":"automated-testing/e2e-testing/testing-methods/#vertical-test","text":"In this method, all most critical transactions of any application are verified and evaluated right from the start to finish. Each individual layer of the application is tested starting from top to bottom. Take an example of a web-based application that uses middleware services for reaching back-end resources. In such case, each layer (tier) is required to be fully tested in conjunction with the \"connected\" layers above and beneath, in which services \"talk\" to each other during the end to end data flow. All these complex testing scenarios will require proper validation and dedicated automated testing. Thus, this method is much more difficult.","title":"Vertical Test"},{"location":"automated-testing/e2e-testing/testing-methods/#e2e-test-cases-design-guidelines","text":"Below enlisted are few guidelines that should be kept in mind while designing the test cases for performing E2E testing: Test cases should be designed from the end user\u2019s perspective. Should focus on testing some existing features of the system. Multiple scenarios should be considered for creating multiple test cases. Different sets of test cases should be created to focus on multiple scenarios of the system.","title":"E2E Test Cases Design Guidelines"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/","text":"Gauge Framework Gauge is a free and open source framework for writing and running E2E tests. Some key features of Gauge that makes it unique include: Simple, flexible and rich syntax based on Markdown. Consistent cross-platform/language support for writing test code. A modular architecture with plugins support Extensible through plugins and hackable. Supports data driven execution and external data sources Helps you create maintainable test suites Supports Visual Studio Code, Intellij IDEA, IDE Support What is a Specification Gauge specifications are written using a Markdown syntax. For example # Search for the data blob ## Look for file * Goto Azure blob In this specification Search for the data blob is the specification heading , Look for file is a scenario with a step Goto Azure blob What is an Implementation You can implement the steps in a specification using a programming language, for example: from getgauge.python import step import os from step_impl.utils.driver import Driver @step ( \"Goto Azure blob\" ) def gotoAzureStorage () : URL = os.getenv ( 'STORAGE_ENDPOINT' ) Driver.driver.get ( URL ) The Gauge runner reads and runs steps and its implementation for every scenario in the specification and generates a report of passing or failing scenarios. # Search for the data blob ## Look for file \u2714 Successfully generated html-report to = > reports/html-report/index.html Specifications: 1 executed 1 passed 0 failed 0 skipped Scenarios: 1 executed 1 passed 0 failed 0 skipped Re-using Steps Gauge helps you focus on testing the flow of an application. Gauge does this by making steps as re-usable as possible. With Gauge, you don\u2019t need to build custom frameworks using a programming language. For example, Gauge steps can pass parameters to an implementation by using a text with quotes. # Search for the data blob ## Look for file * Goto Azure blob * Search for \"store_data.csv\" The implementation can now use \u201cstore_data.csv\u201d as follows from getgauge.python import step import os @step ( \"Search for <query>\" ) def searchForQuery ( query ) : write ( query ) press ( \"Enter\" ) step ( \"Search for <query>\" , ( query ) = > { write ( query ) ; press ( \"Enter\" ) ; You can then re-use this step within or across scenarios with different parameters: # Search for the data blob ## Look for Store data #1 * Goto Azure blob * Search for \"store_1.csv\" ## Look for Store data #2 * Goto Azure blob * Search for \"store_2.csv\" Or combine more than one step into concepts # Search Azure Storage for <query> * Goto Azure blob * Search for \"store_1.csv\" The concept, Search Azure Storage for <query> can be used like a step in a specification # Search for the data blob ## Look for Store data #1 * Search Azure Storage for \"store_1.csv\" ## Look for Store data #2 * Search Azure Storage for \"store_2.csv\" Data-Driven Testing Gauge also supports data driven testing using Markdown tables as well as external csv files for example # Search for the data blob | query | | --------- | | store_1 | | store_2 | | store_3 | ## Look for stores data * Search Azure Storage for <query> This will execute the scenario for all rows in the table. In the examples above, we refactored a specification to be concise and flexible without changing the implementation. Other Features This is brief introduction to a few Gauge features. Please refer to the Gauge documentation for additional features such as: Reports Tags Parallel execution Environments Screenshots Plugins And much more Installing Gauge This getting started guide takes you through the core features of Gauge. By the end of this guide, you\u2019ll be able to install Gauge and learn how to create your first Gauge test automation project. Installation Instructions for Windows OS Step 1: Installing Gauge on Windows This section gives specific instructions on setting up Gauge in a Microsoft Windows environment. Download the following installation bundle to get the latest stable release of Gauge. Step 2: Installing Gauge Extension for Visual Studio Code Follow the steps to add the Gauge Visual Studio Code plugin from the IDE Install the following Gauge extension for Visual Studio Code . Troubleshooting Installation If, when you run your first gauge spec you receive the error of missing python packages, open the command line terminal window and run this command: python.exe -m pip install getgauge == 0 .3.7 --user Installation Instructions for macOS Step 1: Installing Gauge on macOS This section gives specific instructions on setting up Gauge in a macOS environment. Install brew if you haven\u2019t already: Go to the brew website , and follow the directions there. Run the brew command to install Gauge > brew install gauge if HomeBrew is working properly, you should see something similar to the following: == > Fetching gauge == > Downloading https://ghcr.io/v2/homebrew/core/gauge/manifests/1.4.3 ######################################################################## 100.0% == > Downloading https://ghcr.io/v2/homebrew/core/gauge/blobs/sha256:05117bb3c0b2efeafe41e817cd3ad86307c1d2ea7e0e835655c4b51ab2472893 == > Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:05117bb3c0b2efeafe41e817cd3ad86307c1d2ea7e0e835655c4b51ab2472893?se = 2022 -12-13T12%3A35%3A00Z & sig = I78SuuwNgSMFoBTT ######################################################################## 100.0% == > Pouring gauge--1.4.3.ventura.bottle.tar.gz /usr/local/Cellar/gauge/1.4.3: 6 files, 18 .9MB Step 2 : Installing Gauge Extension for Visual Studio Code Follow the steps to add the Gauge Visual Studio Code plugin from the IDE Install the following Gauge extension for Visual Studio Code . Post-Installation Troubleshooting If, when you run your first gauge spec you receive the error of missing python packages, open the command line terminal window and run this command: python.exe -m pip install getgauge == 0 .3.7 --user","title":"Gauge Framework"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#gauge-framework","text":"Gauge is a free and open source framework for writing and running E2E tests. Some key features of Gauge that makes it unique include: Simple, flexible and rich syntax based on Markdown. Consistent cross-platform/language support for writing test code. A modular architecture with plugins support Extensible through plugins and hackable. Supports data driven execution and external data sources Helps you create maintainable test suites Supports Visual Studio Code, Intellij IDEA, IDE Support","title":"Gauge Framework"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#what-is-a-specification","text":"Gauge specifications are written using a Markdown syntax. For example # Search for the data blob ## Look for file * Goto Azure blob In this specification Search for the data blob is the specification heading , Look for file is a scenario with a step Goto Azure blob","title":"What is a Specification"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#what-is-an-implementation","text":"You can implement the steps in a specification using a programming language, for example: from getgauge.python import step import os from step_impl.utils.driver import Driver @step ( \"Goto Azure blob\" ) def gotoAzureStorage () : URL = os.getenv ( 'STORAGE_ENDPOINT' ) Driver.driver.get ( URL ) The Gauge runner reads and runs steps and its implementation for every scenario in the specification and generates a report of passing or failing scenarios. # Search for the data blob ## Look for file \u2714 Successfully generated html-report to = > reports/html-report/index.html Specifications: 1 executed 1 passed 0 failed 0 skipped Scenarios: 1 executed 1 passed 0 failed 0 skipped","title":"What is an Implementation"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#re-using-steps","text":"Gauge helps you focus on testing the flow of an application. Gauge does this by making steps as re-usable as possible. With Gauge, you don\u2019t need to build custom frameworks using a programming language. For example, Gauge steps can pass parameters to an implementation by using a text with quotes. # Search for the data blob ## Look for file * Goto Azure blob * Search for \"store_data.csv\" The implementation can now use \u201cstore_data.csv\u201d as follows from getgauge.python import step import os @step ( \"Search for <query>\" ) def searchForQuery ( query ) : write ( query ) press ( \"Enter\" ) step ( \"Search for <query>\" , ( query ) = > { write ( query ) ; press ( \"Enter\" ) ; You can then re-use this step within or across scenarios with different parameters: # Search for the data blob ## Look for Store data #1 * Goto Azure blob * Search for \"store_1.csv\" ## Look for Store data #2 * Goto Azure blob * Search for \"store_2.csv\" Or combine more than one step into concepts # Search Azure Storage for <query> * Goto Azure blob * Search for \"store_1.csv\" The concept, Search Azure Storage for <query> can be used like a step in a specification # Search for the data blob ## Look for Store data #1 * Search Azure Storage for \"store_1.csv\" ## Look for Store data #2 * Search Azure Storage for \"store_2.csv\"","title":"Re-using Steps"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#data-driven-testing","text":"Gauge also supports data driven testing using Markdown tables as well as external csv files for example # Search for the data blob | query | | --------- | | store_1 | | store_2 | | store_3 | ## Look for stores data * Search Azure Storage for <query> This will execute the scenario for all rows in the table. In the examples above, we refactored a specification to be concise and flexible without changing the implementation.","title":"Data-Driven Testing"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#other-features","text":"This is brief introduction to a few Gauge features. Please refer to the Gauge documentation for additional features such as: Reports Tags Parallel execution Environments Screenshots Plugins And much more","title":"Other Features"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#installing-gauge","text":"This getting started guide takes you through the core features of Gauge. By the end of this guide, you\u2019ll be able to install Gauge and learn how to create your first Gauge test automation project.","title":"Installing Gauge"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#installation-instructions-for-windows-os","text":"","title":"Installation Instructions for Windows OS"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#step-1-installing-gauge-on-windows","text":"This section gives specific instructions on setting up Gauge in a Microsoft Windows environment. Download the following installation bundle to get the latest stable release of Gauge.","title":"Step 1: Installing Gauge on Windows"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#step-2-installing-gauge-extension-for-visual-studio-code","text":"Follow the steps to add the Gauge Visual Studio Code plugin from the IDE Install the following Gauge extension for Visual Studio Code .","title":"Step 2: Installing Gauge Extension for Visual Studio Code"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#troubleshooting-installation","text":"If, when you run your first gauge spec you receive the error of missing python packages, open the command line terminal window and run this command: python.exe -m pip install getgauge == 0 .3.7 --user","title":"Troubleshooting Installation"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#installation-instructions-for-macos","text":"","title":"Installation Instructions for macOS"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#step-1-installing-gauge-on-macos","text":"This section gives specific instructions on setting up Gauge in a macOS environment. Install brew if you haven\u2019t already: Go to the brew website , and follow the directions there. Run the brew command to install Gauge > brew install gauge if HomeBrew is working properly, you should see something similar to the following: == > Fetching gauge == > Downloading https://ghcr.io/v2/homebrew/core/gauge/manifests/1.4.3 ######################################################################## 100.0% == > Downloading https://ghcr.io/v2/homebrew/core/gauge/blobs/sha256:05117bb3c0b2efeafe41e817cd3ad86307c1d2ea7e0e835655c4b51ab2472893 == > Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:05117bb3c0b2efeafe41e817cd3ad86307c1d2ea7e0e835655c4b51ab2472893?se = 2022 -12-13T12%3A35%3A00Z & sig = I78SuuwNgSMFoBTT ######################################################################## 100.0% == > Pouring gauge--1.4.3.ventura.bottle.tar.gz /usr/local/Cellar/gauge/1.4.3: 6 files, 18 .9MB","title":"Step 1: Installing Gauge on macOS"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#step-2-installing-gauge-extension-for-visual-studio-code_1","text":"Follow the steps to add the Gauge Visual Studio Code plugin from the IDE Install the following Gauge extension for Visual Studio Code .","title":"Step 2 : Installing Gauge Extension for Visual Studio Code"},{"location":"automated-testing/e2e-testing/recipes/gauge-framework/#post-installation-troubleshooting","text":"If, when you run your first gauge spec you receive the error of missing python packages, open the command line terminal window and run this command: python.exe -m pip install getgauge == 0 .3.7 --user","title":"Post-Installation Troubleshooting"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/","text":"Postman Testing This purpose of this document is to provide guidance on how to use Newman in your CI/CD pipeline to run End-to-end (E2E) tests defined in Postman Collections while following security best practices. First, we'll introduce Postman and Newman and then outline several Postman testing use cases that answer why you may want to go beyond local testing with Postman Collections. In the final use case, we are looking to use a shell script that references the Postman Collection file path and Environment file path as inputs to Newman. Below is a flow diagram representing the outcome of the final use case: Postman and Newman Postman is a free API platform for testing APIs. Key features highlighted in this guidance include: Postman Collections Postman Environment Files Postman Scripts Newman is a command-line Collection Runner for Postman. It enables you to run and test a Postman Collection directly from the command line. Key features highlighted in this guidance include: Newman Run Command What is a Collection A Postman Collection is a group of executable saved requests. A collection can be exported as a json file. What is an Environment File A Postman Environment file holds environment variables that can be referenced by a valid Postman Collection. What is a Postman Script A Postman Script is Javascript hosted within a Postman Collection that can be written to execute against your Postman Collection and Environment File. What is the Newman Run Command A Newman CLI command that allows you to specify a Postman Collection to be run. Installing Postman and Newman For specific instruction on installing Postman, visit the Downloads Postman page. For specific instruction on installing Newman, visit the NPMJS Newman package page. Implementing Automated End-to-end (E2E) Tests With Postman Collections In order to provide guidance on implementing automated E2E tests with Postman, the section below begins with a use case that explains the trade-offs a dev or QA analyst might face when intending to use Postman for early testing. Each use case represents scenarios that facilitate the end goal of automated E2E tests. Use Case - Hands-on Functional Testing Of Endpoints A developer or QA analyst would like to locally test input data against API services all sharing a common oauth2 token. As a result, they use Postman to craft an API test suite of Postman Collections that can be locally executed against individual endpoints across environments. After validating that their Postman Collection works, they share it with their team. Steps may look like the following: For each of your existing API services, use the Postman IDE's import feature to import its OpenAPI Spec (Swagger) as a Postman Collection. If a service is not already using Swagger, look for language specific guidance on how to use Swagger to generate an OpenAPI Spec for your service. Finally, if your service only has a few endpoints, read Postman docs for guidance on how to manually build a Postman Collection. Provide extra clarity about a request in a Postman Collection by using Postman's Example feature to save its responses as examples. You can also simply add an example manually. Please read Postman docs for guidance on how to specify examples. Combine each Postman Collection into a centralized Postman Collection. Build Postman Environment files (local, Dev and/or QA) and parameterize all saved requests of the Postman Collection in a way that references the Postman Environment files. Use the Postman Script feature to create a shared prefetch script that automatically refreshes expired auth tokens per saved request. This would require referencing secrets from a Postman Environment file. // Please treat this as pseudocode, and adjust as necessary. /* The request to an oauth2 authorization endpoint that will issue a token based on provided credentials.*/ const oauth2Request = POST {...}; var getToken = true ; if ( pm . environment . get ( 'ACCESS_TOKEN_EXPIRY' ) <= ( new Date ()). getTime ()) { console . log ( 'Token is expired' ) } else { getToken = false ; console . log ( 'Token and expiry date are all good' ); } if ( getToken === true ) { pm . sendRequest ( oauth2Request , function ( _ , res ) { console . log ( 'Save the token' ) var responseJson = res . json (); pm . environment . set ( 'token' , responseJson . access_token ) console . log ( 'Save the expiry date' ) var expiryDate = new Date (); expiryDate . setSeconds ( expiryDate . getSeconds () + responseJson . expires_in ); pm . environment . set ( 'ACCESS_TOKEN_EXPIRY' , expiryDate . getTime ()); }); } Use Postman IDE to exercise endpoints. Export collection and environment files then remove any secrets before committing to your repo. Starting with this approach has the following upsides: You've set yourself up for the beginning stages of an E2E postman collection by aggregating the collections into a single file and using environment files to make it easier to switch environments. Token is refreshed automatically on every call in the collection. This saves you time normally lost from manually having to request a token that expired. Grants QA/Dev granular control of submitting combinations of input data per endpoint. Grants developers a common experience via Postman IDE features. Ending with this approach has the following downsides: Promotes unsafe sharing of secrets. Credentials needed to request JWT token in the prefetch script are being manually shared. Secrets may happen to get exposed in the git commit history for various reasons (ex. Sharing the exported Postman Environment files). Collections can only be used locally to hit APIs (local or deployed). Not CI based. Each developer has to keep both their Postman Collection and Postman environment file(s) updated in order to keep up with latest changes to deployed services. Use Case - Hands-on Functional Testing Of Endpoints with Azure Key Vault and Azure App Config A developer or QA analyst may have an existing API test suite of Postman Collections, however, they now want to discourage unsafe sharing of secrets. As a result, they build a script that connects to both Key Vault and Azure App Config in order to automatically generate Postman Environment files instead of checking them into a shared repository. Steps may look like the following: Create an Azure Key Vault and store authentication secrets per environment: - \"Key:value\" (ex. \"dev-auth-password:12345\" ) - \"Key:value\" (ex. \"qa-auth-password:12345\" ) Create a shared Azure App Configuration instance and save all your Postman environment variables. This instance will be dedicated to holding all your Postman environment variables: > NOTE: Use the Label feature to delineate between environments. - \"Key:value\" -> \"apiRoute:url\" (ex. \"servicename:https://servicename.net\" & Label = \"QA\" ) - \"Key:value\" -> \"Header:value\" (ex. \"token: \" & Label = \"QA\" ) - \"Key:value\" -> \"KeyVaultKey:KeyVaultSecret\" (ex. \"authpassword:qa-auth-password\" & Label = \"QA\" ) Install Powershell or Bash. Powershell works for both Azure Powershell and Azure CLI. Download Azure CLI, login to the appropriate subscription and ensure you have access to the appropriate resources. Some helpful commands are below: # login to the appropriate subscription az login # validate login az account show # validate access to Key Vault az keyvault secret list - -vault-name \"$KeyvaultName\" # validate access to App Configuration az appconfig kv list - -name \"$AppConfigName\" Build a script that automatically generates your environment files. > Note: App Configuration references Key Vault, however, your script is responsible for authenticating properly to both App Configuration and Key Vault. The two services don't communicate directly. ```powershell (CreatePostmanEnvironmentFiles.ps1) # Please treat this as pseudocode, and adjust as necessary. ############################################################ env = $arg1 # 1. list app config vars for an environment envVars = az appconfig kv list --name PostmanAppConfig --label $env | ConvertFrom-Json # 2. step through envVars array to get Key Vault uris keyvaultURI = \"\" $envVars | % {if($ .key -eq 'password'){keyvaultURI = $ .value}} # 3. parse uris for Key Vault name and secret names # 4. get secret from Key Vault kvsecret = az keyvault secret show --name $secretName --vault-name $keyvaultName --query \"value\" # 5. set password value to returned Key Vault secret $envVars | % {if($ .key -eq 'password'){$ .value=$kvsecret}} # 6. create environment file envFile = @{ \"_postman_variable_scope\" = \"environment\", \"name\" = $env, values = @() } foreach($var in $envVars){ $envFile.values += @{ key = $var.key; value = $var.value; } } $envFile | ConvertTo-Json -depth 50 | Out-File -encoding ASCII -FilePath .\\$env.postman_environment.json ``` Use Postman IDE to import the Postman Environment files to be referenced by your collection. This approach has the following upsides: Inherits all the upsides of the previous case. Discourages unsafe sharing of secrets. Secrets are now pulled from Key Vault via Azure CLI. Key Vault Uri also no longer needs to be shared for access to auth tokens. Single source of truth for Postman Environment files. There's no longer a need to share them via repo. Developer only has to manage a single Postman Collection. Ending with this approach has the following downsides: Secrets may happen to get exposed in the git commit history if .gitIgnore is not updated to ignore Postman Environment files. Collections can only be used locally to hit APIs (local or deployed). Not CI based. Use Case - E2E Testing with Continuous Integration and Newman A developer or QA analyst may have an existing API test suite of local Postman Collections that follow security best practices for development, however, they now want E2E tests to run as part of automated CI pipeline. With the advent of Newman, you can now more readily use Postman to craft an API test suite executable in your CI. Steps may look like the following: Update your Postman Collection to use the Postman Test feature in order to craft test assertions that will cover all saved requests E2E. Read Postman docs for guidance on how to use the Postman Test feature. Locally use Newman to validate tests are working as intended newman run tests \\ e2e_Postman_collection . json -e qa . postman_environment . json Build a script that automatically executes Postman Test assertions via Newman and Azure CLI. > NOTE: An Azure Service Principal must be setup to continue using azure cli in this CI pipeline example. ```powershell (RunPostmanE2eTests.ps1) # Please treat this as pseudocode, and adjust as necessary. ############################################################ # 1. login to Azure using a Service Principal az login --service-principal -u $APP_ID -p $AZURE_SECRET --tenant $AZURE_TENANT # 2. list app config vars for an environment envVars = az appconfig kv list --name PostmanAppConfig --label $env | ConvertFrom-Json # 3. step through envVars array to get Key Vault uris keyvaultURI = \"\" @envVars | % {if($ .key -eq 'password'){keyvaultURI = $ .value}} # 4. parse uris for Key Vault name and secret names # 5. get secret from Key Vault kvsecret = az keyvault secret show --name $secretName --vault-name $keyvaultName --query \"value\" # 6. set password value to returned Key Vault secret $envVars | % {if($ .key -eq 'password'){$ .value=$kvsecret}} # 7. create environment file envFile = @{ \"_postman_variable_scope\" = \"environment\", \"name\" = $env, values = @() } foreach($var in $envVars){ $envFile.values += @{ key = $var.key; value = $var.value; } } $envFile | ConvertTo-Json -depth 50 | Out-File -encoding ASCII $env.postman_environment.json # 8. install Newman npm install --save-dev newman # 9. run automated E2E tests via Newman node_modules.bin\\newman run tests\\e2e_Postman_collection.json -e $env.postman_environment.json ``` Create a yaml file and define a step that will run your test script. (ex. A yaml file targeting Azure Devops that runs a Powershell script.) # Please treat this as pseudocode, and adjust as necessary. ############################################################ displayName : 'Run Postman E2E tests' inputs : targetType : 'filePath' filePath : RunPostmanE2eTests.ps1 env : APP_ID : $(environment.appId) # credentials for az cli AZURE_SECRET : $(environment.secret) AZURE_TENANT : $(environment.tenant) This approach has the following upside: E2E tests can now be run automatically as part of a CI pipeline. Ending with this approach has the following downside: Postman Environment files are no longer being output to a local environment for hands-on manual testing. However, this can be solved by managing 2 scripts.","title":"Postman Testing"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#postman-testing","text":"This purpose of this document is to provide guidance on how to use Newman in your CI/CD pipeline to run End-to-end (E2E) tests defined in Postman Collections while following security best practices. First, we'll introduce Postman and Newman and then outline several Postman testing use cases that answer why you may want to go beyond local testing with Postman Collections. In the final use case, we are looking to use a shell script that references the Postman Collection file path and Environment file path as inputs to Newman. Below is a flow diagram representing the outcome of the final use case:","title":"Postman Testing"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#postman-and-newman","text":"Postman is a free API platform for testing APIs. Key features highlighted in this guidance include: Postman Collections Postman Environment Files Postman Scripts Newman is a command-line Collection Runner for Postman. It enables you to run and test a Postman Collection directly from the command line. Key features highlighted in this guidance include: Newman Run Command","title":"Postman and Newman"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#what-is-a-collection","text":"A Postman Collection is a group of executable saved requests. A collection can be exported as a json file.","title":"What is a Collection"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#what-is-an-environment-file","text":"A Postman Environment file holds environment variables that can be referenced by a valid Postman Collection.","title":"What is an Environment File"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#what-is-a-postman-script","text":"A Postman Script is Javascript hosted within a Postman Collection that can be written to execute against your Postman Collection and Environment File.","title":"What is a Postman Script"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#what-is-the-newman-run-command","text":"A Newman CLI command that allows you to specify a Postman Collection to be run.","title":"What is the Newman Run Command"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#installing-postman-and-newman","text":"For specific instruction on installing Postman, visit the Downloads Postman page. For specific instruction on installing Newman, visit the NPMJS Newman package page.","title":"Installing Postman and Newman"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#implementing-automated-end-to-end-e2e-tests-with-postman-collections","text":"In order to provide guidance on implementing automated E2E tests with Postman, the section below begins with a use case that explains the trade-offs a dev or QA analyst might face when intending to use Postman for early testing. Each use case represents scenarios that facilitate the end goal of automated E2E tests.","title":"Implementing Automated End-to-end (E2E) Tests With Postman Collections"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#use-case-hands-on-functional-testing-of-endpoints","text":"A developer or QA analyst would like to locally test input data against API services all sharing a common oauth2 token. As a result, they use Postman to craft an API test suite of Postman Collections that can be locally executed against individual endpoints across environments. After validating that their Postman Collection works, they share it with their team. Steps may look like the following: For each of your existing API services, use the Postman IDE's import feature to import its OpenAPI Spec (Swagger) as a Postman Collection. If a service is not already using Swagger, look for language specific guidance on how to use Swagger to generate an OpenAPI Spec for your service. Finally, if your service only has a few endpoints, read Postman docs for guidance on how to manually build a Postman Collection. Provide extra clarity about a request in a Postman Collection by using Postman's Example feature to save its responses as examples. You can also simply add an example manually. Please read Postman docs for guidance on how to specify examples. Combine each Postman Collection into a centralized Postman Collection. Build Postman Environment files (local, Dev and/or QA) and parameterize all saved requests of the Postman Collection in a way that references the Postman Environment files. Use the Postman Script feature to create a shared prefetch script that automatically refreshes expired auth tokens per saved request. This would require referencing secrets from a Postman Environment file. // Please treat this as pseudocode, and adjust as necessary. /* The request to an oauth2 authorization endpoint that will issue a token based on provided credentials.*/ const oauth2Request = POST {...}; var getToken = true ; if ( pm . environment . get ( 'ACCESS_TOKEN_EXPIRY' ) <= ( new Date ()). getTime ()) { console . log ( 'Token is expired' ) } else { getToken = false ; console . log ( 'Token and expiry date are all good' ); } if ( getToken === true ) { pm . sendRequest ( oauth2Request , function ( _ , res ) { console . log ( 'Save the token' ) var responseJson = res . json (); pm . environment . set ( 'token' , responseJson . access_token ) console . log ( 'Save the expiry date' ) var expiryDate = new Date (); expiryDate . setSeconds ( expiryDate . getSeconds () + responseJson . expires_in ); pm . environment . set ( 'ACCESS_TOKEN_EXPIRY' , expiryDate . getTime ()); }); } Use Postman IDE to exercise endpoints. Export collection and environment files then remove any secrets before committing to your repo. Starting with this approach has the following upsides: You've set yourself up for the beginning stages of an E2E postman collection by aggregating the collections into a single file and using environment files to make it easier to switch environments. Token is refreshed automatically on every call in the collection. This saves you time normally lost from manually having to request a token that expired. Grants QA/Dev granular control of submitting combinations of input data per endpoint. Grants developers a common experience via Postman IDE features. Ending with this approach has the following downsides: Promotes unsafe sharing of secrets. Credentials needed to request JWT token in the prefetch script are being manually shared. Secrets may happen to get exposed in the git commit history for various reasons (ex. Sharing the exported Postman Environment files). Collections can only be used locally to hit APIs (local or deployed). Not CI based. Each developer has to keep both their Postman Collection and Postman environment file(s) updated in order to keep up with latest changes to deployed services.","title":"Use Case - Hands-on Functional Testing Of Endpoints"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#use-case-hands-on-functional-testing-of-endpoints-with-azure-key-vault-and-azure-app-config","text":"A developer or QA analyst may have an existing API test suite of Postman Collections, however, they now want to discourage unsafe sharing of secrets. As a result, they build a script that connects to both Key Vault and Azure App Config in order to automatically generate Postman Environment files instead of checking them into a shared repository. Steps may look like the following: Create an Azure Key Vault and store authentication secrets per environment: - \"Key:value\" (ex. \"dev-auth-password:12345\" ) - \"Key:value\" (ex. \"qa-auth-password:12345\" ) Create a shared Azure App Configuration instance and save all your Postman environment variables. This instance will be dedicated to holding all your Postman environment variables: > NOTE: Use the Label feature to delineate between environments. - \"Key:value\" -> \"apiRoute:url\" (ex. \"servicename:https://servicename.net\" & Label = \"QA\" ) - \"Key:value\" -> \"Header:value\" (ex. \"token: \" & Label = \"QA\" ) - \"Key:value\" -> \"KeyVaultKey:KeyVaultSecret\" (ex. \"authpassword:qa-auth-password\" & Label = \"QA\" ) Install Powershell or Bash. Powershell works for both Azure Powershell and Azure CLI. Download Azure CLI, login to the appropriate subscription and ensure you have access to the appropriate resources. Some helpful commands are below: # login to the appropriate subscription az login # validate login az account show # validate access to Key Vault az keyvault secret list - -vault-name \"$KeyvaultName\" # validate access to App Configuration az appconfig kv list - -name \"$AppConfigName\" Build a script that automatically generates your environment files. > Note: App Configuration references Key Vault, however, your script is responsible for authenticating properly to both App Configuration and Key Vault. The two services don't communicate directly. ```powershell (CreatePostmanEnvironmentFiles.ps1) # Please treat this as pseudocode, and adjust as necessary. ############################################################ env = $arg1 # 1. list app config vars for an environment envVars = az appconfig kv list --name PostmanAppConfig --label $env | ConvertFrom-Json # 2. step through envVars array to get Key Vault uris keyvaultURI = \"\" $envVars | % {if($ .key -eq 'password'){keyvaultURI = $ .value}} # 3. parse uris for Key Vault name and secret names # 4. get secret from Key Vault kvsecret = az keyvault secret show --name $secretName --vault-name $keyvaultName --query \"value\" # 5. set password value to returned Key Vault secret $envVars | % {if($ .key -eq 'password'){$ .value=$kvsecret}} # 6. create environment file envFile = @{ \"_postman_variable_scope\" = \"environment\", \"name\" = $env, values = @() } foreach($var in $envVars){ $envFile.values += @{ key = $var.key; value = $var.value; } } $envFile | ConvertTo-Json -depth 50 | Out-File -encoding ASCII -FilePath .\\$env.postman_environment.json ``` Use Postman IDE to import the Postman Environment files to be referenced by your collection. This approach has the following upsides: Inherits all the upsides of the previous case. Discourages unsafe sharing of secrets. Secrets are now pulled from Key Vault via Azure CLI. Key Vault Uri also no longer needs to be shared for access to auth tokens. Single source of truth for Postman Environment files. There's no longer a need to share them via repo. Developer only has to manage a single Postman Collection. Ending with this approach has the following downsides: Secrets may happen to get exposed in the git commit history if .gitIgnore is not updated to ignore Postman Environment files. Collections can only be used locally to hit APIs (local or deployed). Not CI based.","title":"Use Case - Hands-on Functional Testing Of Endpoints with Azure Key Vault and Azure App Config"},{"location":"automated-testing/e2e-testing/recipes/postman-testing/#use-case-e2e-testing-with-continuous-integration-and-newman","text":"A developer or QA analyst may have an existing API test suite of local Postman Collections that follow security best practices for development, however, they now want E2E tests to run as part of automated CI pipeline. With the advent of Newman, you can now more readily use Postman to craft an API test suite executable in your CI. Steps may look like the following: Update your Postman Collection to use the Postman Test feature in order to craft test assertions that will cover all saved requests E2E. Read Postman docs for guidance on how to use the Postman Test feature. Locally use Newman to validate tests are working as intended newman run tests \\ e2e_Postman_collection . json -e qa . postman_environment . json Build a script that automatically executes Postman Test assertions via Newman and Azure CLI. > NOTE: An Azure Service Principal must be setup to continue using azure cli in this CI pipeline example. ```powershell (RunPostmanE2eTests.ps1) # Please treat this as pseudocode, and adjust as necessary. ############################################################ # 1. login to Azure using a Service Principal az login --service-principal -u $APP_ID -p $AZURE_SECRET --tenant $AZURE_TENANT # 2. list app config vars for an environment envVars = az appconfig kv list --name PostmanAppConfig --label $env | ConvertFrom-Json # 3. step through envVars array to get Key Vault uris keyvaultURI = \"\" @envVars | % {if($ .key -eq 'password'){keyvaultURI = $ .value}} # 4. parse uris for Key Vault name and secret names # 5. get secret from Key Vault kvsecret = az keyvault secret show --name $secretName --vault-name $keyvaultName --query \"value\" # 6. set password value to returned Key Vault secret $envVars | % {if($ .key -eq 'password'){$ .value=$kvsecret}} # 7. create environment file envFile = @{ \"_postman_variable_scope\" = \"environment\", \"name\" = $env, values = @() } foreach($var in $envVars){ $envFile.values += @{ key = $var.key; value = $var.value; } } $envFile | ConvertTo-Json -depth 50 | Out-File -encoding ASCII $env.postman_environment.json # 8. install Newman npm install --save-dev newman # 9. run automated E2E tests via Newman node_modules.bin\\newman run tests\\e2e_Postman_collection.json -e $env.postman_environment.json ``` Create a yaml file and define a step that will run your test script. (ex. A yaml file targeting Azure Devops that runs a Powershell script.) # Please treat this as pseudocode, and adjust as necessary. ############################################################ displayName : 'Run Postman E2E tests' inputs : targetType : 'filePath' filePath : RunPostmanE2eTests.ps1 env : APP_ID : $(environment.appId) # credentials for az cli AZURE_SECRET : $(environment.secret) AZURE_TENANT : $(environment.tenant) This approach has the following upside: E2E tests can now be run automatically as part of a CI pipeline. Ending with this approach has the following downside: Postman Environment files are no longer being output to a local environment for hands-on manual testing. However, this can be solved by managing 2 scripts.","title":"Use Case - E2E Testing with Continuous Integration and Newman"},{"location":"automated-testing/fault-injection-testing/","text":"Fault Injection Testing Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its stability and reliability . The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time. When To Use Problem Addressed Systems need to be resilient to the conditions that caused inevitable production disruptions. Modern applications are built with an increasing number of dependencies; on infrastructure, platform, network, 3rd party software or APIs, etc. Such systems increase the risk of impact from dependency disruptions. Each dependent component may fail. Furthermore, its interactions with other components may propagate the failure. Fault injection methods are a way to increase coverage and validate software robustness and error handling, either at build-time or at run-time, with the intention of \"embracing failure\" as part of the development lifecycle. These methods assist engineering teams in designing and continuously validating for failure, accounting for known and unknown failure conditions, architect for redundancy, employ retry and back-off mechanisms, etc. Applicable to Software - Error handling code paths, in-process memory management. Example tests: Edge-case unit/integration tests and/or load tests (i.e. stress and soak). Protocol - Vulnerabilities in communication interfaces such as command line parameters or APIs. Example tests: Fuzzing provides invalid, unexpected, or random data as input we can assess the level of protocol stability of a component. Infrastructure - Outages, networking issues, hardware failures. Example tests: Using different methods to cause fault in the underlying infrastructure such as Shut down virtual machine (VM) instances, crash processes, expire certificates, introduce network latency, etc. This level of testing relies on statistical metrics observations over time and measuring the deviations of its observed behavior during fault, or its recovery time. How to Use Architecture Terminology Fault - The adjudged or hypothesized cause of an error. Error - That part of the system state that may cause a subsequent failure. Failure - An event that occurs when the delivered service deviates from correct state. Fault-Error-Failure cycle - A key mechanism in dependability : A fault may cause an error. An error may cause further errors within the system boundary; therefore each new error acts as a fault. When error states are observed at the system boundary, they are termed failures. Fault Injection Testing Basics Fault injection is an advanced form of testing where the system is subjected to different failure modes , and where the testing engineer may know in advance what is the expected outcome, as in the case of release validation tests, or in an exploration to find potential issues in the product, which should be mitigated. Fault Injection and Chaos Engineering Fault injection testing is a specific approach to testing one condition. It introduces a failure into a system to validate its robustness. Chaos engineering, coined by Netflix, is a practice for generating new information. There is an overlap in concerns and often in tooling between the terms, and many times chaos engineering uses fault injection to introduce the required effects to the system. High-level Step-by-Step Fault injection testing in the development cycle Fault injection is an effective way to find security bugs in software, so much so that the Microsoft Security Development Lifecycle requires fuzzing at every untrusted interface of every product and penetration testing which includes introducing faults to the system, to uncover potential vulnerabilities resulting from coding errors, system configuration faults, or other operational deployment weaknesses. Automated fault injection coverage in a CI pipeline promotes a Shift-Left approach of testing earlier in the lifecycle for potential issues. Examples of performing fault injection during the development lifecycle: Using fuzzing tools in CI. Execute existing end-to-end scenario tests (such as integration or stress tests), which are augmented with fault injection. Write regression and acceptance tests based on issues that were found and fixed or based on resolved service incidents. Ad-hoc (manual) validations of fault in the dev environment for new features. Fault Injection Testing in the Release Cycle Much like Synthetic Monitoring Tests , fault injection testing in the release cycle is a part of Shift-Right testing approach, which uses safe methods to perform tests in a production or pre-production environment. Given the nature of distributed, cloud-based applications, it is very difficult to simulate the real behavior of services outside their production environment. Testers are encouraged to run tests where it really matters, on a live system with customer traffic. Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering: Measure and define a steady (healthy) state for the system's interoperability. Create hypotheses based on predicted behavior when a fault is introduced. Introduce real-world fault-events to the system. Measure the state and compare it to the baseline state. Document the process and the observations. Identify and act on the result. Fault Injection Testing in Kubernetes With the advancement of kubernetes (k8s) as the infrastructure platform, fault injection testing in kubernetes has become inevitable to ensure that system behaves in a reliable manner in the event of a fault or failure. There could be different type of workloads running within a k8s cluster which are written in different languages. For eg. within a K8s cluster, you can run a micro service, a web app and/or a scheduled job. Hence you need to have mechanism to inject fault into any kind of workloads running within the cluster. In addition, kubernetes clusters are managed differently from traditional infrastructure. The tools used for fault injection testing within kubernetes should have compatibility with k8s infrastructure. These are the main characteristics which are required: Ease of injecting fault into kubernetes pods. Support for faster tool installation within the cluster. Support for YAML based configurations which works well with kubernetes. Ease of customization to add custom resources. Support for workflows to deploy various workloads and faults. Ease of maintainability of the tool Ease of integration with telemetry Best Practices and Advice Experimenting in production has the benefit of running tests against a live system with real user traffic, ensuring its health, or building confidence in its ability to handle errors gracefully. However, it has the potential to cause unnecessary customer pain. A test can either succeed or fail. In the event of failure, there will likely be some impact on the production environment. Thinking about the Blast Radius of the effect, should the test fail, is a crucial step to conduct beforehand. The following practices may help minimize such risk: Run tests in a non-production environment first. Understand how the system behaves in a safe environment, using synthetic workload, before introducing potential risk to customer traffic. Use fault injection as gates in different stages through the CD pipeline. Deploy and test on Blue/Green and Canary deployments. Use methods such as traffic shadowing (a.k.a. Dark Traffic ) to get customer traffic to the staging slot. Strive to achieve a balance between collecting actual result data while affecting as few production users as possible. Use defensive design principles such as circuit breaking and the bulkhead patterns. Agreed on a budget (in terms of Service Level Objective (SLO)) as an investment in chaos and fault injection. Grow the risk incrementally - Start with hardening the core and expand out in layers. At each point, progress should be locked in with automated regression tests. Fault Injection Testing Frameworks and Tools Fuzzing OneFuzz - is a Microsoft open-source self-hosted fuzzing-as-a-service platform which is easy to integrate into CI pipelines. AFL and WinAFL - Popular fuzz tools by Google's project zero team which is used locally to target binaries on Linux or Windows. WebScarab - A web-focused fuzzer owned by OWASP which can be found in Kali linux distributions. Chaos Azure Chaos Studio - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources. Chaos toolkit - A declarative, modular chaos platform with many extensions, including the Azure actions and probes kit . Kraken - An Openshift-specific chaos tool, maintained by Redhat. Chaos Monkey - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB). Simmy - A .NET library for chaos testing and fault injection integrated with the Polly library for resilience engineering. Litmus - A CNCF open source tool for chaos testing and fault injection for kubernetes cluster. This ISE dev blog post provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing. Conclusion From the principals of chaos: \"The harder it is to disrupt the steady-state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large\". Fault injection techniques increase resilience and confidence in the products we ship. They are used across the industry to validate applications and platforms before and while they are delivered to customers. Fault injection is a powerful tool and should be used with caution. Cases such as the Cloudflare 30 minute global outage , which was caused due to a deployment of code that was meant to be \u201cdark launched\u201d, entail the importance of curtailing the blast radius in the system during experiments. Resources Mark Russinovich's fault injection and chaos engineering blog post Cindy Sridharan's Testing in production blog post Cindy Sridharan's Testing in production blog post cont. Fault injection in Azure Search Azure Architecture Framework - Chaos engineering Azure Architecture Framework - Testing resilience Landscape of Software Failure Cause Models","title":"Fault Injection Testing"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing","text":"Fault injection testing is the deliberate introduction of errors and faults to a system to validate and harden its stability and reliability . The goal is to improve the system's design for resiliency and performance under intermittent failure conditions over time.","title":"Fault Injection Testing"},{"location":"automated-testing/fault-injection-testing/#when-to-use","text":"","title":"When To Use"},{"location":"automated-testing/fault-injection-testing/#problem-addressed","text":"Systems need to be resilient to the conditions that caused inevitable production disruptions. Modern applications are built with an increasing number of dependencies; on infrastructure, platform, network, 3rd party software or APIs, etc. Such systems increase the risk of impact from dependency disruptions. Each dependent component may fail. Furthermore, its interactions with other components may propagate the failure. Fault injection methods are a way to increase coverage and validate software robustness and error handling, either at build-time or at run-time, with the intention of \"embracing failure\" as part of the development lifecycle. These methods assist engineering teams in designing and continuously validating for failure, accounting for known and unknown failure conditions, architect for redundancy, employ retry and back-off mechanisms, etc.","title":"Problem Addressed"},{"location":"automated-testing/fault-injection-testing/#applicable-to","text":"Software - Error handling code paths, in-process memory management. Example tests: Edge-case unit/integration tests and/or load tests (i.e. stress and soak). Protocol - Vulnerabilities in communication interfaces such as command line parameters or APIs. Example tests: Fuzzing provides invalid, unexpected, or random data as input we can assess the level of protocol stability of a component. Infrastructure - Outages, networking issues, hardware failures. Example tests: Using different methods to cause fault in the underlying infrastructure such as Shut down virtual machine (VM) instances, crash processes, expire certificates, introduce network latency, etc. This level of testing relies on statistical metrics observations over time and measuring the deviations of its observed behavior during fault, or its recovery time.","title":"Applicable to"},{"location":"automated-testing/fault-injection-testing/#how-to-use","text":"","title":"How to Use"},{"location":"automated-testing/fault-injection-testing/#architecture","text":"","title":"Architecture"},{"location":"automated-testing/fault-injection-testing/#terminology","text":"Fault - The adjudged or hypothesized cause of an error. Error - That part of the system state that may cause a subsequent failure. Failure - An event that occurs when the delivered service deviates from correct state. Fault-Error-Failure cycle - A key mechanism in dependability : A fault may cause an error. An error may cause further errors within the system boundary; therefore each new error acts as a fault. When error states are observed at the system boundary, they are termed failures.","title":"Terminology"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-basics","text":"Fault injection is an advanced form of testing where the system is subjected to different failure modes , and where the testing engineer may know in advance what is the expected outcome, as in the case of release validation tests, or in an exploration to find potential issues in the product, which should be mitigated.","title":"Fault Injection Testing Basics"},{"location":"automated-testing/fault-injection-testing/#fault-injection-and-chaos-engineering","text":"Fault injection testing is a specific approach to testing one condition. It introduces a failure into a system to validate its robustness. Chaos engineering, coined by Netflix, is a practice for generating new information. There is an overlap in concerns and often in tooling between the terms, and many times chaos engineering uses fault injection to introduce the required effects to the system.","title":"Fault Injection and Chaos Engineering"},{"location":"automated-testing/fault-injection-testing/#high-level-step-by-step","text":"","title":"High-level Step-by-Step"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-in-the-development-cycle","text":"Fault injection is an effective way to find security bugs in software, so much so that the Microsoft Security Development Lifecycle requires fuzzing at every untrusted interface of every product and penetration testing which includes introducing faults to the system, to uncover potential vulnerabilities resulting from coding errors, system configuration faults, or other operational deployment weaknesses. Automated fault injection coverage in a CI pipeline promotes a Shift-Left approach of testing earlier in the lifecycle for potential issues. Examples of performing fault injection during the development lifecycle: Using fuzzing tools in CI. Execute existing end-to-end scenario tests (such as integration or stress tests), which are augmented with fault injection. Write regression and acceptance tests based on issues that were found and fixed or based on resolved service incidents. Ad-hoc (manual) validations of fault in the dev environment for new features.","title":"Fault injection testing in the development cycle"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-in-the-release-cycle","text":"Much like Synthetic Monitoring Tests , fault injection testing in the release cycle is a part of Shift-Right testing approach, which uses safe methods to perform tests in a production or pre-production environment. Given the nature of distributed, cloud-based applications, it is very difficult to simulate the real behavior of services outside their production environment. Testers are encouraged to run tests where it really matters, on a live system with customer traffic. Fault injection tests rely on metrics observability and are usually statistical; The following high-level steps provide a sample of practicing fault injection and chaos engineering: Measure and define a steady (healthy) state for the system's interoperability. Create hypotheses based on predicted behavior when a fault is introduced. Introduce real-world fault-events to the system. Measure the state and compare it to the baseline state. Document the process and the observations. Identify and act on the result.","title":"Fault Injection Testing in the Release Cycle"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-in-kubernetes","text":"With the advancement of kubernetes (k8s) as the infrastructure platform, fault injection testing in kubernetes has become inevitable to ensure that system behaves in a reliable manner in the event of a fault or failure. There could be different type of workloads running within a k8s cluster which are written in different languages. For eg. within a K8s cluster, you can run a micro service, a web app and/or a scheduled job. Hence you need to have mechanism to inject fault into any kind of workloads running within the cluster. In addition, kubernetes clusters are managed differently from traditional infrastructure. The tools used for fault injection testing within kubernetes should have compatibility with k8s infrastructure. These are the main characteristics which are required: Ease of injecting fault into kubernetes pods. Support for faster tool installation within the cluster. Support for YAML based configurations which works well with kubernetes. Ease of customization to add custom resources. Support for workflows to deploy various workloads and faults. Ease of maintainability of the tool Ease of integration with telemetry","title":"Fault Injection Testing in Kubernetes"},{"location":"automated-testing/fault-injection-testing/#best-practices-and-advice","text":"Experimenting in production has the benefit of running tests against a live system with real user traffic, ensuring its health, or building confidence in its ability to handle errors gracefully. However, it has the potential to cause unnecessary customer pain. A test can either succeed or fail. In the event of failure, there will likely be some impact on the production environment. Thinking about the Blast Radius of the effect, should the test fail, is a crucial step to conduct beforehand. The following practices may help minimize such risk: Run tests in a non-production environment first. Understand how the system behaves in a safe environment, using synthetic workload, before introducing potential risk to customer traffic. Use fault injection as gates in different stages through the CD pipeline. Deploy and test on Blue/Green and Canary deployments. Use methods such as traffic shadowing (a.k.a. Dark Traffic ) to get customer traffic to the staging slot. Strive to achieve a balance between collecting actual result data while affecting as few production users as possible. Use defensive design principles such as circuit breaking and the bulkhead patterns. Agreed on a budget (in terms of Service Level Objective (SLO)) as an investment in chaos and fault injection. Grow the risk incrementally - Start with hardening the core and expand out in layers. At each point, progress should be locked in with automated regression tests.","title":"Best Practices and Advice"},{"location":"automated-testing/fault-injection-testing/#fault-injection-testing-frameworks-and-tools","text":"","title":"Fault Injection Testing Frameworks and Tools"},{"location":"automated-testing/fault-injection-testing/#fuzzing","text":"OneFuzz - is a Microsoft open-source self-hosted fuzzing-as-a-service platform which is easy to integrate into CI pipelines. AFL and WinAFL - Popular fuzz tools by Google's project zero team which is used locally to target binaries on Linux or Windows. WebScarab - A web-focused fuzzer owned by OWASP which can be found in Kali linux distributions.","title":"Fuzzing"},{"location":"automated-testing/fault-injection-testing/#chaos","text":"Azure Chaos Studio - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources. Chaos toolkit - A declarative, modular chaos platform with many extensions, including the Azure actions and probes kit . Kraken - An Openshift-specific chaos tool, maintained by Redhat. Chaos Monkey - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB). Simmy - A .NET library for chaos testing and fault injection integrated with the Polly library for resilience engineering. Litmus - A CNCF open source tool for chaos testing and fault injection for kubernetes cluster. This ISE dev blog post provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing.","title":"Chaos"},{"location":"automated-testing/fault-injection-testing/#conclusion","text":"From the principals of chaos: \"The harder it is to disrupt the steady-state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large\". Fault injection techniques increase resilience and confidence in the products we ship. They are used across the industry to validate applications and platforms before and while they are delivered to customers. Fault injection is a powerful tool and should be used with caution. Cases such as the Cloudflare 30 minute global outage , which was caused due to a deployment of code that was meant to be \u201cdark launched\u201d, entail the importance of curtailing the blast radius in the system during experiments.","title":"Conclusion"},{"location":"automated-testing/fault-injection-testing/#resources","text":"Mark Russinovich's fault injection and chaos engineering blog post Cindy Sridharan's Testing in production blog post Cindy Sridharan's Testing in production blog post cont. Fault injection in Azure Search Azure Architecture Framework - Chaos engineering Azure Architecture Framework - Testing resilience Landscape of Software Failure Cause Models","title":"Resources"},{"location":"automated-testing/integration-testing/","text":"Integration Testing Integration testing is a software testing methodology used to determine how well individually developed components, or modules of a system communicate with each other. This method of testing confirms that an aggregate of a system, or sub-system, works together correctly or otherwise exposes erroneous behavior between two or more units of code. Why Integration Testing Because one component of a system may be developed independently or in isolation of another it is important to verify the interaction of some or all components. A complex system may be composed of databases, APIs, interfaces, and more, that all interact with each other or additional external systems. Integration tests expose system-level issues such as broken database schemas or faulty third-party API integration. It ensures higher test coverage and serves as an important feedback loop throughout development. Integration Testing Design Blocks Consider a banking application with three modules: login, transfers, and current balance, all developed independently. An integration test may verify when a user logs in they are re-directed to their current balance with the correct amount for the specific mock user. Another integration test may perform a transfer of a specified amount of money. The test may confirm there are sufficient funds in the account to perform the transfer, and after the transfer the current balance is updated appropriately for the mock user. The login page may be mocked with a test user and mock credentials if this module is not completed when testing the transfers module. Integration testing is done by the developer or QA tester. In the past, integration testing always happened after unit and before system and E2E testing. Compared to unit-tests, integration tests are fewer in quantity, usually run slower, and are more expensive to set up and develop. Now, if a team is following agile principles, integration tests can be performed before or after unit tests, early and often, as there is no need to wait for sequential processes. Additionally, integration tests can utilize mock data in order to simulate a complete system. There is an abundance of language-specific testing frameworks that can be used throughout the entire development lifecycle. It is important to note the difference between integration and acceptance testing. Integration testing confirms a group of components work together as intended from a technical perspective, while acceptance testing confirms a group of components work together as intended from a business scenario. Applying Integration Testing Prior to writing integration tests, the engineers must identify the different components of the system, and their intended behaviors and inputs and outputs. The architecture of the project must be fully documented or specified somewhere that can be readily referenced (e.g., the architecture diagram). There are two main techniques for integration testing. Big Bang Big Bang integration testing is when all components are tested as a single unit. This is best for small system as a system too large may be difficult to localize for potential errors from failed tests. This approach also requires all components in the system under test to be completed which may delay when testing begins. Incremental Testing Incremental testing is when two or more components that are logically related are tested as a unit. After testing the unit, additional components are combined and tested all together. This process repeats until all necessary components are tested. Top Down Top down testing is when higher level components are tested following the control flow of a software system. In the scenario, what is commonly referred to as stubs are used to emulate the behavior of lower level modules not yet complete or merged in the integration test. Bottom Up Bottom up testing is when lower level modules are tested together. In the scenario, what is commonly referred to as drivers are used to emulate the behavior of higher level modules not yet complete or included in the integration test. A third approach known as the sandwich or hybrid model combines the bottom up and town down approaches to test lower and higher level components at the same time. Things to Avoid There is a tradeoff a developer must make between integration test code coverage and engineering cycles. With mock dependencies, test data, and multiple environments at test, too many integration tests are infeasible to maintain and become increasingly less meaningful. Too much mocking will slow down the test suite, make scaling difficult, and may be a sign the developer should consider other tests for the scenario such as acceptance or E2E. Integration tests of complex systems require high maintenance. Avoid testing business logic in integration tests by keeping test suites separate. Do not test beyond the acceptance criteria of the task and be sure to clean up any resources created for a given test. Additionally, avoid writing tests in a production environment. Instead, write them in a scaled-down copy environment. Integration Testing Frameworks and Tools Many tools and frameworks can be used to write both unit and integration tests. The following tools are for automating integration tests. JUnit Robot Framework moq Cucumber Selenium Behave (Python) Conclusion Integration testing demonstrates how one module of a system, or external system, interfaces with another. This can be a test of two components, a sub-system, a whole system, or a collection of systems. Tests should be written frequently and throughout the entire development lifecycle using an appropriate amount of mocked dependencies and test data. Because integration tests prove that independently developed modules interface as technically designed, it increases confidence in the development cycle providing a path for a system that deploys and scales. Resources Integration testing approaches Integration testing pros and cons Integration tests mocks and stubs Software Testing: Principles and Practices Integration testing Behave test quick start","title":"Integration Testing"},{"location":"automated-testing/integration-testing/#integration-testing","text":"Integration testing is a software testing methodology used to determine how well individually developed components, or modules of a system communicate with each other. This method of testing confirms that an aggregate of a system, or sub-system, works together correctly or otherwise exposes erroneous behavior between two or more units of code.","title":"Integration Testing"},{"location":"automated-testing/integration-testing/#why-integration-testing","text":"Because one component of a system may be developed independently or in isolation of another it is important to verify the interaction of some or all components. A complex system may be composed of databases, APIs, interfaces, and more, that all interact with each other or additional external systems. Integration tests expose system-level issues such as broken database schemas or faulty third-party API integration. It ensures higher test coverage and serves as an important feedback loop throughout development.","title":"Why Integration Testing"},{"location":"automated-testing/integration-testing/#integration-testing-design-blocks","text":"Consider a banking application with three modules: login, transfers, and current balance, all developed independently. An integration test may verify when a user logs in they are re-directed to their current balance with the correct amount for the specific mock user. Another integration test may perform a transfer of a specified amount of money. The test may confirm there are sufficient funds in the account to perform the transfer, and after the transfer the current balance is updated appropriately for the mock user. The login page may be mocked with a test user and mock credentials if this module is not completed when testing the transfers module. Integration testing is done by the developer or QA tester. In the past, integration testing always happened after unit and before system and E2E testing. Compared to unit-tests, integration tests are fewer in quantity, usually run slower, and are more expensive to set up and develop. Now, if a team is following agile principles, integration tests can be performed before or after unit tests, early and often, as there is no need to wait for sequential processes. Additionally, integration tests can utilize mock data in order to simulate a complete system. There is an abundance of language-specific testing frameworks that can be used throughout the entire development lifecycle. It is important to note the difference between integration and acceptance testing. Integration testing confirms a group of components work together as intended from a technical perspective, while acceptance testing confirms a group of components work together as intended from a business scenario.","title":"Integration Testing Design Blocks"},{"location":"automated-testing/integration-testing/#applying-integration-testing","text":"Prior to writing integration tests, the engineers must identify the different components of the system, and their intended behaviors and inputs and outputs. The architecture of the project must be fully documented or specified somewhere that can be readily referenced (e.g., the architecture diagram). There are two main techniques for integration testing.","title":"Applying Integration Testing"},{"location":"automated-testing/integration-testing/#big-bang","text":"Big Bang integration testing is when all components are tested as a single unit. This is best for small system as a system too large may be difficult to localize for potential errors from failed tests. This approach also requires all components in the system under test to be completed which may delay when testing begins.","title":"Big Bang"},{"location":"automated-testing/integration-testing/#incremental-testing","text":"Incremental testing is when two or more components that are logically related are tested as a unit. After testing the unit, additional components are combined and tested all together. This process repeats until all necessary components are tested.","title":"Incremental Testing"},{"location":"automated-testing/integration-testing/#top-down","text":"Top down testing is when higher level components are tested following the control flow of a software system. In the scenario, what is commonly referred to as stubs are used to emulate the behavior of lower level modules not yet complete or merged in the integration test.","title":"Top Down"},{"location":"automated-testing/integration-testing/#bottom-up","text":"Bottom up testing is when lower level modules are tested together. In the scenario, what is commonly referred to as drivers are used to emulate the behavior of higher level modules not yet complete or included in the integration test. A third approach known as the sandwich or hybrid model combines the bottom up and town down approaches to test lower and higher level components at the same time.","title":"Bottom Up"},{"location":"automated-testing/integration-testing/#things-to-avoid","text":"There is a tradeoff a developer must make between integration test code coverage and engineering cycles. With mock dependencies, test data, and multiple environments at test, too many integration tests are infeasible to maintain and become increasingly less meaningful. Too much mocking will slow down the test suite, make scaling difficult, and may be a sign the developer should consider other tests for the scenario such as acceptance or E2E. Integration tests of complex systems require high maintenance. Avoid testing business logic in integration tests by keeping test suites separate. Do not test beyond the acceptance criteria of the task and be sure to clean up any resources created for a given test. Additionally, avoid writing tests in a production environment. Instead, write them in a scaled-down copy environment.","title":"Things to Avoid"},{"location":"automated-testing/integration-testing/#integration-testing-frameworks-and-tools","text":"Many tools and frameworks can be used to write both unit and integration tests. The following tools are for automating integration tests. JUnit Robot Framework moq Cucumber Selenium Behave (Python)","title":"Integration Testing Frameworks and Tools"},{"location":"automated-testing/integration-testing/#conclusion","text":"Integration testing demonstrates how one module of a system, or external system, interfaces with another. This can be a test of two components, a sub-system, a whole system, or a collection of systems. Tests should be written frequently and throughout the entire development lifecycle using an appropriate amount of mocked dependencies and test data. Because integration tests prove that independently developed modules interface as technically designed, it increases confidence in the development cycle providing a path for a system that deploys and scales.","title":"Conclusion"},{"location":"automated-testing/integration-testing/#resources","text":"Integration testing approaches Integration testing pros and cons Integration tests mocks and stubs Software Testing: Principles and Practices Integration testing Behave test quick start","title":"Resources"},{"location":"automated-testing/performance-testing/","text":"Performance Testing Performance Testing is an overloaded term that is used to refer to several subcategories of performance related testing, each of which has different purpose. A good description of overall performance testing is as follows: Performance testing is a type of testing intended to determine the responsiveness, throughput, reliability, and/or scalability of a system under a given workload. Performance Testing Guidance for Web Applications . Before getting into the different subcategories of performance tests let us understand why performance testing is typically done. Why Performance Testing Performance testing is commonly conducted to accomplish one or more the following: Tune the system's performance Identifying bottlenecks and issues with the system at different load levels. Comparing performance characteristics of the system for different system configurations. Come up with a scaling strategy for the system. Assist in capacity planning Capacity planning is the process of determining what type of hardware and software resources are required to run an application to support pre-defined performance goals. Capacity planning involves identifying business expectations, the periodic fluctuations of application usage, considering the cost of running the hardware and software infrastructure. Assess the system's readiness for release: Evaluating the system's performance characteristics (response time, throughput) in a production-like environment. The goal is to ensure that performance goals can be achieved upon release. Evaluate the performance impact of application changes Comparing the performance characteristics of an application after a change to the values of performance characteristics during previous runs (or baseline values), can provide an indication of performance issues (performance regression) or enhancements introduced due to a change Key Performance Testing Categories Performance testing is a broad topic. There are many areas where you can perform tests. In broad strokes you can perform tests on the backend and on the front end. You can test the performance of individual components as well as testing the end-to-end functionality. There are several categories of tests as well: Load Testing This is the subcategory of performance testing that focuses on validating the performance characteristics of a system, when the system faces the load volumes which are expected during production operation. An Endurance Test or a Soak Test is a load test carried over a long duration ranging from several hours to days. Stress Testing This is the subcategory of performance testing that focuses on validating the performance characteristics of a system when the system faces extreme load. The goal is to evaluate how does the system handles being pressured to its limits, does it recover (i.e., scale-out) or does it just break and fail? Endurance Testing The goal of endurance testing is to make sure that the system can maintain good performance under extended periods of load. Spike Testing The goal of Spike testing is to validate that a software system can respond well to large and sudden spikes. Chaos Testing Chaos testing or Chaos engineering is the practice of experimenting on a system to build confidence that the system can withstand turbulent conditions in production. Its goal is to identify weaknesses before they manifest system wide. Developers often implement fallback procedures for service failure. Chaos testing arbitrarily shuts down different parts of the system to validate that fallback procedures function correctly. Best Practices Consider the following best practices for performance testing: Make one change at a time. Don't make multiple changes to the system between tests. If you do, you won't know which change caused the performance to improve or degrade. Automate testing. Strive to automate the setup and teardown of resources for a performance run as much as possible. Manual execution can lead to misconfigurations. Use different IP addresses. Some systems will throttle requests from a single IP address. If you are testing a system that has this type of restriction, you can use different IP addresses to simulate multiple users. Performance Monitor Metrics When executing the various types of testing approaches, whether it is stress, endurance, spike, or chaos testing, it is important to capture various metrics to see how the system performs. At the basic hardware level, there are four areas to consider. Physical disk Memory Processor Network These four areas are inextricably linked, meaning that poor performance in one area will lead to poor performance in another area. Engineers concerned with understanding application performance, should focus on these four core areas. The classic example of how performance in one area can affect performance in another area is memory pressure. If an application's available memory is running low, the operating system will try to compensate for shortages in memory by transferring pages of data from memory to disk, thus freeing up memory. But this work requires help from the CPU and the physical disk. This means that when you look at performance when there are low amounts of memory, you will also notice spikes in disk activity as well as CPU. Physical Disk Almost all software systems are dependent on the performance of the physical disk. This is especially true for the performance of databases. More modern approaches to using SSDs for physical disk storage can dramatically improve the performance of applications. Here are some of the metrics that you can capture and analyze: Counter Description Avg. Disk Queue Length This value is derived using the (Disk Transfers/sec)*(Disk sec/Transfer) counters. This metric describes the disk queue over time, smoothing out any quick spikes. Having any physical disk with an average queue length over 2 for prolonged periods of time can be an indication that your disk is a bottleneck. % Idle Time This is a measure of the percentage of time that the disk was idle. ie. there are no pending disk requests from the operating system waiting to be completed. A low number here is a positive sign that disk has excess capacity to service or write requests from the operating system. Avg. Disk sec/Read and Avg. Disk sec/Write These both measure the latency of your disks. Latency is defined as the average time it takes for a disk transfer to complete. You obviously want is low numbers as possible but need to be careful to account for inherent speed differences between SSD and traditional spinning disks. For this counter is important to define a baseline after the hardware is installed. Then use this value going forward to determine if you are experiencing any latency issues related to the hardware. Disk Reads/sec and Disk Writes/sec These counters each measure the total number of IO requests completed per second. Similar to the latency counters, good and bad values for these counters depend on your disk hardware but values higher than your initial baseline don't normally point to a hardware issue in this case. This counter can be useful to identify spikes in disk I/O. Processor It is important to understand the amount of time spent in kernel or privileged mode. In general, if code is spending too much time executing operating system calls, that could be an area of concern because it will not allow you to run your user mode applications, such as your databases, Web servers/services, etc. The guideline is that the CPU should only spend about 20% of the total processor time running in kernel mode. Counter Description % Processor time This is the percentage of total elapsed time that the processor was busy executing. This counter can either be too high or too low. If your processor time is consistently below 40%, then there is a question as to whether you have over provisioned your CPU. 70% is generally considered a good target number and if you start going higher than 70%, you may want to explore why there is high CPU pressure. % Privileged (Kernel Mode) time This measures the percentage of elapsed time the processor spent executing in kernel mode. Since this counter takes into account only kernel operations a high percentage of privileged time (greater than 25%) may indicate driver or hardware issue that should be investigated. % User time The percentage of elapsed time the processor spent executing in user mode (your application code). A good guideline is to be consistently below 65% as you want to have some buffer for both the kernel operations mentioned above as well as any other bursts of CPU required by other applications. Queue Length This is the number of threads that are ready to execute but waiting for a core to become available. On single core machines a sustained value greater than 2-3 can mean that you have some CPU pressure. Similarly, for a multicore machine divide the queue length by the number of cores and if that is continuously greater than 2-3 there might be CPU pressure. Network Adapter Network speed is often a hidden culprit of poor performance. Finding the root cause to poor network performance is often difficult. The source of issues can originate from bandwidth hogs such as videoconferencing, transaction data, network backups, recreational videos. In fact, the three most common reasons for a network slow down are: Congestion Data corruption Collisions Some of the tools that can help include: ifconfig netstat iperf tcpretrans tcpdump WireShark Troubleshooting network performance usually begins with checking the hardware. Typical things to explore is whether there are any loose wires or checking that all routers are powered up. It is not always possible to do so, but sometimes a simple case of power recycling of the modem or router can solve many problems. Network specialists often perform the following sequence of troubleshooting steps: Check the hardware Use IP config Use ping and tracert Perform DNS Check More advanced approaches often involve looking at some of the networking performance counters, as explained below. Network Counters The table above gives you some reference points to better understand what you can expect out of your network. Here are some counters that can help you understand where the bottlenecks might exist: Counter Description Bytes Received/sec The rate at which bytes are received over each network adapter. Bytes Sent/sec The rate at which bytes are sent over each network adapter. Bytes Total/sec The number of bytes sent and received over the network. Segments Received/sec The rate at which segments are received for the protocol Segments Sent/sec The rate at which segments are sent. % Interrupt Time The percentage of time the processor spends receiving and servicing hardware interrupts. This value is an indirect indicator of the activity of devices that generate interrupts, such as network adapters. There is an important distinction between latency and throughput . Latency measures the time it takes for a packet to be transferred across the network, either in terms of a one-way transmission or a round-trip transmission. Throughput is different and attempts to measure the quantity of data being sent and received within a unit of time. Memory Counter Description Available MBs This counter represents the amount of memory that is available to applications that are executing. Low memory can trigger Page Faults, whereby additional pressure is put on the CPU to swap memory to and from the disk. if the amount of available memory dips below 10%, more memory should be obtained. Pages/sec This is actually the sum of \"Pages Input/sec\" and \"Pages Output/sec\" counters which is the rate at which pages are being read and written as a result of pages faults. Small spikes with this value do not mean there is an issue but sustained values of greater than 50 can mean that system memory is a bottleneck. Paging File(_Total)\\% Usage The percentage of the system page file that is currently in use. This is not directly related to performance, but you can run into serious application issues if the page file does become completely full and additional memory is still being requested by applications. Key Performance Testing Activities Performance testing activities vary depending on the subcategory of performance testing and the system's requirements and constraints. For specific guidance you can follow the link to the subcategory of performance tests listed above. The following activities might be included depending on the performance test subcategory: Identify the Acceptance Criteria for the Tests This will generally include identifying the goals and constraints for the performance characteristics of the system Plan and Design the Tests In general we need to consider the following points: Defining the load the application should be tested with Establishing the metrics to be collected Establish what tools will be used for the tests Establish the performance test frequency: whether the performance tests be done as a part of the feature development sprints, or only prior to release to a major environment? Implementation Implement the performance tests according to the designed approach. Instrument the system and ensure that is emitting the needed performance metrics. Test Execution Execute the tests and collect performance metrics. Result Analysis and Re-testing Analyze the results/performance metrics from the tests. Identify needed changes to tweak the system (i.e., code, infrastructure) to better accommodate the test objectives. Then test again. This cycle continues until the test objective is achieved. The Iterative Performance Test Template can be used to capture details about the test result for every iterations. Resources Patters and Practices: Performance Testing Guidance for Web Applications","title":"Performance Testing"},{"location":"automated-testing/performance-testing/#performance-testing","text":"Performance Testing is an overloaded term that is used to refer to several subcategories of performance related testing, each of which has different purpose. A good description of overall performance testing is as follows: Performance testing is a type of testing intended to determine the responsiveness, throughput, reliability, and/or scalability of a system under a given workload. Performance Testing Guidance for Web Applications . Before getting into the different subcategories of performance tests let us understand why performance testing is typically done.","title":"Performance Testing"},{"location":"automated-testing/performance-testing/#why-performance-testing","text":"Performance testing is commonly conducted to accomplish one or more the following: Tune the system's performance Identifying bottlenecks and issues with the system at different load levels. Comparing performance characteristics of the system for different system configurations. Come up with a scaling strategy for the system. Assist in capacity planning Capacity planning is the process of determining what type of hardware and software resources are required to run an application to support pre-defined performance goals. Capacity planning involves identifying business expectations, the periodic fluctuations of application usage, considering the cost of running the hardware and software infrastructure. Assess the system's readiness for release: Evaluating the system's performance characteristics (response time, throughput) in a production-like environment. The goal is to ensure that performance goals can be achieved upon release. Evaluate the performance impact of application changes Comparing the performance characteristics of an application after a change to the values of performance characteristics during previous runs (or baseline values), can provide an indication of performance issues (performance regression) or enhancements introduced due to a change","title":"Why Performance Testing"},{"location":"automated-testing/performance-testing/#key-performance-testing-categories","text":"Performance testing is a broad topic. There are many areas where you can perform tests. In broad strokes you can perform tests on the backend and on the front end. You can test the performance of individual components as well as testing the end-to-end functionality. There are several categories of tests as well:","title":"Key Performance Testing Categories"},{"location":"automated-testing/performance-testing/#load-testing","text":"This is the subcategory of performance testing that focuses on validating the performance characteristics of a system, when the system faces the load volumes which are expected during production operation. An Endurance Test or a Soak Test is a load test carried over a long duration ranging from several hours to days.","title":"Load Testing"},{"location":"automated-testing/performance-testing/#stress-testing","text":"This is the subcategory of performance testing that focuses on validating the performance characteristics of a system when the system faces extreme load. The goal is to evaluate how does the system handles being pressured to its limits, does it recover (i.e., scale-out) or does it just break and fail?","title":"Stress Testing"},{"location":"automated-testing/performance-testing/#endurance-testing","text":"The goal of endurance testing is to make sure that the system can maintain good performance under extended periods of load.","title":"Endurance Testing"},{"location":"automated-testing/performance-testing/#spike-testing","text":"The goal of Spike testing is to validate that a software system can respond well to large and sudden spikes.","title":"Spike Testing"},{"location":"automated-testing/performance-testing/#chaos-testing","text":"Chaos testing or Chaos engineering is the practice of experimenting on a system to build confidence that the system can withstand turbulent conditions in production. Its goal is to identify weaknesses before they manifest system wide. Developers often implement fallback procedures for service failure. Chaos testing arbitrarily shuts down different parts of the system to validate that fallback procedures function correctly.","title":"Chaos Testing"},{"location":"automated-testing/performance-testing/#best-practices","text":"Consider the following best practices for performance testing: Make one change at a time. Don't make multiple changes to the system between tests. If you do, you won't know which change caused the performance to improve or degrade. Automate testing. Strive to automate the setup and teardown of resources for a performance run as much as possible. Manual execution can lead to misconfigurations. Use different IP addresses. Some systems will throttle requests from a single IP address. If you are testing a system that has this type of restriction, you can use different IP addresses to simulate multiple users.","title":"Best Practices"},{"location":"automated-testing/performance-testing/#performance-monitor-metrics","text":"When executing the various types of testing approaches, whether it is stress, endurance, spike, or chaos testing, it is important to capture various metrics to see how the system performs. At the basic hardware level, there are four areas to consider. Physical disk Memory Processor Network These four areas are inextricably linked, meaning that poor performance in one area will lead to poor performance in another area. Engineers concerned with understanding application performance, should focus on these four core areas. The classic example of how performance in one area can affect performance in another area is memory pressure. If an application's available memory is running low, the operating system will try to compensate for shortages in memory by transferring pages of data from memory to disk, thus freeing up memory. But this work requires help from the CPU and the physical disk. This means that when you look at performance when there are low amounts of memory, you will also notice spikes in disk activity as well as CPU.","title":"Performance Monitor Metrics"},{"location":"automated-testing/performance-testing/#physical-disk","text":"Almost all software systems are dependent on the performance of the physical disk. This is especially true for the performance of databases. More modern approaches to using SSDs for physical disk storage can dramatically improve the performance of applications. Here are some of the metrics that you can capture and analyze: Counter Description Avg. Disk Queue Length This value is derived using the (Disk Transfers/sec)*(Disk sec/Transfer) counters. This metric describes the disk queue over time, smoothing out any quick spikes. Having any physical disk with an average queue length over 2 for prolonged periods of time can be an indication that your disk is a bottleneck. % Idle Time This is a measure of the percentage of time that the disk was idle. ie. there are no pending disk requests from the operating system waiting to be completed. A low number here is a positive sign that disk has excess capacity to service or write requests from the operating system. Avg. Disk sec/Read and Avg. Disk sec/Write These both measure the latency of your disks. Latency is defined as the average time it takes for a disk transfer to complete. You obviously want is low numbers as possible but need to be careful to account for inherent speed differences between SSD and traditional spinning disks. For this counter is important to define a baseline after the hardware is installed. Then use this value going forward to determine if you are experiencing any latency issues related to the hardware. Disk Reads/sec and Disk Writes/sec These counters each measure the total number of IO requests completed per second. Similar to the latency counters, good and bad values for these counters depend on your disk hardware but values higher than your initial baseline don't normally point to a hardware issue in this case. This counter can be useful to identify spikes in disk I/O.","title":"Physical Disk"},{"location":"automated-testing/performance-testing/#processor","text":"It is important to understand the amount of time spent in kernel or privileged mode. In general, if code is spending too much time executing operating system calls, that could be an area of concern because it will not allow you to run your user mode applications, such as your databases, Web servers/services, etc. The guideline is that the CPU should only spend about 20% of the total processor time running in kernel mode. Counter Description % Processor time This is the percentage of total elapsed time that the processor was busy executing. This counter can either be too high or too low. If your processor time is consistently below 40%, then there is a question as to whether you have over provisioned your CPU. 70% is generally considered a good target number and if you start going higher than 70%, you may want to explore why there is high CPU pressure. % Privileged (Kernel Mode) time This measures the percentage of elapsed time the processor spent executing in kernel mode. Since this counter takes into account only kernel operations a high percentage of privileged time (greater than 25%) may indicate driver or hardware issue that should be investigated. % User time The percentage of elapsed time the processor spent executing in user mode (your application code). A good guideline is to be consistently below 65% as you want to have some buffer for both the kernel operations mentioned above as well as any other bursts of CPU required by other applications. Queue Length This is the number of threads that are ready to execute but waiting for a core to become available. On single core machines a sustained value greater than 2-3 can mean that you have some CPU pressure. Similarly, for a multicore machine divide the queue length by the number of cores and if that is continuously greater than 2-3 there might be CPU pressure.","title":"Processor"},{"location":"automated-testing/performance-testing/#network-adapter","text":"Network speed is often a hidden culprit of poor performance. Finding the root cause to poor network performance is often difficult. The source of issues can originate from bandwidth hogs such as videoconferencing, transaction data, network backups, recreational videos. In fact, the three most common reasons for a network slow down are: Congestion Data corruption Collisions Some of the tools that can help include: ifconfig netstat iperf tcpretrans tcpdump WireShark Troubleshooting network performance usually begins with checking the hardware. Typical things to explore is whether there are any loose wires or checking that all routers are powered up. It is not always possible to do so, but sometimes a simple case of power recycling of the modem or router can solve many problems. Network specialists often perform the following sequence of troubleshooting steps: Check the hardware Use IP config Use ping and tracert Perform DNS Check More advanced approaches often involve looking at some of the networking performance counters, as explained below.","title":"Network Adapter"},{"location":"automated-testing/performance-testing/#network-counters","text":"The table above gives you some reference points to better understand what you can expect out of your network. Here are some counters that can help you understand where the bottlenecks might exist: Counter Description Bytes Received/sec The rate at which bytes are received over each network adapter. Bytes Sent/sec The rate at which bytes are sent over each network adapter. Bytes Total/sec The number of bytes sent and received over the network. Segments Received/sec The rate at which segments are received for the protocol Segments Sent/sec The rate at which segments are sent. % Interrupt Time The percentage of time the processor spends receiving and servicing hardware interrupts. This value is an indirect indicator of the activity of devices that generate interrupts, such as network adapters. There is an important distinction between latency and throughput . Latency measures the time it takes for a packet to be transferred across the network, either in terms of a one-way transmission or a round-trip transmission. Throughput is different and attempts to measure the quantity of data being sent and received within a unit of time.","title":"Network Counters"},{"location":"automated-testing/performance-testing/#memory","text":"Counter Description Available MBs This counter represents the amount of memory that is available to applications that are executing. Low memory can trigger Page Faults, whereby additional pressure is put on the CPU to swap memory to and from the disk. if the amount of available memory dips below 10%, more memory should be obtained. Pages/sec This is actually the sum of \"Pages Input/sec\" and \"Pages Output/sec\" counters which is the rate at which pages are being read and written as a result of pages faults. Small spikes with this value do not mean there is an issue but sustained values of greater than 50 can mean that system memory is a bottleneck. Paging File(_Total)\\% Usage The percentage of the system page file that is currently in use. This is not directly related to performance, but you can run into serious application issues if the page file does become completely full and additional memory is still being requested by applications.","title":"Memory"},{"location":"automated-testing/performance-testing/#key-performance-testing-activities","text":"Performance testing activities vary depending on the subcategory of performance testing and the system's requirements and constraints. For specific guidance you can follow the link to the subcategory of performance tests listed above. The following activities might be included depending on the performance test subcategory:","title":"Key Performance Testing Activities"},{"location":"automated-testing/performance-testing/#identify-the-acceptance-criteria-for-the-tests","text":"This will generally include identifying the goals and constraints for the performance characteristics of the system","title":"Identify the Acceptance Criteria for the Tests"},{"location":"automated-testing/performance-testing/#plan-and-design-the-tests","text":"In general we need to consider the following points: Defining the load the application should be tested with Establishing the metrics to be collected Establish what tools will be used for the tests Establish the performance test frequency: whether the performance tests be done as a part of the feature development sprints, or only prior to release to a major environment?","title":"Plan and Design the Tests"},{"location":"automated-testing/performance-testing/#implementation","text":"Implement the performance tests according to the designed approach. Instrument the system and ensure that is emitting the needed performance metrics.","title":"Implementation"},{"location":"automated-testing/performance-testing/#test-execution","text":"Execute the tests and collect performance metrics.","title":"Test Execution"},{"location":"automated-testing/performance-testing/#result-analysis-and-re-testing","text":"Analyze the results/performance metrics from the tests. Identify needed changes to tweak the system (i.e., code, infrastructure) to better accommodate the test objectives. Then test again. This cycle continues until the test objective is achieved. The Iterative Performance Test Template can be used to capture details about the test result for every iterations.","title":"Result Analysis and Re-testing"},{"location":"automated-testing/performance-testing/#resources","text":"Patters and Practices: Performance Testing Guidance for Web Applications","title":"Resources"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/","text":"Performance Test Iteration Template This document provides template for capturing results of performance tests. Performance tests are done in iterations and each iteration should have a clear goal. The results of any iteration is immutable regardless whether the goal was achieved or not. If the iteration failed or the goal is not achieved then a new iteration of testing is carried out with appropriate fixes. It is recommended to keep track of the recorded iterations to maintain a timeline of how system evolved and which changes affected the performance in what way. Feel free to modify this template as needed. Iteration Template Goal Mention in bullet points the goal for this iteration of test. The goal should be small and measurable within this iteration. Test Details Date : Date and time when this iteration started and ended Duration : Time it took to complete this iteration. Application Code : Commit id and link to the commit for the code(s) which are being tested in this iteration Benchmarking Configuration: Application Configuration: In bullet points mention the configuration for application that should be recorded System Configuration: In bullet points mention the configuration of the infrastructure Record different types of configurations. Usually application specific configuration changes between iterations whereas system or infrastructure configurations rarely change Work Items List of links to relevant work items (task, story, bug) being tested in this iteration. Results In bullet points document the results from the test. - Attach any documents supporting the test results. - Add links to the dashboard for metrics and logs such as Application Insights. - Capture screenshots for metrics and include it in the results. Good candidate for this is CPU/Memory/Disk usage. Observations Observations are insights derived from test results. Keep the observations brief and as bullet points. Mention outcomes supporting the goal of the iteration. If any of the observation results in a work item (task, story, bug) then add the link to the work item together with the observation.","title":"Performance Test Iteration Template"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#performance-test-iteration-template","text":"This document provides template for capturing results of performance tests. Performance tests are done in iterations and each iteration should have a clear goal. The results of any iteration is immutable regardless whether the goal was achieved or not. If the iteration failed or the goal is not achieved then a new iteration of testing is carried out with appropriate fixes. It is recommended to keep track of the recorded iterations to maintain a timeline of how system evolved and which changes affected the performance in what way. Feel free to modify this template as needed.","title":"Performance Test Iteration Template"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#iteration-template","text":"","title":"Iteration Template"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#goal","text":"Mention in bullet points the goal for this iteration of test. The goal should be small and measurable within this iteration.","title":"Goal"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#test-details","text":"Date : Date and time when this iteration started and ended Duration : Time it took to complete this iteration. Application Code : Commit id and link to the commit for the code(s) which are being tested in this iteration Benchmarking Configuration: Application Configuration: In bullet points mention the configuration for application that should be recorded System Configuration: In bullet points mention the configuration of the infrastructure Record different types of configurations. Usually application specific configuration changes between iterations whereas system or infrastructure configurations rarely change","title":"Test Details"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#work-items","text":"List of links to relevant work items (task, story, bug) being tested in this iteration.","title":"Work Items"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#results","text":"In bullet points document the results from the test. - Attach any documents supporting the test results. - Add links to the dashboard for metrics and logs such as Application Insights. - Capture screenshots for metrics and include it in the results. Good candidate for this is CPU/Memory/Disk usage.","title":"Results"},{"location":"automated-testing/performance-testing/iterative-perf-test-template/#observations","text":"Observations are insights derived from test results. Keep the observations brief and as bullet points. Mention outcomes supporting the goal of the iteration. If any of the observation results in a work item (task, story, bug) then add the link to the work item together with the observation.","title":"Observations"},{"location":"automated-testing/performance-testing/load-testing/","text":"Load Testing \" Load testing is performed to determine a system's behavior under both normal and anticipated peak load conditions. \" - Load testing - Wikipedia A load test is designed to determine how a system behaves under expected normal and peak workloads. Specifically its main purpose is to confirm if a system can handle the expected load level. Depending on the target system this could be concurrent users, requests per second or data size. Why Load Testing The main objective is to prove the system can behave normally under the expected normal load before releasing it to production. The criteria that define \"behave normally\" will depend on your target, this may be as simple as \"the system remains available\", but it could also include meeting a response time SLA or error rate. Additionally, the results of a load test can also be used as data to help with capacity planning and calculating scalability. Load Testing Design Blocks There are a number of basic components that are required to carry out a load test. In order to have meaningful results the system needs to be tested in a production-like environment with a network and hardware which closely resembles the expected deployment environment. The load test will consist of a module which simulates user activity. Of course the composition of this \"user activity\" will vary based on the type of application being tested. For example, an e-commerce website might simulate user browsing and purchasing items, but an IoT data ingestion pipeline would simulate a stream of device readings. Please ensure the simulation is as close to real activity as possible, and consider not just volume but also patterns and variability. For example, if the simulator data is too uniform or predictable, then cache/hit ratios may impact your results. The load test will be initiated from a component external to the target system which can control the amount of load applied. This can be a single agent, but may need to scaled to multiple agents in order to achieve higher levels of activity. Although not required to run a load test, it is advisable to have monitoring and/or logging in place to be able to measure the impact of the test and discover potential bottlenecks. Applying the Load Testing Planning Identify key scenarios to measure - Gather these scenarios from Product Owner, they should provide a representative sample of real world traffic. The key activity of this phase is to agree on and define the load test cases. Determine expected normal and peak load for the scenarios - Determine a load level such as concurrent users or requests per second to find the size of the load test you will run. Identify success criteria metrics - These may be on testing side such as response time and error rate, or they may be on the system side such as CPU and memory usage. Agree on test matrix - Which load test cases should be run for which combinations of input parameters. Select the right tool - Many frameworks exist for load testing so consider if features and limitations are suitable for your needs (Some popular tools are listed below). This may also include development of a custom load test client, see Preparation phase below. Observability - Determine which metrics need to gathered to gain insight into throughput, latency, resource utilization, etc. Scalability - Determine the amount of scale needed by load generator, workload application, CPU, Memory, and network components needed to achieve testing goals. The use of kubernetes on the cloud can be used to make testing infinitely scalable. Preparation The key activity is to replace the end user client with a test bench that simulates one or more instances of the original client. For standard 3rd party tools it may suffice to configure the existing test UI before initiating the load tests. If a custom client is used, code development will be required: Custom development - Design for minimal impact/overhead. Be sure to capture only those features of the production client that are relevant from a load perspective. Does it matter if the same test is duplicated, or must the workload be unique for each test? Can all tests be run under the same user context? Test environment - Create test environment that resembles production environment. This includes the platform as well as external systems, e.g., data sources. Security contexts - Be sure to have all requisite security contexts for the test environment. Automation like pipelines may require special setup, e.g., OAuth2 client credential flow instead of auth code flow, because interactive login is replaced by non-interactive. Allow planning leeway in case admin approval is required for new security contexts. Test data strategy - Make sure that output data format (ascii/binary/...) is compatible with whatever analysis tool is used in the analysis phase. This also includes storage areas (local/cloud/...), which may trigger new security contexts. Bear in mind that it may be necessary to collect data from sources external to the application to correlate potential performance issues with the application behavior. This includes platform and network metrics. Make sure to collect data that covers analysis needs (statistical measures, distributions, graphs, etc.). Automation - Repeatability is critical. It must be possible to re-run a given test multiple times to verify consistency and resilience of the application itself and the underlying platform. Pipelines are recommended whenever possible. Evaluate whether load tests should be run as part of the PR strategy. Test client debugging - All test modules should be carefully debugged to ensure that the execution phase progresses smoothly. Test client validation - All test modules should be validated for extreme values of the input parameters. This reduces the risk of running into unexpected difficulties when stepping through the full test matrix during the execution phase. Execution It is recommended to use an existing testing framework (see below). These tools will provide a method of both specifying the user activity scenarios and how to execute those at load. Depending on the situation, it may be advisable to coordinate testing activities with the platform operations team. It is common to slowly ramp up to your desired load to better replicate real world behavior. Once you have reached your defined workload, maintain this level long enough to see if your system stabilizes. To finish up the test you should also ramp to see record how the system slows down as well. You should also consider the origin of your load test traffic. Depending on the scope of the target system you may want to initiate from a different location to better replicate real world traffic such as from a different region. Note: Before starting please be aware of any restrictions on your network such as DDOS protection where you may need to notify a network administrator or apply for an exemption. Note: In general, the preferred approach to load testing would be the usage of a standard test framework such as the ones discussed below. There are cases, however, where a custom test client may be advantageous. Examples include batch oriented workloads that can be run under a single security context and the same test data can be re-used for multiple load tests. In such a scenario it may be beneficial to develop a custom script that can be used interactively as well as non-interactively. Analysis The analysis phase represents the work that brings all previous activities together: Set aside time to allow for collection of new test data based on the analysis of the load tests. Correlate application metrics and platform metrics to identify potential pitfalls and bottlenecks. Include business stakeholders early in the analysis phase to validate application findings. Include platform operations to validate platform findings. Report Writing Summarize your findings from the analysis phase. Be sure to include application and platform enhancement suggestions, if any. Further Testing After completing your load test you should be set up to continue on to additional related testing such as; Soak Testing - Also known as Endurance Testing . Performing a load test over an extended period of time to ensure long term stability. Stress Testing - Gradually increasing the load to find the limits of the system and identify the maximum capacity. Spike Testing - Introduce a sharp short-term increase into the load scenarios. Scalability Testing - Re-testing of a system as your expand horizontally or vertically to measure how it scales. Distributed Testing - Distributed testing allows you to leverage the power of multiple machines to perform larger or more in-depth tests faster. Is necessary when a fully optimized node cannot produce the load required by your extremely large test. Load Generation Testing Frameworks and Tools Here are a few popular load testing frameworks you may consider, and the languages used to define your scenarios. Azure Load Testing ( https://learn.microsoft.com/en-us/azure/load-testing/ ) - Managed platform for running load tests on Azure. It allows to run and monitor tests automatically, source secrets from the KeyVault, generate traffic at scale, and load test Azure private endpoints. In the simple case, it executes load tests with HTTP GET traffic to a given endpoint. For the more complex cases, you can upload your own JMeter scenarios . JMeter ( https://github.com/apache/jmeter ) - Has built in patterns to test without coding, but can be extended with Java. Artillery ( https://artillery.io/ ) - Write your scenarios in Javascript, executes a node application. Gatling ( https://gatling.io/ ) - Write your scenarios in Scala with their DSL. Locust ( https://locust.io/ ) - Write your scenarios in Python using the concept of concurrent user activity. K6 ( https://k6.io/ ) - Write your test scenarios in Javascript, available as open source kubernetes operator, open source Docker image, or as SaaS. Particularly useful for distributed load testing. Integrates easily with prometheus. NBomber ( https://nbomber.com/ ) - Write your test scenarios in C# or F#, available integration with test runners (NUnit/xUnit). WebValidate ( https://github.com/microsoft/webvalidate ) - Web request validation tool used to run end-to-end tests and long-running performance and availability tests. Sample Workload Applications In the case where a specific workload application is not being provided and the focus is instead on the system, here are a few popular sample workload applications you may consider. HttpBin ( Python , GoLang ) - Supports variety of endpoint types and language implementations. Can echo data used in request. NGSA ( Java , C# ) - Intended for Kubernetes Platform and Monitoring Testing. Built on top of IMDB data store with many CRUD endpoints available. Does not need to have a live database connection. MockBin ( https://github.com/Kong/mockbin ) - Allows you to generate custom endpoints to test, mock, and track HTTP requests & responses between libraries, sockets and APIs. Conclusion A load test is critical step to understand if a target system will be reliable under the expected real world traffic. Of course, it's only as good as your ability to predict the expected load, so it's important to follow up with other further testing to truly understand how your system behaves in different situations. Resources List additional readings about this test type for those that would like to dive deeper. Microsoft Azure Well-Architected Framework > Load Testing","title":"Load Testing"},{"location":"automated-testing/performance-testing/load-testing/#load-testing","text":"\" Load testing is performed to determine a system's behavior under both normal and anticipated peak load conditions. \" - Load testing - Wikipedia A load test is designed to determine how a system behaves under expected normal and peak workloads. Specifically its main purpose is to confirm if a system can handle the expected load level. Depending on the target system this could be concurrent users, requests per second or data size.","title":"Load Testing"},{"location":"automated-testing/performance-testing/load-testing/#why-load-testing","text":"The main objective is to prove the system can behave normally under the expected normal load before releasing it to production. The criteria that define \"behave normally\" will depend on your target, this may be as simple as \"the system remains available\", but it could also include meeting a response time SLA or error rate. Additionally, the results of a load test can also be used as data to help with capacity planning and calculating scalability.","title":"Why Load Testing"},{"location":"automated-testing/performance-testing/load-testing/#load-testing-design-blocks","text":"There are a number of basic components that are required to carry out a load test. In order to have meaningful results the system needs to be tested in a production-like environment with a network and hardware which closely resembles the expected deployment environment. The load test will consist of a module which simulates user activity. Of course the composition of this \"user activity\" will vary based on the type of application being tested. For example, an e-commerce website might simulate user browsing and purchasing items, but an IoT data ingestion pipeline would simulate a stream of device readings. Please ensure the simulation is as close to real activity as possible, and consider not just volume but also patterns and variability. For example, if the simulator data is too uniform or predictable, then cache/hit ratios may impact your results. The load test will be initiated from a component external to the target system which can control the amount of load applied. This can be a single agent, but may need to scaled to multiple agents in order to achieve higher levels of activity. Although not required to run a load test, it is advisable to have monitoring and/or logging in place to be able to measure the impact of the test and discover potential bottlenecks.","title":"Load Testing Design Blocks"},{"location":"automated-testing/performance-testing/load-testing/#applying-the-load-testing","text":"","title":"Applying the Load Testing"},{"location":"automated-testing/performance-testing/load-testing/#planning","text":"Identify key scenarios to measure - Gather these scenarios from Product Owner, they should provide a representative sample of real world traffic. The key activity of this phase is to agree on and define the load test cases. Determine expected normal and peak load for the scenarios - Determine a load level such as concurrent users or requests per second to find the size of the load test you will run. Identify success criteria metrics - These may be on testing side such as response time and error rate, or they may be on the system side such as CPU and memory usage. Agree on test matrix - Which load test cases should be run for which combinations of input parameters. Select the right tool - Many frameworks exist for load testing so consider if features and limitations are suitable for your needs (Some popular tools are listed below). This may also include development of a custom load test client, see Preparation phase below. Observability - Determine which metrics need to gathered to gain insight into throughput, latency, resource utilization, etc. Scalability - Determine the amount of scale needed by load generator, workload application, CPU, Memory, and network components needed to achieve testing goals. The use of kubernetes on the cloud can be used to make testing infinitely scalable.","title":"Planning"},{"location":"automated-testing/performance-testing/load-testing/#preparation","text":"The key activity is to replace the end user client with a test bench that simulates one or more instances of the original client. For standard 3rd party tools it may suffice to configure the existing test UI before initiating the load tests. If a custom client is used, code development will be required: Custom development - Design for minimal impact/overhead. Be sure to capture only those features of the production client that are relevant from a load perspective. Does it matter if the same test is duplicated, or must the workload be unique for each test? Can all tests be run under the same user context? Test environment - Create test environment that resembles production environment. This includes the platform as well as external systems, e.g., data sources. Security contexts - Be sure to have all requisite security contexts for the test environment. Automation like pipelines may require special setup, e.g., OAuth2 client credential flow instead of auth code flow, because interactive login is replaced by non-interactive. Allow planning leeway in case admin approval is required for new security contexts. Test data strategy - Make sure that output data format (ascii/binary/...) is compatible with whatever analysis tool is used in the analysis phase. This also includes storage areas (local/cloud/...), which may trigger new security contexts. Bear in mind that it may be necessary to collect data from sources external to the application to correlate potential performance issues with the application behavior. This includes platform and network metrics. Make sure to collect data that covers analysis needs (statistical measures, distributions, graphs, etc.). Automation - Repeatability is critical. It must be possible to re-run a given test multiple times to verify consistency and resilience of the application itself and the underlying platform. Pipelines are recommended whenever possible. Evaluate whether load tests should be run as part of the PR strategy. Test client debugging - All test modules should be carefully debugged to ensure that the execution phase progresses smoothly. Test client validation - All test modules should be validated for extreme values of the input parameters. This reduces the risk of running into unexpected difficulties when stepping through the full test matrix during the execution phase.","title":"Preparation"},{"location":"automated-testing/performance-testing/load-testing/#execution","text":"It is recommended to use an existing testing framework (see below). These tools will provide a method of both specifying the user activity scenarios and how to execute those at load. Depending on the situation, it may be advisable to coordinate testing activities with the platform operations team. It is common to slowly ramp up to your desired load to better replicate real world behavior. Once you have reached your defined workload, maintain this level long enough to see if your system stabilizes. To finish up the test you should also ramp to see record how the system slows down as well. You should also consider the origin of your load test traffic. Depending on the scope of the target system you may want to initiate from a different location to better replicate real world traffic such as from a different region. Note: Before starting please be aware of any restrictions on your network such as DDOS protection where you may need to notify a network administrator or apply for an exemption. Note: In general, the preferred approach to load testing would be the usage of a standard test framework such as the ones discussed below. There are cases, however, where a custom test client may be advantageous. Examples include batch oriented workloads that can be run under a single security context and the same test data can be re-used for multiple load tests. In such a scenario it may be beneficial to develop a custom script that can be used interactively as well as non-interactively.","title":"Execution"},{"location":"automated-testing/performance-testing/load-testing/#analysis","text":"The analysis phase represents the work that brings all previous activities together: Set aside time to allow for collection of new test data based on the analysis of the load tests. Correlate application metrics and platform metrics to identify potential pitfalls and bottlenecks. Include business stakeholders early in the analysis phase to validate application findings. Include platform operations to validate platform findings.","title":"Analysis"},{"location":"automated-testing/performance-testing/load-testing/#report-writing","text":"Summarize your findings from the analysis phase. Be sure to include application and platform enhancement suggestions, if any.","title":"Report Writing"},{"location":"automated-testing/performance-testing/load-testing/#further-testing","text":"After completing your load test you should be set up to continue on to additional related testing such as; Soak Testing - Also known as Endurance Testing . Performing a load test over an extended period of time to ensure long term stability. Stress Testing - Gradually increasing the load to find the limits of the system and identify the maximum capacity. Spike Testing - Introduce a sharp short-term increase into the load scenarios. Scalability Testing - Re-testing of a system as your expand horizontally or vertically to measure how it scales. Distributed Testing - Distributed testing allows you to leverage the power of multiple machines to perform larger or more in-depth tests faster. Is necessary when a fully optimized node cannot produce the load required by your extremely large test.","title":"Further Testing"},{"location":"automated-testing/performance-testing/load-testing/#load-generation-testing-frameworks-and-tools","text":"Here are a few popular load testing frameworks you may consider, and the languages used to define your scenarios. Azure Load Testing ( https://learn.microsoft.com/en-us/azure/load-testing/ ) - Managed platform for running load tests on Azure. It allows to run and monitor tests automatically, source secrets from the KeyVault, generate traffic at scale, and load test Azure private endpoints. In the simple case, it executes load tests with HTTP GET traffic to a given endpoint. For the more complex cases, you can upload your own JMeter scenarios . JMeter ( https://github.com/apache/jmeter ) - Has built in patterns to test without coding, but can be extended with Java. Artillery ( https://artillery.io/ ) - Write your scenarios in Javascript, executes a node application. Gatling ( https://gatling.io/ ) - Write your scenarios in Scala with their DSL. Locust ( https://locust.io/ ) - Write your scenarios in Python using the concept of concurrent user activity. K6 ( https://k6.io/ ) - Write your test scenarios in Javascript, available as open source kubernetes operator, open source Docker image, or as SaaS. Particularly useful for distributed load testing. Integrates easily with prometheus. NBomber ( https://nbomber.com/ ) - Write your test scenarios in C# or F#, available integration with test runners (NUnit/xUnit). WebValidate ( https://github.com/microsoft/webvalidate ) - Web request validation tool used to run end-to-end tests and long-running performance and availability tests.","title":"Load Generation Testing Frameworks and Tools"},{"location":"automated-testing/performance-testing/load-testing/#sample-workload-applications","text":"In the case where a specific workload application is not being provided and the focus is instead on the system, here are a few popular sample workload applications you may consider. HttpBin ( Python , GoLang ) - Supports variety of endpoint types and language implementations. Can echo data used in request. NGSA ( Java , C# ) - Intended for Kubernetes Platform and Monitoring Testing. Built on top of IMDB data store with many CRUD endpoints available. Does not need to have a live database connection. MockBin ( https://github.com/Kong/mockbin ) - Allows you to generate custom endpoints to test, mock, and track HTTP requests & responses between libraries, sockets and APIs.","title":"Sample Workload Applications"},{"location":"automated-testing/performance-testing/load-testing/#conclusion","text":"A load test is critical step to understand if a target system will be reliable under the expected real world traffic. Of course, it's only as good as your ability to predict the expected load, so it's important to follow up with other further testing to truly understand how your system behaves in different situations.","title":"Conclusion"},{"location":"automated-testing/performance-testing/load-testing/#resources","text":"List additional readings about this test type for those that would like to dive deeper. Microsoft Azure Well-Architected Framework > Load Testing","title":"Resources"},{"location":"automated-testing/shadow-testing/","text":"Shadow Testing Shadow testing is one approach to reduce risks before going to production. Shadow testing is also known as \"Shadow Deployment\" or \"Shadowing Traffic\" and similarities with \"Dark launching\". When to Use Shadow Testing reduces risks when you consider replacing the current environment (V-Current) with candidate environment with new feature (V-Next). This approach is monitoring and capturing differences between two environments then compare and reduces all risks before you introduce a new feature/release. In our test cases, code coverage is very important however sometimes providing code coverage can be tricky to replicate real-life combinations and possibilities. In this approach, to test V-Next environment we have side by side deployment, we're replicating the same traffic with V-Current environment and directing same traffic to V-Next environment, the only difference is we don't return any response from V-Next environment to users, but we collect those responses to compare with V-Current responses. Referencing back to one of the Principles of Chaos Engineering, mentions importance of sampling real traffic like below: Systems behave differently depending on environment and traffic patterns. Since the behavior of utilization can change at any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic. With this Shadow Testing approach we're leveraging real customer behavior in V-Next environment with sampling real traffic and mitigating the risks which users may face on production. At the same time we're testing V-Next environment infrastructure for scaling with real sampled traffic. V-Next should scale with the same way V-Current does. We're testing actual behavior of the product and this cause zero impact to production to test new features since traffic is replicated to V-next environment. There are some similarities with Dark Launching , Dark Launching proposes to integrate new feature into production code, but users can't use the feature. On the backend you can test your feature and improve the performance until it's acceptable. It is also similar to Feature Toggles which provides you with an ability to enable/disable your new feature in production on a UI level. With this approach your new feature will be visible to users, and you can collect feedback. Using Dark Launching with Feature Toggles can be very useful for introducing a new feature. Applicable to Production deployments : V-Next in Shadow testing always working separately and not effecting production. Users are not effected with this test. Infrastructure : Shadow testing replicating the same traffic, in test environment you can have the same traffic on the production. It helps to produce real life test scenarios Handling Scale : All traffic is replicated, and you have a chance to see how your system scaling. Shadow Testing Frameworks and Tools There are some tools to implement shadow testing. The main purpose of these tools is to compare responses of V-Current and V-Next then find the differences. Diffy Envoy McRouter Scientist Keploy One of the most popular tools is Diffy . It was created and used at Twitter. Now the original author and a former Twitter employee maintains their own version of this project, called Opendiffy . Twitter announced this tool on their engineering blog as \" Testing services without writing tests \". As of today Diffy is used in production by Twitter, Airbnb, Baidu and Bytedance companies. Diffy explains the shadow testing feature like this: Diffy finds potential bugs in your service using running instances of your new code, and your old code side by side. Diffy behaves as a proxy and multicasts whatever requests it receives to each of the running instances. It then compares the responses, and reports any regressions that may surface from those comparisons. The premise for Diffy is that if two implementations of the service return \u201csimilar\u201d responses for a sufficiently large and diverse set of requests, then the two implementations can be treated as equivalent, and the newer implementation is regression-free. Diffy architecture Conclusion Shadow Testing is a useful approach to reduce risks when you consider replacing the current environment with candidate environment using new feature(s). Shadow testing replicates traffic of the production to candidate environment for testing, so you get same production use case scenarios in the test environment. You can compare differences on both environments and validate your candidate environment to be ready for releasing. Some advantages of shadow testing are: Zero impact to production environment No need to generate test scenarios and test data We can test real-life scenarios with real-life data. We can simulate scale with replicated production traffic. Resources Martin Fowler - Dark Launching Martin Fowler - Feature Toggle Traffic Shadowing/Mirroring","title":"Shadow Testing"},{"location":"automated-testing/shadow-testing/#shadow-testing","text":"Shadow testing is one approach to reduce risks before going to production. Shadow testing is also known as \"Shadow Deployment\" or \"Shadowing Traffic\" and similarities with \"Dark launching\".","title":"Shadow Testing"},{"location":"automated-testing/shadow-testing/#when-to-use","text":"Shadow Testing reduces risks when you consider replacing the current environment (V-Current) with candidate environment with new feature (V-Next). This approach is monitoring and capturing differences between two environments then compare and reduces all risks before you introduce a new feature/release. In our test cases, code coverage is very important however sometimes providing code coverage can be tricky to replicate real-life combinations and possibilities. In this approach, to test V-Next environment we have side by side deployment, we're replicating the same traffic with V-Current environment and directing same traffic to V-Next environment, the only difference is we don't return any response from V-Next environment to users, but we collect those responses to compare with V-Current responses. Referencing back to one of the Principles of Chaos Engineering, mentions importance of sampling real traffic like below: Systems behave differently depending on environment and traffic patterns. Since the behavior of utilization can change at any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic. With this Shadow Testing approach we're leveraging real customer behavior in V-Next environment with sampling real traffic and mitigating the risks which users may face on production. At the same time we're testing V-Next environment infrastructure for scaling with real sampled traffic. V-Next should scale with the same way V-Current does. We're testing actual behavior of the product and this cause zero impact to production to test new features since traffic is replicated to V-next environment. There are some similarities with Dark Launching , Dark Launching proposes to integrate new feature into production code, but users can't use the feature. On the backend you can test your feature and improve the performance until it's acceptable. It is also similar to Feature Toggles which provides you with an ability to enable/disable your new feature in production on a UI level. With this approach your new feature will be visible to users, and you can collect feedback. Using Dark Launching with Feature Toggles can be very useful for introducing a new feature.","title":"When to Use"},{"location":"automated-testing/shadow-testing/#applicable-to","text":"Production deployments : V-Next in Shadow testing always working separately and not effecting production. Users are not effected with this test. Infrastructure : Shadow testing replicating the same traffic, in test environment you can have the same traffic on the production. It helps to produce real life test scenarios Handling Scale : All traffic is replicated, and you have a chance to see how your system scaling.","title":"Applicable to"},{"location":"automated-testing/shadow-testing/#shadow-testing-frameworks-and-tools","text":"There are some tools to implement shadow testing. The main purpose of these tools is to compare responses of V-Current and V-Next then find the differences. Diffy Envoy McRouter Scientist Keploy One of the most popular tools is Diffy . It was created and used at Twitter. Now the original author and a former Twitter employee maintains their own version of this project, called Opendiffy . Twitter announced this tool on their engineering blog as \" Testing services without writing tests \". As of today Diffy is used in production by Twitter, Airbnb, Baidu and Bytedance companies. Diffy explains the shadow testing feature like this: Diffy finds potential bugs in your service using running instances of your new code, and your old code side by side. Diffy behaves as a proxy and multicasts whatever requests it receives to each of the running instances. It then compares the responses, and reports any regressions that may surface from those comparisons. The premise for Diffy is that if two implementations of the service return \u201csimilar\u201d responses for a sufficiently large and diverse set of requests, then the two implementations can be treated as equivalent, and the newer implementation is regression-free. Diffy architecture","title":"Shadow Testing Frameworks and Tools"},{"location":"automated-testing/shadow-testing/#conclusion","text":"Shadow Testing is a useful approach to reduce risks when you consider replacing the current environment with candidate environment using new feature(s). Shadow testing replicates traffic of the production to candidate environment for testing, so you get same production use case scenarios in the test environment. You can compare differences on both environments and validate your candidate environment to be ready for releasing. Some advantages of shadow testing are: Zero impact to production environment No need to generate test scenarios and test data We can test real-life scenarios with real-life data. We can simulate scale with replicated production traffic.","title":"Conclusion"},{"location":"automated-testing/shadow-testing/#resources","text":"Martin Fowler - Dark Launching Martin Fowler - Feature Toggle Traffic Shadowing/Mirroring","title":"Resources"},{"location":"automated-testing/smoke-testing/","text":"Smoke Testing Smoke tests, sometimes named Sanity , Acceptance , or Build/Release Verification tests, are a sub-type of system/functional tests that are usually used as gates that verify the application's readiness as a preliminary step. If an application passes the smoke tests, it is acceptable, or in a stable-enough state, for the next stages of testing or deployment. When To Use Problem Addressed Smoke tests are meant to find, as early as possible, if an application is working or not. The goal of smoke tests is to save time; if the current version of the application does not pass smoke tests, then the rest of the integration or deployment chain for it can be abandoned. Smoke tests do not aim to provide full functionality coverage but instead focus on a few quick acceptance invocations for which the application should, at all times, respond correctly to. ROI Tipping Point Smoke tests cover only the most critical application path, and should not be used to actually test the application's behavior, keeping execution time and complexity to minimum. The tests can be formed of a subset of the application's integration or e2e tests, and they cover as much of the functionality with as little depth as required. The golden rule of a good smoke test is that it saves time on validating that the application is acceptable to a stage where better, more thorough testing will begin. Applicable to Local dev desktop - Example: Applying manual smoke testing to verify that the application is OK. Build pipelines - Example: Running a small set of the integration test suite before running the full coverage of tests, which may take a long time. Non-production and Production deployments - Example: Running a curl command to the product's API and asserting the response is 200 before running load test which consume resources. PR Validation - Example: - Deploying the application chart to a test namespace and validating the release is successful and no immediate regressions are merged. Conclusion Smoke testing is a low-effort, high-impact step to ship more reliable software. It should be considered amongst the first stages to implement when planning continuously integrated and delivered systems. Resources Wikipedia - Smoke Testing Google SRE Book - System Tests","title":"Smoke Testing"},{"location":"automated-testing/smoke-testing/#smoke-testing","text":"Smoke tests, sometimes named Sanity , Acceptance , or Build/Release Verification tests, are a sub-type of system/functional tests that are usually used as gates that verify the application's readiness as a preliminary step. If an application passes the smoke tests, it is acceptable, or in a stable-enough state, for the next stages of testing or deployment.","title":"Smoke Testing"},{"location":"automated-testing/smoke-testing/#when-to-use","text":"","title":"When To Use"},{"location":"automated-testing/smoke-testing/#problem-addressed","text":"Smoke tests are meant to find, as early as possible, if an application is working or not. The goal of smoke tests is to save time; if the current version of the application does not pass smoke tests, then the rest of the integration or deployment chain for it can be abandoned. Smoke tests do not aim to provide full functionality coverage but instead focus on a few quick acceptance invocations for which the application should, at all times, respond correctly to.","title":"Problem Addressed"},{"location":"automated-testing/smoke-testing/#roi-tipping-point","text":"Smoke tests cover only the most critical application path, and should not be used to actually test the application's behavior, keeping execution time and complexity to minimum. The tests can be formed of a subset of the application's integration or e2e tests, and they cover as much of the functionality with as little depth as required. The golden rule of a good smoke test is that it saves time on validating that the application is acceptable to a stage where better, more thorough testing will begin.","title":"ROI Tipping Point"},{"location":"automated-testing/smoke-testing/#applicable-to","text":"Local dev desktop - Example: Applying manual smoke testing to verify that the application is OK. Build pipelines - Example: Running a small set of the integration test suite before running the full coverage of tests, which may take a long time. Non-production and Production deployments - Example: Running a curl command to the product's API and asserting the response is 200 before running load test which consume resources. PR Validation - Example: - Deploying the application chart to a test namespace and validating the release is successful and no immediate regressions are merged.","title":"Applicable to"},{"location":"automated-testing/smoke-testing/#conclusion","text":"Smoke testing is a low-effort, high-impact step to ship more reliable software. It should be considered amongst the first stages to implement when planning continuously integrated and delivered systems.","title":"Conclusion"},{"location":"automated-testing/smoke-testing/#resources","text":"Wikipedia - Smoke Testing Google SRE Book - System Tests","title":"Resources"},{"location":"automated-testing/synthetic-monitoring-tests/","text":"Synthetic Monitoring Tests Synthetic Monitoring Tests are a set of functional tests that target a live system in production. The focus of these tests, which are sometimes named \"watchdog\", \"active monitoring\" or \"synthetic transactions\", is to verify the product's health and resilience continuously. Why Synthetic Monitoring Tests Traditionally, software providers rely on testing through CI/CD stages in the well known testing pyramid (unit, integration, e2e) to validate that the product is healthy and without regressions. Such tests will run on the build agent or in the test/stage environment before being deployed to production and released to live user traffic. During the services' lifetime in the production environment, they are safeguarded by monitoring and alerting tools that rely on Real User Metrics/Monitoring ( RUM ). However, as more organizations today provide highly-available (99.9+ SLA) products, they find that the nature of long-lived distributed applications, which typically rely on several hardware and software components, is to fail. Frequent releases (sometimes multiple times per day) of various components of the system can create further instability. This rapid rate of change to the production environment tends to make testing during CI/CD stages not hermetic and actually not representative of the end user experience and how the production system actually behaves. For such systems, the ambition of service engineering teams is to reduce to a minimum the time it takes to fix errors, or the MTTR - Mean Time To Repair . It is a continuous effort, performed on the live/production system. Synthetic Monitors can be used to detect the following issues: Availability - Is the system or specific region available. Transactions and customer journeys - Known good requests should work, while known bad requests should error. Performance - How fast are actions and is that performance maintained through high loads and through version releases. 3rd Party components - Cloud or software components used by the system may fail. Shift-Right Testing Synthetic Monitoring tests are a subset of tests that run in production, sometimes named Test-in-Production or Shift-Right tests. With Shift-Left paradigms that are so popular, the approach is to perform testing as early as possible in the application development lifecycle (i.e., moved left on the project timeline). Shift right compliments and adds on top of Shift-Left. It refers to running tests late in the cycle, during deployment, release, and post-release when the product is serving production traffic. They provide modern engineering teams a broader set of tools to assure high SLAs over time. Synthetic Monitoring Tests Design Blocks A synthetic monitoring test is a test that uses synthetic data and real testing accounts to inject user behaviors to the system and validates their effect, usually by passively relying on existing monitoring and alerting capabilities. Components of synthetic monitoring tests include Probes , test code/ accounts which generates data, and Monitoring tools placed to validate both the system's behavior under test and the health of the probes themselves. Probes Probes are the source of synthetic user actions that drive testing. They target the product's front-end or publicly-facing APIs and are running on their own production environment. A Synthetic Monitoring test is, in fact, very related to black-box tests and would usually focus on end-to-end scenarios from a user's perspective. It is not uncommon for the same code for e2e or integration tests to be used to implement the probe. Monitoring Given that Synthetic Monitoring tests are continuously running, at intervals, in a production environment, the assertion of system behavior through analysis relies on existing monitoring pillars used in live system (Logging, Metrics, Distributed Tracing). There would usually be a finite set of tests, and key metrics that are used to build monitors and alerts to assert against the known SLO , and verify that the OKR for that system are maintained. The monitoring tools are effectively capturing both RUMs and synthetic data generated by the probes. Applying Synthetic Monitoring Tests Asserting the System under Test Synthetic monitoring tests are usually statistical. Test metrics are compared against some historical or running average with a time dimension (Example: Over the last 30 days, for this time of day, the mean average response time is 250ms for AddToCart operation with a standard deviation from the mean of +/- 32ms) . So if an observed measurement is within a deviation of the norm at any time, the services are probably healthy. Building a Synthetic Monitoring Solution At a high level, building synthetic monitors usually consists of the following steps: Determine the metric to be validated (functional result, latency, etc.) Build a piece of automation that measures that metric against the system, and gathers telemetry into the system's existing monitoring infrastructure. Set up monitoring alarms/actions/responses that detect the failure of the system to meet the desired goal of the metric. Run the test case automation continuously at an appropriate interval. Monitoring the Health of Tests Probes runtime is a production environment on its own, and the health of tests is critical. Many providers offer cloud-based systems that host such runtimes, while some organizations use existing production environments to run these tests on. In either way, a monitor-the-monitor strategy should be a first-class citizen of the production environment's alerting systems. Synthetic Monitoring and Real User Monitoring Synthetic monitoring does not replace the need for RUM. Probes are predictable code that verifies specific scenarios, and they do not 100% completely and truly represent how a user session is handled. On the other hand, prefer not to use RUMs to test for site reliability because: As the name implies, RUM requires user traffic. The site may be down, but since no user visited the monitored path, no alerts were triggered yet. Inconsistent Traffic and usage patterns make it hard to gauge for benchmarks. Risks Testing in production, in general, has a risk factor attached to it, which does not exist tests executed during CI/CD stages. Specifically, in synthetic monitoring tests, the following may affect the production environment: Corrupted or invalid data - Tests inject test data which may be in some ways corrupt. Consider using a testing schema. Protected data leakage - Tests run in a production environment and emit logs or trace that may contain protected data. Overloaded systems - Synthetic tests may cause errors or overload the system. Unintended side effects or impacts on other production systems. Skewed analytics (traffic funnels, A/B test results, etc.) Auth/AuthZ - Tests are required to run in production where access to tokens and secrets may be restricted or more challenging to retrieve. Synthetic Monitoring Tests Frameworks and Tools Most key monitoring/APM players have an enterprise product that supports synthetic monitoring built into their systems (see list below). Such offerings make some of the risks raised above irrelevant as the integration and runtime aspects of the solution are OOTB. However, such solutions are typically pricey. Some organizations prefer running probes on existing infrastructure using known tools such as Postman , Wrk , JMeter , Selenium or even custom code to generate the synthetic data. Such solutions must account for isolating and decoupling the probe's production environment from the core product's as well as provide monitoring, geo-distribution, and maintaining test health. Application Insights availability - Simple availability tests that allow some customization using Multi-step web test DataDog Synthetics Dynatrace Synthetic Monitoring New Relic Synthetics Checkly Conclusion The value of production tests, in general, and specifically Synthetic monitoring, is only there for particular engagement types, and there is associated risk and cost to them. However, when applicable, they provide continuous assurance that there are no system failures from a user's perspective. When developing a PaaS/SaaS solution, Synthetic monitoring is key to the success of service reliability teams, and they are becoming an integral part of the quality assurance stack of highly available products. Resources Google SRE book - Testing Reliability Microsoft DevOps Architectures - Shift Right to Test in Production Martin Fowler - Synthetic Monitoring","title":"Synthetic Monitoring Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#synthetic-monitoring-tests","text":"Synthetic Monitoring Tests are a set of functional tests that target a live system in production. The focus of these tests, which are sometimes named \"watchdog\", \"active monitoring\" or \"synthetic transactions\", is to verify the product's health and resilience continuously.","title":"Synthetic Monitoring Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#why-synthetic-monitoring-tests","text":"Traditionally, software providers rely on testing through CI/CD stages in the well known testing pyramid (unit, integration, e2e) to validate that the product is healthy and without regressions. Such tests will run on the build agent or in the test/stage environment before being deployed to production and released to live user traffic. During the services' lifetime in the production environment, they are safeguarded by monitoring and alerting tools that rely on Real User Metrics/Monitoring ( RUM ). However, as more organizations today provide highly-available (99.9+ SLA) products, they find that the nature of long-lived distributed applications, which typically rely on several hardware and software components, is to fail. Frequent releases (sometimes multiple times per day) of various components of the system can create further instability. This rapid rate of change to the production environment tends to make testing during CI/CD stages not hermetic and actually not representative of the end user experience and how the production system actually behaves. For such systems, the ambition of service engineering teams is to reduce to a minimum the time it takes to fix errors, or the MTTR - Mean Time To Repair . It is a continuous effort, performed on the live/production system. Synthetic Monitors can be used to detect the following issues: Availability - Is the system or specific region available. Transactions and customer journeys - Known good requests should work, while known bad requests should error. Performance - How fast are actions and is that performance maintained through high loads and through version releases. 3rd Party components - Cloud or software components used by the system may fail.","title":"Why Synthetic Monitoring Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#shift-right-testing","text":"Synthetic Monitoring tests are a subset of tests that run in production, sometimes named Test-in-Production or Shift-Right tests. With Shift-Left paradigms that are so popular, the approach is to perform testing as early as possible in the application development lifecycle (i.e., moved left on the project timeline). Shift right compliments and adds on top of Shift-Left. It refers to running tests late in the cycle, during deployment, release, and post-release when the product is serving production traffic. They provide modern engineering teams a broader set of tools to assure high SLAs over time.","title":"Shift-Right Testing"},{"location":"automated-testing/synthetic-monitoring-tests/#synthetic-monitoring-tests-design-blocks","text":"A synthetic monitoring test is a test that uses synthetic data and real testing accounts to inject user behaviors to the system and validates their effect, usually by passively relying on existing monitoring and alerting capabilities. Components of synthetic monitoring tests include Probes , test code/ accounts which generates data, and Monitoring tools placed to validate both the system's behavior under test and the health of the probes themselves.","title":"Synthetic Monitoring Tests Design Blocks"},{"location":"automated-testing/synthetic-monitoring-tests/#probes","text":"Probes are the source of synthetic user actions that drive testing. They target the product's front-end or publicly-facing APIs and are running on their own production environment. A Synthetic Monitoring test is, in fact, very related to black-box tests and would usually focus on end-to-end scenarios from a user's perspective. It is not uncommon for the same code for e2e or integration tests to be used to implement the probe.","title":"Probes"},{"location":"automated-testing/synthetic-monitoring-tests/#monitoring","text":"Given that Synthetic Monitoring tests are continuously running, at intervals, in a production environment, the assertion of system behavior through analysis relies on existing monitoring pillars used in live system (Logging, Metrics, Distributed Tracing). There would usually be a finite set of tests, and key metrics that are used to build monitors and alerts to assert against the known SLO , and verify that the OKR for that system are maintained. The monitoring tools are effectively capturing both RUMs and synthetic data generated by the probes.","title":"Monitoring"},{"location":"automated-testing/synthetic-monitoring-tests/#applying-synthetic-monitoring-tests","text":"","title":"Applying Synthetic Monitoring Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#asserting-the-system-under-test","text":"Synthetic monitoring tests are usually statistical. Test metrics are compared against some historical or running average with a time dimension (Example: Over the last 30 days, for this time of day, the mean average response time is 250ms for AddToCart operation with a standard deviation from the mean of +/- 32ms) . So if an observed measurement is within a deviation of the norm at any time, the services are probably healthy.","title":"Asserting the System under Test"},{"location":"automated-testing/synthetic-monitoring-tests/#building-a-synthetic-monitoring-solution","text":"At a high level, building synthetic monitors usually consists of the following steps: Determine the metric to be validated (functional result, latency, etc.) Build a piece of automation that measures that metric against the system, and gathers telemetry into the system's existing monitoring infrastructure. Set up monitoring alarms/actions/responses that detect the failure of the system to meet the desired goal of the metric. Run the test case automation continuously at an appropriate interval.","title":"Building a Synthetic Monitoring Solution"},{"location":"automated-testing/synthetic-monitoring-tests/#monitoring-the-health-of-tests","text":"Probes runtime is a production environment on its own, and the health of tests is critical. Many providers offer cloud-based systems that host such runtimes, while some organizations use existing production environments to run these tests on. In either way, a monitor-the-monitor strategy should be a first-class citizen of the production environment's alerting systems.","title":"Monitoring the Health of Tests"},{"location":"automated-testing/synthetic-monitoring-tests/#synthetic-monitoring-and-real-user-monitoring","text":"Synthetic monitoring does not replace the need for RUM. Probes are predictable code that verifies specific scenarios, and they do not 100% completely and truly represent how a user session is handled. On the other hand, prefer not to use RUMs to test for site reliability because: As the name implies, RUM requires user traffic. The site may be down, but since no user visited the monitored path, no alerts were triggered yet. Inconsistent Traffic and usage patterns make it hard to gauge for benchmarks.","title":"Synthetic Monitoring and Real User Monitoring"},{"location":"automated-testing/synthetic-monitoring-tests/#risks","text":"Testing in production, in general, has a risk factor attached to it, which does not exist tests executed during CI/CD stages. Specifically, in synthetic monitoring tests, the following may affect the production environment: Corrupted or invalid data - Tests inject test data which may be in some ways corrupt. Consider using a testing schema. Protected data leakage - Tests run in a production environment and emit logs or trace that may contain protected data. Overloaded systems - Synthetic tests may cause errors or overload the system. Unintended side effects or impacts on other production systems. Skewed analytics (traffic funnels, A/B test results, etc.) Auth/AuthZ - Tests are required to run in production where access to tokens and secrets may be restricted or more challenging to retrieve.","title":"Risks"},{"location":"automated-testing/synthetic-monitoring-tests/#synthetic-monitoring-tests-frameworks-and-tools","text":"Most key monitoring/APM players have an enterprise product that supports synthetic monitoring built into their systems (see list below). Such offerings make some of the risks raised above irrelevant as the integration and runtime aspects of the solution are OOTB. However, such solutions are typically pricey. Some organizations prefer running probes on existing infrastructure using known tools such as Postman , Wrk , JMeter , Selenium or even custom code to generate the synthetic data. Such solutions must account for isolating and decoupling the probe's production environment from the core product's as well as provide monitoring, geo-distribution, and maintaining test health. Application Insights availability - Simple availability tests that allow some customization using Multi-step web test DataDog Synthetics Dynatrace Synthetic Monitoring New Relic Synthetics Checkly","title":"Synthetic Monitoring Tests Frameworks and Tools"},{"location":"automated-testing/synthetic-monitoring-tests/#conclusion","text":"The value of production tests, in general, and specifically Synthetic monitoring, is only there for particular engagement types, and there is associated risk and cost to them. However, when applicable, they provide continuous assurance that there are no system failures from a user's perspective. When developing a PaaS/SaaS solution, Synthetic monitoring is key to the success of service reliability teams, and they are becoming an integral part of the quality assurance stack of highly available products.","title":"Conclusion"},{"location":"automated-testing/synthetic-monitoring-tests/#resources","text":"Google SRE book - Testing Reliability Microsoft DevOps Architectures - Shift Right to Test in Production Martin Fowler - Synthetic Monitoring","title":"Resources"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/","text":"Building Containers with Azure DevOps Using the DevTest Pattern In this documents, we highlight learnings from applying the DevTest pattern to container development in Azure DevOps through pipelines. The pattern enabled as to build container for development, testing and releasing the container for further reuse (production ready). We will dive into tools needed to build, test and push a container, our environment and go through each step separately. Follow this link to dive deeper or revisit the DevTest pattern . Build the Container The first step in container development, after creating the necessary Dockerfiles and source code, is building the container. Even the Dockerfile itself can include some basic testing. Code tests are performed when pushing the code to the repository origin, where it is then used to build the container. The first step in our pipeline is to run the docker build command with a temporary tag and the required build arguments: - task : Bash@3 name : BuildImage displayName : 'Build the image via docker' inputs : workingDirectory : \"$(System.DefaultWorkingDirectory)${{ parameters.buildDirectory }}\" targetType : 'inline' script : | docker build -t ${{ parameters.imageName }} --build-arg YOUR_BUILD_ARG -f ${{ parameters.dockerfileName }} . env : PredefinedPassword : $(Password) NewVariable : \"newVariableValue\" This task includes the parameters buildDirectory , imageName and dockerfileName , which have to be set beforehand. This task can for example be used in a template for multiple containers to improve code reuse. It is also possible to pass environment variables directly to the Dockerfile through the env section of the task. If this task succeeds, the Dockerfile was build without errors and we can continue to testing the container itself. Test the Container To test the container, we are using the tox environment. For more details on tox please visit the tox section of this repository or visit the official tox documentation page . Before we test the container, we are checking for exposed credentials in the docker image history. If known passwords, used to access our internal resources, are exposed here, the build step will fail: - task: Bash@3 name: CheckIfPasswordInDockerHistory displayName: 'Check for password in docker history' inputs: workingDirectory: \"$(System.DefaultWorkingDirectory)\" targetType: 'inline' failOnStdErr: true script: | if docker image history --no-trunc ${{ parameters.imageName }} | grep -qF $PredefinedPassword; then exit 1; fi exit 0; env: PredefinedPassword: $(Password) After the credential test, the container is tested through the pytest extension testinfra . Testinfra is a Python-based tool which can be used to start a container, gather prerequisites, test the container and shut it down again, without any effort besides writing the tests. These tests can for example include: if files exist if environment variables are set correctly if certain processes are running if the correct host environment is used For a complete collection of capabilities and requirements, please visit the testinfra project on GitHub . A few methods of a Linux-based container test can look like this: def test_dependencies ( host ): ''' Check all files needed to run the container properly. ''' env_file = \"/app/environment.sh.env\" assert host . file ( env_file ) . exists activate_sh_path = \"/app/start.sh\" assert host . file ( activate_sh_path ) . exists def test_container_running ( host ): process = host . process . get ( comm = \"start.sh\" ) assert process . user == \"root\" def test_host_system ( host ): system_type = 'linux' distribution = 'ubuntu' release = '18.04' assert system_type == host . system_info . type assert distribution == host . system_info . distribution assert release == host . system_info . release def extract_env_var ( file_content ): import re regex = r \"ENV_VAR= \\\" (?P<s>[^ \\\" ]*) \\\" \" match = re . match ( regex , file_content ) return match . group ( 's' ) def test_ports_exposed ( host ): port1 = \"9010\" st1 = f \"grep -q { port1 } /app/Dockerfile && echo 'true' || echo 'false'\" cmd1 = host . run ( st1 ) assert cmd1 . stdout def test_listening_simserver_sockets ( host ): assert host . socket ( \"tcp://0.0.0.0:32512\" ) . is_listening assert host . socket ( \"tcp://0.0.0.0:32513\" ) . is_listening To start the test, a pytest command is executed through tox. A task containing the tox command can look like this: - task : Bash@3 name : RunTestCommands displayName : \"Test - Run test commands\" inputs : workingDirectory : \"$(System.DefaultWorkingDirectory)\" targetType : 'inline' script : | tox -e testinfra-${{ parameters.makeTarget }} -- ${{ parameters.imageName }} failOnStderr : true Which could trigger the following pytest code, which is contained in the tox.ini file: pytest -vv tests/ { env:CONTEXT } --container-image ={ posargs: { env:IMAGE_TAG }} --volume ={ env:VOLUME } As a last task of this pipeline to build and test the container, we set a variable called testsPassed which is only true , if the previous tasks succeeded: - task: Bash@3 name: UpdateTestResultVariable condition: succeeded() inputs: targetType: 'inline' script: | echo '##vso[task.setvariable variable=testsPassed]true' Push the Container After building and testing, if our container runs as expected, we want to release it to our Azure Container Registry (ACR) to be used by our larger application. Before that, we want to automate the push behavior and define a meaningful tag. As a developer it is often helpful to have containers pushed to ACR, even if they are failing. This can be done by checking for the testsPassed variable we introduced at the end of our testing. If the test failed, we want to add a failed suffix at the end of the tag: - task: Bash@3 name: SetFailedSuffixTag displayName: \"Set failed suffix, if the tests failed.\" condition: and(eq(variables['testsPassed'], false), ne(variables['Build.SourceBranchName'], 'main')) # if this is not a release and failed -> retag the image to add failedSuffix inputs: targetType: inline script: | docker tag ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }} ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }}$(failedSuffix) The condition checks, if the value of testsPassed is false and also if we are not on the main branch, as we don't want to push failed containers from main. This helps us to keep our production environment clean. The value for imageRepository was defined in another template, along with the failedSuffix and testsPassed : parameters: - name: component variables: testsPassed: false failedSuffix: \"-failed\" # the imageRepo will changed based on dev or release ${{ if eq( variables['Build.SourceBranchName'], 'main' ) }}: imageRepository: 'stable/${{ parameters.component }}' ${{ if ne( variables['Build.SourceBranchName'], 'main' ) }}: imageRepository: 'dev/${{ parameters.component }}' The imageTag is open to discussion, as it depends highly on how your team wants to use the container. We went for Build.SourceVersion which is the commit ID of the branch the container was developed in. This allows you to easily track the origin of the container and aids debugging. A link to Azure DevOps predefined variables can be found in the Azure Docs on Azure DevOps After a tag was added to the container, the image must be pushed. This can be done with the following task: - task: Docker@1 name: pushFailedDockerImage displayName: 'Pushes failed image via Docker' condition: and(eq(variables['testsPassed'], false), ne(variables['Build.SourceBranchName'], 'main')) # if this is not a release and failed -> push the image with the failed tag inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:${{ parameters.imageTag }}$(failedSuffix)' Similarly, these are the steps to publish the container to the ACR, if the tests succeeded: - task: Bash@3 name: SetLatestSuffixTag displayName: \"Set latest suffix, if the tests succeed.\" condition: eq(variables['testsPassed'], true) inputs: targetType: inline script: | docker tag ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }} ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:latest - task: Docker@1 name: pushSuccessfulDockerImageSha displayName: 'Pushes successful image via Docker' condition: eq(variables['testsPassed'], true) inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:${{ parameters.imageTag }}' - task: Docker@1 name: pushSuccessfulDockerImageLatest displayName: 'Pushes successful image as latest' condition: eq(variables['testsPassed'], true) inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:latest' If you don't want to include the latest tag, you can also remove the steps involving latest (SetLatestSuffixTag & pushSuccessfulDockerImageLatest). Resources DevTest pattern Azure Docs on Azure DevOps official tox documentation page Testinfra Testinfra project on GitHub pytest","title":"Building Containers with Azure DevOps Using the DevTest Pattern"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#building-containers-with-azure-devops-using-the-devtest-pattern","text":"In this documents, we highlight learnings from applying the DevTest pattern to container development in Azure DevOps through pipelines. The pattern enabled as to build container for development, testing and releasing the container for further reuse (production ready). We will dive into tools needed to build, test and push a container, our environment and go through each step separately. Follow this link to dive deeper or revisit the DevTest pattern .","title":"Building Containers with Azure DevOps Using the DevTest Pattern"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#build-the-container","text":"The first step in container development, after creating the necessary Dockerfiles and source code, is building the container. Even the Dockerfile itself can include some basic testing. Code tests are performed when pushing the code to the repository origin, where it is then used to build the container. The first step in our pipeline is to run the docker build command with a temporary tag and the required build arguments: - task : Bash@3 name : BuildImage displayName : 'Build the image via docker' inputs : workingDirectory : \"$(System.DefaultWorkingDirectory)${{ parameters.buildDirectory }}\" targetType : 'inline' script : | docker build -t ${{ parameters.imageName }} --build-arg YOUR_BUILD_ARG -f ${{ parameters.dockerfileName }} . env : PredefinedPassword : $(Password) NewVariable : \"newVariableValue\" This task includes the parameters buildDirectory , imageName and dockerfileName , which have to be set beforehand. This task can for example be used in a template for multiple containers to improve code reuse. It is also possible to pass environment variables directly to the Dockerfile through the env section of the task. If this task succeeds, the Dockerfile was build without errors and we can continue to testing the container itself.","title":"Build the Container"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#test-the-container","text":"To test the container, we are using the tox environment. For more details on tox please visit the tox section of this repository or visit the official tox documentation page . Before we test the container, we are checking for exposed credentials in the docker image history. If known passwords, used to access our internal resources, are exposed here, the build step will fail: - task: Bash@3 name: CheckIfPasswordInDockerHistory displayName: 'Check for password in docker history' inputs: workingDirectory: \"$(System.DefaultWorkingDirectory)\" targetType: 'inline' failOnStdErr: true script: | if docker image history --no-trunc ${{ parameters.imageName }} | grep -qF $PredefinedPassword; then exit 1; fi exit 0; env: PredefinedPassword: $(Password) After the credential test, the container is tested through the pytest extension testinfra . Testinfra is a Python-based tool which can be used to start a container, gather prerequisites, test the container and shut it down again, without any effort besides writing the tests. These tests can for example include: if files exist if environment variables are set correctly if certain processes are running if the correct host environment is used For a complete collection of capabilities and requirements, please visit the testinfra project on GitHub . A few methods of a Linux-based container test can look like this: def test_dependencies ( host ): ''' Check all files needed to run the container properly. ''' env_file = \"/app/environment.sh.env\" assert host . file ( env_file ) . exists activate_sh_path = \"/app/start.sh\" assert host . file ( activate_sh_path ) . exists def test_container_running ( host ): process = host . process . get ( comm = \"start.sh\" ) assert process . user == \"root\" def test_host_system ( host ): system_type = 'linux' distribution = 'ubuntu' release = '18.04' assert system_type == host . system_info . type assert distribution == host . system_info . distribution assert release == host . system_info . release def extract_env_var ( file_content ): import re regex = r \"ENV_VAR= \\\" (?P<s>[^ \\\" ]*) \\\" \" match = re . match ( regex , file_content ) return match . group ( 's' ) def test_ports_exposed ( host ): port1 = \"9010\" st1 = f \"grep -q { port1 } /app/Dockerfile && echo 'true' || echo 'false'\" cmd1 = host . run ( st1 ) assert cmd1 . stdout def test_listening_simserver_sockets ( host ): assert host . socket ( \"tcp://0.0.0.0:32512\" ) . is_listening assert host . socket ( \"tcp://0.0.0.0:32513\" ) . is_listening To start the test, a pytest command is executed through tox. A task containing the tox command can look like this: - task : Bash@3 name : RunTestCommands displayName : \"Test - Run test commands\" inputs : workingDirectory : \"$(System.DefaultWorkingDirectory)\" targetType : 'inline' script : | tox -e testinfra-${{ parameters.makeTarget }} -- ${{ parameters.imageName }} failOnStderr : true Which could trigger the following pytest code, which is contained in the tox.ini file: pytest -vv tests/ { env:CONTEXT } --container-image ={ posargs: { env:IMAGE_TAG }} --volume ={ env:VOLUME } As a last task of this pipeline to build and test the container, we set a variable called testsPassed which is only true , if the previous tasks succeeded: - task: Bash@3 name: UpdateTestResultVariable condition: succeeded() inputs: targetType: 'inline' script: | echo '##vso[task.setvariable variable=testsPassed]true'","title":"Test the Container"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#push-the-container","text":"After building and testing, if our container runs as expected, we want to release it to our Azure Container Registry (ACR) to be used by our larger application. Before that, we want to automate the push behavior and define a meaningful tag. As a developer it is often helpful to have containers pushed to ACR, even if they are failing. This can be done by checking for the testsPassed variable we introduced at the end of our testing. If the test failed, we want to add a failed suffix at the end of the tag: - task: Bash@3 name: SetFailedSuffixTag displayName: \"Set failed suffix, if the tests failed.\" condition: and(eq(variables['testsPassed'], false), ne(variables['Build.SourceBranchName'], 'main')) # if this is not a release and failed -> retag the image to add failedSuffix inputs: targetType: inline script: | docker tag ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }} ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }}$(failedSuffix) The condition checks, if the value of testsPassed is false and also if we are not on the main branch, as we don't want to push failed containers from main. This helps us to keep our production environment clean. The value for imageRepository was defined in another template, along with the failedSuffix and testsPassed : parameters: - name: component variables: testsPassed: false failedSuffix: \"-failed\" # the imageRepo will changed based on dev or release ${{ if eq( variables['Build.SourceBranchName'], 'main' ) }}: imageRepository: 'stable/${{ parameters.component }}' ${{ if ne( variables['Build.SourceBranchName'], 'main' ) }}: imageRepository: 'dev/${{ parameters.component }}' The imageTag is open to discussion, as it depends highly on how your team wants to use the container. We went for Build.SourceVersion which is the commit ID of the branch the container was developed in. This allows you to easily track the origin of the container and aids debugging. A link to Azure DevOps predefined variables can be found in the Azure Docs on Azure DevOps After a tag was added to the container, the image must be pushed. This can be done with the following task: - task: Docker@1 name: pushFailedDockerImage displayName: 'Pushes failed image via Docker' condition: and(eq(variables['testsPassed'], false), ne(variables['Build.SourceBranchName'], 'main')) # if this is not a release and failed -> push the image with the failed tag inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:${{ parameters.imageTag }}$(failedSuffix)' Similarly, these are the steps to publish the container to the ACR, if the tests succeeded: - task: Bash@3 name: SetLatestSuffixTag displayName: \"Set latest suffix, if the tests succeed.\" condition: eq(variables['testsPassed'], true) inputs: targetType: inline script: | docker tag ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:${{ parameters.imageTag }} ${{ parameters.containerRegistry }}/${{ parameters.imageRepository }}:latest - task: Docker@1 name: pushSuccessfulDockerImageSha displayName: 'Pushes successful image via Docker' condition: eq(variables['testsPassed'], true) inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:${{ parameters.imageTag }}' - task: Docker@1 name: pushSuccessfulDockerImageLatest displayName: 'Pushes successful image as latest' condition: eq(variables['testsPassed'], true) inputs: containerregistrytype: 'Azure Container Registry' azureSubscriptionEndpoint: ${{ parameters.serviceConnection }} azureContainerRegistry: ${{ parameters.containerRegistry }} command: 'Push an image' imageName: '${{ parameters.imageRepository }}:latest' If you don't want to include the latest tag, you can also remove the steps involving latest (SetLatestSuffixTag & pushSuccessfulDockerImageLatest).","title":"Push the Container"},{"location":"automated-testing/tech-specific-samples/building-containers-with-azure-devops/#resources","text":"DevTest pattern Azure Docs on Azure DevOps official tox documentation page Testinfra Testinfra project on GitHub pytest","title":"Resources"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/","text":"Using Azurite to Run Blob Storage Tests in a Pipeline This document determines the approach for writing automated tests with a short feedback loop (i.e. unit tests) against security considerations (private endpoints) for the Azure Blob Storage functionality. Once private endpoints are enabled for the Azure Storage accounts, the current tests will fail when executed locally or as part of a pipeline because this connection will be blocked. Utilize an Azure Storage Emulator - Azurite To emulate a local Azure Blob Storage, we can use Azure Storage Emulator . The Storage Emulator currently runs only on Windows. If you need a Storage Emulator for Linux, one option is the community maintained, open-source Storage Emulator Azurite . The Azure Storage Emulator is no longer being actively developed. Azurite is the Storage Emulator platform going forward. Azurite supersedes the Azure Storage Emulator. Azurite will continue to be updated to support the latest versions of Azure Storage APIs. For more information, see Use the Azurite emulator for local Azure Storage development . Some differences in functionality exist between the Storage Emulator and Azure storage services. For more information about these differences, see the Differences between the Storage Emulator and Azure Storage . There are several ways to install and run Azurite on your local system as listed here . In this document we will cover Install and run Azurite using NPM and Install and run the Azurite Docker image . 1. Install and Run Azurite a. Using NPM In order to run Azurite V3 you need Node.js >= 8.0 installed on your system. Azurite works cross-platform on Windows, Linux, and OS X. After the Node.js installation, you can install Azurite simply with npm which is the Node.js package management tool included with every Node.js installation. # Install Azurite npm install -g azurite # Create azurite directory mkdir c:/azurite # Launch Azurite for Windows azurite --silent --location c: \\a zurite --debug c: \\a zurite \\d ebug.log If you want to avoid any disk persistence and destroy the test data when the Azurite process terminates, you can pass the --inMemoryPersistence option, as of Azurite 3.28.0. The output will be: Azurite Blob service is starting at http://127.0.0.1:10000 Azurite Blob service is successfully listening at http://127.0.0.1:10000 Azurite Queue service is starting at http://127.0.0.1:10001 Azurite Queue service is successfully listening at http://127.0.0.1:10001 b. Using a Docker Image Another way to run Azurite is using docker, using default HTTP endpoint docker run -p 10000 :10000 mcr.microsoft.com/azure-storage/azurite azurite-blob --blobHost 0 .0.0.0 Docker Compose is another option and can run the same docker image using the docker-compose.yml file below. version : '3.4' services : azurite : image : mcr.microsoft.com/azure-storage/azurite hostname : azurite volumes : - ./cert/azurite:/data command : \"azurite-blob --blobHost 0.0.0.0 -l /data --cert /data/127.0.0.1.pem --key /data/127.0.0.1-key.pem --oauth basic\" ports : - \"10000:10000\" - \"10001:10001\" 2. Run Tests on Your Local Machine Python 3.8.7 is used for this, but it should be fine on other 3.x versions as well. Install and run Azurite for local tests: Option 1: using npm: # Install Azurite npm install -g azurite # Create azurite directory mkdir c:/azurite # Launch Azurite for Windows azurite --silent --location c: \\a zurite --debug c: \\a zurite \\d ebug.log Option 2: using docker docker run -p 10000 :10000 mcr.microsoft.com/azure-storage/azurite azurite-blob --blobHost 0 .0.0.0 In Azure Storage Explorer, select Attach to a local emulator Provide a Display name and port number, then your connection will be ready, and you can use Storage Explorer to manage your local blob storage. To test and see how these endpoints are running you can attach your local blob storage to the Azure Storage Explorer . Create a virtual python environment python -m venv .venv Container name and initialize env variables: Use conftest.py for test integration. from azure.storage.blob import BlobServiceClient import os def pytest_generate_tests ( metafunc ): os . environ [ 'STORAGE_CONNECTION_STRING' ] = 'DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;' os . environ [ 'STORAGE_CONTAINER' ] = 'test-container' # Crete container for Azurite for the first run blob_service_client = BlobServiceClient . from_connection_string ( os . environ . get ( \"STORAGE_CONNECTION_STRING\" )) try : blob_service_client . create_container ( os . environ . get ( \"STORAGE_CONTAINER\" )) except Exception as e : print ( e ) * Note: value for STORAGE_CONNECTION_STRING is default value for Azurite, it's not a private key Install the dependencies pip install -r requirements_tests.txt Run tests: python -m pytest ./tests After running tests, you can see the files in your local blob storage 3. Run Tests on Azure Pipelines After running tests locally we need to make sure these tests pass on Azure Pipelines too. We have 2 options here, we can use docker image as hosted agent on Azure or install an npm package in the Pipeline steps. trigger: - master steps: - task: UsePythonVersion@0 displayName: 'Use Python 3.7' inputs: versionSpec: 3 .7 - bash: | pip install -r requirements_tests.txt displayName: 'Setup requirements for tests' - bash: | sudo npm install -g azurite sudo mkdir azurite sudo azurite --silent --location azurite --debug azurite \\d ebug.log & displayName: 'Install and Run Azurite' - bash: | python -m pytest --junit-xml = unit_tests_report.xml --cov = tests --cov-report = html --cov-report = xml ./tests displayName: 'Run Tests' - task: PublishCodeCoverageResults@1 inputs: codeCoverageTool: Cobertura summaryFileLocation: '$(System.DefaultWorkingDirectory)/**/coverage.xml' reportDirectory: '$(System.DefaultWorkingDirectory)/**/htmlcov' - task: PublishTestResults@2 inputs: testResultsFormat: 'JUnit' testResultsFiles: '**/*_tests_report.xml' failTaskOnFailedTests: true Once we set up our pipeline in Azure Pipelines, result will be like below","title":"Using Azurite to Run Blob Storage Tests in a Pipeline"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#using-azurite-to-run-blob-storage-tests-in-a-pipeline","text":"This document determines the approach for writing automated tests with a short feedback loop (i.e. unit tests) against security considerations (private endpoints) for the Azure Blob Storage functionality. Once private endpoints are enabled for the Azure Storage accounts, the current tests will fail when executed locally or as part of a pipeline because this connection will be blocked.","title":"Using Azurite to Run Blob Storage Tests in a Pipeline"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#utilize-an-azure-storage-emulator-azurite","text":"To emulate a local Azure Blob Storage, we can use Azure Storage Emulator . The Storage Emulator currently runs only on Windows. If you need a Storage Emulator for Linux, one option is the community maintained, open-source Storage Emulator Azurite . The Azure Storage Emulator is no longer being actively developed. Azurite is the Storage Emulator platform going forward. Azurite supersedes the Azure Storage Emulator. Azurite will continue to be updated to support the latest versions of Azure Storage APIs. For more information, see Use the Azurite emulator for local Azure Storage development . Some differences in functionality exist between the Storage Emulator and Azure storage services. For more information about these differences, see the Differences between the Storage Emulator and Azure Storage . There are several ways to install and run Azurite on your local system as listed here . In this document we will cover Install and run Azurite using NPM and Install and run the Azurite Docker image .","title":"Utilize an Azure Storage Emulator - Azurite"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#1-install-and-run-azurite","text":"","title":"1. Install and Run Azurite"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#a-using-npm","text":"In order to run Azurite V3 you need Node.js >= 8.0 installed on your system. Azurite works cross-platform on Windows, Linux, and OS X. After the Node.js installation, you can install Azurite simply with npm which is the Node.js package management tool included with every Node.js installation. # Install Azurite npm install -g azurite # Create azurite directory mkdir c:/azurite # Launch Azurite for Windows azurite --silent --location c: \\a zurite --debug c: \\a zurite \\d ebug.log If you want to avoid any disk persistence and destroy the test data when the Azurite process terminates, you can pass the --inMemoryPersistence option, as of Azurite 3.28.0. The output will be: Azurite Blob service is starting at http://127.0.0.1:10000 Azurite Blob service is successfully listening at http://127.0.0.1:10000 Azurite Queue service is starting at http://127.0.0.1:10001 Azurite Queue service is successfully listening at http://127.0.0.1:10001","title":"a. Using NPM"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#b-using-a-docker-image","text":"Another way to run Azurite is using docker, using default HTTP endpoint docker run -p 10000 :10000 mcr.microsoft.com/azure-storage/azurite azurite-blob --blobHost 0 .0.0.0 Docker Compose is another option and can run the same docker image using the docker-compose.yml file below. version : '3.4' services : azurite : image : mcr.microsoft.com/azure-storage/azurite hostname : azurite volumes : - ./cert/azurite:/data command : \"azurite-blob --blobHost 0.0.0.0 -l /data --cert /data/127.0.0.1.pem --key /data/127.0.0.1-key.pem --oauth basic\" ports : - \"10000:10000\" - \"10001:10001\"","title":"b. Using a Docker Image"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#2-run-tests-on-your-local-machine","text":"Python 3.8.7 is used for this, but it should be fine on other 3.x versions as well. Install and run Azurite for local tests: Option 1: using npm: # Install Azurite npm install -g azurite # Create azurite directory mkdir c:/azurite # Launch Azurite for Windows azurite --silent --location c: \\a zurite --debug c: \\a zurite \\d ebug.log Option 2: using docker docker run -p 10000 :10000 mcr.microsoft.com/azure-storage/azurite azurite-blob --blobHost 0 .0.0.0 In Azure Storage Explorer, select Attach to a local emulator Provide a Display name and port number, then your connection will be ready, and you can use Storage Explorer to manage your local blob storage. To test and see how these endpoints are running you can attach your local blob storage to the Azure Storage Explorer . Create a virtual python environment python -m venv .venv Container name and initialize env variables: Use conftest.py for test integration. from azure.storage.blob import BlobServiceClient import os def pytest_generate_tests ( metafunc ): os . environ [ 'STORAGE_CONNECTION_STRING' ] = 'DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;' os . environ [ 'STORAGE_CONTAINER' ] = 'test-container' # Crete container for Azurite for the first run blob_service_client = BlobServiceClient . from_connection_string ( os . environ . get ( \"STORAGE_CONNECTION_STRING\" )) try : blob_service_client . create_container ( os . environ . get ( \"STORAGE_CONTAINER\" )) except Exception as e : print ( e ) * Note: value for STORAGE_CONNECTION_STRING is default value for Azurite, it's not a private key Install the dependencies pip install -r requirements_tests.txt Run tests: python -m pytest ./tests After running tests, you can see the files in your local blob storage","title":"2. Run Tests on Your Local Machine"},{"location":"automated-testing/tech-specific-samples/blobstorage-unit-tests/#3-run-tests-on-azure-pipelines","text":"After running tests locally we need to make sure these tests pass on Azure Pipelines too. We have 2 options here, we can use docker image as hosted agent on Azure or install an npm package in the Pipeline steps. trigger: - master steps: - task: UsePythonVersion@0 displayName: 'Use Python 3.7' inputs: versionSpec: 3 .7 - bash: | pip install -r requirements_tests.txt displayName: 'Setup requirements for tests' - bash: | sudo npm install -g azurite sudo mkdir azurite sudo azurite --silent --location azurite --debug azurite \\d ebug.log & displayName: 'Install and Run Azurite' - bash: | python -m pytest --junit-xml = unit_tests_report.xml --cov = tests --cov-report = html --cov-report = xml ./tests displayName: 'Run Tests' - task: PublishCodeCoverageResults@1 inputs: codeCoverageTool: Cobertura summaryFileLocation: '$(System.DefaultWorkingDirectory)/**/coverage.xml' reportDirectory: '$(System.DefaultWorkingDirectory)/**/htmlcov' - task: PublishTestResults@2 inputs: testResultsFormat: 'JUnit' testResultsFiles: '**/*_tests_report.xml' failTaskOnFailedTests: true Once we set up our pipeline in Azure Pipelines, result will be like below","title":"3. Run Tests on Azure Pipelines"},{"location":"automated-testing/templates/case-study-template/","text":"Case study template [Customer Project] Case Study Background Describe the customer and business requirements with the explicit problem statement. System Under Test (SUT) Include the system's conceptual architecture and highlight the architecture components that were included in the E2E testing. Problems and Limitations Describe about the problems of the overall SUT solution that prevented from testing specific (or any) part of the solution. Describe limitation of the testing tools and framework(s) used in this implementation E2E Testing Framework and Tools Describe what testing framework and/or tools were used to implement E2E testing in the SUT. Test Cases Describe the E2E test cases were created to E2E test the SUT Test Metrics Describe any architecture solution were used to monitor, observe and track the various service states that were used as the E2E testing metrics. Also, include the list of test cases were build to measure the progress of E2E testing. E2E Testing Architecture Describe any testing architecture were built to run E2E testing. E2E Testing Implementation (Code Samples) Include sample test cases and their implementation in the programming language of choice. Include any common reusable code implementation blocks that could be leveraged in the future project's E2E testing implementation. E2E Testing Reporting and Results Include sample of E2E testing reports and results obtained from the E2E testing runs in this project.","title":"Case study template"},{"location":"automated-testing/templates/case-study-template/#case-study-template","text":"[Customer Project] Case Study","title":"Case study template"},{"location":"automated-testing/templates/case-study-template/#background","text":"Describe the customer and business requirements with the explicit problem statement.","title":"Background"},{"location":"automated-testing/templates/case-study-template/#system-under-test-sut","text":"Include the system's conceptual architecture and highlight the architecture components that were included in the E2E testing.","title":"System Under Test (SUT)"},{"location":"automated-testing/templates/case-study-template/#problems-and-limitations","text":"Describe about the problems of the overall SUT solution that prevented from testing specific (or any) part of the solution. Describe limitation of the testing tools and framework(s) used in this implementation","title":"Problems and Limitations"},{"location":"automated-testing/templates/case-study-template/#e2e-testing-framework-and-tools","text":"Describe what testing framework and/or tools were used to implement E2E testing in the SUT.","title":"E2E Testing Framework and Tools"},{"location":"automated-testing/templates/case-study-template/#test-cases","text":"Describe the E2E test cases were created to E2E test the SUT","title":"Test Cases"},{"location":"automated-testing/templates/case-study-template/#test-metrics","text":"Describe any architecture solution were used to monitor, observe and track the various service states that were used as the E2E testing metrics. Also, include the list of test cases were build to measure the progress of E2E testing.","title":"Test Metrics"},{"location":"automated-testing/templates/case-study-template/#e2e-testing-architecture","text":"Describe any testing architecture were built to run E2E testing.","title":"E2E Testing Architecture"},{"location":"automated-testing/templates/case-study-template/#e2e-testing-implementation-code-samples","text":"Include sample test cases and their implementation in the programming language of choice. Include any common reusable code implementation blocks that could be leveraged in the future project's E2E testing implementation.","title":"E2E Testing Implementation (Code Samples)"},{"location":"automated-testing/templates/case-study-template/#e2e-testing-reporting-and-results","text":"Include sample of E2E testing reports and results obtained from the E2E testing runs in this project.","title":"E2E Testing Reporting and Results"},{"location":"automated-testing/templates/test-type-template/","text":"Test Type Template [Test Technique Name Here] Put a 2-3 sentence overview about the test technique here. When To Use Problem Addressed Describing the problem that this test type addresses, this should focus on the motivation behind the test type/technique to help the reader correlate this technique to their problem. When to Avoid Describe when NOT to use, if applicable. ROI Tipping Point How much is enough? For example, some opine that unit test ROI drops significantly at 80% block coverage and when the codebase is well-exercised by real traffic in production. Applicable to Local dev 'desktop' Build pipelines Non-production deployments Production deployments NOTE: If there is great (clear, succinct) documentation for the technique on the web, supply a pointer and skip the rest of this template. No need to re-type content How to Use Architecture Describe the components of the technique and how they interact with each other and the subject of the test technique. Add a simple diagram of how the technique's parts are organized, if helpful to illustrate. Pre-requisites Anything required in advance? High-level Step-by-Step 1. 1. 1. Best Practices and Advice Describe what good testing looks like for this technique, best practices, pitfalls. Anti patterns e.g. unit tests should never require off-box or even out-of-process dependencies. Are there similar things to avoid when applying this technique? Frameworks, Tools, Templates Describe known good (i.e. actually used and known to provide good results) frameworks, tools, templates, their pros and cons, with links. Resources Provide links to further readings about this technique to dive deeper.","title":"Test Type Template"},{"location":"automated-testing/templates/test-type-template/#test-type-template","text":"[Test Technique Name Here] Put a 2-3 sentence overview about the test technique here.","title":"Test Type Template"},{"location":"automated-testing/templates/test-type-template/#when-to-use","text":"","title":"When To Use"},{"location":"automated-testing/templates/test-type-template/#problem-addressed","text":"Describing the problem that this test type addresses, this should focus on the motivation behind the test type/technique to help the reader correlate this technique to their problem.","title":"Problem Addressed"},{"location":"automated-testing/templates/test-type-template/#when-to-avoid","text":"Describe when NOT to use, if applicable.","title":"When to Avoid"},{"location":"automated-testing/templates/test-type-template/#roi-tipping-point","text":"How much is enough? For example, some opine that unit test ROI drops significantly at 80% block coverage and when the codebase is well-exercised by real traffic in production.","title":"ROI Tipping Point"},{"location":"automated-testing/templates/test-type-template/#applicable-to","text":"Local dev 'desktop' Build pipelines Non-production deployments Production deployments","title":"Applicable to"},{"location":"automated-testing/templates/test-type-template/#note-if-there-is-great-clear-succinct-documentation-for-the-technique-on-the-web-supply-a-pointer-and-skip-the-rest-of-this-template-no-need-to-re-type-content","text":"","title":"NOTE: If there is great (clear, succinct) documentation for the technique on the web, supply a pointer and skip the rest of this template.  No need to re-type content"},{"location":"automated-testing/templates/test-type-template/#how-to-use","text":"","title":"How to Use"},{"location":"automated-testing/templates/test-type-template/#architecture","text":"Describe the components of the technique and how they interact with each other and the subject of the test technique. Add a simple diagram of how the technique's parts are organized, if helpful to illustrate.","title":"Architecture"},{"location":"automated-testing/templates/test-type-template/#pre-requisites","text":"Anything required in advance?","title":"Pre-requisites"},{"location":"automated-testing/templates/test-type-template/#high-level-step-by-step","text":"1. 1. 1.","title":"High-level Step-by-Step"},{"location":"automated-testing/templates/test-type-template/#best-practices-and-advice","text":"Describe what good testing looks like for this technique, best practices, pitfalls.","title":"Best Practices and Advice"},{"location":"automated-testing/templates/test-type-template/#anti-patterns","text":"e.g. unit tests should never require off-box or even out-of-process dependencies. Are there similar things to avoid when applying this technique?","title":"Anti patterns"},{"location":"automated-testing/templates/test-type-template/#frameworks-tools-templates","text":"Describe known good (i.e. actually used and known to provide good results) frameworks, tools, templates, their pros and cons, with links.","title":"Frameworks, Tools, Templates"},{"location":"automated-testing/templates/test-type-template/#resources","text":"Provide links to further readings about this technique to dive deeper.","title":"Resources"},{"location":"automated-testing/ui-testing/","text":"User Interface Testing This section is primarily geared towards web-based UIs, but the guidance is similar for mobile and OS based applications. Applicability UI Testing is not always going to be applicable, for example applications without a UI or parts of an application that require no human interaction. In those cases unit, functional and integration/e2e testing would be the primary means. UI Testing is going to be mainly applicable when dealing with a public facing UI that is used in a diverse environment or in a mission critical UI that requires higher fidelity. With something like an admin UI that is used by just a handful of people, UI Testing is still valuable but not as high priority. Goals UI testing provides the ability to ensure that users have a consistent visual user experience across a variety of means of access and that the user interaction is consistent with the function requirements. Ensure the UI appearance and interaction satisfy the functional and non-functional requirements Detect changes in the UI both across devices and delivery platforms and between code changes Provide confidence to designers and developers the user experience is consistent Support fast code evolution and refactoring while reducing the risk of regressions Evidence and Measures Integrating UI Tests in to your CI/CD is necessary but more challenging than unit tests. The increased challenge is that UI tests either need to run in headless mode with something like Puppeteer or there needs to be more extensive orchestration with Azure DevOps or GitHub that would handle the full testing integration for you like BrowserStack Integrations like BrowserStack are nice since they provide Azure DevOps reports as part of the test run. That said, Azure DevOps supports a variety of test adapters, so you can use any UI Testing framework that supports outputting the test results to one of the output formats listed at Publish Test Results task . If you're using an Azure DevOps pipeline to run UI tests, consider using a self hosted agent in order to manage framework versions and avoid unexpected updates. General Guidance The scope of UI testing should be strategic. UI tests can take a significant amount of time to both implement and run, and it's challenging to test every type of user interaction in a production application due to the large number of possible interactions. Designing the UI tests around the functional tests makes sense. For example, given an input form, a UI test would ensure that the visual representation is consistent across devices, is accessible and easy to interact with, and is consistent across code changes. UI Tests will catch 'runtime' bugs that unit and functional tests won't. For example if the submit button for an input form is rendered but not clickable due to a positioning bug in the UI, then this could be considered a runtime bug that would not have been caught by unit or functional tests. UI Tests can run on mock data or snapshots of production data, like in QA or staging. Writing Tests Good UI tests follow a few general principles: Choose a UI testing framework that enables quick feedback and is easy to use Design the UI to be easily testable. For example, add CSS selectors or set the id on elements in a web page to allow easier selecting. Test on all primary devices that the user uses, don't just test on a single device or OS. When a test mutates data ensure that data is created on demand and cleaned up after. The consequence of not doing this would be inconsistent testing. Common Issues UI Testing can get very challenging at the lower level, especially with a testing framework like Selenium. If you choose to go this route, then you'll likely encounter timeouts, missing elements, and you'll have significant friction with the testing framework itself. Due to many issues with UI testing there have been a number of free and paid solutions that help alleviate certain issues with frameworks like Selenium. This is why you'll find Cypress in the recommended frameworks as it solves many of the known issues with Selenium. This is an important point though. Depending on the UI testing framework you choose will result in either a smoother test creation experience, or a very frustrating and time-consuming one. If you were to choose just Selenium the development costs and time costs would likely be very high. It's better to use either a framework built on top of Selenium or one that attempts to solve many of the problems with something like Selenium. Note there that there are further considerations as when running in headless mode the UI can render differently than what you may see on your development machine, particularly with web applications. Furthermore, note that when rendering in different page dimensions elements may disappear on the page due to CSS rules, therefore not be selectable by certain frameworks with default options out of the box. All of these issues can be resolved and worked around, but the rendering demonstrates another particular challenge of UI testing. Specific Guidance Recommended testing frameworks: Web BrowserStack Cypress Jest Selenium Appium OS/Mobile Applications Coded UI tests (CUITs) Xamarin.UITest BrowserStack Appium Note that the framework listed above that is paid is BrowserStack, it's listed as it's an industry standard, the rest are open source and free.","title":"User Interface Testing"},{"location":"automated-testing/ui-testing/#user-interface-testing","text":"This section is primarily geared towards web-based UIs, but the guidance is similar for mobile and OS based applications.","title":"User Interface Testing"},{"location":"automated-testing/ui-testing/#applicability","text":"UI Testing is not always going to be applicable, for example applications without a UI or parts of an application that require no human interaction. In those cases unit, functional and integration/e2e testing would be the primary means. UI Testing is going to be mainly applicable when dealing with a public facing UI that is used in a diverse environment or in a mission critical UI that requires higher fidelity. With something like an admin UI that is used by just a handful of people, UI Testing is still valuable but not as high priority.","title":"Applicability"},{"location":"automated-testing/ui-testing/#goals","text":"UI testing provides the ability to ensure that users have a consistent visual user experience across a variety of means of access and that the user interaction is consistent with the function requirements. Ensure the UI appearance and interaction satisfy the functional and non-functional requirements Detect changes in the UI both across devices and delivery platforms and between code changes Provide confidence to designers and developers the user experience is consistent Support fast code evolution and refactoring while reducing the risk of regressions","title":"Goals"},{"location":"automated-testing/ui-testing/#evidence-and-measures","text":"Integrating UI Tests in to your CI/CD is necessary but more challenging than unit tests. The increased challenge is that UI tests either need to run in headless mode with something like Puppeteer or there needs to be more extensive orchestration with Azure DevOps or GitHub that would handle the full testing integration for you like BrowserStack Integrations like BrowserStack are nice since they provide Azure DevOps reports as part of the test run. That said, Azure DevOps supports a variety of test adapters, so you can use any UI Testing framework that supports outputting the test results to one of the output formats listed at Publish Test Results task . If you're using an Azure DevOps pipeline to run UI tests, consider using a self hosted agent in order to manage framework versions and avoid unexpected updates.","title":"Evidence and Measures"},{"location":"automated-testing/ui-testing/#general-guidance","text":"The scope of UI testing should be strategic. UI tests can take a significant amount of time to both implement and run, and it's challenging to test every type of user interaction in a production application due to the large number of possible interactions. Designing the UI tests around the functional tests makes sense. For example, given an input form, a UI test would ensure that the visual representation is consistent across devices, is accessible and easy to interact with, and is consistent across code changes. UI Tests will catch 'runtime' bugs that unit and functional tests won't. For example if the submit button for an input form is rendered but not clickable due to a positioning bug in the UI, then this could be considered a runtime bug that would not have been caught by unit or functional tests. UI Tests can run on mock data or snapshots of production data, like in QA or staging.","title":"General Guidance"},{"location":"automated-testing/ui-testing/#writing-tests","text":"Good UI tests follow a few general principles: Choose a UI testing framework that enables quick feedback and is easy to use Design the UI to be easily testable. For example, add CSS selectors or set the id on elements in a web page to allow easier selecting. Test on all primary devices that the user uses, don't just test on a single device or OS. When a test mutates data ensure that data is created on demand and cleaned up after. The consequence of not doing this would be inconsistent testing.","title":"Writing Tests"},{"location":"automated-testing/ui-testing/#common-issues","text":"UI Testing can get very challenging at the lower level, especially with a testing framework like Selenium. If you choose to go this route, then you'll likely encounter timeouts, missing elements, and you'll have significant friction with the testing framework itself. Due to many issues with UI testing there have been a number of free and paid solutions that help alleviate certain issues with frameworks like Selenium. This is why you'll find Cypress in the recommended frameworks as it solves many of the known issues with Selenium. This is an important point though. Depending on the UI testing framework you choose will result in either a smoother test creation experience, or a very frustrating and time-consuming one. If you were to choose just Selenium the development costs and time costs would likely be very high. It's better to use either a framework built on top of Selenium or one that attempts to solve many of the problems with something like Selenium. Note there that there are further considerations as when running in headless mode the UI can render differently than what you may see on your development machine, particularly with web applications. Furthermore, note that when rendering in different page dimensions elements may disappear on the page due to CSS rules, therefore not be selectable by certain frameworks with default options out of the box. All of these issues can be resolved and worked around, but the rendering demonstrates another particular challenge of UI testing.","title":"Common Issues"},{"location":"automated-testing/ui-testing/#specific-guidance","text":"Recommended testing frameworks: Web BrowserStack Cypress Jest Selenium Appium OS/Mobile Applications Coded UI tests (CUITs) Xamarin.UITest BrowserStack Appium Note that the framework listed above that is paid is BrowserStack, it's listed as it's an industry standard, the rest are open source and free.","title":"Specific Guidance"},{"location":"automated-testing/ui-testing/teams-tests/","text":"Automated UI Tests for a Teams Application Overview This is an overview on how you can implement UI tests for a custom Teams application. The insights provided can also be applied to automated end-to-end testing. General Observations Testing in a web browser is easier than on a native app. Testing a Teams app on a mobile device in an automated way is more challenging due to the fact that you are testing an app within an app: There is no Android Application Package (APK) / iOS App Store Package (IPA) publicly available for Microsoft Teams app itself. Mobile testing frameworks are designed with the assumption that you own the APK/IPA of the app under test. Workarounds need to be found to first automate the installation of Teams. Should you choose working with emulators, testing in a local Windows box is more stable than in a CI/CD. The latter involves a CI/CD agent and an emulator in a VM. When deciding whether to implement such tests, consider the project requirements as well as the advantages and disadvantages. Manual UI tests are often an acceptable solution due to their low effort requirements. The following are learnings from various engagements: Web Based UI Tests To implement web-based UI tests for your Teams application, follow the same approach as you would for testing any other web application with a UI. UI testing provides valuable guidance in this regard. Your starting point for the test would be to automatically launch a browser (using Selenium or similar frameworks) and navigate to https://teams.microsoft.com . If you want to test a Teams app that hasn\u2019t been published in the Teams store yet or if you\u2019d like to test the DEV/QA version of your app, you can use the Teams Toolkit and package your app based on the manifest.json . npx teamsfx package -- env dev -- manifest - path ... Once the app is installed, implement selectors to access your custom app and to perform various actions within the app. Pipeline If you are using Selenium and Edge as the browser, consider leveraging the selenium/standalone-edge Docker image which contains a standalone Selenium server with the Microsoft Edge browser installed. By default, it will run in headless mode, but by setting START_XVFB variable to True , you can control whether to start a virtual framebuffer server (Xvfb) that allows GUI applications to run without a display. Below is a code snippet which illustrates the usage of the image in a Gitlab pipeline: ... run-tests-dev: allow_failure: false image: ... environment: name: dev stage: tests services: - name: selenium/standalone-edge:latest alias: selenium variables: START_XVFB: \"true\" description: \"Start Xvfb server\" ... When running a test, you need to use the Selenium server URL for remote execution. With the definition from above, the URL is: http://selenium:4444/wd/hub . The code snippet below illustrates how you can initialize the Selenium driver to point to the remote Selenium server using JavaScript: var { Builder } = require ( \"selenium-webdriver\" ); const edge = require ( \"selenium-webdriver/edge\" ); var buildEdgeDriver = function () { let builder = new Builder (). forBrowser ( \"MicrosoftEdge\" ); builder = builder . usingServer ( \"http://selenium:4444/wd/hub\" ); builder . setEdgeOptions ( new edge . Options (). addArguments ( \"--inprivate\" )); return builder . build (); }; Mobile Based UI Tests Testing your custom Teams application on mobile devices is a bit more difficult than using the web-based approach as it requires usage of actual or simulated devices. Running such tests in a CI/CD pipeline can be more difficult and resource-intensive. One approach is to use real devices or cloud-based emulators from vendors such as BrowserStack which requires a license. Alternatively, you can use virtual devices hosted in Azure Virtual Machines. Option 1: Using Android Virtual Devices (AVD) This approach enables the creation of Android UI tests using virtual devices. It comes with the advantage of not requiring paid licenses to certain vendors. However, due to the nature of emulators, compared to real devices, it may prove to be less stable. Always choose the solution that best fits your project requirements and resources. Overall setup: AVD - Android Virtual Devices - which are virtual representation of physical Android devices. Appium is an open-source project designed to facilitate UI automation of many app platforms, including mobile. Appium is based on the W3C WebDriver specification . Note: If you look at these commands in the WebDriver specification, you will notice that they are not defined in terms of any particular programming language. They are not Java commands, or JavaScript commands, or Python commands. Instead, they form part of an HTTP API which can be accessed from within any programming language. Appium implements a client-server architecture: The server (consisting of Appium itself along with any drivers or plugins you are using for automation) is connected to the devices under test, and is actually responsible for making automation happen on those devices. UiAutomator driver is compatible with Android platform. The client is responsible for sending commands to the server over the network, and receiving responses from the server as a result. You can choose the language of your choice to write the commands. For example, for Javascript WebDriverIO can be used as client. Here you can read more about Appium ecosystem The advantage of this architecture is that it opens the possibility of running the server in a VM, and the client in a pipeline, enabling the tests to be ran automatically on scheduled basis as part of CI/CD pipelines. How to Run Mobile Tests Locally on a Windows Machine Using AVD? This approach involves: An emulator ( AVD - Android Virtual Devices ), which will represent the physical device. Appium server , which will redirect the commands from the test to your virtual device. Creating an Android Virtual Device Install Android Studio from official link . Note: At the time of writing the documentation, the latest version available was Android Studio Giraffe, 2022.3.1 Patch 2 for Window. Set ANDROID_HOME environment variable to point to the installation path of Android SDK. i.e. C:Users\\<user-name>\\AppData\\Local\\Android\\Sdk Install Java Development Kit (JDK) from official link . For the most recent devices JDK 9 is required, otherwise JDK 8 is required. Make sure you get the JDK and not the JRE. Set JAVA_HOME environment variable to the installation path, i.e. C:\\Program Files\\Java\\jdk-11 Create an AVD (Android Virtual Device): - Open Android Studio. From the Android Studio welcome screen, select More Action -> Virtual Device Manager , as instructed here - Click Create Device . - Choose a device definition with Play Store enabled . This is important, otherwise Teams cannot be installed on the device. - Choose a System image from the Recommended tab which includes access to Google Play services. You may need to install it before selecting it. - Start the emulator by clicking on the Run button from the Device Manage screen. - Manually install Microsoft Teams from Google Playstore on the device. Setting up Appium Install appium : Download NodeJs, if it is not already installed on your machine: Download | Node.js (nodejs.org) Install Appium globally: Install Appium - Appium Documentation Install the UiAutomator2 driver: Install the UiAutomator2 Driver - Appium Documentation . Go through the Set up Android automation requirements in the documentation, to make sure you have set up everything correctly. Read more about Appium Drivers here . - Start appium server by running appium command in a command prompt. Useful commands List emulators that you have previously created, without opening Android Studio: emulator -list-avds How to run Teams mobile tests in a pipeline using an Azure VM? This approach leverages the fact that Appium implements a client-server architecture. In this approach, the Appium server as well as the AVD run on an Azure VM, while the client operates within a pipeline and sends commands to be executed on the device. Configure the VM This approach involves hosting a virtual device within a virtual machine. To set up the emulator (Android Virtual Device) in an Azure VM, the VM must support nested virtualization . Azure VM configuration which, at the time of writing the documentation, worked successfully with AVD and appium: Operating system: Windows (Windows-10 Pro) VM generation: V1 Size: Standard D4ds v5 16 GiB memory Enable connection from outside to Appium server on the VM Note: By default appium server runs on port 4723. The rest of the steps will assume that this is the port where your appium server runs. In order to be able to reach appium server which runs on the VM from outside: Create an Inbound Rule for port 4723 from within the VM. Create an Inbound Security Rule in the NSG (Network Security Group) of the VM to be able to connect from that IP address to port 4723: - Find out the IP of the machine on which the tests will run on. - Replace the Source IP Address with the IP of your machine. Installing Android Studio and create AVD inside the VM Follow the instructions under the end to end tests on a Windows machine section to install Android Studio and create an Android Virtual Device. When you launch the emulator, it may show a warning as below and will eventually crash: Solution to fix it: 1. Enable Windows Hypervisor Platform 1. Enable Hyper-V (if not enabled by default) 1. Restart the VM. 1. Restart the AVD. How to inspect the Teams app in an Azure Virtual Device (AVD)? Inspecting the app is highly valuable when writing new tests, as it enables you to identify the unique IDs of various elements displayed on the screen. This process is similar to using DevTools, which allows you to navigate through the Document Object Model (DOM) of a web page. Appium Inspector is a very useful tool that allows you to inspect an app runing on an emulator. Note: This section assumes that you have already performed the prerequisites from How to run mobile test locally on a Windows machine using AVD? Steps Run the appium server with --alow-cors flag by running the following command in a terminal: appium --allow-cors Go to https://inspector.appiumpro.com and type in the following properties: { \"appium:deviceName\" : \"your-emulator-name\" , \"appium:appPackage\" : \"com.microsoft.teams\" , \"appium:appActivity\" : \"com.microsoft.skype.teams.Launcher\" , \"appium:automationName\" : \"UiAutomator2\" , \"platformName\" : \"Android\" } \"appium:deviceName\" - is the name of your emulator. In Useful commands sections from above, you can see how to get the name of your AVD. \"appium:appPackage\" - is the name of the package, should be kept to \" com.microsoft.teams \". \"appium:appActivity\"- is the name of the activity in the app that you want to launch, should be kept to \" com.microsoft.skype.teams.Launcher \" \"appium:automationName\" - is the name of the driver you are using, in this case, \" UiAutomator2 \" If the appium server runs on your local machine at the default portal, then Remote Host and Remote Port can be kept to the default values. The configuration should look similar to the printscren below: Press on Start Session . - In the browser, you should see a similar view as below: You can do any action on the emulator, and if you press on the \"Refresh\" button in the browser, the left hand side of the Appium Inspector will reflect your app. In the App Source you will be able to see the IDs of the elements, so you can write relevant selectors in your tests. Connecting to Appium server Below it is outlined how this can be achieved with JavaScript. A similar approach can be followed for other languages. Assuming you are using webdriverio as the client, you would need to initialize the remote connection as follows: const opts = { port : 4723 , hostname : \"your-hostname\" , capabilities : { platformName : \"android\" , \"appium:deviceName\" : \"the-name-of-the-virtual-device\" , \"appium:appPackage\" : \"com.microsoft.teams\" , \"appium:appActivity\" : \"com.microsoft.skype.teams.Launcher\" , \"appium:automationName\" : \"the-name-of-the-driver\" , }, }; // Create a new WebDriverIO instance with the Appium server URL and capabilities await wdio . remote ( opts ); \"port\": the port on which the Appium server runs on. By default, it is 4723. \"hostname\": the IP of the machine where the Appium sever runs on. If it is running locally, that is 127.0.0.1. If it runs in an Azure VM, it would be the public IP address of the VM. Note: ensure you have followed the steps from 2. Enable connection from outside to Appium server on the VM . \"platformName\": Appium can be used to connect to different platforms (Windows, iOS, Android). In our case, it would be \"android\". \"appium:deviceName\": the name of the Android Virtual Device. See Useful commands on how to find the name of the device. \"appium:appPackage\": the name of the app's package that you would like to launch. Teams' package name is \"com.microsoft.teams\". \"appium:appActivity\": the activity within Teams that you would like to launch on the device. In our case, we would like just to launch the app. The activity name for launching Teams is called \"com.microsoft.skype.teams.Launcher\". \"appium:automationName\": the name of the driver you are using. Note: Appium can communicate to different platforms. This is achieved by installing a dedicated driver, designed for each platform. In our case, it would be UiAutomator2 or Espresso , since they are both designed for Android platform. Option 2: Using BrowserStack BrowserStack serves as a cloud-based platform that enables developers to test both the web and mobile application across various browsers, operating systems, and real mobile devices. This can be seen as an alternative solution to the approach described earlier. The specific insights provided below relate to implementing such tests for a custom Microsoft Teams application: BrowserStack does not support out of the box the installation of Teams from the App Store or Play Store. However, there is a workaround, described in their documentation . Therefore, if you choose to go this way, you would first need to implement a step that installs Teams on the cloud-based device, by implementing the workaround described above. You may encounter issues with Google login, as it requires a newly created Google account, in order to log in to the store. To overcome this, make sure to disable 2FA from Google, further described in Troubleshooting Google login issues .","title":"Automated UI Tests for a Teams Application"},{"location":"automated-testing/ui-testing/teams-tests/#automated-ui-tests-for-a-teams-application","text":"","title":"Automated UI Tests for a Teams Application"},{"location":"automated-testing/ui-testing/teams-tests/#overview","text":"This is an overview on how you can implement UI tests for a custom Teams application. The insights provided can also be applied to automated end-to-end testing.","title":"Overview"},{"location":"automated-testing/ui-testing/teams-tests/#general-observations","text":"Testing in a web browser is easier than on a native app. Testing a Teams app on a mobile device in an automated way is more challenging due to the fact that you are testing an app within an app: There is no Android Application Package (APK) / iOS App Store Package (IPA) publicly available for Microsoft Teams app itself. Mobile testing frameworks are designed with the assumption that you own the APK/IPA of the app under test. Workarounds need to be found to first automate the installation of Teams. Should you choose working with emulators, testing in a local Windows box is more stable than in a CI/CD. The latter involves a CI/CD agent and an emulator in a VM. When deciding whether to implement such tests, consider the project requirements as well as the advantages and disadvantages. Manual UI tests are often an acceptable solution due to their low effort requirements. The following are learnings from various engagements:","title":"General Observations"},{"location":"automated-testing/ui-testing/teams-tests/#web-based-ui-tests","text":"To implement web-based UI tests for your Teams application, follow the same approach as you would for testing any other web application with a UI. UI testing provides valuable guidance in this regard. Your starting point for the test would be to automatically launch a browser (using Selenium or similar frameworks) and navigate to https://teams.microsoft.com . If you want to test a Teams app that hasn\u2019t been published in the Teams store yet or if you\u2019d like to test the DEV/QA version of your app, you can use the Teams Toolkit and package your app based on the manifest.json . npx teamsfx package -- env dev -- manifest - path ... Once the app is installed, implement selectors to access your custom app and to perform various actions within the app.","title":"Web Based UI Tests"},{"location":"automated-testing/ui-testing/teams-tests/#pipeline","text":"If you are using Selenium and Edge as the browser, consider leveraging the selenium/standalone-edge Docker image which contains a standalone Selenium server with the Microsoft Edge browser installed. By default, it will run in headless mode, but by setting START_XVFB variable to True , you can control whether to start a virtual framebuffer server (Xvfb) that allows GUI applications to run without a display. Below is a code snippet which illustrates the usage of the image in a Gitlab pipeline: ... run-tests-dev: allow_failure: false image: ... environment: name: dev stage: tests services: - name: selenium/standalone-edge:latest alias: selenium variables: START_XVFB: \"true\" description: \"Start Xvfb server\" ... When running a test, you need to use the Selenium server URL for remote execution. With the definition from above, the URL is: http://selenium:4444/wd/hub . The code snippet below illustrates how you can initialize the Selenium driver to point to the remote Selenium server using JavaScript: var { Builder } = require ( \"selenium-webdriver\" ); const edge = require ( \"selenium-webdriver/edge\" ); var buildEdgeDriver = function () { let builder = new Builder (). forBrowser ( \"MicrosoftEdge\" ); builder = builder . usingServer ( \"http://selenium:4444/wd/hub\" ); builder . setEdgeOptions ( new edge . Options (). addArguments ( \"--inprivate\" )); return builder . build (); };","title":"Pipeline"},{"location":"automated-testing/ui-testing/teams-tests/#mobile-based-ui-tests","text":"Testing your custom Teams application on mobile devices is a bit more difficult than using the web-based approach as it requires usage of actual or simulated devices. Running such tests in a CI/CD pipeline can be more difficult and resource-intensive. One approach is to use real devices or cloud-based emulators from vendors such as BrowserStack which requires a license. Alternatively, you can use virtual devices hosted in Azure Virtual Machines.","title":"Mobile Based UI Tests"},{"location":"automated-testing/ui-testing/teams-tests/#option-1-using-android-virtual-devices-avd","text":"This approach enables the creation of Android UI tests using virtual devices. It comes with the advantage of not requiring paid licenses to certain vendors. However, due to the nature of emulators, compared to real devices, it may prove to be less stable. Always choose the solution that best fits your project requirements and resources. Overall setup: AVD - Android Virtual Devices - which are virtual representation of physical Android devices. Appium is an open-source project designed to facilitate UI automation of many app platforms, including mobile. Appium is based on the W3C WebDriver specification . Note: If you look at these commands in the WebDriver specification, you will notice that they are not defined in terms of any particular programming language. They are not Java commands, or JavaScript commands, or Python commands. Instead, they form part of an HTTP API which can be accessed from within any programming language. Appium implements a client-server architecture: The server (consisting of Appium itself along with any drivers or plugins you are using for automation) is connected to the devices under test, and is actually responsible for making automation happen on those devices. UiAutomator driver is compatible with Android platform. The client is responsible for sending commands to the server over the network, and receiving responses from the server as a result. You can choose the language of your choice to write the commands. For example, for Javascript WebDriverIO can be used as client. Here you can read more about Appium ecosystem The advantage of this architecture is that it opens the possibility of running the server in a VM, and the client in a pipeline, enabling the tests to be ran automatically on scheduled basis as part of CI/CD pipelines.","title":"Option 1: Using Android Virtual Devices (AVD)"},{"location":"automated-testing/ui-testing/teams-tests/#how-to-run-mobile-tests-locally-on-a-windows-machine-using-avd","text":"This approach involves: An emulator ( AVD - Android Virtual Devices ), which will represent the physical device. Appium server , which will redirect the commands from the test to your virtual device.","title":"How to Run Mobile Tests Locally on a Windows Machine Using AVD?"},{"location":"automated-testing/ui-testing/teams-tests/#creating-an-android-virtual-device","text":"Install Android Studio from official link . Note: At the time of writing the documentation, the latest version available was Android Studio Giraffe, 2022.3.1 Patch 2 for Window. Set ANDROID_HOME environment variable to point to the installation path of Android SDK. i.e. C:Users\\<user-name>\\AppData\\Local\\Android\\Sdk Install Java Development Kit (JDK) from official link . For the most recent devices JDK 9 is required, otherwise JDK 8 is required. Make sure you get the JDK and not the JRE. Set JAVA_HOME environment variable to the installation path, i.e. C:\\Program Files\\Java\\jdk-11 Create an AVD (Android Virtual Device): - Open Android Studio. From the Android Studio welcome screen, select More Action -> Virtual Device Manager , as instructed here - Click Create Device . - Choose a device definition with Play Store enabled . This is important, otherwise Teams cannot be installed on the device. - Choose a System image from the Recommended tab which includes access to Google Play services. You may need to install it before selecting it. - Start the emulator by clicking on the Run button from the Device Manage screen. - Manually install Microsoft Teams from Google Playstore on the device.","title":"Creating an Android Virtual Device"},{"location":"automated-testing/ui-testing/teams-tests/#setting-up-appium","text":"Install appium : Download NodeJs, if it is not already installed on your machine: Download | Node.js (nodejs.org) Install Appium globally: Install Appium - Appium Documentation Install the UiAutomator2 driver: Install the UiAutomator2 Driver - Appium Documentation . Go through the Set up Android automation requirements in the documentation, to make sure you have set up everything correctly. Read more about Appium Drivers here . - Start appium server by running appium command in a command prompt.","title":"Setting up Appium"},{"location":"automated-testing/ui-testing/teams-tests/#useful-commands","text":"List emulators that you have previously created, without opening Android Studio: emulator -list-avds","title":"Useful commands"},{"location":"automated-testing/ui-testing/teams-tests/#how-to-run-teams-mobile-tests-in-a-pipeline-using-an-azure-vm","text":"This approach leverages the fact that Appium implements a client-server architecture. In this approach, the Appium server as well as the AVD run on an Azure VM, while the client operates within a pipeline and sends commands to be executed on the device.","title":"How to run Teams mobile tests in a pipeline using an Azure VM?"},{"location":"automated-testing/ui-testing/teams-tests/#configure-the-vm","text":"This approach involves hosting a virtual device within a virtual machine. To set up the emulator (Android Virtual Device) in an Azure VM, the VM must support nested virtualization . Azure VM configuration which, at the time of writing the documentation, worked successfully with AVD and appium: Operating system: Windows (Windows-10 Pro) VM generation: V1 Size: Standard D4ds v5 16 GiB memory","title":"Configure the VM"},{"location":"automated-testing/ui-testing/teams-tests/#enable-connection-from-outside-to-appium-server-on-the-vm","text":"Note: By default appium server runs on port 4723. The rest of the steps will assume that this is the port where your appium server runs. In order to be able to reach appium server which runs on the VM from outside: Create an Inbound Rule for port 4723 from within the VM. Create an Inbound Security Rule in the NSG (Network Security Group) of the VM to be able to connect from that IP address to port 4723: - Find out the IP of the machine on which the tests will run on. - Replace the Source IP Address with the IP of your machine.","title":"Enable connection from outside to Appium server on the VM"},{"location":"automated-testing/ui-testing/teams-tests/#installing-android-studio-and-create-avd-inside-the-vm","text":"Follow the instructions under the end to end tests on a Windows machine section to install Android Studio and create an Android Virtual Device. When you launch the emulator, it may show a warning as below and will eventually crash: Solution to fix it: 1. Enable Windows Hypervisor Platform 1. Enable Hyper-V (if not enabled by default) 1. Restart the VM. 1. Restart the AVD.","title":"Installing Android Studio and create AVD inside the VM"},{"location":"automated-testing/ui-testing/teams-tests/#how-to-inspect-the-teams-app-in-an-azure-virtual-device-avd","text":"Inspecting the app is highly valuable when writing new tests, as it enables you to identify the unique IDs of various elements displayed on the screen. This process is similar to using DevTools, which allows you to navigate through the Document Object Model (DOM) of a web page. Appium Inspector is a very useful tool that allows you to inspect an app runing on an emulator. Note: This section assumes that you have already performed the prerequisites from How to run mobile test locally on a Windows machine using AVD?","title":"How to inspect the Teams app in an Azure Virtual Device (AVD)?"},{"location":"automated-testing/ui-testing/teams-tests/#steps","text":"Run the appium server with --alow-cors flag by running the following command in a terminal: appium --allow-cors Go to https://inspector.appiumpro.com and type in the following properties: { \"appium:deviceName\" : \"your-emulator-name\" , \"appium:appPackage\" : \"com.microsoft.teams\" , \"appium:appActivity\" : \"com.microsoft.skype.teams.Launcher\" , \"appium:automationName\" : \"UiAutomator2\" , \"platformName\" : \"Android\" } \"appium:deviceName\" - is the name of your emulator. In Useful commands sections from above, you can see how to get the name of your AVD. \"appium:appPackage\" - is the name of the package, should be kept to \" com.microsoft.teams \". \"appium:appActivity\"- is the name of the activity in the app that you want to launch, should be kept to \" com.microsoft.skype.teams.Launcher \" \"appium:automationName\" - is the name of the driver you are using, in this case, \" UiAutomator2 \" If the appium server runs on your local machine at the default portal, then Remote Host and Remote Port can be kept to the default values. The configuration should look similar to the printscren below: Press on Start Session . - In the browser, you should see a similar view as below: You can do any action on the emulator, and if you press on the \"Refresh\" button in the browser, the left hand side of the Appium Inspector will reflect your app. In the App Source you will be able to see the IDs of the elements, so you can write relevant selectors in your tests. Connecting to Appium server Below it is outlined how this can be achieved with JavaScript. A similar approach can be followed for other languages. Assuming you are using webdriverio as the client, you would need to initialize the remote connection as follows: const opts = { port : 4723 , hostname : \"your-hostname\" , capabilities : { platformName : \"android\" , \"appium:deviceName\" : \"the-name-of-the-virtual-device\" , \"appium:appPackage\" : \"com.microsoft.teams\" , \"appium:appActivity\" : \"com.microsoft.skype.teams.Launcher\" , \"appium:automationName\" : \"the-name-of-the-driver\" , }, }; // Create a new WebDriverIO instance with the Appium server URL and capabilities await wdio . remote ( opts ); \"port\": the port on which the Appium server runs on. By default, it is 4723. \"hostname\": the IP of the machine where the Appium sever runs on. If it is running locally, that is 127.0.0.1. If it runs in an Azure VM, it would be the public IP address of the VM. Note: ensure you have followed the steps from 2. Enable connection from outside to Appium server on the VM . \"platformName\": Appium can be used to connect to different platforms (Windows, iOS, Android). In our case, it would be \"android\". \"appium:deviceName\": the name of the Android Virtual Device. See Useful commands on how to find the name of the device. \"appium:appPackage\": the name of the app's package that you would like to launch. Teams' package name is \"com.microsoft.teams\". \"appium:appActivity\": the activity within Teams that you would like to launch on the device. In our case, we would like just to launch the app. The activity name for launching Teams is called \"com.microsoft.skype.teams.Launcher\". \"appium:automationName\": the name of the driver you are using. Note: Appium can communicate to different platforms. This is achieved by installing a dedicated driver, designed for each platform. In our case, it would be UiAutomator2 or Espresso , since they are both designed for Android platform.","title":"Steps"},{"location":"automated-testing/ui-testing/teams-tests/#option-2-using-browserstack","text":"BrowserStack serves as a cloud-based platform that enables developers to test both the web and mobile application across various browsers, operating systems, and real mobile devices. This can be seen as an alternative solution to the approach described earlier. The specific insights provided below relate to implementing such tests for a custom Microsoft Teams application: BrowserStack does not support out of the box the installation of Teams from the App Store or Play Store. However, there is a workaround, described in their documentation . Therefore, if you choose to go this way, you would first need to implement a step that installs Teams on the cloud-based device, by implementing the workaround described above. You may encounter issues with Google login, as it requires a newly created Google account, in order to log in to the store. To overcome this, make sure to disable 2FA from Google, further described in Troubleshooting Google login issues .","title":"Option 2: Using BrowserStack"},{"location":"automated-testing/unit-testing/","text":"Unit Testing Unit testing is a fundamental tool in every developer's toolbox. Unit tests not only help us test our code, they encourage good design practices, reduce the chances of bugs reaching production, and can even serve as examples or documentation on how code functions. Properly written unit tests can also improve developer efficiency. Unit testing also is one of the most commonly misunderstood forms of testing. Unit testing refers to a very specific type of testing; a unit test should be: Provably reliable - should be 100% reliable so failures indicate a bug in the code Fast - should run in milliseconds, a whole unit testing suite shouldn't take longer than a couple seconds Isolated - removing all external dependencies ensures reliability and speed Why Unit Testing It is no secret that writing unit tests is hard, and even harder to write well. Writing unit tests also increases the development time for every feature. So why should we write them? Unit tests reduce costs by catching bugs earlier and preventing regressions increase developer confidence in changes speed up the developer inner loop act as documentation as code For more details, see all the detailed descriptions of the points above . Unit Testing Design Blocks Unit testing is the lowest level of testing and as such generally has few components and dependencies. The system under test (abbreviated SUT) is the \"unit\" we are testing. Generally these are methods or functions, but depending on the language these could be different. In general, you want the unit to be as small as possible though. Most languages also have a wide suite of unit testing frameworks and test runners. These test frameworks have a wide range of functionality, but the base functionality should be a way to organize your tests and run them quickly. Finally, there is your unit test code ; unit test code is generally short and simple, preferring repetition to adding layers and complexity to the code. Applying the Unit Testing Getting started with writing a unit test is much easier than some other test types since it should require next to no setup and is just code. Each test framework is different in how you organize and write your tests, but the general techniques and best practices of writing a unit test are universal. Techniques These are some commonly used techniques that will help when authoring unit tests. For some examples, see the pages on using abstraction and dependency injection to author a unit test , or how to do test-driven development . Note that some of these techniques are more specific to strongly typed, object-oriented languages. Functional languages and scripting languages have similar techniques that may look different, but these terms are commonly used in all unit testing examples. Abstraction Abstraction is when we take an exact implementation detail, and we generalize it into a concept instead. This technique can be used in creating testable design and is used often especially in object-oriented languages. For unit tests, abstraction is commonly used to break a hard dependency and replace it with an abstraction. That abstraction then allows for greater flexibility in the code and allows for the a mock or simulator to be used in its place. One of the side effects of abstracting dependencies is that you may have an abstraction that has no test coverage. This is case where unit testing is not well-suited, you can not expect to unit test everything, things like dependencies will always be an uncovered case. This is why even if you have a robust unit testing suite, integration or functional testing should still be used - without that, a change in the way the dependency functions would never be caught. When building wrappers around third-party dependencies, it is best to keep the implementations with as little logic as possible, using a very simple facade that calls the dependency. An example of using abstraction can be found here . Dependency Injection Dependency injection is a technique which allows us to extract dependencies from our code. In a normal use-case of a dependant class, the dependency is constructed and used within the system under test. This creates a hard dependency between the two classes, which can make it particularly hard to test in isolation. Dependencies could be things like classes wrapping a REST API, or even something as simple as file access. By injecting the dependencies into our system rather than constructing them, we have \"inverted control\" of the dependency. You may see \"Inversion of Control\" and \"Dependency Injection\" used as separate terms, but it is very hard to have one and not the other, with some arguing that Dependency Injection is a more specific way of saying inversion of control . In certain languages such as C#, not using dependency injection can lead to code that is not unit testable since there is no way to inject mocked objects. Keeping testability in mind from the beginning and evaluating using dependency injection can save you from a time-intensive refactor later. One of the downsides of dependency injection is that it can easily go overboard. While there are no longer hard dependencies, there is still coupling between the interfaces, and passing around every interface implementation into every class presents just as many downsides as not using Dependency Injection. Being intentional with what dependencies get injected to what classes, is key to developing a maintainable system. Many languages include special Dependency Injection frameworks that take care of the boilerplate code and construction of the objects. Examples of this are Spring in Java or built into ASP.NET Core An example of using dependency injection can be found here . Test-Driven Development Test-Driven Development (TDD) is less a technique in how your code is designed, but a technique for writing your code that will lead you to a testable design from the start. The basic premise of test-driven development is that you write your test code first and then write the system under test to match the test you just wrote. This way all the test design is done up front and by the time you finish writing your system code, you are already at 100% test pass rate and test coverage. It also guarantees testable design is built into the system since the test was written first! For more information on TDD and an example, see the page on Test-Driven Development Best Practices Arrange/Act/Assert One common form of organizing your unit test code is called Arrange/Act/Assert. This divides up your unit test into 3 different discrete sections: Arrange - Set up all the variables, mocks, interfaces, and state you will need to run the test Act - Run the system under test, passing in any of the above objects that were created Assert - Check that with the given state that the system acted appropriately. Using this pattern to write tests makes them very readable and also familiar to future developers who would need to read your unit tests. Example Let's assume we have a class MyObject with a method TrySomething that interacts with an array of strings, but if the array has no elements, it will return false. We want to write a test that checks the case where array has no elements: [Fact] public void TrySomething_NoElements_ReturnsFalse () { // Arrange var elements = Array . Empty < string > (); var myObject = new MyObject (); // Act var myReturn = myObject . TrySomething ( elements ); // Assert Assert . False ( myReturn ); } Keep Tests Small and Test Only One Thing Unit tests should be short and test only one thing. This makes it easy to diagnose when there was a failure without needing something like which line number the test failed at. When using Arrange/Act/Assert , think of it like testing just one thing in the \"Act\" phase. There is some disagreement on whether testing one thing means \"assert one thing\" or \"test one state, with multiple asserts if needed\". Both have their advantages and disadvantages, but as with most technical disagreements there is no \"right\" answer. Consistency when writing your tests one way or the other is more important! Using a Standard Naming Convention for All Unit Tests Without having a set standard convention for unit test names, unit test names end up being either not descriptive enough, or duplicated across multiple different test classes. Establishing a standard is not only important for keeping your code consistent, but a good standard also improves the readability and debug-ability of a test. In this article, the convention used for all unit tests has been UnitName_StateUnderTest_ExpectedResult , but there are lots of other possible conventions as well, the important thing is to be consistent and descriptive. Having descriptive names such as the one above makes it trivial to find the test when there is a failure, and also already explains what the expectation of the test was and what state caused it to fail. This can be especially helpful when looking at failures in a CI/CD system where all you know is the name of the test that failed - instead now you know the name of the test and exactly why it failed (especially coupled with a test framework that logs helpful output on failures). Things to Avoid Some common pitfalls when writing a unit test that are important to avoid: Sleeps - A sleep can be an indicator that perhaps something is making a request to a dependency that it should not be. In general, if your code is flaky without the sleep, consider why it is failing and if you can remove the flakiness by introducing a more reliable way to communicate potential state changes. Adding sleeps to your unit tests also breaks one of our original tenets of unit testing: tests should be fast, as in order of milliseconds. If tests are taking on the order of seconds, they become more cumbersome to run. Reading from disk - It can be really tempting to the expected value of a function return in a file and read that file to compare the results. This creates a dependency with the system drive, and it breaks our tenet of keeping our unit tests isolated and 100% reliable. Any outside dependency such as file system access could potentially cause intermittent failures. Additionally, this could be a sign that perhaps the test or unit under test is too complex and should be simplified. Calling third-party APIs - When you do not control a third-party library that you are calling into, it's impossible to know for sure what that is doing, and it is best to abstract it out. Otherwise, you may be making REST calls or other potential areas of failure without directly writing the code for it. This is also generally a sign that the design of the system is not entirely testable. It is best to wrap third party API calls in interfaces or other structures so that they do not get invoked in unit tests. For more information see the page on mocking . Unit Testing Frameworks and Tools Test Frameworks Unit test frameworks are constantly changing. For a full list of every unit testing framework see the page on Wikipedia . Frameworks have many features and should be picked based on which feature-set fits best for the particular project. Mock Frameworks Many projects start with both a unit test framework, and also add a mock framework. While mocking frameworks have their uses and sometimes can be a requirement, it should not be something that is added without considering the broader implications and risks associated with heavy usage of mocks. To see if mocking is right for your project, or if a mock-free approach is more appropriate, see the page on mocking . Tools These tools allow for constant running of your unit tests with in-line code coverage, making the dev inner loop extremely fast and allows for easy TDD: Visual Studio Live Unit Testing Wallaby.js Infinitest for Java PyCrunch for Python Things to Consider Transferring Responsibility to Integration Tests In some situations it is worth considering to include the integration tests in the inner development loop to provide a sufficient code coverage to ensure the system is working properly. The prerequisite for this approach to be successful is to have integration tests being able to execute at a speed comparable to that of unit tests both locally and in a CI environment. Modern application frameworks like .NET or Spring Boot combined with the right mocking or stubbing approach for external dependencies offer excellent capabilities to enable such scenarios for testing. Usually, integration tests only prove that independently developed modules connect together as designed. The test coverage of integration tests can be extended to verify the correct behavior of the system as well. The responsibility of providing a sufficient branch and line code coverage can be transferred from unit tests to integration tests. Instead of several unit tests needed to test a specific case of functionality of the system, one integration scenario is created that covers the entire flow. For example in case of an API, the received HTTP responses and their content are verified for each request in test. This covers both the integration between components of the API and the correctness of its business logic. With this approach efficient integration tests can be treated as an extension of unit testing, taking over the responsibility of validating happy/failure path scenarios. It has the advantage of testing the system as a black box without any knowledge of its internals. Code refactoring has no impact on tests. Common testing techniques as TDD can be applied at a higher level which results in a development process that is driven by acceptance tests. Depending on the project specifics unit tests still play an important role. They can be used to help dictate a testable design at a lower level or to test complex business logic and corner cases if necessary. Conclusion Unit testing is extremely important, but it is also not the silver bullet; having proper unit tests is just a part of a well-tested system. However, writing proper unit tests will help with the design of your system as well as help catch regressions, bugs, and increase developer velocity. Resources Unit Testing Best Practices","title":"Unit Testing"},{"location":"automated-testing/unit-testing/#unit-testing","text":"Unit testing is a fundamental tool in every developer's toolbox. Unit tests not only help us test our code, they encourage good design practices, reduce the chances of bugs reaching production, and can even serve as examples or documentation on how code functions. Properly written unit tests can also improve developer efficiency. Unit testing also is one of the most commonly misunderstood forms of testing. Unit testing refers to a very specific type of testing; a unit test should be: Provably reliable - should be 100% reliable so failures indicate a bug in the code Fast - should run in milliseconds, a whole unit testing suite shouldn't take longer than a couple seconds Isolated - removing all external dependencies ensures reliability and speed","title":"Unit Testing"},{"location":"automated-testing/unit-testing/#why-unit-testing","text":"It is no secret that writing unit tests is hard, and even harder to write well. Writing unit tests also increases the development time for every feature. So why should we write them? Unit tests reduce costs by catching bugs earlier and preventing regressions increase developer confidence in changes speed up the developer inner loop act as documentation as code For more details, see all the detailed descriptions of the points above .","title":"Why Unit Testing"},{"location":"automated-testing/unit-testing/#unit-testing-design-blocks","text":"Unit testing is the lowest level of testing and as such generally has few components and dependencies. The system under test (abbreviated SUT) is the \"unit\" we are testing. Generally these are methods or functions, but depending on the language these could be different. In general, you want the unit to be as small as possible though. Most languages also have a wide suite of unit testing frameworks and test runners. These test frameworks have a wide range of functionality, but the base functionality should be a way to organize your tests and run them quickly. Finally, there is your unit test code ; unit test code is generally short and simple, preferring repetition to adding layers and complexity to the code.","title":"Unit Testing Design Blocks"},{"location":"automated-testing/unit-testing/#applying-the-unit-testing","text":"Getting started with writing a unit test is much easier than some other test types since it should require next to no setup and is just code. Each test framework is different in how you organize and write your tests, but the general techniques and best practices of writing a unit test are universal.","title":"Applying the Unit Testing"},{"location":"automated-testing/unit-testing/#techniques","text":"These are some commonly used techniques that will help when authoring unit tests. For some examples, see the pages on using abstraction and dependency injection to author a unit test , or how to do test-driven development . Note that some of these techniques are more specific to strongly typed, object-oriented languages. Functional languages and scripting languages have similar techniques that may look different, but these terms are commonly used in all unit testing examples.","title":"Techniques"},{"location":"automated-testing/unit-testing/#abstraction","text":"Abstraction is when we take an exact implementation detail, and we generalize it into a concept instead. This technique can be used in creating testable design and is used often especially in object-oriented languages. For unit tests, abstraction is commonly used to break a hard dependency and replace it with an abstraction. That abstraction then allows for greater flexibility in the code and allows for the a mock or simulator to be used in its place. One of the side effects of abstracting dependencies is that you may have an abstraction that has no test coverage. This is case where unit testing is not well-suited, you can not expect to unit test everything, things like dependencies will always be an uncovered case. This is why even if you have a robust unit testing suite, integration or functional testing should still be used - without that, a change in the way the dependency functions would never be caught. When building wrappers around third-party dependencies, it is best to keep the implementations with as little logic as possible, using a very simple facade that calls the dependency. An example of using abstraction can be found here .","title":"Abstraction"},{"location":"automated-testing/unit-testing/#dependency-injection","text":"Dependency injection is a technique which allows us to extract dependencies from our code. In a normal use-case of a dependant class, the dependency is constructed and used within the system under test. This creates a hard dependency between the two classes, which can make it particularly hard to test in isolation. Dependencies could be things like classes wrapping a REST API, or even something as simple as file access. By injecting the dependencies into our system rather than constructing them, we have \"inverted control\" of the dependency. You may see \"Inversion of Control\" and \"Dependency Injection\" used as separate terms, but it is very hard to have one and not the other, with some arguing that Dependency Injection is a more specific way of saying inversion of control . In certain languages such as C#, not using dependency injection can lead to code that is not unit testable since there is no way to inject mocked objects. Keeping testability in mind from the beginning and evaluating using dependency injection can save you from a time-intensive refactor later. One of the downsides of dependency injection is that it can easily go overboard. While there are no longer hard dependencies, there is still coupling between the interfaces, and passing around every interface implementation into every class presents just as many downsides as not using Dependency Injection. Being intentional with what dependencies get injected to what classes, is key to developing a maintainable system. Many languages include special Dependency Injection frameworks that take care of the boilerplate code and construction of the objects. Examples of this are Spring in Java or built into ASP.NET Core An example of using dependency injection can be found here .","title":"Dependency Injection"},{"location":"automated-testing/unit-testing/#test-driven-development","text":"Test-Driven Development (TDD) is less a technique in how your code is designed, but a technique for writing your code that will lead you to a testable design from the start. The basic premise of test-driven development is that you write your test code first and then write the system under test to match the test you just wrote. This way all the test design is done up front and by the time you finish writing your system code, you are already at 100% test pass rate and test coverage. It also guarantees testable design is built into the system since the test was written first! For more information on TDD and an example, see the page on Test-Driven Development","title":"Test-Driven Development"},{"location":"automated-testing/unit-testing/#best-practices","text":"","title":"Best Practices"},{"location":"automated-testing/unit-testing/#arrangeactassert","text":"One common form of organizing your unit test code is called Arrange/Act/Assert. This divides up your unit test into 3 different discrete sections: Arrange - Set up all the variables, mocks, interfaces, and state you will need to run the test Act - Run the system under test, passing in any of the above objects that were created Assert - Check that with the given state that the system acted appropriately. Using this pattern to write tests makes them very readable and also familiar to future developers who would need to read your unit tests.","title":"Arrange/Act/Assert"},{"location":"automated-testing/unit-testing/#example","text":"Let's assume we have a class MyObject with a method TrySomething that interacts with an array of strings, but if the array has no elements, it will return false. We want to write a test that checks the case where array has no elements: [Fact] public void TrySomething_NoElements_ReturnsFalse () { // Arrange var elements = Array . Empty < string > (); var myObject = new MyObject (); // Act var myReturn = myObject . TrySomething ( elements ); // Assert Assert . False ( myReturn ); }","title":"Example"},{"location":"automated-testing/unit-testing/#keep-tests-small-and-test-only-one-thing","text":"Unit tests should be short and test only one thing. This makes it easy to diagnose when there was a failure without needing something like which line number the test failed at. When using Arrange/Act/Assert , think of it like testing just one thing in the \"Act\" phase. There is some disagreement on whether testing one thing means \"assert one thing\" or \"test one state, with multiple asserts if needed\". Both have their advantages and disadvantages, but as with most technical disagreements there is no \"right\" answer. Consistency when writing your tests one way or the other is more important!","title":"Keep Tests Small and Test Only One Thing"},{"location":"automated-testing/unit-testing/#using-a-standard-naming-convention-for-all-unit-tests","text":"Without having a set standard convention for unit test names, unit test names end up being either not descriptive enough, or duplicated across multiple different test classes. Establishing a standard is not only important for keeping your code consistent, but a good standard also improves the readability and debug-ability of a test. In this article, the convention used for all unit tests has been UnitName_StateUnderTest_ExpectedResult , but there are lots of other possible conventions as well, the important thing is to be consistent and descriptive. Having descriptive names such as the one above makes it trivial to find the test when there is a failure, and also already explains what the expectation of the test was and what state caused it to fail. This can be especially helpful when looking at failures in a CI/CD system where all you know is the name of the test that failed - instead now you know the name of the test and exactly why it failed (especially coupled with a test framework that logs helpful output on failures).","title":"Using a Standard Naming Convention for All Unit Tests"},{"location":"automated-testing/unit-testing/#things-to-avoid","text":"Some common pitfalls when writing a unit test that are important to avoid: Sleeps - A sleep can be an indicator that perhaps something is making a request to a dependency that it should not be. In general, if your code is flaky without the sleep, consider why it is failing and if you can remove the flakiness by introducing a more reliable way to communicate potential state changes. Adding sleeps to your unit tests also breaks one of our original tenets of unit testing: tests should be fast, as in order of milliseconds. If tests are taking on the order of seconds, they become more cumbersome to run. Reading from disk - It can be really tempting to the expected value of a function return in a file and read that file to compare the results. This creates a dependency with the system drive, and it breaks our tenet of keeping our unit tests isolated and 100% reliable. Any outside dependency such as file system access could potentially cause intermittent failures. Additionally, this could be a sign that perhaps the test or unit under test is too complex and should be simplified. Calling third-party APIs - When you do not control a third-party library that you are calling into, it's impossible to know for sure what that is doing, and it is best to abstract it out. Otherwise, you may be making REST calls or other potential areas of failure without directly writing the code for it. This is also generally a sign that the design of the system is not entirely testable. It is best to wrap third party API calls in interfaces or other structures so that they do not get invoked in unit tests. For more information see the page on mocking .","title":"Things to Avoid"},{"location":"automated-testing/unit-testing/#unit-testing-frameworks-and-tools","text":"","title":"Unit Testing Frameworks and Tools"},{"location":"automated-testing/unit-testing/#test-frameworks","text":"Unit test frameworks are constantly changing. For a full list of every unit testing framework see the page on Wikipedia . Frameworks have many features and should be picked based on which feature-set fits best for the particular project.","title":"Test Frameworks"},{"location":"automated-testing/unit-testing/#mock-frameworks","text":"Many projects start with both a unit test framework, and also add a mock framework. While mocking frameworks have their uses and sometimes can be a requirement, it should not be something that is added without considering the broader implications and risks associated with heavy usage of mocks. To see if mocking is right for your project, or if a mock-free approach is more appropriate, see the page on mocking .","title":"Mock Frameworks"},{"location":"automated-testing/unit-testing/#tools","text":"These tools allow for constant running of your unit tests with in-line code coverage, making the dev inner loop extremely fast and allows for easy TDD: Visual Studio Live Unit Testing Wallaby.js Infinitest for Java PyCrunch for Python","title":"Tools"},{"location":"automated-testing/unit-testing/#things-to-consider","text":"","title":"Things to Consider"},{"location":"automated-testing/unit-testing/#transferring-responsibility-to-integration-tests","text":"In some situations it is worth considering to include the integration tests in the inner development loop to provide a sufficient code coverage to ensure the system is working properly. The prerequisite for this approach to be successful is to have integration tests being able to execute at a speed comparable to that of unit tests both locally and in a CI environment. Modern application frameworks like .NET or Spring Boot combined with the right mocking or stubbing approach for external dependencies offer excellent capabilities to enable such scenarios for testing. Usually, integration tests only prove that independently developed modules connect together as designed. The test coverage of integration tests can be extended to verify the correct behavior of the system as well. The responsibility of providing a sufficient branch and line code coverage can be transferred from unit tests to integration tests. Instead of several unit tests needed to test a specific case of functionality of the system, one integration scenario is created that covers the entire flow. For example in case of an API, the received HTTP responses and their content are verified for each request in test. This covers both the integration between components of the API and the correctness of its business logic. With this approach efficient integration tests can be treated as an extension of unit testing, taking over the responsibility of validating happy/failure path scenarios. It has the advantage of testing the system as a black box without any knowledge of its internals. Code refactoring has no impact on tests. Common testing techniques as TDD can be applied at a higher level which results in a development process that is driven by acceptance tests. Depending on the project specifics unit tests still play an important role. They can be used to help dictate a testable design at a lower level or to test complex business logic and corner cases if necessary.","title":"Transferring Responsibility to Integration Tests"},{"location":"automated-testing/unit-testing/#conclusion","text":"Unit testing is extremely important, but it is also not the silver bullet; having proper unit tests is just a part of a well-tested system. However, writing proper unit tests will help with the design of your system as well as help catch regressions, bugs, and increase developer velocity.","title":"Conclusion"},{"location":"automated-testing/unit-testing/#resources","text":"Unit Testing Best Practices","title":"Resources"},{"location":"automated-testing/unit-testing/authoring-example/","text":"Writing a Unit Test To illustrate some unit testing techniques for an object-oriented language, let's start with an example of some code we wish to add unit tests for. In this example, we have a configuration class that contains all the startup options for an app we are writing. Normally it reads from a .config file, but we are having three problems with the current implementation: There is a bug in the Configuration class, and we have no unit tests since it relies on reading a config file We can't unit test any of the code that relies on the Configuration class reading a config file In the future, we want to allow for configuration to be saved in the cloud and accessed via REST api. The bug we are trying to fix is that if there are multiple empty lines in the configuration file, an IndexOutOfRangeException is being thrown. Our class currently looks like this: using System.IO ; using System.Linq ; public class Configuration { // Public getter properties from configuration object public string MyProperty { get ; private set ; } public void Initialize () { var configContents = File . ReadAllLines ( \".config\" ); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } } Abstraction In our example, we have a single dependency: the file system. Rather than just abstracting the file system entirely, let us think about why we need the file system and abstract the concept rather than the implementation. In this case, we are using the File class to read from the config file, and the config contents. The abstraction concept here is some form or configuration reader that returns each line of the configuration in a string array. We could call it ConfigurationReader , and it has a single method, Read , which returns the contents. When creating abstractions, it can be good practice creating an interface for that abstraction, in languages that support it. In the example with C#, we can create an IConfigurationReader interface, and instead of just having a ConfigurationReader class we can be more specific and name if FileConfigurationReader to indicate that it reads from the file system: // IConfigurationReader.cs public interface IConfigurationReader { string [] Read (); } // FileConfigurationReader.cs public class FileConfigurationReader : IConfigurationReader { public string [] Read () { return File . ReadAllLines ( \".config\" ); } } Now that the file dependency has been abstracted away, we need to update our Configuration class's Initialize method to use the new abstraction instead of calling File.ReadAllLines directly: public void Initialize () { var configContents = new FileConfigurationReader (). Read (); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } As you can see, we still have a dependency on the file system, but that dependency has been abstracted out. We will need to use other techniques to break the dependency completely. Dependency Injection In the previous section, we abstracted the file access into a FileConfigurationReader but we still had a dependency on the file system in our function. We can use dependency injection to inject the right reader into our Configuration class: using System.IO ; using System.Linq ; public class Configuration { private readonly IConfigurationReader configReader ; // Public getter properties from configuration object public string MyProperty { get ; private set ; } public Configuration ( IConfigurationReader reader ) { this . configReader = reader ; } public void Initialize () { var configContents = configReader . Read (); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } } Above, a technique was used called Constructor Injection . This uses the object's constructor to set what our dependencies will be, which means whichever object creates the Configuration object will control which reader needs to get passed in. This is an example of \"inversion of control\", previously the Configuration object controlled the dependency, but instead we pushed up the control to whatever component creates this object. Note that we injected the interface IConfigurationReader and not the concrete class. This is what allows us to break the dependency; whereas originally we had a hard-coded dependency on the File class, now we only depend on an object that implements IConfigurationReader . Writing our first unit tests We started down this venture because we have a bug in the Configuration class that was not caught because we do not have unit tests. Let us write some unit tests that gives us full coverage of the Configuration class, including a test that tests the scenario described by the bug (if there are multiple empty lines in the configuration file, an IndexOutOfRangeException is being thrown). However, we still have one problem, we only have a single implementation of IConfigurationReader , and it uses the file system, meaning any unit tests we write will still have a dependency on the file system! Luckily since we used dependency injection, all we need to do is create an implementation of IConfigurationReader that does not depend on the file system. We could create a mock here, but instead let's create a concrete implementation of the interface which simply returns the passed in string[] - we can call it PassThroughConfigurationReader (for more details on why this approach may be better than mocking, see the page on mocking ) public class PassThroughConfigurationReader : IConfigurationReader { private readonly string [] contents ; public PassThroughConfigurationReader ( string [] contents ) { this . contents = contents ; } public string [] Read () { return this . contents ; } } This simple class will be used in our unit tests, so we can create different states without requiring lots of file access. Now that we have this in place, we can go ahead and write our unit tests, starting with the tests that describe the current behavior: public class ConfigurationTests { [Fact] public void Initialize_EmptyConfig_Throws () { var reader = new PassThroughConfigurationReader ( Array . Empty < string > ()); var config = new Configuration ( reader ); Assert . Throws < KeyNotFoundException > (() => config . Initialize ()); } [Fact] public void Initialize_CorrectFormat_SetsProperty () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } } Fixing the Bug All our current tests pass, and give us 100% coverage, however as evidenced by the bug, we must not be covering all possible inputs and outputs. In the case of the bug, multiple empty lines would cause an issue. Additionally, KeyNotFoundException is not a very friendly exception and is an implementation detail, not something that makes sense when designing the Configuration API. Let's add some more tests and align the tests with how we think the Configuration class should behave: public class ConfigurationTests { [Fact] public void Initialize_EmptyConfig_Throws () { var reader = new PassThroughConfigurationReader ( Array . Empty < string > ()); var config = new Configuration ( reader ); Assert . Throws < InvalidOperationException > (() => config . Initialize ()); } [Fact] public void Initialize_MalformedLine_Throws () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty\" , }); var config = new Configuration ( reader ); Assert . Throws < InvalidOperationException > (() => config . Initialize ()); } [Fact] public void Initialize_MultipleEqualSigns_PropertyContainsNoEquals () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myval1=myval2\" , }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myval1=myval2\" , config . MyProperty ); } [Fact] public void Initialize_WithBlankLines_Ignores () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" , string . Empty , }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } [Fact] public void Initialize_CorrectFormat_SetsProperty () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } } Now we have 4 failing tests and 1 passing test, but we have firmly established through the use of these tests how we expect callers to user the Configuration class and what is and isn't allowed as inputs. Now we just need to fix the Configuration class so that our tests pass: public void Initialize () { var configContents = configReader . Read (); if ( configContents . Length == 0 ) { throw new InvalidOperationException ( \"Empty config\" ); } // Config is in the format: key=value var config = configContents . Where ( l => ! string . IsNullOrWhiteSpace ( l )) . Select ( l => { var splitLine = l . Split ( '=' , 2 ); if ( splitLine . Length < 2 ) { throw new InvalidOperationException ( \"Malformed line\" ); } return splitLine ; }) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } Now all our tests pass! We have fixed our bug, added unit tests to the Configuration class, and have much higher confidence in future changes. Untestable Code As described in the abstraction section , not all code can be properly unit tested. In our case we have a single class that has 0% test coverage: FileConfigurationReader . This is expected; in this case we kept FileConfigurationReader as light as possible with no additional logic other than calling into the third-party dependency. FileConfigurationReader is an example of the facade design pattern . Testable Design and Future Improvements One of our original problems described in this example is that in the future we expect to load the configuration from a web API. By doing all the work of abstracting the way we load the configuration text and breaking the dependency on the file system, we have already done all the hard work to enable this future scenario! All that needs to be done next is to create a WebApiConfigurationReader implementation and use that the construct the Configuration object, and it should just work. That is one of the benefits of testable design, in the process of writing our tests in a safe way, a side effect of that is that we already have our dependencies that might change abstracted, and will require minimal changes to implement. Another added benefit is we have multiple possibilities opened by this testable design. For example, we can have a cascading configuration set up now using all 3 IConfigurationReader implementations, including the one we wrote only for our tests! We can first check if internet access is available and if so use WebApiConfigurationReader . If no internet is available, we can fall back to the local config file on the current system using FileConfigurationReader . If for some reason the config file does not exist, we can use the PassThroughConfigurationReader as a hard-coded default configuration somewhere in the code. We have full flexibility to do whatever we may need to do in the future!","title":"Writing a Unit Test"},{"location":"automated-testing/unit-testing/authoring-example/#writing-a-unit-test","text":"To illustrate some unit testing techniques for an object-oriented language, let's start with an example of some code we wish to add unit tests for. In this example, we have a configuration class that contains all the startup options for an app we are writing. Normally it reads from a .config file, but we are having three problems with the current implementation: There is a bug in the Configuration class, and we have no unit tests since it relies on reading a config file We can't unit test any of the code that relies on the Configuration class reading a config file In the future, we want to allow for configuration to be saved in the cloud and accessed via REST api. The bug we are trying to fix is that if there are multiple empty lines in the configuration file, an IndexOutOfRangeException is being thrown. Our class currently looks like this: using System.IO ; using System.Linq ; public class Configuration { // Public getter properties from configuration object public string MyProperty { get ; private set ; } public void Initialize () { var configContents = File . ReadAllLines ( \".config\" ); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } }","title":"Writing a Unit Test"},{"location":"automated-testing/unit-testing/authoring-example/#abstraction","text":"In our example, we have a single dependency: the file system. Rather than just abstracting the file system entirely, let us think about why we need the file system and abstract the concept rather than the implementation. In this case, we are using the File class to read from the config file, and the config contents. The abstraction concept here is some form or configuration reader that returns each line of the configuration in a string array. We could call it ConfigurationReader , and it has a single method, Read , which returns the contents. When creating abstractions, it can be good practice creating an interface for that abstraction, in languages that support it. In the example with C#, we can create an IConfigurationReader interface, and instead of just having a ConfigurationReader class we can be more specific and name if FileConfigurationReader to indicate that it reads from the file system: // IConfigurationReader.cs public interface IConfigurationReader { string [] Read (); } // FileConfigurationReader.cs public class FileConfigurationReader : IConfigurationReader { public string [] Read () { return File . ReadAllLines ( \".config\" ); } } Now that the file dependency has been abstracted away, we need to update our Configuration class's Initialize method to use the new abstraction instead of calling File.ReadAllLines directly: public void Initialize () { var configContents = new FileConfigurationReader (). Read (); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } As you can see, we still have a dependency on the file system, but that dependency has been abstracted out. We will need to use other techniques to break the dependency completely.","title":"Abstraction"},{"location":"automated-testing/unit-testing/authoring-example/#dependency-injection","text":"In the previous section, we abstracted the file access into a FileConfigurationReader but we still had a dependency on the file system in our function. We can use dependency injection to inject the right reader into our Configuration class: using System.IO ; using System.Linq ; public class Configuration { private readonly IConfigurationReader configReader ; // Public getter properties from configuration object public string MyProperty { get ; private set ; } public Configuration ( IConfigurationReader reader ) { this . configReader = reader ; } public void Initialize () { var configContents = configReader . Read (); // Config is in the format: key=value var config = configContents . Select ( l => l . Split ( '=' )) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } } Above, a technique was used called Constructor Injection . This uses the object's constructor to set what our dependencies will be, which means whichever object creates the Configuration object will control which reader needs to get passed in. This is an example of \"inversion of control\", previously the Configuration object controlled the dependency, but instead we pushed up the control to whatever component creates this object. Note that we injected the interface IConfigurationReader and not the concrete class. This is what allows us to break the dependency; whereas originally we had a hard-coded dependency on the File class, now we only depend on an object that implements IConfigurationReader .","title":"Dependency Injection"},{"location":"automated-testing/unit-testing/authoring-example/#writing-our-first-unit-tests","text":"We started down this venture because we have a bug in the Configuration class that was not caught because we do not have unit tests. Let us write some unit tests that gives us full coverage of the Configuration class, including a test that tests the scenario described by the bug (if there are multiple empty lines in the configuration file, an IndexOutOfRangeException is being thrown). However, we still have one problem, we only have a single implementation of IConfigurationReader , and it uses the file system, meaning any unit tests we write will still have a dependency on the file system! Luckily since we used dependency injection, all we need to do is create an implementation of IConfigurationReader that does not depend on the file system. We could create a mock here, but instead let's create a concrete implementation of the interface which simply returns the passed in string[] - we can call it PassThroughConfigurationReader (for more details on why this approach may be better than mocking, see the page on mocking ) public class PassThroughConfigurationReader : IConfigurationReader { private readonly string [] contents ; public PassThroughConfigurationReader ( string [] contents ) { this . contents = contents ; } public string [] Read () { return this . contents ; } } This simple class will be used in our unit tests, so we can create different states without requiring lots of file access. Now that we have this in place, we can go ahead and write our unit tests, starting with the tests that describe the current behavior: public class ConfigurationTests { [Fact] public void Initialize_EmptyConfig_Throws () { var reader = new PassThroughConfigurationReader ( Array . Empty < string > ()); var config = new Configuration ( reader ); Assert . Throws < KeyNotFoundException > (() => config . Initialize ()); } [Fact] public void Initialize_CorrectFormat_SetsProperty () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } }","title":"Writing our first unit tests"},{"location":"automated-testing/unit-testing/authoring-example/#fixing-the-bug","text":"All our current tests pass, and give us 100% coverage, however as evidenced by the bug, we must not be covering all possible inputs and outputs. In the case of the bug, multiple empty lines would cause an issue. Additionally, KeyNotFoundException is not a very friendly exception and is an implementation detail, not something that makes sense when designing the Configuration API. Let's add some more tests and align the tests with how we think the Configuration class should behave: public class ConfigurationTests { [Fact] public void Initialize_EmptyConfig_Throws () { var reader = new PassThroughConfigurationReader ( Array . Empty < string > ()); var config = new Configuration ( reader ); Assert . Throws < InvalidOperationException > (() => config . Initialize ()); } [Fact] public void Initialize_MalformedLine_Throws () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty\" , }); var config = new Configuration ( reader ); Assert . Throws < InvalidOperationException > (() => config . Initialize ()); } [Fact] public void Initialize_MultipleEqualSigns_PropertyContainsNoEquals () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myval1=myval2\" , }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myval1=myval2\" , config . MyProperty ); } [Fact] public void Initialize_WithBlankLines_Ignores () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" , string . Empty , }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } [Fact] public void Initialize_CorrectFormat_SetsProperty () { var reader = new PassThroughConfigurationReader ( new [] { \"myproperty=myvalue\" }); var config = new Configuration ( reader ); config . Initialize (); Assert . Equal ( \"myvalue\" , config . MyProperty ); } } Now we have 4 failing tests and 1 passing test, but we have firmly established through the use of these tests how we expect callers to user the Configuration class and what is and isn't allowed as inputs. Now we just need to fix the Configuration class so that our tests pass: public void Initialize () { var configContents = configReader . Read (); if ( configContents . Length == 0 ) { throw new InvalidOperationException ( \"Empty config\" ); } // Config is in the format: key=value var config = configContents . Where ( l => ! string . IsNullOrWhiteSpace ( l )) . Select ( l => { var splitLine = l . Split ( '=' , 2 ); if ( splitLine . Length < 2 ) { throw new InvalidOperationException ( \"Malformed line\" ); } return splitLine ; }) . ToDictionary ( kv => kv [ 0 ], kv => kv [ 1 ]); // Assign all properties here this . MyProperty = config [ \"myproperty\" ]; } Now all our tests pass! We have fixed our bug, added unit tests to the Configuration class, and have much higher confidence in future changes.","title":"Fixing the Bug"},{"location":"automated-testing/unit-testing/authoring-example/#untestable-code","text":"As described in the abstraction section , not all code can be properly unit tested. In our case we have a single class that has 0% test coverage: FileConfigurationReader . This is expected; in this case we kept FileConfigurationReader as light as possible with no additional logic other than calling into the third-party dependency. FileConfigurationReader is an example of the facade design pattern .","title":"Untestable Code"},{"location":"automated-testing/unit-testing/authoring-example/#testable-design-and-future-improvements","text":"One of our original problems described in this example is that in the future we expect to load the configuration from a web API. By doing all the work of abstracting the way we load the configuration text and breaking the dependency on the file system, we have already done all the hard work to enable this future scenario! All that needs to be done next is to create a WebApiConfigurationReader implementation and use that the construct the Configuration object, and it should just work. That is one of the benefits of testable design, in the process of writing our tests in a safe way, a side effect of that is that we already have our dependencies that might change abstracted, and will require minimal changes to implement. Another added benefit is we have multiple possibilities opened by this testable design. For example, we can have a cascading configuration set up now using all 3 IConfigurationReader implementations, including the one we wrote only for our tests! We can first check if internet access is available and if so use WebApiConfigurationReader . If no internet is available, we can fall back to the local config file on the current system using FileConfigurationReader . If for some reason the config file does not exist, we can use the PassThroughConfigurationReader as a hard-coded default configuration somewhere in the code. We have full flexibility to do whatever we may need to do in the future!","title":"Testable Design and Future Improvements"},{"location":"automated-testing/unit-testing/custom-connector/","text":"Custom Connector Testing When developing Custom Connectors to put data into the Power Platform there are some strategies you can follow: Unit Testing There are several verifications one can do while developing custom connectors in order to be sure the code is working properly. There are two main ones: Validating the OpenAPI schema which the connector is defined. Validating if the schema also have all the information necessary for the certified connector process. (the later one is optional, but necessary in case you want to publish it as a certified connector). There are several tool to help validate the OpenAPI schema, a list of them are available in this link . A suggested tool would be swagger-cli . On the other hand, to validate if the custom connector you are building is correct to become a certified connector, use the paconn-cli , since it has a validate command that shows missing information from the custom connector definition.","title":"Custom Connector Testing"},{"location":"automated-testing/unit-testing/custom-connector/#custom-connector-testing","text":"When developing Custom Connectors to put data into the Power Platform there are some strategies you can follow:","title":"Custom Connector Testing"},{"location":"automated-testing/unit-testing/custom-connector/#unit-testing","text":"There are several verifications one can do while developing custom connectors in order to be sure the code is working properly. There are two main ones: Validating the OpenAPI schema which the connector is defined. Validating if the schema also have all the information necessary for the certified connector process. (the later one is optional, but necessary in case you want to publish it as a certified connector). There are several tool to help validate the OpenAPI schema, a list of them are available in this link . A suggested tool would be swagger-cli . On the other hand, to validate if the custom connector you are building is correct to become a certified connector, use the paconn-cli , since it has a validate command that shows missing information from the custom connector definition.","title":"Unit Testing"},{"location":"automated-testing/unit-testing/mocking/","text":"Mocking in Unit Tests One of the key components of writing unit tests is to remove the dependencies your system has and replacing it with an implementation you control. The most common method people use as the replacement for the dependency is a mock, and mocking frameworks exist to help make this process easier. Many frameworks and articles use different meanings for the differences between test doubles. A test double is a generic term for any \"pretend\" object used in place of a real one. This term, as well as others used in this page are the definitions provided by Martin Fowler . The most commonly used form of test double is Mocks, but there are many cases where Mocks perhaps are not the best choice and Fakes should be considered instead. Stubs Stub allows you to have predetermined behavior that substitutes real behavior. The dependency (abstract class or interface) is implemented as a stub with a logic as expected by the client. Stubs can be useful when the clients of the stubs all expect the same set of responses, e.g. you use a third party service. The key concept here is that stubs should never fail a unit or integration test where a mock can. Stubs do not require any sort of framework to run, but are usually supported by mocking frameworks to quickly build the stubs. Stubs are commonly used in combination with a dependency injection frameworks or libraries, where the real object is replaced by a stub implementation. Stubs can be useful especially during early development of a system, but since nearly every test requires its own stubs (to test the different states), this quickly becomes repetitive and involves a lot of boilerplate code. Rarely will you find a codebase that uses only stubs for mocking, they are usually paired with other test doubles. Stubs do not require any sort of framework to run, but are usually supported by mocking frameworks to quickly build the stubs. # Python test example, that creates an application # with a dependency injection framework an overrides # a service with a stub class StubTestCase ( TestBase ): def setUp ( self ) -> None : super ( StubTestCase , self ) . setUp () self . app . container . service_a . override ( StubService ()) def test_service (): service = self . app . container . service_a () self . assertTrue ( isinstance ( service , StubService )) Upsides Do not require any framework, easy to set up. Downsides Can involve rewriting the same code many times, lots of boilerplate. Mocks Fowler describes mocks as pre-programmed objects with expectations which form a specification of the calls they are expected to receive. In other words, mocks are a replacement object for the dependency that has certain expectations that are placed on it; those expectations might be things like validating a sub-method has been called a certain number of times or that arguments are passed down in a certain way. Mocking frameworks are abundant for every language, with some languages having mocks built into the unit test packages. They make writing unit tests easy and still encourage good unit testing practices. The main difference between a mock and most of the other test doubles is that mocks do behavioral verification , whereas other test doubles do state verification . With behavioral verification, you end up testing that the implementation of the system under test is as you expect, whereas with state verification the implementation is not tested, rather the inputs and the outputs to the system are validated. The major downside to behavioral verification is that it is tied to the implementation. One of the biggest advantages of writing unit tests is that when you make code changes you have confidence that if your unit tests continue to pass, that you are making a relatively safe change. If tests need to be updated every time because the behavior of the method has changed, then you lose that confidence because bugs could also be introduced into the test code. This also increases the development time and can be a source of frustration. For example, let's assume you have a method that you are testing that makes 5 web service calls. With mocks, one of your tests could be to check that those 5 web service calls were made. Sometime later the API is updated and only a single web service call needs to be made. Once the system code is changed, the unit test will fail because it expects 5 calls and not 1. The test needs to be updated, which results in lowered confidence in the change, as well as potentially introduces more areas for bugs to sneak in. Some would argue that in the example above, the unit test is not a good test anyway because it depends on the implementation, and that may be true; but one of the biggest problems with using mocks (and specifically mocking frameworks that allow these verifications), is that it encourages these types of tests to be written. By not using a mock framework that allows this, you never run the risk of writing tests that are validating the implementation. Upsides to Mocking Easy to write. Encourages testable design. Downsides to Mocking Behavioral testing can present problems with maintainability in unit test code. Usually requires a framework to be installed (or if no framework, lots of boilerplate code) Fakes Fake objects actually have working implementations, but usually take some shortcut which may make them not suitable for production. One of the common examples of using a Fake is an in-memory database - typically you want your database to be able to save data somewhere between application runs, but when writing unit tests if you have a fake implementation of your database APIs that are store all data in memory, you can use these for unit tests and not break abstraction as well as still keep your tests fast. Writing a fake does take more time than other test doubles, because they are full implementations, and can have their own suite of unit tests. In this sense though, they increase confidence in your code even more because your test double has been thoroughly tested for bugs before you even use it as a downstream dependency. Similarly to mocks, fakes also promote testable design, but unlike mocks they do not require any frameworks to write. Writing a fake is as easy as writing any other implementation class. Fakes can be included in the test code only, but many times they end up being \"promoted\" to the product code, and in some cases can even start off in the product code since it is held to the same standard with full unit tests. Especially if writing a library or an API that other developers can use, providing a fake in the product code means those developers no longer need to write their own mock implementations, further increasing re-usability of code. Upsides to Fakes No framework needed, is just like any other implementation. Encourages testable design. Code can be \"promoted\" to product code, so it is not wasted effort. Downsides to Fakes Takes more time to implement. Best Practices To keep your mocking efficient, consider these best practices to make your code testable, save time and make your test assertions more meaningful. Dependency Injection If you don\u2019t keep testability in mind from the beginning, once you start writing your tests, you might realize you have to do a time-intensive refactor to make the code unit testable. A common problem that can lead to non-testable code in certain languages such as C# is not using dependency injection. Consider using dependency injection so that a mock can easily be injected into your Subject Under Test (SUT) during a unit test. More information on using dependency injection can be found here . Assertions When it comes to assertions in unit tests you want to make sure that you assert the right things, not necessarily lots of things. Some assertions can be inefficient and not give you the confidence you need in the test result. When you are mocking a client or configuration and your method passes the mock result directly as a return value without significant changes, consider not asserting on the return value. Because if you do, you are mainly asserting whether you set up the mock correctly. For a very simple example, look at this class: public class SearchController : ControllerBase { public ISearchClient SearchClient { get ; } public SearchController ( ISearchClient searchClient ) { SearchClient = searchClient ; } public String GetName ( string id ) { return this . SearchClient . GetName ( id ); } } When testing the GetName method, you can set up a mock search client to return a certain value. Then, it\u2019s easy to assert that the return value is, in fact, this value from the mock. mockSearchClient . Setup ( x => x . GetName ( id )) . ReturnsAsync ( \"myResult\" ); var result = searchController . GetName ( id ); Assert . Equal ( \"myResult\" , result . Value ); But now, your method could look like this, and the test would still pass: public String GetName ( string id ) { return \"myResult\" ; } Similarly, if you set up your mock wrong, the test would fail even though the logic inside the method is sound. For efficient assertions that will give you confidence in your SUT, make assertions on your logic, not mock return values. The simple example above doesn\u2019t have a lot of logic, but you want to make sure that it calls the search client to retrieve the result. For this, you can use the verify method to make sure the search client was called using the right parameters even though you don\u2019t care about the result. mockSearchClient . Verify ( mock => mock . GetName ( id ), Times . Once ()); This example is kept simple to visualize the principle of making meaningful assertions. In a real world application, your SUT will probably have more logic inside. Pieces of glue code that have as little logic as this example don't always have to be unit tested and might instead be covered by integration tests. If there is more logic and a unit test with mocking is required, you should apply this principle by verifying mock calls and making assertions on the part of the mock result that was modified by your SUT. Callbacks It can be time-consuming to set up mocks if you want to make sure they are being called with the right parameters, especially if the parameters are complex. To make your testing more efficient, consider using callbacks to make assertions on the parameters after a method was called. Often you don\u2019t care about all the parameters but only a few, or even only parts of them if the parameters are also objects. It\u2019s easy to make a small mistake in the creation of the parameter, like missing an attribute that the actual method sets, and then your mock won\u2019t be called, even though you might not care about this attribute at all. To avoid this, you can define only the most relevant parameters to differentiate between method calls and use an any -statement for the others. In this example, the method has a complex search options parameter which would take a lot of time to set up manually. Since you only care about 2 attributes in the search options, you use an any -statement and store the options in a callback for later assertions. var actualOptions = new SearchOptions (); mockSearchClient . Setup ( x => x . Search ( \"[This parameter is most relevant]\" , It . IsAny < SearchOptions > () ) ) . Returns ( mockResults ) . Callback < string , SearchOptions > (( query , searchOptions ) => { actualOptions = searchOptions ; } ); Since you want to test your method logic, you should care only about the parts of the parameter which are influenced by your SUT, in this example, let's say the search mode and the search query type. So, with the variable you stored in the callback, you can make assertions on only these two attributes. Assert . Equal ( SearchMode . All , actualOptions . SearchMode ); Assert . Equal ( SearchQueryType . Full , actualOptions . QueryType ); This makes the test more explicit since it shows which parts of the logic you care about. It\u2019s also more efficient since you don\u2019t have to spend a lot of time setting up the parameters for the mock. Conclusion Using test doubles in unit tests is an essential part of having a healthy test suite. When looking at mocking frameworks and using test doubles, it is important to consider the future implications of integrating with a mocking framework from the start. Sometimes certain features of mocking frameworks seem essential, but usually that is a sign that the code itself is not abstracted enough if it requires a framework. If possible, starting without a mocking framework and attempting to create fake implementations will lead to a more healthy code base, but when that is not possible the onus is on the technical leaders of the team to find cases where mocks may be overused, rely too much on implementation details, or end up not testing the right things.","title":"Mocking in Unit Tests"},{"location":"automated-testing/unit-testing/mocking/#mocking-in-unit-tests","text":"One of the key components of writing unit tests is to remove the dependencies your system has and replacing it with an implementation you control. The most common method people use as the replacement for the dependency is a mock, and mocking frameworks exist to help make this process easier. Many frameworks and articles use different meanings for the differences between test doubles. A test double is a generic term for any \"pretend\" object used in place of a real one. This term, as well as others used in this page are the definitions provided by Martin Fowler . The most commonly used form of test double is Mocks, but there are many cases where Mocks perhaps are not the best choice and Fakes should be considered instead.","title":"Mocking in Unit Tests"},{"location":"automated-testing/unit-testing/mocking/#stubs","text":"Stub allows you to have predetermined behavior that substitutes real behavior. The dependency (abstract class or interface) is implemented as a stub with a logic as expected by the client. Stubs can be useful when the clients of the stubs all expect the same set of responses, e.g. you use a third party service. The key concept here is that stubs should never fail a unit or integration test where a mock can. Stubs do not require any sort of framework to run, but are usually supported by mocking frameworks to quickly build the stubs. Stubs are commonly used in combination with a dependency injection frameworks or libraries, where the real object is replaced by a stub implementation. Stubs can be useful especially during early development of a system, but since nearly every test requires its own stubs (to test the different states), this quickly becomes repetitive and involves a lot of boilerplate code. Rarely will you find a codebase that uses only stubs for mocking, they are usually paired with other test doubles. Stubs do not require any sort of framework to run, but are usually supported by mocking frameworks to quickly build the stubs. # Python test example, that creates an application # with a dependency injection framework an overrides # a service with a stub class StubTestCase ( TestBase ): def setUp ( self ) -> None : super ( StubTestCase , self ) . setUp () self . app . container . service_a . override ( StubService ()) def test_service (): service = self . app . container . service_a () self . assertTrue ( isinstance ( service , StubService ))","title":"Stubs"},{"location":"automated-testing/unit-testing/mocking/#upsides","text":"Do not require any framework, easy to set up.","title":"Upsides"},{"location":"automated-testing/unit-testing/mocking/#downsides","text":"Can involve rewriting the same code many times, lots of boilerplate.","title":"Downsides"},{"location":"automated-testing/unit-testing/mocking/#mocks","text":"Fowler describes mocks as pre-programmed objects with expectations which form a specification of the calls they are expected to receive. In other words, mocks are a replacement object for the dependency that has certain expectations that are placed on it; those expectations might be things like validating a sub-method has been called a certain number of times or that arguments are passed down in a certain way. Mocking frameworks are abundant for every language, with some languages having mocks built into the unit test packages. They make writing unit tests easy and still encourage good unit testing practices. The main difference between a mock and most of the other test doubles is that mocks do behavioral verification , whereas other test doubles do state verification . With behavioral verification, you end up testing that the implementation of the system under test is as you expect, whereas with state verification the implementation is not tested, rather the inputs and the outputs to the system are validated. The major downside to behavioral verification is that it is tied to the implementation. One of the biggest advantages of writing unit tests is that when you make code changes you have confidence that if your unit tests continue to pass, that you are making a relatively safe change. If tests need to be updated every time because the behavior of the method has changed, then you lose that confidence because bugs could also be introduced into the test code. This also increases the development time and can be a source of frustration. For example, let's assume you have a method that you are testing that makes 5 web service calls. With mocks, one of your tests could be to check that those 5 web service calls were made. Sometime later the API is updated and only a single web service call needs to be made. Once the system code is changed, the unit test will fail because it expects 5 calls and not 1. The test needs to be updated, which results in lowered confidence in the change, as well as potentially introduces more areas for bugs to sneak in. Some would argue that in the example above, the unit test is not a good test anyway because it depends on the implementation, and that may be true; but one of the biggest problems with using mocks (and specifically mocking frameworks that allow these verifications), is that it encourages these types of tests to be written. By not using a mock framework that allows this, you never run the risk of writing tests that are validating the implementation.","title":"Mocks"},{"location":"automated-testing/unit-testing/mocking/#upsides-to-mocking","text":"Easy to write. Encourages testable design.","title":"Upsides to Mocking"},{"location":"automated-testing/unit-testing/mocking/#downsides-to-mocking","text":"Behavioral testing can present problems with maintainability in unit test code. Usually requires a framework to be installed (or if no framework, lots of boilerplate code)","title":"Downsides to Mocking"},{"location":"automated-testing/unit-testing/mocking/#fakes","text":"Fake objects actually have working implementations, but usually take some shortcut which may make them not suitable for production. One of the common examples of using a Fake is an in-memory database - typically you want your database to be able to save data somewhere between application runs, but when writing unit tests if you have a fake implementation of your database APIs that are store all data in memory, you can use these for unit tests and not break abstraction as well as still keep your tests fast. Writing a fake does take more time than other test doubles, because they are full implementations, and can have their own suite of unit tests. In this sense though, they increase confidence in your code even more because your test double has been thoroughly tested for bugs before you even use it as a downstream dependency. Similarly to mocks, fakes also promote testable design, but unlike mocks they do not require any frameworks to write. Writing a fake is as easy as writing any other implementation class. Fakes can be included in the test code only, but many times they end up being \"promoted\" to the product code, and in some cases can even start off in the product code since it is held to the same standard with full unit tests. Especially if writing a library or an API that other developers can use, providing a fake in the product code means those developers no longer need to write their own mock implementations, further increasing re-usability of code.","title":"Fakes"},{"location":"automated-testing/unit-testing/mocking/#upsides-to-fakes","text":"No framework needed, is just like any other implementation. Encourages testable design. Code can be \"promoted\" to product code, so it is not wasted effort.","title":"Upsides to Fakes"},{"location":"automated-testing/unit-testing/mocking/#downsides-to-fakes","text":"Takes more time to implement.","title":"Downsides to Fakes"},{"location":"automated-testing/unit-testing/mocking/#best-practices","text":"To keep your mocking efficient, consider these best practices to make your code testable, save time and make your test assertions more meaningful.","title":"Best Practices"},{"location":"automated-testing/unit-testing/mocking/#dependency-injection","text":"If you don\u2019t keep testability in mind from the beginning, once you start writing your tests, you might realize you have to do a time-intensive refactor to make the code unit testable. A common problem that can lead to non-testable code in certain languages such as C# is not using dependency injection. Consider using dependency injection so that a mock can easily be injected into your Subject Under Test (SUT) during a unit test. More information on using dependency injection can be found here .","title":"Dependency Injection"},{"location":"automated-testing/unit-testing/mocking/#assertions","text":"When it comes to assertions in unit tests you want to make sure that you assert the right things, not necessarily lots of things. Some assertions can be inefficient and not give you the confidence you need in the test result. When you are mocking a client or configuration and your method passes the mock result directly as a return value without significant changes, consider not asserting on the return value. Because if you do, you are mainly asserting whether you set up the mock correctly. For a very simple example, look at this class: public class SearchController : ControllerBase { public ISearchClient SearchClient { get ; } public SearchController ( ISearchClient searchClient ) { SearchClient = searchClient ; } public String GetName ( string id ) { return this . SearchClient . GetName ( id ); } } When testing the GetName method, you can set up a mock search client to return a certain value. Then, it\u2019s easy to assert that the return value is, in fact, this value from the mock. mockSearchClient . Setup ( x => x . GetName ( id )) . ReturnsAsync ( \"myResult\" ); var result = searchController . GetName ( id ); Assert . Equal ( \"myResult\" , result . Value ); But now, your method could look like this, and the test would still pass: public String GetName ( string id ) { return \"myResult\" ; } Similarly, if you set up your mock wrong, the test would fail even though the logic inside the method is sound. For efficient assertions that will give you confidence in your SUT, make assertions on your logic, not mock return values. The simple example above doesn\u2019t have a lot of logic, but you want to make sure that it calls the search client to retrieve the result. For this, you can use the verify method to make sure the search client was called using the right parameters even though you don\u2019t care about the result. mockSearchClient . Verify ( mock => mock . GetName ( id ), Times . Once ()); This example is kept simple to visualize the principle of making meaningful assertions. In a real world application, your SUT will probably have more logic inside. Pieces of glue code that have as little logic as this example don't always have to be unit tested and might instead be covered by integration tests. If there is more logic and a unit test with mocking is required, you should apply this principle by verifying mock calls and making assertions on the part of the mock result that was modified by your SUT.","title":"Assertions"},{"location":"automated-testing/unit-testing/mocking/#callbacks","text":"It can be time-consuming to set up mocks if you want to make sure they are being called with the right parameters, especially if the parameters are complex. To make your testing more efficient, consider using callbacks to make assertions on the parameters after a method was called. Often you don\u2019t care about all the parameters but only a few, or even only parts of them if the parameters are also objects. It\u2019s easy to make a small mistake in the creation of the parameter, like missing an attribute that the actual method sets, and then your mock won\u2019t be called, even though you might not care about this attribute at all. To avoid this, you can define only the most relevant parameters to differentiate between method calls and use an any -statement for the others. In this example, the method has a complex search options parameter which would take a lot of time to set up manually. Since you only care about 2 attributes in the search options, you use an any -statement and store the options in a callback for later assertions. var actualOptions = new SearchOptions (); mockSearchClient . Setup ( x => x . Search ( \"[This parameter is most relevant]\" , It . IsAny < SearchOptions > () ) ) . Returns ( mockResults ) . Callback < string , SearchOptions > (( query , searchOptions ) => { actualOptions = searchOptions ; } ); Since you want to test your method logic, you should care only about the parts of the parameter which are influenced by your SUT, in this example, let's say the search mode and the search query type. So, with the variable you stored in the callback, you can make assertions on only these two attributes. Assert . Equal ( SearchMode . All , actualOptions . SearchMode ); Assert . Equal ( SearchQueryType . Full , actualOptions . QueryType ); This makes the test more explicit since it shows which parts of the logic you care about. It\u2019s also more efficient since you don\u2019t have to spend a lot of time setting up the parameters for the mock.","title":"Callbacks"},{"location":"automated-testing/unit-testing/mocking/#conclusion","text":"Using test doubles in unit tests is an essential part of having a healthy test suite. When looking at mocking frameworks and using test doubles, it is important to consider the future implications of integrating with a mocking framework from the start. Sometimes certain features of mocking frameworks seem essential, but usually that is a sign that the code itself is not abstracted enough if it requires a framework. If possible, starting without a mocking framework and attempting to create fake implementations will lead to a more healthy code base, but when that is not possible the onus is on the technical leaders of the team to find cases where mocks may be overused, rely too much on implementation details, or end up not testing the right things.","title":"Conclusion"},{"location":"automated-testing/unit-testing/tdd-example/","text":"Test-Driven Development Example With this method, rather than writing all your tests up front, you write one test at a time and then switch to write the system code that would make that test pass. It's important to write the bare minimum of code necessary even if it is not actually \"correct\". Once the test passes you can refactor the code to make it maybe make more sense, but again the logic should be simple. As you write more tests, the logic gets more and more complex, but you can continue to make the minimal changes to the system code with confidence because all code that was written is covered. As an example, let's assume we are trying to write a new function that validates a string is a valid password format. The password format should be a string larger than 8 characters containing at least one number. We start with the simplest possible test; one of the easiest ways to do this is to first write tests that validate inputs into the function: // Tests.cs public class Tests { [Fact] public void ValidatePassword_NullInput_Throws () { var s = new MyClass (); Assert . Throws < ArgumentNullException > (() => s . ValidatePassword ( null )); } } // MyClass.cs public class MyClass { public bool ValidatePassword ( string input ) { return false ; } } If we run this code, the test will fail as no exception was thrown since our code in ValidateString is just a stub. This is ok! This is the \"Red\" part of Red-Green-Refactor. Now we want to move onto the \"Green\" part - making the minimal change required to make this test pass: // MyClass.cs public class MyClass { public bool ValidatePassword ( string input ) { throw new ArgumentNullException ( nameof ( input )); } } Our tests pass, but this function doesn't really work, it will always throw the exception. That's ok! As we continue to write tests we will slowly add the logic for this function, and it will build on itself, all while guaranteeing our tests continue to pass. We will skip the \"Refactor\" stage at this point because there isn't anything to refactor. Next let's add a test that checks that the function returns false if the password is less than size 8: [Fact] public void ValidatePassword_SmallSize_ReturnsFalse () { var s = new MyClass (); Assert . False ( s . ValidatePassword ( \"abc\" )); } This test will pass as it still only throws an ArgumentNullException , but again, that is an expected failure. Fixing our function should see it pass: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } return false ; } Finally, some code that looks real! Note how it wasn't the test that checked for null that had us add the if statement for the null-check, but rather the subsequent test which unlocked a whole new branch. By adding that if statement, we made the bare minimum change necessary in order to get both tests to pass, but we still have work to do. In general, working in the order of adding a negative test first before adding a positive test will ensure that both cases get covered by the code in a way that can get tests. Red-Green-Refactor makes that process super easy by requiring the bare minimum change - since we only want to make the bare minimum changes, we just simply return false here, knowing full well that we will be adding logic later that will expand on this. Speaking of which, let's add the positive test now: [Fact] public void ValidatePassword_RightSize_ReturnsTrue () { var s = new MyClass (); Assert . True ( s . ValidatePassword ( \"abcdefgh1\" )); } Again, this test will fail at the start. One thing to note here if that its important that we try and make our tests resilient to future changes. When we write the code under test, we act very naively, only trying to make the current tests we have pass; when you write tests though, you want to ensure that everything you are doing is a valid case in the future. In this case, we could have written the input string as abcdefgh and when we eventually write the function it would pass, but later when we add tests that validate the function has the rest of the proper inputs it would fail incorrectly. Anyways, the next code change is: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length > 8 ) { return true ; } return false ; } Here we now have a passing test! However, the logic doesn't actually make much sense. We did the bare minimum change which was adding a new condition that passed for longer strings, but thinking forward we know this won't work as soon as we add additional validations. So let's use our first \"Refactor\" step in the Red-Green-Refactor flow! public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length < 8 ) { return false ; } return true ; } That looks better. Note how from a functional perspective, inverting the if-statement does not change what the function returns. This is an important part of the refactor flow, maintaining the logic by doing provably safe refactors, usually through the use of tooling and automated refactors from your IDE. Finally, we have one last requirement for our ValidatePassword method and that is that it needs to check that there is a number in the password. Let's again start with the negative test and validate that with a string with the valid length that the function returns false if we do not pass in a number: [Fact] public void ValidatePassword_ValidLength_ReturnsFalse () { var s = new MyClass (); Assert . False ( s . ValidatePassword ( \"abcdefghij\" )); } Of course the test fails as it is only checking length requirements. Let's fix the method to check for numbers: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length < 8 ) { return false ; } if ( ! input . Any ( char . IsDigit )) { return false ; } return true ; } Here we use a handy LINQ method to check if any of the char s in the string are a digit, and if not, return false. Tests now pass, and we can refactor. For readability, why not combine the if statements: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if (( input . Length < 8 ) || ( ! input . Any ( char . IsDigit ))) { return false ; } return true ; } As we refactor this code, we feel 100% confident in the changes we made as we have 100% test coverage which tests both positive and negative scenarios. In this case we actually already have a method that tests the positive case, so our function is done! Now that our code is completely tested we can make all sorts of changes and still have confidence that it works. For example, if we wanted to change the implementation of the method to use regex, all of our tests would still pass and still be valid. That is it! We finished writing our function, we have 100% test coverage, and if we had done something a little more complex, we are guaranteed that whatever we designed is already testable since the tests were written first!","title":"Test-Driven Development Example"},{"location":"automated-testing/unit-testing/tdd-example/#test-driven-development-example","text":"With this method, rather than writing all your tests up front, you write one test at a time and then switch to write the system code that would make that test pass. It's important to write the bare minimum of code necessary even if it is not actually \"correct\". Once the test passes you can refactor the code to make it maybe make more sense, but again the logic should be simple. As you write more tests, the logic gets more and more complex, but you can continue to make the minimal changes to the system code with confidence because all code that was written is covered. As an example, let's assume we are trying to write a new function that validates a string is a valid password format. The password format should be a string larger than 8 characters containing at least one number. We start with the simplest possible test; one of the easiest ways to do this is to first write tests that validate inputs into the function: // Tests.cs public class Tests { [Fact] public void ValidatePassword_NullInput_Throws () { var s = new MyClass (); Assert . Throws < ArgumentNullException > (() => s . ValidatePassword ( null )); } } // MyClass.cs public class MyClass { public bool ValidatePassword ( string input ) { return false ; } } If we run this code, the test will fail as no exception was thrown since our code in ValidateString is just a stub. This is ok! This is the \"Red\" part of Red-Green-Refactor. Now we want to move onto the \"Green\" part - making the minimal change required to make this test pass: // MyClass.cs public class MyClass { public bool ValidatePassword ( string input ) { throw new ArgumentNullException ( nameof ( input )); } } Our tests pass, but this function doesn't really work, it will always throw the exception. That's ok! As we continue to write tests we will slowly add the logic for this function, and it will build on itself, all while guaranteeing our tests continue to pass. We will skip the \"Refactor\" stage at this point because there isn't anything to refactor. Next let's add a test that checks that the function returns false if the password is less than size 8: [Fact] public void ValidatePassword_SmallSize_ReturnsFalse () { var s = new MyClass (); Assert . False ( s . ValidatePassword ( \"abc\" )); } This test will pass as it still only throws an ArgumentNullException , but again, that is an expected failure. Fixing our function should see it pass: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } return false ; } Finally, some code that looks real! Note how it wasn't the test that checked for null that had us add the if statement for the null-check, but rather the subsequent test which unlocked a whole new branch. By adding that if statement, we made the bare minimum change necessary in order to get both tests to pass, but we still have work to do. In general, working in the order of adding a negative test first before adding a positive test will ensure that both cases get covered by the code in a way that can get tests. Red-Green-Refactor makes that process super easy by requiring the bare minimum change - since we only want to make the bare minimum changes, we just simply return false here, knowing full well that we will be adding logic later that will expand on this. Speaking of which, let's add the positive test now: [Fact] public void ValidatePassword_RightSize_ReturnsTrue () { var s = new MyClass (); Assert . True ( s . ValidatePassword ( \"abcdefgh1\" )); } Again, this test will fail at the start. One thing to note here if that its important that we try and make our tests resilient to future changes. When we write the code under test, we act very naively, only trying to make the current tests we have pass; when you write tests though, you want to ensure that everything you are doing is a valid case in the future. In this case, we could have written the input string as abcdefgh and when we eventually write the function it would pass, but later when we add tests that validate the function has the rest of the proper inputs it would fail incorrectly. Anyways, the next code change is: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length > 8 ) { return true ; } return false ; } Here we now have a passing test! However, the logic doesn't actually make much sense. We did the bare minimum change which was adding a new condition that passed for longer strings, but thinking forward we know this won't work as soon as we add additional validations. So let's use our first \"Refactor\" step in the Red-Green-Refactor flow! public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length < 8 ) { return false ; } return true ; } That looks better. Note how from a functional perspective, inverting the if-statement does not change what the function returns. This is an important part of the refactor flow, maintaining the logic by doing provably safe refactors, usually through the use of tooling and automated refactors from your IDE. Finally, we have one last requirement for our ValidatePassword method and that is that it needs to check that there is a number in the password. Let's again start with the negative test and validate that with a string with the valid length that the function returns false if we do not pass in a number: [Fact] public void ValidatePassword_ValidLength_ReturnsFalse () { var s = new MyClass (); Assert . False ( s . ValidatePassword ( \"abcdefghij\" )); } Of course the test fails as it is only checking length requirements. Let's fix the method to check for numbers: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if ( input . Length < 8 ) { return false ; } if ( ! input . Any ( char . IsDigit )) { return false ; } return true ; } Here we use a handy LINQ method to check if any of the char s in the string are a digit, and if not, return false. Tests now pass, and we can refactor. For readability, why not combine the if statements: public bool ValidatePassword ( string input ) { if ( input == null ) { throw new ArgumentNullException ( nameof ( input )); } if (( input . Length < 8 ) || ( ! input . Any ( char . IsDigit ))) { return false ; } return true ; } As we refactor this code, we feel 100% confident in the changes we made as we have 100% test coverage which tests both positive and negative scenarios. In this case we actually already have a method that tests the positive case, so our function is done! Now that our code is completely tested we can make all sorts of changes and still have confidence that it works. For example, if we wanted to change the implementation of the method to use regex, all of our tests would still pass and still be valid. That is it! We finished writing our function, we have 100% test coverage, and if we had done something a little more complex, we are guaranteed that whatever we designed is already testable since the tests were written first!","title":"Test-Driven Development Example"},{"location":"automated-testing/unit-testing/why-unit-tests/","text":"Why Unit Tests It is no secret that writing unit tests is hard, and even harder to write well. Writing unit tests also increases the development time for every feature. So why should we bother writing them? Reduce Costs There is no question that the later a bug is found, the more expensive it is to fix; especially so if the bug makes it into production. A 2008 research study by IBM estimates that a bug caught in production could cost 6 times as much as if it was caught during implementation. Increase Developer Confidence Many changes that developers make are not big features or something that requires an entire testing suite. A strong unit test suite helps increase the confidence of the developer that their change is not going to cause any downstream bugs. Having unit tests also helps with making safe, mechanical refactors that are provably safe; using things like refactoring tools to do mechanical refactoring and running unit tests that cover the refactored code should be enough to increase confidence in the commit. Speed Up Development Unit tests take time to write, but they also speed up development? While this may seem like an oxymoron, it is one of the strengths of a unit testing suite - over time it continues to grow and evolve until the tests become an essential part of the developer workflow. If the only testing available to a developer is a long-running system test, integration tests that require a deployment, or manual testing, it will increase the amount of time taken to write a feature. These types of tests should be a part of the \"Outer loop\"; tests that may take some time to run and validate more than just the code you are writing. Usually these types of outer loop tests get run at the PR stage or even later during merges into branches. The Developer Inner Loop is the process that developers go through as they are authoring code. This varies from developer to developer and language to language but typically is something like code -> build -> run -> repeat. When unit tests are inserted into the inner loop, developers can get early feedback and results from the code they are writing. Since unit tests execute really quickly, running tests shouldn't be seen as a barrier to entry for this loop. Tooling such as Visual Studio Live Unit Testing also help to shorten the inner loop even more. Documentation as Code Writing unit tests is a great way to show how the units of code you are writing are supposed to be used. In some ways, unit tests are better than any documentation or samples because they are (or at least should be) executed with every build so there is confidence that they are not out of date. Unit tests also should be so simple that they are easy to follow.","title":"Why Unit Tests"},{"location":"automated-testing/unit-testing/why-unit-tests/#why-unit-tests","text":"It is no secret that writing unit tests is hard, and even harder to write well. Writing unit tests also increases the development time for every feature. So why should we bother writing them?","title":"Why Unit Tests"},{"location":"automated-testing/unit-testing/why-unit-tests/#reduce-costs","text":"There is no question that the later a bug is found, the more expensive it is to fix; especially so if the bug makes it into production. A 2008 research study by IBM estimates that a bug caught in production could cost 6 times as much as if it was caught during implementation.","title":"Reduce Costs"},{"location":"automated-testing/unit-testing/why-unit-tests/#increase-developer-confidence","text":"Many changes that developers make are not big features or something that requires an entire testing suite. A strong unit test suite helps increase the confidence of the developer that their change is not going to cause any downstream bugs. Having unit tests also helps with making safe, mechanical refactors that are provably safe; using things like refactoring tools to do mechanical refactoring and running unit tests that cover the refactored code should be enough to increase confidence in the commit.","title":"Increase Developer Confidence"},{"location":"automated-testing/unit-testing/why-unit-tests/#speed-up-development","text":"Unit tests take time to write, but they also speed up development? While this may seem like an oxymoron, it is one of the strengths of a unit testing suite - over time it continues to grow and evolve until the tests become an essential part of the developer workflow. If the only testing available to a developer is a long-running system test, integration tests that require a deployment, or manual testing, it will increase the amount of time taken to write a feature. These types of tests should be a part of the \"Outer loop\"; tests that may take some time to run and validate more than just the code you are writing. Usually these types of outer loop tests get run at the PR stage or even later during merges into branches. The Developer Inner Loop is the process that developers go through as they are authoring code. This varies from developer to developer and language to language but typically is something like code -> build -> run -> repeat. When unit tests are inserted into the inner loop, developers can get early feedback and results from the code they are writing. Since unit tests execute really quickly, running tests shouldn't be seen as a barrier to entry for this loop. Tooling such as Visual Studio Live Unit Testing also help to shorten the inner loop even more.","title":"Speed Up Development"},{"location":"automated-testing/unit-testing/why-unit-tests/#documentation-as-code","text":"Writing unit tests is a great way to show how the units of code you are writing are supposed to be used. In some ways, unit tests are better than any documentation or samples because they are (or at least should be) executed with every build so there is confidence that they are not out of date. Unit tests also should be so simple that they are easy to follow.","title":"Documentation as Code"},{"location":"code-reviews/","text":"Code Reviews Developers working on projects should conduct peer code reviews on every pull request (or check-in to a shared branch). Goals Code review is a way to have a conversation about the code where participants will: Improve code quality by identifying and removing defects before they can be introduced into shared code branches. Learn and grow by having others review the code, we get exposed to unfamiliar design patterns or languages among other topics, and even break some bad habits. Shared understanding between the developers over the project's code. Resources Code review tools Google's Engineering Practices documentation: How to do a code review Best Kept Secrets of Peer Code Review","title":"Code Reviews"},{"location":"code-reviews/#code-reviews","text":"Developers working on projects should conduct peer code reviews on every pull request (or check-in to a shared branch).","title":"Code Reviews"},{"location":"code-reviews/#goals","text":"Code review is a way to have a conversation about the code where participants will: Improve code quality by identifying and removing defects before they can be introduced into shared code branches. Learn and grow by having others review the code, we get exposed to unfamiliar design patterns or languages among other topics, and even break some bad habits. Shared understanding between the developers over the project's code.","title":"Goals"},{"location":"code-reviews/#resources","text":"Code review tools Google's Engineering Practices documentation: How to do a code review Best Kept Secrets of Peer Code Review","title":"Resources"},{"location":"code-reviews/faq/","text":"FAQ This is a list of questions / frequently occurring issues when working with code reviews and answers how you can possibly tackle them. What Makes a Code Review Different from a PR? A pull request (PR) is a way to notify a task is finished and ready to be merged into the main working branch (source of truth). A code review is having someone go over the code in a PR and validate it before it is merged, but, in general, code reviews can take place outside PRs too. Code Review Pull Request Source code focused Intended to enhance and enable code reviews. Includes both source code but can have a broader scope (e.g., docs, integration tests, compiles) Intended for early feedback before submitting a PR Not intended for early feedback . Created when author is ready to merge Usually a synchronous review with faster feedback cycles (draft PRs as an exception). Examples: scheduled meetings, over-the-shoulder review, pair programming Usually a tool assisted asynchronous review but can be elevated to a synchronous meeting when needed Why do we Need Code Reviews? Our peer code reviews are structured around best practices, to find specific kinds of errors. Much like you would still run a linter over mobbed code, you would still ask someone to make the last pass to make sure the code conforms to expected standards and avoids common pitfalls. PRs are Too Large, How can we Fix This? Make sure you size the work items into small clear chunks, so the reviewer will be able to understand the code on their own. The team is instructed to commit early, before the full product backlog item / user story is complete, but rather when an individual item is done. If the work would result in an incomplete feature, make sure it can be turned off, until the full feature is delivered. More information can be found in Pull Requests - Size Guidance . How can we Expedite Code Reviews? Slow code reviews might cause delays in delivering features and cause frustration amongst team members. Possible Actions you can Take Add a rule for PR turnaround time to your work agreement. Set up a slot after the standup to go through pending PRs and assign the ones that are inactive. Dedicate a PR review manager who will be responsible to keep things flowing by assigning or notifying people when PR got stale. Use tools to better indicate stale reviews - Customize ADO - Task Boards . Which Tools can I use to Review a Complex PR? Checkout the Tools for help on how to perform reviews out of Visual Studio or Visual Studio Code. How can we Enforce the Code Review Policies? By configuring Branch Policies , you can easily enforce code reviews rules. We Pair or Mob. How Should This Reflect in our Code Reviews? There are two ways to perform a code review: Pair - Someone outside the pair should perform the code review. One of the other major benefits of code reviews is spreading knowledge about the code base to other members of the team that don't usually work in the part of the codebase under review. Mob - A member of the mob who spent less (or no) time at the keyboard should perform the code review.","title":"FAQ"},{"location":"code-reviews/faq/#faq","text":"This is a list of questions / frequently occurring issues when working with code reviews and answers how you can possibly tackle them.","title":"FAQ"},{"location":"code-reviews/faq/#what-makes-a-code-review-different-from-a-pr","text":"A pull request (PR) is a way to notify a task is finished and ready to be merged into the main working branch (source of truth). A code review is having someone go over the code in a PR and validate it before it is merged, but, in general, code reviews can take place outside PRs too. Code Review Pull Request Source code focused Intended to enhance and enable code reviews. Includes both source code but can have a broader scope (e.g., docs, integration tests, compiles) Intended for early feedback before submitting a PR Not intended for early feedback . Created when author is ready to merge Usually a synchronous review with faster feedback cycles (draft PRs as an exception). Examples: scheduled meetings, over-the-shoulder review, pair programming Usually a tool assisted asynchronous review but can be elevated to a synchronous meeting when needed","title":"What Makes a Code Review Different from a PR?"},{"location":"code-reviews/faq/#why-do-we-need-code-reviews","text":"Our peer code reviews are structured around best practices, to find specific kinds of errors. Much like you would still run a linter over mobbed code, you would still ask someone to make the last pass to make sure the code conforms to expected standards and avoids common pitfalls.","title":"Why do we Need Code Reviews?"},{"location":"code-reviews/faq/#prs-are-too-large-how-can-we-fix-this","text":"Make sure you size the work items into small clear chunks, so the reviewer will be able to understand the code on their own. The team is instructed to commit early, before the full product backlog item / user story is complete, but rather when an individual item is done. If the work would result in an incomplete feature, make sure it can be turned off, until the full feature is delivered. More information can be found in Pull Requests - Size Guidance .","title":"PRs are Too Large, How can we Fix This?"},{"location":"code-reviews/faq/#how-can-we-expedite-code-reviews","text":"Slow code reviews might cause delays in delivering features and cause frustration amongst team members.","title":"How can we Expedite Code Reviews?"},{"location":"code-reviews/faq/#possible-actions-you-can-take","text":"Add a rule for PR turnaround time to your work agreement. Set up a slot after the standup to go through pending PRs and assign the ones that are inactive. Dedicate a PR review manager who will be responsible to keep things flowing by assigning or notifying people when PR got stale. Use tools to better indicate stale reviews - Customize ADO - Task Boards .","title":"Possible Actions you can Take"},{"location":"code-reviews/faq/#which-tools-can-i-use-to-review-a-complex-pr","text":"Checkout the Tools for help on how to perform reviews out of Visual Studio or Visual Studio Code.","title":"Which Tools can I use to Review a Complex PR?"},{"location":"code-reviews/faq/#how-can-we-enforce-the-code-review-policies","text":"By configuring Branch Policies , you can easily enforce code reviews rules.","title":"How can we Enforce the Code Review Policies?"},{"location":"code-reviews/faq/#we-pair-or-mob-how-should-this-reflect-in-our-code-reviews","text":"There are two ways to perform a code review: Pair - Someone outside the pair should perform the code review. One of the other major benefits of code reviews is spreading knowledge about the code base to other members of the team that don't usually work in the part of the codebase under review. Mob - A member of the mob who spent less (or no) time at the keyboard should perform the code review.","title":"We Pair or Mob. How Should This Reflect in our Code Reviews?"},{"location":"code-reviews/inclusion-in-code-review/","text":"Inclusion in Code Review Below are some points which emphasize why inclusivity in code reviews is important: Code reviews are an important part of our job as software professionals. In ISE we work with cross cultural teams from across the globe. How we communicate affects team morale. Inclusive code reviews welcome new developers and make them comfortable with the team. Rude or personal attacks doing code reviews alienate - people can unknowingly make rude comments when reviewing pull requests (PRs). Types and Examples of Non-Inclusive Code Review Behavior Inequitable review assignments. Example: Assigning most reviews to few people and dismissing some members of the team altogether. Negative interpersonal interactions. Example: Long arguments over subjective topics such as code style. Biased decision making. Example: Comments about the developer and not the code. Assuming code from developer X will always be good and hence not reviewing it properly and vice versa. Examples of Inclusive Code Reviews Anyone and everyone in the team should be assigned PRs to review. Reviewer should be clear about what is an opinion, their personal preference, best practice or a fact. Arguments over personal preferences and opinions are mostly avoidable. Using inclusive language and tone in the code review comments. For example, being suggestive rather being prescriptive in the review comments is a good way to get the point across the table. It's a good practice for the author of a PR to thank the reviewer for the review, when they have contributed in improving the code or you have learnt something new. Using the sandwich method for recommending a code change to a new developer or a new customer: Sandwich the suggestion between 2 compliments. For example: \"Great work so far, but I would recommend a few changes here. Btw, I loved the use of XYZ here, nice job!\" Guidelines for the Author Aim to write a code that is easy to read, review and maintain. It\u2019s important to ensure that whoever is looking at the code, whether that be the reviewer or a future engineer, can understand the motivations and how your code achieves its goals. Proactively asking for targeted help or feedback. Respond clearly to questions asked by the reviewers. Avoid huge commits by submitting incremental changes. Commits which are large and contain changes to multiple files will lead to unfair review of the code. Biased behavior of reviewers may kick in while reviewing such PRs. For e.g. a huge commit from a senior developer may get approved without thorough review whereas a huge commit from a junior developer may never get reviewed and approved. Guidelines for the Reviewer Assume positive intent from the author. Write clear and elaborate comments. Identify subjectivity, choice of coding and best practice. It is good to discuss coding style and subjective coding choices in some other forum and not in the PR. A PR should not become a ground to discuss subjective coding choices and having long arguments over it. If you do not understand the code properly, refrain from commenting e.g., \"This code is incomprehensible\". It is better to have a call with the author and get a basic understanding of their work. Be suggestive and not prescriptive. A reviewer should suggest changes and not prescribe changes, let the author decide if they really want to accept the changes proposed. Culture and Code Reviews We in ISE, may come across situations in which code reviews are not ideal and often we are observing non inclusive code review behaviors. Its important to be aware of the fact that culture and communication style of a particular geography also influences how people interact over pull requests. In such cases, assuming positive intent of the author and reviewer is a good start to start analyzing quality of code reviews. Dealing with the Impostor Phenomenon Impostor phenomenon is a psychological pattern in which an individual doubts their skills, talents, or accomplishments and has a persistent internalized fear of being exposed as a \"fraud\" - Wikipedia . Someone experiencing impostor phenomenon may find submitting code for a review particularly stressful. It is important to realize that everybody can have meaningful contributions and not to let the perceived weaknesses prevent contributions. Some tips for overcoming the impostor phenomenon for authors: Review the guidelines highlighted above and make sure your code change adhere to them. Ask for help from a colleague - pair program with an experienced colleague that you can learn from. Some tips for overcoming the impostor phenomenon for reviewers: Anyone can have valuable insights. A fresh new pair of eyes are always welcome. Study the review until you have clearly understood it, check the corner cases and look for ways to improve it. If something is not clear, a simple specific question should be asked. If you have learnt something, you can always compliment the author. If possible, pair with someone to review the code so that you can establish a personal connection and have a more profound discussion about the code. Tools Below are some tools which may help in establishing inclusive code review culture within our teams. Anonymous GitHub Blind Code Reviews Gitmask inclusivelint","title":"Inclusion in Code Review"},{"location":"code-reviews/inclusion-in-code-review/#inclusion-in-code-review","text":"Below are some points which emphasize why inclusivity in code reviews is important: Code reviews are an important part of our job as software professionals. In ISE we work with cross cultural teams from across the globe. How we communicate affects team morale. Inclusive code reviews welcome new developers and make them comfortable with the team. Rude or personal attacks doing code reviews alienate - people can unknowingly make rude comments when reviewing pull requests (PRs).","title":"Inclusion in Code Review"},{"location":"code-reviews/inclusion-in-code-review/#types-and-examples-of-non-inclusive-code-review-behavior","text":"Inequitable review assignments. Example: Assigning most reviews to few people and dismissing some members of the team altogether. Negative interpersonal interactions. Example: Long arguments over subjective topics such as code style. Biased decision making. Example: Comments about the developer and not the code. Assuming code from developer X will always be good and hence not reviewing it properly and vice versa.","title":"Types and Examples of Non-Inclusive Code Review Behavior"},{"location":"code-reviews/inclusion-in-code-review/#examples-of-inclusive-code-reviews","text":"Anyone and everyone in the team should be assigned PRs to review. Reviewer should be clear about what is an opinion, their personal preference, best practice or a fact. Arguments over personal preferences and opinions are mostly avoidable. Using inclusive language and tone in the code review comments. For example, being suggestive rather being prescriptive in the review comments is a good way to get the point across the table. It's a good practice for the author of a PR to thank the reviewer for the review, when they have contributed in improving the code or you have learnt something new. Using the sandwich method for recommending a code change to a new developer or a new customer: Sandwich the suggestion between 2 compliments. For example: \"Great work so far, but I would recommend a few changes here. Btw, I loved the use of XYZ here, nice job!\"","title":"Examples of Inclusive Code Reviews"},{"location":"code-reviews/inclusion-in-code-review/#guidelines-for-the-author","text":"Aim to write a code that is easy to read, review and maintain. It\u2019s important to ensure that whoever is looking at the code, whether that be the reviewer or a future engineer, can understand the motivations and how your code achieves its goals. Proactively asking for targeted help or feedback. Respond clearly to questions asked by the reviewers. Avoid huge commits by submitting incremental changes. Commits which are large and contain changes to multiple files will lead to unfair review of the code. Biased behavior of reviewers may kick in while reviewing such PRs. For e.g. a huge commit from a senior developer may get approved without thorough review whereas a huge commit from a junior developer may never get reviewed and approved.","title":"Guidelines for the Author"},{"location":"code-reviews/inclusion-in-code-review/#guidelines-for-the-reviewer","text":"Assume positive intent from the author. Write clear and elaborate comments. Identify subjectivity, choice of coding and best practice. It is good to discuss coding style and subjective coding choices in some other forum and not in the PR. A PR should not become a ground to discuss subjective coding choices and having long arguments over it. If you do not understand the code properly, refrain from commenting e.g., \"This code is incomprehensible\". It is better to have a call with the author and get a basic understanding of their work. Be suggestive and not prescriptive. A reviewer should suggest changes and not prescribe changes, let the author decide if they really want to accept the changes proposed.","title":"Guidelines for the Reviewer"},{"location":"code-reviews/inclusion-in-code-review/#culture-and-code-reviews","text":"We in ISE, may come across situations in which code reviews are not ideal and often we are observing non inclusive code review behaviors. Its important to be aware of the fact that culture and communication style of a particular geography also influences how people interact over pull requests. In such cases, assuming positive intent of the author and reviewer is a good start to start analyzing quality of code reviews.","title":"Culture and Code Reviews"},{"location":"code-reviews/inclusion-in-code-review/#dealing-with-the-impostor-phenomenon","text":"Impostor phenomenon is a psychological pattern in which an individual doubts their skills, talents, or accomplishments and has a persistent internalized fear of being exposed as a \"fraud\" - Wikipedia . Someone experiencing impostor phenomenon may find submitting code for a review particularly stressful. It is important to realize that everybody can have meaningful contributions and not to let the perceived weaknesses prevent contributions. Some tips for overcoming the impostor phenomenon for authors: Review the guidelines highlighted above and make sure your code change adhere to them. Ask for help from a colleague - pair program with an experienced colleague that you can learn from. Some tips for overcoming the impostor phenomenon for reviewers: Anyone can have valuable insights. A fresh new pair of eyes are always welcome. Study the review until you have clearly understood it, check the corner cases and look for ways to improve it. If something is not clear, a simple specific question should be asked. If you have learnt something, you can always compliment the author. If possible, pair with someone to review the code so that you can establish a personal connection and have a more profound discussion about the code.","title":"Dealing with the Impostor Phenomenon"},{"location":"code-reviews/inclusion-in-code-review/#tools","text":"Below are some tools which may help in establishing inclusive code review culture within our teams. Anonymous GitHub Blind Code Reviews Gitmask inclusivelint","title":"Tools"},{"location":"code-reviews/pull-request-template/","text":"Pull Request Template # [Work Item ID](./link-to-the-work-item) For more information about how to contribute to this repo, visit this [ page ]( https://github.com/microsoft/code-with-engineering-playbook/blob/main/CONTRIBUTING.md ) ## Description --- > Should include a concise description of the changes (bug or feature), it's impact, along with a summary of the solution ## Steps to Reproduce Bug and Validate Solution --- > Only applicable if the work is to address a bug. Please remove this section if the work is for a feature or story > Provide details on the environment the bug is found, and detailed steps to recreate the bug. > This should be detailed enough for a team member to confirm that the bug no longer occurs ## PR Checklist --- > Use the check-list below to ensure your branch is ready for PR. If the item is not applicable, leave it blank. - [ ] I have updated the documentation accordingly. - [ ] I have added tests to cover my changes. - [ ] All new and existing tests passed. - [ ] My code follows the code style of this project. - [ ] I ran the lint checks which produced no new errors nor warnings for my changes. - [ ] I have checked to ensure there aren't other open Pull Requests for the same update/change. ## Does This Introduce a Breaking Change? --- - [ ] Yes - [ ] No > If this introduces a breaking change, please describe the impact and migration path for existing applications below. ## Testing --- > - Instructions for testing and validation of your code: > - What OS was used for testing. > - Which test sets were used. > - Description of test scenarios that you have tried. ## Any Relevant Logs or Outputs --- > - Use this section to attach pictures that demonstrates your changes working / healthy > - If you are printing something show a screenshot > - When you want to share long logs upload to: > `(StorageAccount)/pr-support/attachments/(PR Number)/(yourFiles) using [Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/)` or [portal.azure.com](https://portal.azure.com) and insert the link here. ## Other Information or Known Dependencies --- > - Any other information or known dependencies that is important to this PR. > - TODO that are to be done after this PR.","title":"Pull Request Template"},{"location":"code-reviews/pull-request-template/#pull-request-template","text":"# [Work Item ID](./link-to-the-work-item) For more information about how to contribute to this repo, visit this [ page ]( https://github.com/microsoft/code-with-engineering-playbook/blob/main/CONTRIBUTING.md ) ## Description --- > Should include a concise description of the changes (bug or feature), it's impact, along with a summary of the solution ## Steps to Reproduce Bug and Validate Solution --- > Only applicable if the work is to address a bug. Please remove this section if the work is for a feature or story > Provide details on the environment the bug is found, and detailed steps to recreate the bug. > This should be detailed enough for a team member to confirm that the bug no longer occurs ## PR Checklist --- > Use the check-list below to ensure your branch is ready for PR. If the item is not applicable, leave it blank. - [ ] I have updated the documentation accordingly. - [ ] I have added tests to cover my changes. - [ ] All new and existing tests passed. - [ ] My code follows the code style of this project. - [ ] I ran the lint checks which produced no new errors nor warnings for my changes. - [ ] I have checked to ensure there aren't other open Pull Requests for the same update/change. ## Does This Introduce a Breaking Change? --- - [ ] Yes - [ ] No > If this introduces a breaking change, please describe the impact and migration path for existing applications below. ## Testing --- > - Instructions for testing and validation of your code: > - What OS was used for testing. > - Which test sets were used. > - Description of test scenarios that you have tried. ## Any Relevant Logs or Outputs --- > - Use this section to attach pictures that demonstrates your changes working / healthy > - If you are printing something show a screenshot > - When you want to share long logs upload to: > `(StorageAccount)/pr-support/attachments/(PR Number)/(yourFiles) using [Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/)` or [portal.azure.com](https://portal.azure.com) and insert the link here. ## Other Information or Known Dependencies --- > - Any other information or known dependencies that is important to this PR. > - TODO that are to be done after this PR.","title":"Pull Request Template"},{"location":"code-reviews/pull-requests/","text":"Pull Requests Changes to any main codebase - main branch in Git repository, for example - must be done using pull requests (PR). Pull requests enable: Code inspection - see Code Reviews Running automated qualification of the code Linters Compilation Unit tests Integration tests etc. The requirements of pull requests can and should be enforced by policies, which can be set in the most modern version control and work item tracking systems. See Evidence and Measures section for more information. General Process Implement changes based on the well-defined description and acceptance criteria of the task at hand Then, before creating a new pull request: * Make sure the code conforms with the agreed coding conventions * This can be partially automated using linters * Ensure the code compiles and runs without errors or warnings * Write and/or update tests to cover the changes and make sure all new and existing tests pass * Write and/or update the documentation to match the changes Once convinced the criteria above are met, create and submit a new pull request adhering to the pull request template Follow the code review process to merge the changes to the main codebase The following diagram illustrates this approach. sequenceDiagram New branch->>+Pull request: New PR creation Pull request->>+Code review: Review process Code review->>+Pull request: Code updates Pull request->>+New branch: Merge Pull Request Pull request-->>-New branch: Delete branch Pull request ->>+ Main branch: Merge after completion New branch->>+Main branch: Goal of the Pull request Size Guidance We should always aim to keep pull requests small. Small PRs have multiple advantages: They are easier to review; a clear benefit for the reviewers. They are easier to deploy; this is aligned with the strategy of release fast and release often. Minimizes possible conflicts and stale PRs. However, we should keep PRs focused - for example around a functional feature, optimization or code readability and avoid having PRs that include code that is without context or loosely coupled. There is no right size, but keep in mind that a code review is a collaborative process, a big PRs could be difficult and therefore slower to review. We should always strive to have as small PRs as possible that still add value. Best Practices Beyond the size, remember that every PR should: be consistent, not break the build, and include related tests as part of the PR. Be consistent means that all the changes included on the PR should aim to solve one goal (ex. one user story) and be intrinsically related. Think of this as the Single-responsibility principle in terms of the whole project, the PR should have only one reason to change the project. Start small, it is easier to create a small PR from the start than to break up a bigger one. These are some strategies to keep PRs small depending on the \"cause\" of the inevitability, you could break the PR into self-container changes which still add value, release features that are hidden (see feature flag, feature toggling or canary releases) or break the PR into different layers (for example using design patterns like MVC or Observer/Subject). No matter the strategy. Pull Request Description Well written PR descriptions helps maintain a clean, well-structured change history. While every team need not conform to the same specification, it is important that the convention is agreed upon at the start of the project. One popular specification for open-source projects and others is the Conventional Commits specification , which is structured as: <type>[optional scope]: <description> [optional body] [optional footer] The <type> in this message can be selected from a list of types defined by the team, but many projects use the list of commit types from the Angular open-source project . It should be clear that scope , body and footer elements are optional , but having a required type and short description enables the features mentioned above. See also Pull Request Template Resources Writing a great pull request description Review code-with pull requests (Azure DevOps) Collaborating with issues and pull requests (GitHub) Google approach to PR size Feature Flags Facebook approach to hidden features Conventional Commits specification Angular Commit types","title":"Pull Requests"},{"location":"code-reviews/pull-requests/#pull-requests","text":"Changes to any main codebase - main branch in Git repository, for example - must be done using pull requests (PR). Pull requests enable: Code inspection - see Code Reviews Running automated qualification of the code Linters Compilation Unit tests Integration tests etc. The requirements of pull requests can and should be enforced by policies, which can be set in the most modern version control and work item tracking systems. See Evidence and Measures section for more information.","title":"Pull Requests"},{"location":"code-reviews/pull-requests/#general-process","text":"Implement changes based on the well-defined description and acceptance criteria of the task at hand Then, before creating a new pull request: * Make sure the code conforms with the agreed coding conventions * This can be partially automated using linters * Ensure the code compiles and runs without errors or warnings * Write and/or update tests to cover the changes and make sure all new and existing tests pass * Write and/or update the documentation to match the changes Once convinced the criteria above are met, create and submit a new pull request adhering to the pull request template Follow the code review process to merge the changes to the main codebase The following diagram illustrates this approach. sequenceDiagram New branch->>+Pull request: New PR creation Pull request->>+Code review: Review process Code review->>+Pull request: Code updates Pull request->>+New branch: Merge Pull Request Pull request-->>-New branch: Delete branch Pull request ->>+ Main branch: Merge after completion New branch->>+Main branch: Goal of the Pull request","title":"General Process"},{"location":"code-reviews/pull-requests/#size-guidance","text":"We should always aim to keep pull requests small. Small PRs have multiple advantages: They are easier to review; a clear benefit for the reviewers. They are easier to deploy; this is aligned with the strategy of release fast and release often. Minimizes possible conflicts and stale PRs. However, we should keep PRs focused - for example around a functional feature, optimization or code readability and avoid having PRs that include code that is without context or loosely coupled. There is no right size, but keep in mind that a code review is a collaborative process, a big PRs could be difficult and therefore slower to review. We should always strive to have as small PRs as possible that still add value.","title":"Size Guidance"},{"location":"code-reviews/pull-requests/#best-practices","text":"Beyond the size, remember that every PR should: be consistent, not break the build, and include related tests as part of the PR. Be consistent means that all the changes included on the PR should aim to solve one goal (ex. one user story) and be intrinsically related. Think of this as the Single-responsibility principle in terms of the whole project, the PR should have only one reason to change the project. Start small, it is easier to create a small PR from the start than to break up a bigger one. These are some strategies to keep PRs small depending on the \"cause\" of the inevitability, you could break the PR into self-container changes which still add value, release features that are hidden (see feature flag, feature toggling or canary releases) or break the PR into different layers (for example using design patterns like MVC or Observer/Subject). No matter the strategy.","title":"Best Practices"},{"location":"code-reviews/pull-requests/#pull-request-description","text":"Well written PR descriptions helps maintain a clean, well-structured change history. While every team need not conform to the same specification, it is important that the convention is agreed upon at the start of the project. One popular specification for open-source projects and others is the Conventional Commits specification , which is structured as: <type>[optional scope]: <description> [optional body] [optional footer] The <type> in this message can be selected from a list of types defined by the team, but many projects use the list of commit types from the Angular open-source project . It should be clear that scope , body and footer elements are optional , but having a required type and short description enables the features mentioned above. See also Pull Request Template","title":"Pull Request Description"},{"location":"code-reviews/pull-requests/#resources","text":"Writing a great pull request description Review code-with pull requests (Azure DevOps) Collaborating with issues and pull requests (GitHub) Google approach to PR size Feature Flags Facebook approach to hidden features Conventional Commits specification Angular Commit types","title":"Resources"},{"location":"code-reviews/tools/","text":"Code Review Tools Customize ADO Task Boards AzDO: Customize cards AzDO: Add columns on task board Reviewer Policies Setting required reviewer group in AzDO - Automatically include code reviewers Configuring Branch Policies AzDO: Configure branch policies AzDO: Configuring branch policies with the CLI tool: Create a policy configuration file Approval count policy GitHub: Configuring protected branches VSCode GitHub: GitHub Pull Requests Supports processing GitHub pull requests inside VS Code. Open the plugin from the Activity Bar Select Assigned To Me Select a PR Under Description you can choose to Check Out the branch and get into Review Mode and get a more integrated experience Azure DevOps: Azure DevOps Pull Requests Supports processing Azure DevOps pull requests inside VS Code. Open the plugin from the Activity Bar Select Assigned To Me Select a PR Under Description you can choose to Check Out the branch and get into Review Mode and get a more integrated experience Visual Studio The following extensions can be used to create an integrated code review experience in Visual Studio working with either GitHub or Azure DevOps. GitHub: GitHub Extension for Visual Studio Provides extended functionality for working with pull requests on GitHub directly out of Visual Studio. View -> Other Windows -> GitHub Click on the Pull Requests icon in the task bar Double click on a pending pull request Azure DevOps: Pull Requests for Visual Studio Work with pull requests on Azure DevOps directly out of Visual Studio. Open Team Explorer Click on Pull Requests Double-click a pull request - the Pull Request Details open Click on Checkout if you want to have the full change locally and have a more integrated experience Go through the changes and make comments Web Reviewable: Seamless multi-round GitHub reviews Supports multi-round GitHub code reviews, with keyboard shortcuts and more. VS Code extension is in-progress. Visit the Review Dashboard to see reviews awaiting your action, that have new comments for you, and more. Select a Pull Request from that list. Open any file in your browser, in Visual Studio Code, or any editor you've configured by clicking on your profile photo in the top-right Select an editor under \"External editor link template\". VS Code is an option, but so is any editor that supports URI's. Review the diff on an overall or per-file basis, leaving comments, code suggestions, and more","title":"Code Review Tools"},{"location":"code-reviews/tools/#code-review-tools","text":"","title":"Code Review Tools"},{"location":"code-reviews/tools/#customize-ado","text":"","title":"Customize ADO"},{"location":"code-reviews/tools/#task-boards","text":"AzDO: Customize cards AzDO: Add columns on task board","title":"Task Boards"},{"location":"code-reviews/tools/#reviewer-policies","text":"Setting required reviewer group in AzDO - Automatically include code reviewers","title":"Reviewer Policies"},{"location":"code-reviews/tools/#configuring-branch-policies","text":"AzDO: Configure branch policies AzDO: Configuring branch policies with the CLI tool: Create a policy configuration file Approval count policy GitHub: Configuring protected branches","title":"Configuring Branch Policies"},{"location":"code-reviews/tools/#vscode","text":"","title":"VSCode"},{"location":"code-reviews/tools/#github-github-pull-requests","text":"Supports processing GitHub pull requests inside VS Code. Open the plugin from the Activity Bar Select Assigned To Me Select a PR Under Description you can choose to Check Out the branch and get into Review Mode and get a more integrated experience","title":"GitHub: GitHub Pull Requests"},{"location":"code-reviews/tools/#azure-devops-azure-devops-pull-requests","text":"Supports processing Azure DevOps pull requests inside VS Code. Open the plugin from the Activity Bar Select Assigned To Me Select a PR Under Description you can choose to Check Out the branch and get into Review Mode and get a more integrated experience","title":"Azure DevOps: Azure DevOps Pull Requests"},{"location":"code-reviews/tools/#visual-studio","text":"The following extensions can be used to create an integrated code review experience in Visual Studio working with either GitHub or Azure DevOps.","title":"Visual Studio"},{"location":"code-reviews/tools/#github-github-extension-for-visual-studio","text":"Provides extended functionality for working with pull requests on GitHub directly out of Visual Studio. View -> Other Windows -> GitHub Click on the Pull Requests icon in the task bar Double click on a pending pull request","title":"GitHub: GitHub Extension for Visual Studio"},{"location":"code-reviews/tools/#azure-devops-pull-requests-for-visual-studio","text":"Work with pull requests on Azure DevOps directly out of Visual Studio. Open Team Explorer Click on Pull Requests Double-click a pull request - the Pull Request Details open Click on Checkout if you want to have the full change locally and have a more integrated experience Go through the changes and make comments","title":"Azure DevOps: Pull Requests for Visual Studio"},{"location":"code-reviews/tools/#web","text":"","title":"Web"},{"location":"code-reviews/tools/#reviewable-seamless-multi-round-github-reviews","text":"Supports multi-round GitHub code reviews, with keyboard shortcuts and more. VS Code extension is in-progress. Visit the Review Dashboard to see reviews awaiting your action, that have new comments for you, and more. Select a Pull Request from that list. Open any file in your browser, in Visual Studio Code, or any editor you've configured by clicking on your profile photo in the top-right Select an editor under \"External editor link template\". VS Code is an option, but so is any editor that supports URI's. Review the diff on an overall or per-file basis, leaving comments, code suggestions, and more","title":"Reviewable: Seamless multi-round GitHub reviews"},{"location":"code-reviews/evidence-and-measures/","text":"Evidence and Measures Evidence Many of the code quality assurance items can be automated or enforced by policies in modern version control and work item tracking systems. Verification of the policies on the main branch in Azure DevOps (AzDO) or GitHub , for example, may be sufficient evidence that a project team is conducting code reviews. The main branches in all repositories have branch policies. - Configure branch policies All builds produced out of project repositories include appropriate linters, run unit tests. Every bug work item should include a link to the pull request that introduced it, once the error has been diagnosed. This helps with learning. Each bug work item should include a note on how the bug might (or might not have) been caught in a code review. The project team regularly updates their code review checklists to reflect common issues they have encountered. Dev Leads should review a sample of pull requests and/or be co-reviewers with other developers to help everyone improve their skills as code reviewers. Measures The team can collect metrics of code reviews to measure their efficiency. Some useful metrics include: Defect Removal Efficiency (DRE) - a measure of the development team's ability to remove defects prior to release Time metrics: Time used preparing for code inspection sessions Time used in review sessions Lines of code (LOC) inspected per time unit/meeting It is a perfectly reasonable solution to track these metrics manually e.g. in an Excel sheet. It is also possible to utilize the features of project management platforms - for example, AzDO enables dashboards for metrics including tracking bugs . You may find ready-made plugins for various platforms - see GitHub Marketplace for instance - or you can choose to implement these features yourself. Remember that since defects removed thanks to reviews is far less costly compared to finding them in production, the cost of doing code reviews is actually negative! Resources A Guide to Code Inspections","title":"Evidence and Measures"},{"location":"code-reviews/evidence-and-measures/#evidence-and-measures","text":"","title":"Evidence and Measures"},{"location":"code-reviews/evidence-and-measures/#evidence","text":"Many of the code quality assurance items can be automated or enforced by policies in modern version control and work item tracking systems. Verification of the policies on the main branch in Azure DevOps (AzDO) or GitHub , for example, may be sufficient evidence that a project team is conducting code reviews. The main branches in all repositories have branch policies. - Configure branch policies All builds produced out of project repositories include appropriate linters, run unit tests. Every bug work item should include a link to the pull request that introduced it, once the error has been diagnosed. This helps with learning. Each bug work item should include a note on how the bug might (or might not have) been caught in a code review. The project team regularly updates their code review checklists to reflect common issues they have encountered. Dev Leads should review a sample of pull requests and/or be co-reviewers with other developers to help everyone improve their skills as code reviewers.","title":"Evidence"},{"location":"code-reviews/evidence-and-measures/#measures","text":"The team can collect metrics of code reviews to measure their efficiency. Some useful metrics include: Defect Removal Efficiency (DRE) - a measure of the development team's ability to remove defects prior to release Time metrics: Time used preparing for code inspection sessions Time used in review sessions Lines of code (LOC) inspected per time unit/meeting It is a perfectly reasonable solution to track these metrics manually e.g. in an Excel sheet. It is also possible to utilize the features of project management platforms - for example, AzDO enables dashboards for metrics including tracking bugs . You may find ready-made plugins for various platforms - see GitHub Marketplace for instance - or you can choose to implement these features yourself. Remember that since defects removed thanks to reviews is far less costly compared to finding them in production, the cost of doing code reviews is actually negative!","title":"Measures"},{"location":"code-reviews/evidence-and-measures/#resources","text":"A Guide to Code Inspections","title":"Resources"},{"location":"code-reviews/process-guidance/","text":"Process Guidance General Guidance Code reviews should be part of the software engineering team process regardless of the development model. Furthermore, the team should learn to execute reviews in a timely manner. Pull requests (PRs) left hanging can cause additional merge problems and go stale resulting in lost work. Qualified PRs are expected to reflect well-defined, concise tasks, and thus be compact in content. Reviewing a single task should then take relatively little time to complete. To ensure that the code review process is healthy, inclusive and meets the goals stated above, consider following these guidelines: Establish a service-level agreement (SLA) for code reviews and add it to your teams working agreement. Although modern DevOps environments incorporate tools for managing PRs, it can be useful to label tasks pending for review or to have a dedicated place for them on the task board - Customize AzDO task boards In the daily standup meeting check tasks pending for review and make sure they have reviewers assigned. Junior teams and teams new to the process can consider creating separate tasks for reviews together with the tasks themselves. Utilize tools to streamline the review process - Code review tools Foster inclusive code reviews - Inclusion in Code Review Measuring Code Review Process If the team is finding that code reviews are taking a significant time to merge, and it is becoming a blocker, consider the following additional recommendations: Measure the average time it takes to merge a PR per sprint cycle. Review during retrospective how the time to merge can be improved and prioritized. Assess the time to merge across sprints to see if the process is improving. Ping required approvers directly as a reminder. Code Reviews Shouldn't Include too Many Lines of Code It's easy to say a developer can review few hundred lines of code, but when the code surpasses certain amount of lines, the effectiveness of defects discovery will decrease and there is a lesser chance of doing a good review. It's not a matter of setting a code line limit, but rather using common sense. More code there is to review, the higher chances there are letting a bug sneak through. See PR size guidance . Automate Whenever Reasonable Use automation (linting, code analysis etc.) to avoid the need for \" nits \" and allow the reviewer to focus more on the functional aspects of the PR. By configuring automated builds, tests and checks (something achievable in the CI process ), teams can save human reviewers some time and let them focus in areas like design and functionality for proper evaluation. This will ensure higher chances of success as the team is focusing on the things that matter. Role specific guidance Author Guidance Reviewer Guidance","title":"Process Guidance"},{"location":"code-reviews/process-guidance/#process-guidance","text":"","title":"Process Guidance"},{"location":"code-reviews/process-guidance/#general-guidance","text":"Code reviews should be part of the software engineering team process regardless of the development model. Furthermore, the team should learn to execute reviews in a timely manner. Pull requests (PRs) left hanging can cause additional merge problems and go stale resulting in lost work. Qualified PRs are expected to reflect well-defined, concise tasks, and thus be compact in content. Reviewing a single task should then take relatively little time to complete. To ensure that the code review process is healthy, inclusive and meets the goals stated above, consider following these guidelines: Establish a service-level agreement (SLA) for code reviews and add it to your teams working agreement. Although modern DevOps environments incorporate tools for managing PRs, it can be useful to label tasks pending for review or to have a dedicated place for them on the task board - Customize AzDO task boards In the daily standup meeting check tasks pending for review and make sure they have reviewers assigned. Junior teams and teams new to the process can consider creating separate tasks for reviews together with the tasks themselves. Utilize tools to streamline the review process - Code review tools Foster inclusive code reviews - Inclusion in Code Review","title":"General Guidance"},{"location":"code-reviews/process-guidance/#measuring-code-review-process","text":"If the team is finding that code reviews are taking a significant time to merge, and it is becoming a blocker, consider the following additional recommendations: Measure the average time it takes to merge a PR per sprint cycle. Review during retrospective how the time to merge can be improved and prioritized. Assess the time to merge across sprints to see if the process is improving. Ping required approvers directly as a reminder.","title":"Measuring Code Review Process"},{"location":"code-reviews/process-guidance/#code-reviews-shouldnt-include-too-many-lines-of-code","text":"It's easy to say a developer can review few hundred lines of code, but when the code surpasses certain amount of lines, the effectiveness of defects discovery will decrease and there is a lesser chance of doing a good review. It's not a matter of setting a code line limit, but rather using common sense. More code there is to review, the higher chances there are letting a bug sneak through. See PR size guidance .","title":"Code Reviews Shouldn't Include too Many Lines of Code"},{"location":"code-reviews/process-guidance/#automate-whenever-reasonable","text":"Use automation (linting, code analysis etc.) to avoid the need for \" nits \" and allow the reviewer to focus more on the functional aspects of the PR. By configuring automated builds, tests and checks (something achievable in the CI process ), teams can save human reviewers some time and let them focus in areas like design and functionality for proper evaluation. This will ensure higher chances of success as the team is focusing on the things that matter.","title":"Automate Whenever Reasonable"},{"location":"code-reviews/process-guidance/#role-specific-guidance","text":"Author Guidance Reviewer Guidance","title":"Role specific guidance"},{"location":"code-reviews/process-guidance/author-guidance/","text":"Author Guidance Properly Describe Your Pull Request (PR) Give the PR a descriptive title, so that other members can easily (in one short sentence) understand what a PR is about. Every PR should have a proper description, that shows the reviewer what has been changed and why. Add Relevant Reviewers Add one or more reviewers (depending on your project's guidelines) to the PR. Ideally, you would add at least someone who has expertise and is familiar with the project, or the language used Adding someone less familiar with the project or the language can aid in verifying the changes are understandable, easy to read, and increases the expertise within the team In ISE code-with projects with a customer team, it is important to include reviewers from both organizations for knowledge transfer - Customize Reviewers Policy Be Open to Receive Feedback Discuss design/code logic and address all comments as follows: Resolve a comment, if the requested change has been made. Mark the comment as \"won't fix\", if you are not going to make the requested changes and provide a clear reasoning If the requested change is within the scope of the task, \"I'll do it later\" is not an acceptable reason! If the requested change is out of scope, create a new work item (task or bug) for it If you don't understand a comment, ask questions in the review itself as opposed to a private chat If a thread gets bloated without a conclusion, have a meeting with the reviewer (call them or knock on door) Use Checklists When creating a PR, it is a good idea to add a checklist of objectives of the PR in the description. This helps the reviewers to focus on the key areas of the code changes. Link a Task to Your PR Link the corresponding work items/tasks to the PR. There is no need to duplicate information between the work item and the PR, but if some details are missing in either one, together they provide more context to the reviewer. Code Should Have Annotations Before the Review If you can't avoid large PRs, include explanations of the changes in order to make it easier for the reviewer to review the code, with clear comments the reviewer can identify the goal of every code block.","title":"Author Guidance"},{"location":"code-reviews/process-guidance/author-guidance/#author-guidance","text":"","title":"Author Guidance"},{"location":"code-reviews/process-guidance/author-guidance/#properly-describe-your-pull-request-pr","text":"Give the PR a descriptive title, so that other members can easily (in one short sentence) understand what a PR is about. Every PR should have a proper description, that shows the reviewer what has been changed and why.","title":"Properly Describe Your Pull Request (PR)"},{"location":"code-reviews/process-guidance/author-guidance/#add-relevant-reviewers","text":"Add one or more reviewers (depending on your project's guidelines) to the PR. Ideally, you would add at least someone who has expertise and is familiar with the project, or the language used Adding someone less familiar with the project or the language can aid in verifying the changes are understandable, easy to read, and increases the expertise within the team In ISE code-with projects with a customer team, it is important to include reviewers from both organizations for knowledge transfer - Customize Reviewers Policy","title":"Add Relevant Reviewers"},{"location":"code-reviews/process-guidance/author-guidance/#be-open-to-receive-feedback","text":"Discuss design/code logic and address all comments as follows: Resolve a comment, if the requested change has been made. Mark the comment as \"won't fix\", if you are not going to make the requested changes and provide a clear reasoning If the requested change is within the scope of the task, \"I'll do it later\" is not an acceptable reason! If the requested change is out of scope, create a new work item (task or bug) for it If you don't understand a comment, ask questions in the review itself as opposed to a private chat If a thread gets bloated without a conclusion, have a meeting with the reviewer (call them or knock on door)","title":"Be Open to Receive Feedback"},{"location":"code-reviews/process-guidance/author-guidance/#use-checklists","text":"When creating a PR, it is a good idea to add a checklist of objectives of the PR in the description. This helps the reviewers to focus on the key areas of the code changes.","title":"Use Checklists"},{"location":"code-reviews/process-guidance/author-guidance/#link-a-task-to-your-pr","text":"Link the corresponding work items/tasks to the PR. There is no need to duplicate information between the work item and the PR, but if some details are missing in either one, together they provide more context to the reviewer.","title":"Link a Task to Your PR"},{"location":"code-reviews/process-guidance/author-guidance/#code-should-have-annotations-before-the-review","text":"If you can't avoid large PRs, include explanations of the changes in order to make it easier for the reviewer to review the code, with clear comments the reviewer can identify the goal of every code block.","title":"Code Should Have Annotations Before the Review"},{"location":"code-reviews/process-guidance/reviewer-guidance/","text":"Reviewer Guidance Since parts of reviews can be automated via linters and such, human reviewers can focus on architectural and functional correctness. Human reviewers should focus on: The correctness of the business logic embodied in the code. The correctness of any new or changed tests. The \"readability\" and maintainability of the overall design decisions reflected in the code. The checklist of common errors that the team maintains for each programming language. Code reviews should use the below guidance and checklists to ensure positive and effective code reviews. General Guidance Understand the Code You are Reviewing Read every line changed. If we have a stakeholder review, it\u2019s not necessary to run the PR unless it aids your understanding of the code. AzDO orders the files for you, but you should read the code in some logical sequence to aid understanding. If you don\u2019t fully understand a change in a file because you don\u2019t have context, click to view the whole file and read through the surrounding code or checkout the changes and view them in IDE. Ask the author to clarify. Take Your Time and Keep Focus on Scope You shouldn't review code hastily but neither take too long in one sitting. If you have many pull requests (PRs) to review or if the complexity of code is demanding, the recommendation is to take a break between the reviews to recover and focus on the ones you are most experienced with. Always remember that a goal of a code review is to verify that the goals of the corresponding task have been achieved. If you have concerns about the related, adjacent code that isn't in the scope of the PR, address those as separate tasks (e.g., bugs, technical debt). Don't block the current PR due to issues that are out of scope. Foster a Positive Code Review Culture Code reviews play a critical role in product quality and it should not represent an arena for long discussions or even worse a battle of egos. What matters is a bug caught, not who made it, not who found it, not who fixed it. The only thing that matters is having the best possible product. Be Considerate Be positive \u2013 encouraging, appreciation for good practices. Prefix a \u201cpoint of polish\u201d with \u201cNit:\u201d. Avoid language that points fingers like \u201cyou\u201d but rather use \u201cwe\u201d or \u201cthis line\u201d -- code reviews are not personal and language matters. Prefer asking questions above making statements. There might be a good reason for the author to do something. If you make a direct comment, explain why the code needs to be changed, preferably with an example. Talking about changes, you can suggest changes to a PR by using the suggestion feature (available in GitHub and Azure DevOps) or by creating a PR to the author branch. If a few back-and-forth comments don't resolve a disagreement, have a quick talk with each other (in-person or call) or create a group discussion this can lead to an array of improvements for upcoming PRs. Don't forget to update the PR with what you agreed on and why. First Design Pass Pull Request Overview Does the PR description make sense? Do all the changes logically fit in this PR, or are there unrelated changes? If necessary, are the changes made reflected in updates to the README or other docs? Especially if the changes affect how the user builds code. User Facing Changes If the code involves a user-facing change, is there a GIF/photo that explains the functionality? If not, it might be key to validate the PR to ensure the change does what is expected. Ensure UI changes look good without unexpected behavior. Design Do the interactions of the various pieces of code in the PR make sense? Does the code recognize and incorporate architectures and coding patterns? Code Quality Pass Complexity Are functions too complex? Is the single responsibility principle followed? Function or class should do one \u2018thing\u2019. Should a function be broken into multiple functions? If a method has greater than 3 arguments, it is potentially overly complex. Does the code add functionality that isn\u2019t needed? Can the code be understood easily by code readers? Naming/Readability Did the developer pick good names for functions, variables, etc? Error Handling Are errors handled gracefully and explicitly where necessary? Functionality Is there parallel programming in this PR that could cause race conditions? Carefully read through this logic. Could the code be optimized? For example: are there more calls to the database than need be? How does the functionality fit in the bigger picture? Can it have negative effects to the overall system? Are there security flaws? Does a variable name reveal any customer specific information? Is PII and EUII treated correctly? Are we logging any PII information? Style Are there extraneous comments? If the code isn\u2019t clear enough to explain itself, then the code should be made simpler. Comments may be there to explain why some code exists. Does the code adhere to the style guide/conventions that we have agreed upon? We use automated styling like black and prettier. Tests Tests should always be committed in the same PR as the code itself (\u2018I\u2019ll add tests next\u2019 is not acceptable). Make sure tests are sensible and valid assumptions are made. Make sure edge cases are handled as well. Tests can be a great source to understand the changes. It can be a strategy to look at tests first to help you understand the changes better.","title":"Reviewer Guidance"},{"location":"code-reviews/process-guidance/reviewer-guidance/#reviewer-guidance","text":"Since parts of reviews can be automated via linters and such, human reviewers can focus on architectural and functional correctness. Human reviewers should focus on: The correctness of the business logic embodied in the code. The correctness of any new or changed tests. The \"readability\" and maintainability of the overall design decisions reflected in the code. The checklist of common errors that the team maintains for each programming language. Code reviews should use the below guidance and checklists to ensure positive and effective code reviews.","title":"Reviewer Guidance"},{"location":"code-reviews/process-guidance/reviewer-guidance/#general-guidance","text":"","title":"General Guidance"},{"location":"code-reviews/process-guidance/reviewer-guidance/#understand-the-code-you-are-reviewing","text":"Read every line changed. If we have a stakeholder review, it\u2019s not necessary to run the PR unless it aids your understanding of the code. AzDO orders the files for you, but you should read the code in some logical sequence to aid understanding. If you don\u2019t fully understand a change in a file because you don\u2019t have context, click to view the whole file and read through the surrounding code or checkout the changes and view them in IDE. Ask the author to clarify.","title":"Understand the Code You are Reviewing"},{"location":"code-reviews/process-guidance/reviewer-guidance/#take-your-time-and-keep-focus-on-scope","text":"You shouldn't review code hastily but neither take too long in one sitting. If you have many pull requests (PRs) to review or if the complexity of code is demanding, the recommendation is to take a break between the reviews to recover and focus on the ones you are most experienced with. Always remember that a goal of a code review is to verify that the goals of the corresponding task have been achieved. If you have concerns about the related, adjacent code that isn't in the scope of the PR, address those as separate tasks (e.g., bugs, technical debt). Don't block the current PR due to issues that are out of scope.","title":"Take Your Time and Keep Focus on Scope"},{"location":"code-reviews/process-guidance/reviewer-guidance/#foster-a-positive-code-review-culture","text":"Code reviews play a critical role in product quality and it should not represent an arena for long discussions or even worse a battle of egos. What matters is a bug caught, not who made it, not who found it, not who fixed it. The only thing that matters is having the best possible product.","title":"Foster a Positive Code Review Culture"},{"location":"code-reviews/process-guidance/reviewer-guidance/#be-considerate","text":"Be positive \u2013 encouraging, appreciation for good practices. Prefix a \u201cpoint of polish\u201d with \u201cNit:\u201d. Avoid language that points fingers like \u201cyou\u201d but rather use \u201cwe\u201d or \u201cthis line\u201d -- code reviews are not personal and language matters. Prefer asking questions above making statements. There might be a good reason for the author to do something. If you make a direct comment, explain why the code needs to be changed, preferably with an example. Talking about changes, you can suggest changes to a PR by using the suggestion feature (available in GitHub and Azure DevOps) or by creating a PR to the author branch. If a few back-and-forth comments don't resolve a disagreement, have a quick talk with each other (in-person or call) or create a group discussion this can lead to an array of improvements for upcoming PRs. Don't forget to update the PR with what you agreed on and why.","title":"Be Considerate"},{"location":"code-reviews/process-guidance/reviewer-guidance/#first-design-pass","text":"","title":"First Design Pass"},{"location":"code-reviews/process-guidance/reviewer-guidance/#pull-request-overview","text":"Does the PR description make sense? Do all the changes logically fit in this PR, or are there unrelated changes? If necessary, are the changes made reflected in updates to the README or other docs? Especially if the changes affect how the user builds code.","title":"Pull Request Overview"},{"location":"code-reviews/process-guidance/reviewer-guidance/#user-facing-changes","text":"If the code involves a user-facing change, is there a GIF/photo that explains the functionality? If not, it might be key to validate the PR to ensure the change does what is expected. Ensure UI changes look good without unexpected behavior.","title":"User Facing Changes"},{"location":"code-reviews/process-guidance/reviewer-guidance/#design","text":"Do the interactions of the various pieces of code in the PR make sense? Does the code recognize and incorporate architectures and coding patterns?","title":"Design"},{"location":"code-reviews/process-guidance/reviewer-guidance/#code-quality-pass","text":"","title":"Code Quality Pass"},{"location":"code-reviews/process-guidance/reviewer-guidance/#complexity","text":"Are functions too complex? Is the single responsibility principle followed? Function or class should do one \u2018thing\u2019. Should a function be broken into multiple functions? If a method has greater than 3 arguments, it is potentially overly complex. Does the code add functionality that isn\u2019t needed? Can the code be understood easily by code readers?","title":"Complexity"},{"location":"code-reviews/process-guidance/reviewer-guidance/#namingreadability","text":"Did the developer pick good names for functions, variables, etc?","title":"Naming/Readability"},{"location":"code-reviews/process-guidance/reviewer-guidance/#error-handling","text":"Are errors handled gracefully and explicitly where necessary?","title":"Error Handling"},{"location":"code-reviews/process-guidance/reviewer-guidance/#functionality","text":"Is there parallel programming in this PR that could cause race conditions? Carefully read through this logic. Could the code be optimized? For example: are there more calls to the database than need be? How does the functionality fit in the bigger picture? Can it have negative effects to the overall system? Are there security flaws? Does a variable name reveal any customer specific information? Is PII and EUII treated correctly? Are we logging any PII information?","title":"Functionality"},{"location":"code-reviews/process-guidance/reviewer-guidance/#style","text":"Are there extraneous comments? If the code isn\u2019t clear enough to explain itself, then the code should be made simpler. Comments may be there to explain why some code exists. Does the code adhere to the style guide/conventions that we have agreed upon? We use automated styling like black and prettier.","title":"Style"},{"location":"code-reviews/process-guidance/reviewer-guidance/#tests","text":"Tests should always be committed in the same PR as the code itself (\u2018I\u2019ll add tests next\u2019 is not acceptable). Make sure tests are sensible and valid assumptions are made. Make sure edge cases are handled as well. Tests can be a great source to understand the changes. It can be a strategy to look at tests first to help you understand the changes better.","title":"Tests"},{"location":"code-reviews/recipes/azure-pipelines-yaml/","text":"YAML(Azure Pipelines) Code Reviews Style Guide Developers should follow the YAML schema reference . Code Analysis / Linting The most popular YAML linter is YAML extension. This extension provides YAML validation, document outlining, auto-completion, hover support and formatter features. VS Code Extensions There is an Azure Pipelines for VS Code extension to add syntax highlighting and autocompletion for Azure Pipelines YAML to VS Code. It also helps you set up continuous build and deployment for Azure WebApps without leaving VS Code. YAML in Azure Pipelines Overview When the pipeline is triggered, before running the pipeline, there are a few phases such as Queue Time, Compile Time and Runtime where variables are interpreted by their runtime expression syntax . When the pipeline is triggered, all nested YAML files are expanded to run in Azure Pipelines. This checklist contains some tips and tricks for reviewing all nested YAML files. These documents may be useful when reviewing YAML files: Azure Pipelines YAML documentation . Pipeline run sequence Key concepts for new Azure Pipelines Key concepts overview A trigger tells a Pipeline to run. A pipeline is made up of one or more stages. A pipeline can deploy to one or more environments. A stage is a way of organizing jobs in a pipeline and each stage can have one or more jobs. Each job runs on one agent. A job can also be agentless. Each agent runs a job that contains one or more steps. A step can be a task or script and is the smallest building block of a pipeline. A task is a pre-packaged script that performs an action, such as invoking a REST API or publishing a build artifact. An artifact is a collection of files or packages published by a run. Code Review Checklist In addition to the Code Review Checklist you should also look for these Azure Pipelines YAML specific code review items. Pipeline Structure The steps are well understood and components are easily identifiable. Ensure that there is a proper description displayName: for every step in the pipeline. Steps/stages of the pipeline are checked in Azure Pipelines to have more understanding of components. In case you have complex nested YAML files, The pipeline in Azure Pipelines is edited to find trigger root file. All the template file references are visited to ensure a small change does not cause breaking changes, changing one file may affect multiple pipelines Long inline scripts in YAML file are moved into script files YAML Structure Re-usable components are split into separate YAML templates. Variables are separated per environment stored in templates or variable groups. Variable value changes in Queue Time , Compile Time and Runtime are considered. Variable syntax values used with Macro Syntax , Template Expression Syntax and Runtime Expression Syntax are considered. Variables can change during the pipeline, Parameters cannot. Unused variables/parameters are removed in pipeline. Does the pipeline meet with stage/job Conditions criteria? Permission Check & Security Secret values shouldn't be printed in pipeline, issecret is used for printing secrets for debugging If pipeline is using variable groups in Library, ensure pipeline has access to the variable groups created. If pipeline has a remote task in other repo/organization, does it have access? If pipeline is trying to access a secure file, does it have the permission? If pipeline requires approval for environment deployments, Who is the approver? Does it need to keep secrets and manage them, did you consider using Azure KeyVault? Troubleshooting Tips Consider Variable Syntax with Runtime Expressions in the pipeline. Here is a nice sample to understand Expansion of variables . When we assign variable like below it won't set during initialize time, it'll assign during runtime, then we can retrieve some errors based on when template runs. - task : AzureWebApp@1 displayName : 'Deploy Azure Web App : $(webAppName)' inputs : azureSubscription : '$(azureServiceConnectionId)' appName : '$(webAppName)' package : $(Pipeline.Workspace)/drop/Application$(Build.BuildId).zip startUpCommand : 'gunicorn --bind=0.0.0.0 --workers=4 app:app' Error: After passing these variables as parameter, it loads values properly. - template : steps-deployment.yaml parameters : azureServiceConnectionId : ${{ variables.azureServiceConnectionId }} webAppName : ${{ variables.webAppName }} - task : AzureWebApp@1 displayName : 'Deploy Azure Web App :${{ parameters.webAppName }}' inputs : azureSubscription : '${{ parameters.azureServiceConnectionId }}' appName : '${{ parameters.webAppName }}' package : $(Pipeline.Workspace)/drop/Application$(Build.BuildId).zip startUpCommand : 'gunicorn --bind=0.0.0.0 --workers=4 app:app' Use issecret for printing secrets for debugging echo \"##vso[task.setvariable variable=token;issecret=true] ${ token } \"","title":"YAML(Azure Pipelines) Code Reviews"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#yamlazure-pipelines-code-reviews","text":"","title":"YAML(Azure Pipelines) Code Reviews"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#style-guide","text":"Developers should follow the YAML schema reference .","title":"Style Guide"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#code-analysis-linting","text":"The most popular YAML linter is YAML extension. This extension provides YAML validation, document outlining, auto-completion, hover support and formatter features.","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#vs-code-extensions","text":"There is an Azure Pipelines for VS Code extension to add syntax highlighting and autocompletion for Azure Pipelines YAML to VS Code. It also helps you set up continuous build and deployment for Azure WebApps without leaving VS Code.","title":"VS Code Extensions"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#yaml-in-azure-pipelines-overview","text":"When the pipeline is triggered, before running the pipeline, there are a few phases such as Queue Time, Compile Time and Runtime where variables are interpreted by their runtime expression syntax . When the pipeline is triggered, all nested YAML files are expanded to run in Azure Pipelines. This checklist contains some tips and tricks for reviewing all nested YAML files. These documents may be useful when reviewing YAML files: Azure Pipelines YAML documentation . Pipeline run sequence Key concepts for new Azure Pipelines Key concepts overview A trigger tells a Pipeline to run. A pipeline is made up of one or more stages. A pipeline can deploy to one or more environments. A stage is a way of organizing jobs in a pipeline and each stage can have one or more jobs. Each job runs on one agent. A job can also be agentless. Each agent runs a job that contains one or more steps. A step can be a task or script and is the smallest building block of a pipeline. A task is a pre-packaged script that performs an action, such as invoking a REST API or publishing a build artifact. An artifact is a collection of files or packages published by a run.","title":"YAML in Azure Pipelines Overview"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these Azure Pipelines YAML specific code review items.","title":"Code Review Checklist"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#pipeline-structure","text":"The steps are well understood and components are easily identifiable. Ensure that there is a proper description displayName: for every step in the pipeline. Steps/stages of the pipeline are checked in Azure Pipelines to have more understanding of components. In case you have complex nested YAML files, The pipeline in Azure Pipelines is edited to find trigger root file. All the template file references are visited to ensure a small change does not cause breaking changes, changing one file may affect multiple pipelines Long inline scripts in YAML file are moved into script files","title":"Pipeline Structure"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#yaml-structure","text":"Re-usable components are split into separate YAML templates. Variables are separated per environment stored in templates or variable groups. Variable value changes in Queue Time , Compile Time and Runtime are considered. Variable syntax values used with Macro Syntax , Template Expression Syntax and Runtime Expression Syntax are considered. Variables can change during the pipeline, Parameters cannot. Unused variables/parameters are removed in pipeline. Does the pipeline meet with stage/job Conditions criteria?","title":"YAML Structure"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#permission-check-security","text":"Secret values shouldn't be printed in pipeline, issecret is used for printing secrets for debugging If pipeline is using variable groups in Library, ensure pipeline has access to the variable groups created. If pipeline has a remote task in other repo/organization, does it have access? If pipeline is trying to access a secure file, does it have the permission? If pipeline requires approval for environment deployments, Who is the approver? Does it need to keep secrets and manage them, did you consider using Azure KeyVault?","title":"Permission Check &amp; Security"},{"location":"code-reviews/recipes/azure-pipelines-yaml/#troubleshooting-tips","text":"Consider Variable Syntax with Runtime Expressions in the pipeline. Here is a nice sample to understand Expansion of variables . When we assign variable like below it won't set during initialize time, it'll assign during runtime, then we can retrieve some errors based on when template runs. - task : AzureWebApp@1 displayName : 'Deploy Azure Web App : $(webAppName)' inputs : azureSubscription : '$(azureServiceConnectionId)' appName : '$(webAppName)' package : $(Pipeline.Workspace)/drop/Application$(Build.BuildId).zip startUpCommand : 'gunicorn --bind=0.0.0.0 --workers=4 app:app' Error: After passing these variables as parameter, it loads values properly. - template : steps-deployment.yaml parameters : azureServiceConnectionId : ${{ variables.azureServiceConnectionId }} webAppName : ${{ variables.webAppName }} - task : AzureWebApp@1 displayName : 'Deploy Azure Web App :${{ parameters.webAppName }}' inputs : azureSubscription : '${{ parameters.azureServiceConnectionId }}' appName : '${{ parameters.webAppName }}' package : $(Pipeline.Workspace)/drop/Application$(Build.BuildId).zip startUpCommand : 'gunicorn --bind=0.0.0.0 --workers=4 app:app' Use issecret for printing secrets for debugging echo \"##vso[task.setvariable variable=token;issecret=true] ${ token } \"","title":"Troubleshooting Tips"},{"location":"code-reviews/recipes/bash/","text":"Bash Code Reviews Style Guide Developers should follow Google's Bash Style Guide . Code Analysis / Linting Projects must check bash code with shellcheck as part of the CI process . Apart from linting, shfmt can be used to automatically format shell scripts. There are few vscode code extensions which are based on shfmt like shell-format which can be used to automatically format shell scripts. Project Setup vscode-shellcheck Shellcheck extension should be used in VS Code, it provides static code analysis capabilities and auto fixing linting issues. To use vscode-shellcheck in vscode do the following: Install shellcheck on Your Machine For macOS brew install shellcheck For Ubuntu: apt-get install shellcheck Install shellcheck on VSCode Find the vscode-shellcheck extension in vscode and install it. Automatic Code Formatting shell-format shell-format extension does automatic formatting of your bash scripts, docker files and several configuration files. It is dependent on shfmt which can enforce google style guide checks for bash. To use shell-format in vscode do the following: Install shfmt on Your Machine Requires Go 1.13 or Later GO111MODULE = on go get mvdan.cc/sh/v3/cmd/shfmt Install shell-format on VSCode Find the shell-format extension in vscode and install it. Build Validation To automate this process in Azure DevOps you can add the following snippet to you azure-pipelines.yaml file. This will lint any scripts in the ./scripts/ folder. - bash : | echo \"This checks for formatting and common bash errors. See wiki for error details and ignore options: https://github.com/koalaman/shellcheck/wiki/SC1000\" export scversion=\"stable\" wget -qO- \"https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz\" | tar -xJv sudo mv \"shellcheck-${scversion}/shellcheck\" /usr/bin/ rm -r \"shellcheck-${scversion}\" shellcheck ./scripts/*.sh displayName : \"Validate Scripts: Shellcheck\" Also, your shell scripts can be formatted in your build pipeline by using the shfmt tool. To integrate shfmt in your build pipeline do the following: - bash : | echo \"This step does auto formatting of shell scripts\" shfmt -l -w ./scripts/*.sh displayName : \"Format Scripts: shfmt\" Unit testing using shunit2 can also be added to the build pipeline, using the following block: - bash : | echo \"This step unit tests shell scripts by using shunit2\" ./shunit2 displayName : \"Format Scripts: shfmt\" Pre-Commit Hooks All developers should run shellcheck and shfmt as pre-commit hooks. Step 1- Install pre-commit Run pip install pre-commit to install pre-commit. Alternatively you can run brew install pre-commit if you are using homebrew. Step 2- Add shellcheck and shfmt Add .pre-commit-config.yaml file to root of the go project. Run shfmt on pre-commit by adding it to .pre-commit-config.yaml file like below. - repo : git://github.com/pecigonzalo/pre-commit-fmt sha : master hooks : - id : shell-fmt args : - --indent=4 - repo : https://github.com/shellcheck-py/shellcheck-py rev : v0.7.1.1 hooks : - id : shellcheck Step 3 Run $ pre-commit install to set up the git hook scripts Dependencies Bash scripts are often used to 'glue together' other systems and tools. As such, Bash scripts can often have numerous and/or complicated dependencies. Consider using Docker containers to ensure that scripts are executed in a portable and reproducible environment that is guaranteed to contain all the correct dependencies. To ensure that dockerized scripts are nevertheless easy to execute, consider making the use of Docker transparent to the script's caller by wrapping the script in a 'bootstrap' which checks whether the script is running in Docker and re-executes itself in Docker if it's not the case. This provides the best of both worlds: easy script execution and consistent environments. if [[ \" ${ DOCKER } \" ! = \"true\" ]] ; then docker build -t my_script -f my_script.Dockerfile . > /dev/null docker run -e DOCKER = true my_script \" $@ \" exit $? fi # ... implementation of my_script here can assume that all of its dependencies exist since it's always running in Docker ... Code Review Checklist In addition to the Code Review Checklist you should also look for these bash specific code review items Does this code use Built-in Shell Options like set -o, set -e, set -u for execution control of shell scripts ? Is the code modularized? Shell scripts can be modularized like python modules. Portions of bash scripts should be sourced in complex bash projects. Are all exceptions handled correctly? Exceptions should be handled correctly using exit codes or trapping signals. Does the code pass all linting checks as per shellcheck and unit tests as per shunit2 ? Does the code uses relative paths or absolute paths? Relative paths should be avoided as they are prone to environment attacks. If relative path is needed, check that the PATH variable is set. Does the code take credentials as user input? Are the credentials masked or encrypted in the script? S","title":"Bash Code Reviews"},{"location":"code-reviews/recipes/bash/#bash-code-reviews","text":"","title":"Bash Code Reviews"},{"location":"code-reviews/recipes/bash/#style-guide","text":"Developers should follow Google's Bash Style Guide .","title":"Style Guide"},{"location":"code-reviews/recipes/bash/#code-analysis-linting","text":"Projects must check bash code with shellcheck as part of the CI process . Apart from linting, shfmt can be used to automatically format shell scripts. There are few vscode code extensions which are based on shfmt like shell-format which can be used to automatically format shell scripts.","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/bash/#project-setup","text":"","title":"Project Setup"},{"location":"code-reviews/recipes/bash/#vscode-shellcheck","text":"Shellcheck extension should be used in VS Code, it provides static code analysis capabilities and auto fixing linting issues. To use vscode-shellcheck in vscode do the following:","title":"vscode-shellcheck"},{"location":"code-reviews/recipes/bash/#install-shellcheck-on-your-machine","text":"For macOS brew install shellcheck For Ubuntu: apt-get install shellcheck","title":"Install shellcheck on Your Machine"},{"location":"code-reviews/recipes/bash/#install-shellcheck-on-vscode","text":"Find the vscode-shellcheck extension in vscode and install it.","title":"Install shellcheck on VSCode"},{"location":"code-reviews/recipes/bash/#automatic-code-formatting","text":"","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/bash/#shell-format","text":"shell-format extension does automatic formatting of your bash scripts, docker files and several configuration files. It is dependent on shfmt which can enforce google style guide checks for bash. To use shell-format in vscode do the following:","title":"shell-format"},{"location":"code-reviews/recipes/bash/#install-shfmt-on-your-machine","text":"Requires Go 1.13 or Later GO111MODULE = on go get mvdan.cc/sh/v3/cmd/shfmt","title":"Install shfmt on Your Machine"},{"location":"code-reviews/recipes/bash/#install-shell-format-on-vscode","text":"Find the shell-format extension in vscode and install it.","title":"Install shell-format on VSCode"},{"location":"code-reviews/recipes/bash/#build-validation","text":"To automate this process in Azure DevOps you can add the following snippet to you azure-pipelines.yaml file. This will lint any scripts in the ./scripts/ folder. - bash : | echo \"This checks for formatting and common bash errors. See wiki for error details and ignore options: https://github.com/koalaman/shellcheck/wiki/SC1000\" export scversion=\"stable\" wget -qO- \"https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz\" | tar -xJv sudo mv \"shellcheck-${scversion}/shellcheck\" /usr/bin/ rm -r \"shellcheck-${scversion}\" shellcheck ./scripts/*.sh displayName : \"Validate Scripts: Shellcheck\" Also, your shell scripts can be formatted in your build pipeline by using the shfmt tool. To integrate shfmt in your build pipeline do the following: - bash : | echo \"This step does auto formatting of shell scripts\" shfmt -l -w ./scripts/*.sh displayName : \"Format Scripts: shfmt\" Unit testing using shunit2 can also be added to the build pipeline, using the following block: - bash : | echo \"This step unit tests shell scripts by using shunit2\" ./shunit2 displayName : \"Format Scripts: shfmt\"","title":"Build Validation"},{"location":"code-reviews/recipes/bash/#pre-commit-hooks","text":"All developers should run shellcheck and shfmt as pre-commit hooks.","title":"Pre-Commit Hooks"},{"location":"code-reviews/recipes/bash/#step-1-install-pre-commit","text":"Run pip install pre-commit to install pre-commit. Alternatively you can run brew install pre-commit if you are using homebrew.","title":"Step 1- Install pre-commit"},{"location":"code-reviews/recipes/bash/#step-2-add-shellcheck-and-shfmt","text":"Add .pre-commit-config.yaml file to root of the go project. Run shfmt on pre-commit by adding it to .pre-commit-config.yaml file like below. - repo : git://github.com/pecigonzalo/pre-commit-fmt sha : master hooks : - id : shell-fmt args : - --indent=4 - repo : https://github.com/shellcheck-py/shellcheck-py rev : v0.7.1.1 hooks : - id : shellcheck","title":"Step 2- Add shellcheck and shfmt"},{"location":"code-reviews/recipes/bash/#step-3","text":"Run $ pre-commit install to set up the git hook scripts","title":"Step 3"},{"location":"code-reviews/recipes/bash/#dependencies","text":"Bash scripts are often used to 'glue together' other systems and tools. As such, Bash scripts can often have numerous and/or complicated dependencies. Consider using Docker containers to ensure that scripts are executed in a portable and reproducible environment that is guaranteed to contain all the correct dependencies. To ensure that dockerized scripts are nevertheless easy to execute, consider making the use of Docker transparent to the script's caller by wrapping the script in a 'bootstrap' which checks whether the script is running in Docker and re-executes itself in Docker if it's not the case. This provides the best of both worlds: easy script execution and consistent environments. if [[ \" ${ DOCKER } \" ! = \"true\" ]] ; then docker build -t my_script -f my_script.Dockerfile . > /dev/null docker run -e DOCKER = true my_script \" $@ \" exit $? fi # ... implementation of my_script here can assume that all of its dependencies exist since it's always running in Docker ...","title":"Dependencies"},{"location":"code-reviews/recipes/bash/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these bash specific code review items Does this code use Built-in Shell Options like set -o, set -e, set -u for execution control of shell scripts ? Is the code modularized? Shell scripts can be modularized like python modules. Portions of bash scripts should be sourced in complex bash projects. Are all exceptions handled correctly? Exceptions should be handled correctly using exit codes or trapping signals. Does the code pass all linting checks as per shellcheck and unit tests as per shunit2 ? Does the code uses relative paths or absolute paths? Relative paths should be avoided as they are prone to environment attacks. If relative path is needed, check that the PATH variable is set. Does the code take credentials as user input? Are the credentials masked or encrypted in the script? S","title":"Code Review Checklist"},{"location":"code-reviews/recipes/csharp/","text":"C# Code Reviews Style Guide Developers should follow Microsoft's C# Coding Conventions and, where applicable, Microsoft's Secure Coding Guidelines . Code Analysis / Linting We strongly believe that consistent style increases readability and maintainability of a code base. Hence, we are recommending analyzers / linters to enforce consistency and style rules. Project Setup We recommend using a common setup for your solution that you can refer to in all the projects that are part of the solution. Create a common.props file that contains the defaults for all of your projects: <Project> ... <ItemGroup> <PackageReference Include= \"Microsoft.CodeAnalysis.NetAnalyzers\" Version= \"5.0.3\" > <PrivateAssets> all </PrivateAssets> <IncludeAssets> runtime; build; native; contentfiles; analyzers; buildtransitive </IncludeAssets> </PackageReference> <PackageReference Include= \"StyleCop.Analyzers\" Version= \"1.1.118\" > <PrivateAssets> all </PrivateAssets> <IncludeAssets> runtime; build; native; contentfiles; analyzers; buildtransitive </IncludeAssets> </PackageReference> </ItemGroup> <PropertyGroup> <TreatWarningsAsErrors> true </TreatWarningsAsErrors> </PropertyGroup> <ItemGroup Condition= \"Exists('$(MSBuildThisFileDirectory)../.editorconfig')\" > <AdditionalFiles Include= \"$(MSBuildThisFileDirectory)../.editorconfig\" /> </ItemGroup> ... </Project> You can then reference the common.props in your other project files to ensure a consistent setup. <Project Sdk= \"Microsoft.NET.Sdk.Web\" > <Import Project= \"..\\common.props\" /> </Project> The .editorconfig allows for configuration and overrides of rules. You can have an .editorconfig file at project level to customize rules for different projects (test projects for example). Details about the configuration of different rules . .NET analyzers Microsoft's .NET analyzers has code quality rules and .NET API usage rules implemented as analyzers using the .NET Compiler Platform (Roslyn). This is the replacement for Microsoft's legacy FxCop analyzers. Enable or install first-party .NET analyzers . If you are currently using the legacy FxCop analyzers, migrate from FxCop analyzers to .NET analyzers . StyleCop Analyzer The StyleCop analyzer is a nuget package (StyleCop.Analyzers) that can be installed in any of your projects. It's mainly around code style rules and makes sure the team is following the same rules without having subjective discussions about braces and spaces. Detailed information can be found here: StyleCop Analyzers for the .NET Compiler Platform . The minimum rules set teams should adopt is the Managed Recommended Rules rule set. Automatic Code Formatting Use .editorconfig to configure code formatting rules in your project. Build Validation It's important that you enforce your code style and rules in the CI to avoid any team member merging code that does not comply with your standards into your git repo. If you are using FxCop analyzers and StyleCop analyzer, it's very simple to enable those in the CI. You have to make sure you are setting up the project using nuget and .editorconfig ( see Project setup ). Once you have this setup, you will have to configure the pipeline to build your code. That's pretty much it. The FxCop analyzers will run and report the result in your build pipeline. If there are rules that are violated, your build will be red. - task : DotNetCoreCLI@2 displayName : 'Style Check & Build' inputs : command : 'build' projects : '**/*.csproj' Enable Roslyn Support in VSCode The above steps also work in VS Code provided you enable Roslyn support for Omnisharp. The setting is omnisharp.enableRoslynAnalyzers and must be set to true . After enabling this setting you must \"Restart Omnisharp\" (this can be done from the Command Palette in VS Code or by restarting VS Code). Code Review Checklist In addition to the Code Review Checklist you should also look for these C# specific code review items Does this code make correct use of asynchronous programming constructs , including proper use of await and Task.WhenAll including CancellationTokens? Is the code subject to concurrency issues? Are shared objects properly protected? Is dependency injection (DI) used? Is it setup correctly? Are middleware included in this project configured correctly? Are resources released deterministically using the IDispose pattern? Are all disposable objects properly disposed ( using pattern )? Is the code creating a lot of short-lived objects. Could we optimize GC pressure? Is the code written in a way that causes boxing operations to happen? Does the code handle exceptions correctly ? Is package management being used (NuGet) instead of committing DLLs? Does this code use LINQ appropriately? Pulling LINQ into a project to replace a single short loop or in ways that do not perform well are usually not appropriate. Does this code properly validate arguments sanity (i.e. CA1062 )? Consider leveraging extensions such as Ensure.That Does this code include telemetry ( metrics, tracing and logging ) instrumentation? Does this code leverage the options design pattern by using classes to provide strongly typed access to groups of related settings? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why this is here. If the number is repetitive, is there a constant/enum or equivalent? Is proper exception handling set up? Catching the exception base class ( catch (Exception) ) is generally not the right pattern. Instead, catch the specific exceptions that can happen e.g., IOException . Is the use of #pragma fair? Are tests arranged correctly with the Arrange/Act/Assert pattern and properly documented in this way? If there is an asynchronous method, does the name of the method end with the Async suffix? If a method is asynchronous, is Task.Delay used instead of Thread.Sleep ? Task.Delay is not blocking the current thread and creates a task that will complete without blocking the thread, so in a multi-threaded, multi-task environment, this is the one to prefer. Is a cancellation token for asynchronous tasks needed rather than bool patterns? Is a minimum level of logging in place? Are the logging levels used sensible? Are internal vs private vs public classes and methods used the right way? Are auto property set and get used the right way? In a model without constructor and for deserialization, it is ok to have all accessible. For other classes usually a private set or internal set is better. Is the using pattern for streams and other disposable classes used? If not, better to have the Dispose method called explicitly. Are the classes that maintain collections in memory, thread safe? When used under concurrency, use lock pattern.","title":"C# Code Reviews"},{"location":"code-reviews/recipes/csharp/#c-code-reviews","text":"","title":"C# Code Reviews"},{"location":"code-reviews/recipes/csharp/#style-guide","text":"Developers should follow Microsoft's C# Coding Conventions and, where applicable, Microsoft's Secure Coding Guidelines .","title":"Style Guide"},{"location":"code-reviews/recipes/csharp/#code-analysis-linting","text":"We strongly believe that consistent style increases readability and maintainability of a code base. Hence, we are recommending analyzers / linters to enforce consistency and style rules.","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/csharp/#project-setup","text":"We recommend using a common setup for your solution that you can refer to in all the projects that are part of the solution. Create a common.props file that contains the defaults for all of your projects: <Project> ... <ItemGroup> <PackageReference Include= \"Microsoft.CodeAnalysis.NetAnalyzers\" Version= \"5.0.3\" > <PrivateAssets> all </PrivateAssets> <IncludeAssets> runtime; build; native; contentfiles; analyzers; buildtransitive </IncludeAssets> </PackageReference> <PackageReference Include= \"StyleCop.Analyzers\" Version= \"1.1.118\" > <PrivateAssets> all </PrivateAssets> <IncludeAssets> runtime; build; native; contentfiles; analyzers; buildtransitive </IncludeAssets> </PackageReference> </ItemGroup> <PropertyGroup> <TreatWarningsAsErrors> true </TreatWarningsAsErrors> </PropertyGroup> <ItemGroup Condition= \"Exists('$(MSBuildThisFileDirectory)../.editorconfig')\" > <AdditionalFiles Include= \"$(MSBuildThisFileDirectory)../.editorconfig\" /> </ItemGroup> ... </Project> You can then reference the common.props in your other project files to ensure a consistent setup. <Project Sdk= \"Microsoft.NET.Sdk.Web\" > <Import Project= \"..\\common.props\" /> </Project> The .editorconfig allows for configuration and overrides of rules. You can have an .editorconfig file at project level to customize rules for different projects (test projects for example). Details about the configuration of different rules .","title":"Project Setup"},{"location":"code-reviews/recipes/csharp/#net-analyzers","text":"Microsoft's .NET analyzers has code quality rules and .NET API usage rules implemented as analyzers using the .NET Compiler Platform (Roslyn). This is the replacement for Microsoft's legacy FxCop analyzers. Enable or install first-party .NET analyzers . If you are currently using the legacy FxCop analyzers, migrate from FxCop analyzers to .NET analyzers .","title":".NET analyzers"},{"location":"code-reviews/recipes/csharp/#stylecop-analyzer","text":"The StyleCop analyzer is a nuget package (StyleCop.Analyzers) that can be installed in any of your projects. It's mainly around code style rules and makes sure the team is following the same rules without having subjective discussions about braces and spaces. Detailed information can be found here: StyleCop Analyzers for the .NET Compiler Platform . The minimum rules set teams should adopt is the Managed Recommended Rules rule set.","title":"StyleCop Analyzer"},{"location":"code-reviews/recipes/csharp/#automatic-code-formatting","text":"Use .editorconfig to configure code formatting rules in your project.","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/csharp/#build-validation","text":"It's important that you enforce your code style and rules in the CI to avoid any team member merging code that does not comply with your standards into your git repo. If you are using FxCop analyzers and StyleCop analyzer, it's very simple to enable those in the CI. You have to make sure you are setting up the project using nuget and .editorconfig ( see Project setup ). Once you have this setup, you will have to configure the pipeline to build your code. That's pretty much it. The FxCop analyzers will run and report the result in your build pipeline. If there are rules that are violated, your build will be red. - task : DotNetCoreCLI@2 displayName : 'Style Check & Build' inputs : command : 'build' projects : '**/*.csproj'","title":"Build Validation"},{"location":"code-reviews/recipes/csharp/#enable-roslyn-support-in-vscode","text":"The above steps also work in VS Code provided you enable Roslyn support for Omnisharp. The setting is omnisharp.enableRoslynAnalyzers and must be set to true . After enabling this setting you must \"Restart Omnisharp\" (this can be done from the Command Palette in VS Code or by restarting VS Code).","title":"Enable Roslyn Support in VSCode"},{"location":"code-reviews/recipes/csharp/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these C# specific code review items Does this code make correct use of asynchronous programming constructs , including proper use of await and Task.WhenAll including CancellationTokens? Is the code subject to concurrency issues? Are shared objects properly protected? Is dependency injection (DI) used? Is it setup correctly? Are middleware included in this project configured correctly? Are resources released deterministically using the IDispose pattern? Are all disposable objects properly disposed ( using pattern )? Is the code creating a lot of short-lived objects. Could we optimize GC pressure? Is the code written in a way that causes boxing operations to happen? Does the code handle exceptions correctly ? Is package management being used (NuGet) instead of committing DLLs? Does this code use LINQ appropriately? Pulling LINQ into a project to replace a single short loop or in ways that do not perform well are usually not appropriate. Does this code properly validate arguments sanity (i.e. CA1062 )? Consider leveraging extensions such as Ensure.That Does this code include telemetry ( metrics, tracing and logging ) instrumentation? Does this code leverage the options design pattern by using classes to provide strongly typed access to groups of related settings? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why this is here. If the number is repetitive, is there a constant/enum or equivalent? Is proper exception handling set up? Catching the exception base class ( catch (Exception) ) is generally not the right pattern. Instead, catch the specific exceptions that can happen e.g., IOException . Is the use of #pragma fair? Are tests arranged correctly with the Arrange/Act/Assert pattern and properly documented in this way? If there is an asynchronous method, does the name of the method end with the Async suffix? If a method is asynchronous, is Task.Delay used instead of Thread.Sleep ? Task.Delay is not blocking the current thread and creates a task that will complete without blocking the thread, so in a multi-threaded, multi-task environment, this is the one to prefer. Is a cancellation token for asynchronous tasks needed rather than bool patterns? Is a minimum level of logging in place? Are the logging levels used sensible? Are internal vs private vs public classes and methods used the right way? Are auto property set and get used the right way? In a model without constructor and for deserialization, it is ok to have all accessible. For other classes usually a private set or internal set is better. Is the using pattern for streams and other disposable classes used? If not, better to have the Dispose method called explicitly. Are the classes that maintain collections in memory, thread safe? When used under concurrency, use lock pattern.","title":"Code Review Checklist"},{"location":"code-reviews/recipes/go/","text":"Go Code Reviews Style Guide Developers should follow the Effective Go Style Guide. Code Analysis / Linting Project Setup Below is the project setup that you would like to have in your VS Code. VSCode go Extension Using the Go extension for Visual Studio Code, you get language features like IntelliSense, code navigation, symbol search, bracket matching, snippets, etc. This extension includes rich language support for go in VS Code. go vet go vet is a static analysis tool that checks for common go errors, such as incorrect use of range loop variables or misaligned printf arguments. Go code should be able to build with no go vet errors. This will be part of vscode-go extension. golint Note: The golint library is deprecated and archived. The linter revive (below) might be a suitable replacement. golint can be an effective tool for finding many issues, but it errors on the side of false positives. It is best used by developers when working on code, not as part of an automated build process. This is the default linter which is set up as part of the vscode-go extension. revive Revive is a linter for go, it provides a framework for development of custom rules, and lets you define a strict preset for enhancing your development & code review processes. Automatic Code Formatting gofmt gofmt is the automated code format style guide for Go. This is part of the vs-code extension, and it is enabled by default to run on save of every file. Aggregator golangci-lint golangci-lint is the replacement for the now deprecated gometalinter . It is 2-7x faster than gometalinter along with a host of other benefits . golangci-lint is a powerful, customizable aggregator of linters. By default, several are enabled but not all. A full list of linters and their usages can be found here . It will allow you to configure each linter and choose which ones you would like to enable in your project. One awesome feature of golangci-lint is that is can be easily introduced to an existing large codebase using the --new-from-rev COMMITID . With this setting only newly introduced issues are flagged, allowing a team to improve new code without having to fix all historic issues in a large codebase. This provides a great path to improving code-reviews on existing solutions. golangci-lint can also be setup as the default linter in VS Code. Installation options for golangci-lint are present at golangci-lint . To use golangci-lint with VS Code, use the below recommended settings: \"go.lintTool\" : \"golangci-lint\" , \"go.lintFlags\" : [ \"--fast\" ] Pre-Commit Hooks All developers should run gofmt in a pre-commit hook to ensure standard formatting. Step 1- Install pre-commit Run pip install pre-commit to install pre-commit. Alternatively you can run brew install pre-commit if you are using homebrew. Step 2- Add go-fmt in pre-commit Add .pre-commit-config.yaml file to root of the go project. Run go-fmt on pre-commit by adding it to .pre-commit-config.yaml file like below. - repo : git://github.com/dnephin/pre-commit-golang rev : master hooks : - id : go-fmt Step 3 Run $ pre-commit install to set up the git hook scripts Build Validation gofmt should be run as a part of every build to enforce the common standard. To automate this process in Azure DevOps you can add the following snippet to your azure-pipelines.yaml file. This will format any scripts in the ./scripts/ folder. - script : go fmt workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code formatting\" govet should be run as a part of every build to check code linting. To automate this process in Azure DevOps you can add the following snippet to your azure-pipelines.yaml file. This will check linting of any scripts in the ./scripts/ folder. - script : go vet workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code linting\" Alternatively you can use golangci-lint as a step in the pipeline to do multiple enabled validations(including go vet and go fmt) of golangci-lint. - script : golangci-lint run --enable gofmt --fix workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code linting\" Sample Build Validation Pipeline in Azure DevOps trigger : master pool : vmImage : 'ubuntu-latest' steps : - task : GoTool@0 inputs : version : '1.13.5' - task : Go@0 inputs : command : 'get' arguments : '-d' workingDirectory : '$(System.DefaultWorkingDirectory)/scripts' - script : go fmt workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code formatting\" - script : go vet workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : 'Run go vet' - task : Go@0 inputs : command : 'build' workingDirectory : '$(System.DefaultWorkingDirectory)' - task : CopyFiles@2 inputs : TargetFolder : '$(Build.ArtifactStagingDirectory)' - task : PublishBuildArtifacts@1 inputs : artifactName : drop Code Review Checklist The Go language team maintains a list of common Code Review Comments for go that form the basis for a solid checklist for a team working in Go that should be followed in addition to the ISE Code Review Checklist Does this code handle errors correctly? This includes not throwing away errors with _ assignments and returning errors, instead of in-band error values ? Does this code follow Go standards for method receiver types ? Does this code pass values when it should? Are interfaces in this code defined in the correct packages ? Do go-routines in this code have clear lifetimes ? Is parallelism in this code handled via go-routines and channels with synchronous methods ? Does this code have meaningful Doc Comments ? Does this code have meaningful Package Comments ? Does this code use Contexts correctly? Do unit tests fail with meaningful messages ?","title":"Go Code Reviews"},{"location":"code-reviews/recipes/go/#go-code-reviews","text":"","title":"Go Code Reviews"},{"location":"code-reviews/recipes/go/#style-guide","text":"Developers should follow the Effective Go Style Guide.","title":"Style Guide"},{"location":"code-reviews/recipes/go/#code-analysis-linting","text":"","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/go/#project-setup","text":"Below is the project setup that you would like to have in your VS Code.","title":"Project Setup"},{"location":"code-reviews/recipes/go/#vscode-go-extension","text":"Using the Go extension for Visual Studio Code, you get language features like IntelliSense, code navigation, symbol search, bracket matching, snippets, etc. This extension includes rich language support for go in VS Code.","title":"VSCode go Extension"},{"location":"code-reviews/recipes/go/#go-vet","text":"go vet is a static analysis tool that checks for common go errors, such as incorrect use of range loop variables or misaligned printf arguments. Go code should be able to build with no go vet errors. This will be part of vscode-go extension.","title":"go vet"},{"location":"code-reviews/recipes/go/#golint","text":"Note: The golint library is deprecated and archived. The linter revive (below) might be a suitable replacement. golint can be an effective tool for finding many issues, but it errors on the side of false positives. It is best used by developers when working on code, not as part of an automated build process. This is the default linter which is set up as part of the vscode-go extension.","title":"golint"},{"location":"code-reviews/recipes/go/#revive","text":"Revive is a linter for go, it provides a framework for development of custom rules, and lets you define a strict preset for enhancing your development & code review processes.","title":"revive"},{"location":"code-reviews/recipes/go/#automatic-code-formatting","text":"","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/go/#gofmt","text":"gofmt is the automated code format style guide for Go. This is part of the vs-code extension, and it is enabled by default to run on save of every file.","title":"gofmt"},{"location":"code-reviews/recipes/go/#aggregator","text":"","title":"Aggregator"},{"location":"code-reviews/recipes/go/#golangci-lint","text":"golangci-lint is the replacement for the now deprecated gometalinter . It is 2-7x faster than gometalinter along with a host of other benefits . golangci-lint is a powerful, customizable aggregator of linters. By default, several are enabled but not all. A full list of linters and their usages can be found here . It will allow you to configure each linter and choose which ones you would like to enable in your project. One awesome feature of golangci-lint is that is can be easily introduced to an existing large codebase using the --new-from-rev COMMITID . With this setting only newly introduced issues are flagged, allowing a team to improve new code without having to fix all historic issues in a large codebase. This provides a great path to improving code-reviews on existing solutions. golangci-lint can also be setup as the default linter in VS Code. Installation options for golangci-lint are present at golangci-lint . To use golangci-lint with VS Code, use the below recommended settings: \"go.lintTool\" : \"golangci-lint\" , \"go.lintFlags\" : [ \"--fast\" ]","title":"golangci-lint"},{"location":"code-reviews/recipes/go/#pre-commit-hooks","text":"All developers should run gofmt in a pre-commit hook to ensure standard formatting.","title":"Pre-Commit Hooks"},{"location":"code-reviews/recipes/go/#step-1-install-pre-commit","text":"Run pip install pre-commit to install pre-commit. Alternatively you can run brew install pre-commit if you are using homebrew.","title":"Step 1- Install pre-commit"},{"location":"code-reviews/recipes/go/#step-2-add-go-fmt-in-pre-commit","text":"Add .pre-commit-config.yaml file to root of the go project. Run go-fmt on pre-commit by adding it to .pre-commit-config.yaml file like below. - repo : git://github.com/dnephin/pre-commit-golang rev : master hooks : - id : go-fmt","title":"Step 2- Add go-fmt in pre-commit"},{"location":"code-reviews/recipes/go/#step-3","text":"Run $ pre-commit install to set up the git hook scripts","title":"Step 3"},{"location":"code-reviews/recipes/go/#build-validation","text":"gofmt should be run as a part of every build to enforce the common standard. To automate this process in Azure DevOps you can add the following snippet to your azure-pipelines.yaml file. This will format any scripts in the ./scripts/ folder. - script : go fmt workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code formatting\" govet should be run as a part of every build to check code linting. To automate this process in Azure DevOps you can add the following snippet to your azure-pipelines.yaml file. This will check linting of any scripts in the ./scripts/ folder. - script : go vet workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code linting\" Alternatively you can use golangci-lint as a step in the pipeline to do multiple enabled validations(including go vet and go fmt) of golangci-lint. - script : golangci-lint run --enable gofmt --fix workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code linting\"","title":"Build Validation"},{"location":"code-reviews/recipes/go/#sample-build-validation-pipeline-in-azure-devops","text":"trigger : master pool : vmImage : 'ubuntu-latest' steps : - task : GoTool@0 inputs : version : '1.13.5' - task : Go@0 inputs : command : 'get' arguments : '-d' workingDirectory : '$(System.DefaultWorkingDirectory)/scripts' - script : go fmt workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : \"Run code formatting\" - script : go vet workingDirectory : $(System.DefaultWorkingDirectory)/scripts displayName : 'Run go vet' - task : Go@0 inputs : command : 'build' workingDirectory : '$(System.DefaultWorkingDirectory)' - task : CopyFiles@2 inputs : TargetFolder : '$(Build.ArtifactStagingDirectory)' - task : PublishBuildArtifacts@1 inputs : artifactName : drop","title":"Sample Build Validation Pipeline in Azure DevOps"},{"location":"code-reviews/recipes/go/#code-review-checklist","text":"The Go language team maintains a list of common Code Review Comments for go that form the basis for a solid checklist for a team working in Go that should be followed in addition to the ISE Code Review Checklist Does this code handle errors correctly? This includes not throwing away errors with _ assignments and returning errors, instead of in-band error values ? Does this code follow Go standards for method receiver types ? Does this code pass values when it should? Are interfaces in this code defined in the correct packages ? Do go-routines in this code have clear lifetimes ? Is parallelism in this code handled via go-routines and channels with synchronous methods ? Does this code have meaningful Doc Comments ? Does this code have meaningful Package Comments ? Does this code use Contexts correctly? Do unit tests fail with meaningful messages ?","title":"Code Review Checklist"},{"location":"code-reviews/recipes/java/","text":"Java Code Reviews Java Style Guide Developers should follow the Google Java Style Guide . Code Analysis / Linting We strongly believe that consistent style increases readability and maintainability of a code base. Hence, we are recommending analyzers to enforce consistency and style rules. We make use of Checkstyle using the same configuration used in the Azure Java SDK . FindBugs and PMD are also commonly used. Automatic Code Formatting Eclipse, and other Java IDEs, support automatic code formatting. If using Maven, some developers also make use of the formatter-maven-plugin . Build Validation It's important to enforce your code style and rules in the CI to avoid any team members merging code that does not comply with standards into your git repo. If building using Azure DevOps, Azure DevOps support Maven and Gradle build tasks using PMD , Checkstyle , and FindBugs code analysis tools as part of every build. Here is an example yaml for a Maven build task with all three analysis tools enabled: - task : Maven@3 displayName : 'Maven pom.xml' inputs : mavenPomFile : '$(Parameters.mavenPOMFile)' checkStyleRunAnalysis : true pmdRunAnalysis : true findBugsRunAnalysis : true Here is an example yaml for a Gradle build task with all three analysis tools enabled: - task : Gradle@2 displayName : 'gradlew build' inputs : checkStyleRunAnalysis : true findBugsRunAnalysis : true pmdRunAnalysis : true Code Review Checklist In addition to the Code Review Checklist you should also look for these Java specific code review items Does the project use Lambda to make code cleaner? Is dependency injection (DI) used? Is it setup correctly? If the code uses Spring Boot, are you using @Inject instead of @Autowire? Does the code handle exceptions correctly? Is the Azul Zulu OpenJDK being used? Is a build automation and package management tool (Gradle or Maven) being used?","title":"Java Code Reviews"},{"location":"code-reviews/recipes/java/#java-code-reviews","text":"","title":"Java Code Reviews"},{"location":"code-reviews/recipes/java/#java-style-guide","text":"Developers should follow the Google Java Style Guide .","title":"Java Style Guide"},{"location":"code-reviews/recipes/java/#code-analysis-linting","text":"We strongly believe that consistent style increases readability and maintainability of a code base. Hence, we are recommending analyzers to enforce consistency and style rules. We make use of Checkstyle using the same configuration used in the Azure Java SDK . FindBugs and PMD are also commonly used.","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/java/#automatic-code-formatting","text":"Eclipse, and other Java IDEs, support automatic code formatting. If using Maven, some developers also make use of the formatter-maven-plugin .","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/java/#build-validation","text":"It's important to enforce your code style and rules in the CI to avoid any team members merging code that does not comply with standards into your git repo. If building using Azure DevOps, Azure DevOps support Maven and Gradle build tasks using PMD , Checkstyle , and FindBugs code analysis tools as part of every build. Here is an example yaml for a Maven build task with all three analysis tools enabled: - task : Maven@3 displayName : 'Maven pom.xml' inputs : mavenPomFile : '$(Parameters.mavenPOMFile)' checkStyleRunAnalysis : true pmdRunAnalysis : true findBugsRunAnalysis : true Here is an example yaml for a Gradle build task with all three analysis tools enabled: - task : Gradle@2 displayName : 'gradlew build' inputs : checkStyleRunAnalysis : true findBugsRunAnalysis : true pmdRunAnalysis : true","title":"Build Validation"},{"location":"code-reviews/recipes/java/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these Java specific code review items Does the project use Lambda to make code cleaner? Is dependency injection (DI) used? Is it setup correctly? If the code uses Spring Boot, are you using @Inject instead of @Autowire? Does the code handle exceptions correctly? Is the Azul Zulu OpenJDK being used? Is a build automation and package management tool (Gradle or Maven) being used?","title":"Code Review Checklist"},{"location":"code-reviews/recipes/javascript-and-typescript/","text":"JavaScript/TypeScript Code Reviews Style Guide Developers should use prettier to do code formatting for JavaScript. Using an automated code formatting tool like Prettier enforces a well accepted style guide that was collaboratively built by a wide range of companies including Microsoft, Facebook, and AirBnB. For higher level style guidance not covered by prettier, we follow the AirBnB Style Guide . Code Analysis / Linting eslint Per guidance outlined in Palantir's 2019 TSLint road map , TypeScript code should be linted with ESLint . See the typescript-eslint documentation for more information around linting TypeScript code with ESLint. To install and configure linting with ESLint , install the following packages as dev-dependencies: npm install -D eslint @typescript-eslint/parser @typescript-eslint/eslint-plugin Add a .eslintrc.js to the root of your project: module . exports = { root : true , parser : '@typescript-eslint/parser' , plugins : [ '@typescript-eslint' , ], extends : [ 'eslint:recommended' , 'plugin:@typescript-eslint/eslint-recommended' , 'plugin:@typescript-eslint/recommended' , ], }; Add the following to the scripts of your package.json : \"scripts\" : { \"lint\" : \"eslint . --ext .js,.jsx,.ts,.tsx --ignore-path .gitignore\" } This will lint all .js , .jsx , .ts , .tsx files in your project and omit any files or directories specified in your .gitignore . You can run linting with: npm run lint Setting up Prettier Prettier is an opinionated code formatter. Getting started guide . Install with npm as a dev-dependency: npm install -D prettier eslint-config-prettier eslint-plugin-prettier Add prettier to your .eslintrc.js : module . exports = { root : true , parser : '@typescript-eslint/parser' , plugins : [ '@typescript-eslint' , ], extends : [ 'eslint:recommended' , 'plugin:@typescript-eslint/eslint-recommended' , 'plugin:@typescript-eslint/recommended' , 'prettier/@typescript-eslint' , 'plugin:prettier/recommended' , ], }; This will apply the prettier rule set when linting with ESLint. Auto Formatting with VSCode VS Code can be configured to automatically perform eslint --fix on save. Create a .vscode folder in the root of your project and add the following to your .vscode/settings.json : { \"editor.codeActionsOnSave\" : { \"source.fixAll.eslint\" : true }, } By default, we use the following overrides should be added to the VS Code configuration to standardize on single quotes, a four space drop, and to do ESLinting: { \"prettier.singleQuote\" : true , \"prettier.eslintIntegration\" : true , \"prettier.tabWidth\" : 4 } Setting Up Testing Playwright is highly recommended to be set up within a project. its an open source testing suite created by Microsoft. To install it use this command: npm install playwright Since playwright shows the tests in the browser you have to choose which browser you want it to run if unless using chrome, which is the default. You can do this by Build Validation To automate this process in Azure Devops you can add the following snippet to your pipeline definition yaml file. This will lint any scripts in the ./scripts/ folder. - task : Npm@1 displayName : 'Lint' inputs : command : 'custom' customCommand : 'run lint' workingDir : './scripts/' Pre-Commit Hooks All developers should run eslint in a pre-commit hook to ensure standard formatting. We highly recommend using an editor integration like vscode-eslint to provide realtime feedback. Under .git/hooks rename pre-commit.sample to pre-commit Remove the existing sample code in that file There are many examples of scripts for this on gist, like pre-commit-eslint Modify accordingly to include TypeScript files (include ts extension and make sure typescript-eslint is set up) Make the file executable: chmod +x .git/hooks/pre-commit As an alternative husky can be considered to simplify pre-commit hooks. Code Review Checklist In addition to the Code Review Checklist you should also look for these JavaScript and TypeScript specific code review items. Javascript / Typescript Checklist Does the code stick to our formatting and code standards? Does running prettier and ESLint over the code should yield no warnings or errors respectively? Does the change re-implement code that would be better served by pulling in a well known module from the ecosystem? Is \"use strict\"; used to reduce errors with undeclared variables? Are unit tests used where possible, also for APIs? Are tests arranged correctly with the Arrange/Act/Assert pattern and properly documented in this way? Are best practices for error handling followed, as well as try catch finally statements? Are the doWork().then(doSomething).then(checkSomething) properly followed for async calls, including expect , done ? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why it is there. If the number is repetitive, is there a constant/enum or equivalent? If there is an asynchronous method, does the name of the method end with the Async suffix? Is a minimum level of logging in place? Are the logging levels used sensible? Is document fragment manipulation limited to when you need to manipulate multiple sub elements? Does TypeScript code compile without raising linting errors? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why it is there. If the number is repetitive, is there a constant/enum or equivalent? Is there a proper /* */ in the various classes and methods? Are heavy operations implemented in the backend, leaving the controller as thin as possible? Is event handling on the html efficiently done?","title":"JavaScript/TypeScript Code Reviews"},{"location":"code-reviews/recipes/javascript-and-typescript/#javascripttypescript-code-reviews","text":"","title":"JavaScript/TypeScript Code Reviews"},{"location":"code-reviews/recipes/javascript-and-typescript/#style-guide","text":"Developers should use prettier to do code formatting for JavaScript. Using an automated code formatting tool like Prettier enforces a well accepted style guide that was collaboratively built by a wide range of companies including Microsoft, Facebook, and AirBnB. For higher level style guidance not covered by prettier, we follow the AirBnB Style Guide .","title":"Style Guide"},{"location":"code-reviews/recipes/javascript-and-typescript/#code-analysis-linting","text":"","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/javascript-and-typescript/#eslint","text":"Per guidance outlined in Palantir's 2019 TSLint road map , TypeScript code should be linted with ESLint . See the typescript-eslint documentation for more information around linting TypeScript code with ESLint. To install and configure linting with ESLint , install the following packages as dev-dependencies: npm install -D eslint @typescript-eslint/parser @typescript-eslint/eslint-plugin Add a .eslintrc.js to the root of your project: module . exports = { root : true , parser : '@typescript-eslint/parser' , plugins : [ '@typescript-eslint' , ], extends : [ 'eslint:recommended' , 'plugin:@typescript-eslint/eslint-recommended' , 'plugin:@typescript-eslint/recommended' , ], }; Add the following to the scripts of your package.json : \"scripts\" : { \"lint\" : \"eslint . --ext .js,.jsx,.ts,.tsx --ignore-path .gitignore\" } This will lint all .js , .jsx , .ts , .tsx files in your project and omit any files or directories specified in your .gitignore . You can run linting with: npm run lint","title":"eslint"},{"location":"code-reviews/recipes/javascript-and-typescript/#setting-up-prettier","text":"Prettier is an opinionated code formatter. Getting started guide . Install with npm as a dev-dependency: npm install -D prettier eslint-config-prettier eslint-plugin-prettier Add prettier to your .eslintrc.js : module . exports = { root : true , parser : '@typescript-eslint/parser' , plugins : [ '@typescript-eslint' , ], extends : [ 'eslint:recommended' , 'plugin:@typescript-eslint/eslint-recommended' , 'plugin:@typescript-eslint/recommended' , 'prettier/@typescript-eslint' , 'plugin:prettier/recommended' , ], }; This will apply the prettier rule set when linting with ESLint.","title":"Setting up Prettier"},{"location":"code-reviews/recipes/javascript-and-typescript/#auto-formatting-with-vscode","text":"VS Code can be configured to automatically perform eslint --fix on save. Create a .vscode folder in the root of your project and add the following to your .vscode/settings.json : { \"editor.codeActionsOnSave\" : { \"source.fixAll.eslint\" : true }, } By default, we use the following overrides should be added to the VS Code configuration to standardize on single quotes, a four space drop, and to do ESLinting: { \"prettier.singleQuote\" : true , \"prettier.eslintIntegration\" : true , \"prettier.tabWidth\" : 4 }","title":"Auto Formatting with VSCode"},{"location":"code-reviews/recipes/javascript-and-typescript/#setting-up-testing","text":"Playwright is highly recommended to be set up within a project. its an open source testing suite created by Microsoft. To install it use this command: npm install playwright Since playwright shows the tests in the browser you have to choose which browser you want it to run if unless using chrome, which is the default. You can do this by","title":"Setting Up Testing"},{"location":"code-reviews/recipes/javascript-and-typescript/#build-validation","text":"To automate this process in Azure Devops you can add the following snippet to your pipeline definition yaml file. This will lint any scripts in the ./scripts/ folder. - task : Npm@1 displayName : 'Lint' inputs : command : 'custom' customCommand : 'run lint' workingDir : './scripts/'","title":"Build Validation"},{"location":"code-reviews/recipes/javascript-and-typescript/#pre-commit-hooks","text":"All developers should run eslint in a pre-commit hook to ensure standard formatting. We highly recommend using an editor integration like vscode-eslint to provide realtime feedback. Under .git/hooks rename pre-commit.sample to pre-commit Remove the existing sample code in that file There are many examples of scripts for this on gist, like pre-commit-eslint Modify accordingly to include TypeScript files (include ts extension and make sure typescript-eslint is set up) Make the file executable: chmod +x .git/hooks/pre-commit As an alternative husky can be considered to simplify pre-commit hooks.","title":"Pre-Commit Hooks"},{"location":"code-reviews/recipes/javascript-and-typescript/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these JavaScript and TypeScript specific code review items.","title":"Code Review Checklist"},{"location":"code-reviews/recipes/javascript-and-typescript/#javascript-typescript-checklist","text":"Does the code stick to our formatting and code standards? Does running prettier and ESLint over the code should yield no warnings or errors respectively? Does the change re-implement code that would be better served by pulling in a well known module from the ecosystem? Is \"use strict\"; used to reduce errors with undeclared variables? Are unit tests used where possible, also for APIs? Are tests arranged correctly with the Arrange/Act/Assert pattern and properly documented in this way? Are best practices for error handling followed, as well as try catch finally statements? Are the doWork().then(doSomething).then(checkSomething) properly followed for async calls, including expect , done ? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why it is there. If the number is repetitive, is there a constant/enum or equivalent? If there is an asynchronous method, does the name of the method end with the Async suffix? Is a minimum level of logging in place? Are the logging levels used sensible? Is document fragment manipulation limited to when you need to manipulate multiple sub elements? Does TypeScript code compile without raising linting errors? Instead of using raw strings, are constants used in the main class? Or if these strings are used across files/classes, is there a static class for the constants? Are magic numbers explained? There should be no number in the code without at least a comment of why it is there. If the number is repetitive, is there a constant/enum or equivalent? Is there a proper /* */ in the various classes and methods? Are heavy operations implemented in the backend, leaving the controller as thin as possible? Is event handling on the html efficiently done?","title":"Javascript / Typescript Checklist"},{"location":"code-reviews/recipes/markdown/","text":"Markdown Code Reviews Style Guide Developers should treat documentation like other source code and follow the same rules and checklists when reviewing documentation as code. Documentation should both use good Markdown syntax to ensure it's properly parsed, and follow good writing style guidelines to ensure the document is easy to read and understand. Markdown Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world\u2019s most popular markup languages. Using Markdown is different from using a WYSIWYG editor. In an application like Microsoft Word, you click buttons to format words and phrases, and the changes are visible immediately. Markdown isn\u2019t like that. When you create a Markdown-formatted file, you add Markdown syntax to the text to indicate which words and phrases should look different. You can find more information and full documentation here . Linters Markdown has specific way of being formatted. It is important to respect this formatting, otherwise some interpreters which are strict won't properly display the document. Linters are often used to help developers properly create documents by both verifying proper Markdown syntax, grammar and proper English language. A good setup includes a markdown linter used during editing and PR build verification, and a grammar linter used while editing the document. The following are a list of linters that could be used in this setup. markdownlint markdownlint is a linter for markdown that verifies Markdown syntax, and also enforces rules that make the text more readable. Markdownlint-cli is an easy-to-use CLI based on Markdownlint. It's available as a ruby gem , an npm package , a Node.js CLI and a VS Code extension . The VS Code extension Prettier also catches all markdownlint errors. Installing the Node.js CLI npm install -g markdownlint-cli Running markdownlint on a Node.js project markdownlint **/*.md --ignore node_modules Fixing errors automatically markdownlint **/*.md --ignore node_modules --fix A comprehensive list of markdownlint rules is available here . write-good write-good is a linter for English text that helps writing better documentation. npm install -g write-good Run write-good write-good *.md Run write-good without installing it npx write-good *.md Write Good is also available as an extension for VS Code VSCode Extensions Write Good Linter The Write Good Linter Extension integrates with VS Code to give grammar and language advice while editing the document. markdownlint Extension The markdownlint extension examines the Markdown documents, showing warnings for rule violations while editing. Build Validation Linting To automate linting with markdownlint for PR validation in GitHub actions, you can either use linters aggregator as we do with MegaLinter in this repository or use the following YAML. name : Markdownlint on : push : paths : - \"**/*.md\" pull_request : paths : - \"**/*.md\" jobs : lint : runs-on : ubuntu-latest steps : - uses : actions/checkout@v2 - name : Use Node.js uses : actions/setup-node@v1 with : node-version : 12.x - name : Run Markdownlint run : | npm i -g markdownlint-cli markdownlint \"**/*.md\" --ignore node_modules Checking Links To automate link check in your markdown files add markdown-link-check action to your validation pipeline: markdown-link-check : runs-on : ubuntu-latest steps : - uses : actions/checkout@master - uses : gaurav-nelson/github-action-markdown-link-check@v1 More information about markdown-link-check action options can be found at markdown-link-check home page Code Review Checklist In addition to the Code Review Checklist you should also look for these documentation specific code review items Is the document easy to read and understand and does it follow good writing guidelines ? Is there a single source of truth or is content repeated in more than one document? Is the documentation up to date with the code? Is the documentation technically, and ethically correct? Writing Style Guidelines The following are some examples of writing style guidelines. Agree in your team which guidelines you should apply to your project documentation. Save your guidelines together with your documentation, so they are easy to refer back to. Wording Use inclusive language, and avoid jargon and uncommon words. The docs should be easy to understand Be clear and concise, stick to the goal of the document Use active voice Spell check and grammar check the text Always follow chronological order Visit Plain English for tips on how to write documentation that is easy to understand. Document Organization Organize documents by topic rather than type, this makes it easier to find the documentation Each folder should have a top-level README.md and any other documents within that folder should link directly or indirectly from that README.md Document names with more than one word should use underscores instead of spaces, for example machine_learning_pipeline_design.md . The same applies to images Headings Start with a H1 (single # in markdown) and respect the order H1 > H2 > H3 etc Follow each heading with text before proceeding with the next heading Avoid putting numbers in headings. Numbers shift, and can create outdated titles Avoid using symbols and special characters in headers, this causes problems with anchor links Avoid links in headers Resources Avoid duplication of content, instead link to the single source of truth Link but don't summarize. Summarizing content on another page leads to the content living in two places Use meaningful anchor texts, e.g. instead of writing Follow the instructions [here](../recipes/markdown.md) write Follow the [Markdown guidelines](../recipes/markdown.md) Make sure links to Microsoft docs do not contain the language marker /en-us/ or /fr-fr/ , as this is automatically determined by the site itself. Lists List items should start with capital letters if possible Use ordered lists when the items describe a sequence to follow, otherwise use unordered lists For ordered lists, prefix each item with 1. When rendered, the list items will appear with sequential numbering. This avoids number-gaps in list Do not add commas , or semicolons ; to the end of list items, and avoid periods . unless the list item represents a complete sentence Images Place images in a separate directory named img Name images appropriately, avoiding generic names like screenshot.png Avoid adding large images or videos to source control, link to an external location instead Emphasis and Special Sections Use bold or italic to emphasize For sections that everyone reading this document needs to be aware of, use blocks Use backticks for code, a single backtick for inline code like pip install flake8 and 3 backticks for code blocks followed by the language for syntax highlighting def add ( num1 : int , num2 : int ): return num1 + num2 Use check boxes for task lists Item 1 Item 2 Item 3 Add a References section to the end of the document with links to external references Prefer tables to lists for comparisons and reports to make research and results more readable Option Pros Cons Option 1 Some pros Some cons Option 2 Some pros Some cons General Always use Markdown syntax, don't mix with HTML Make sure the extension of the files is .md - if the extension is missing, a linter might ignore the files","title":"Markdown Code Reviews"},{"location":"code-reviews/recipes/markdown/#markdown-code-reviews","text":"","title":"Markdown Code Reviews"},{"location":"code-reviews/recipes/markdown/#style-guide","text":"Developers should treat documentation like other source code and follow the same rules and checklists when reviewing documentation as code. Documentation should both use good Markdown syntax to ensure it's properly parsed, and follow good writing style guidelines to ensure the document is easy to read and understand.","title":"Style Guide"},{"location":"code-reviews/recipes/markdown/#markdown","text":"Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world\u2019s most popular markup languages. Using Markdown is different from using a WYSIWYG editor. In an application like Microsoft Word, you click buttons to format words and phrases, and the changes are visible immediately. Markdown isn\u2019t like that. When you create a Markdown-formatted file, you add Markdown syntax to the text to indicate which words and phrases should look different. You can find more information and full documentation here .","title":"Markdown"},{"location":"code-reviews/recipes/markdown/#linters","text":"Markdown has specific way of being formatted. It is important to respect this formatting, otherwise some interpreters which are strict won't properly display the document. Linters are often used to help developers properly create documents by both verifying proper Markdown syntax, grammar and proper English language. A good setup includes a markdown linter used during editing and PR build verification, and a grammar linter used while editing the document. The following are a list of linters that could be used in this setup.","title":"Linters"},{"location":"code-reviews/recipes/markdown/#markdownlint","text":"markdownlint is a linter for markdown that verifies Markdown syntax, and also enforces rules that make the text more readable. Markdownlint-cli is an easy-to-use CLI based on Markdownlint. It's available as a ruby gem , an npm package , a Node.js CLI and a VS Code extension . The VS Code extension Prettier also catches all markdownlint errors. Installing the Node.js CLI npm install -g markdownlint-cli Running markdownlint on a Node.js project markdownlint **/*.md --ignore node_modules Fixing errors automatically markdownlint **/*.md --ignore node_modules --fix A comprehensive list of markdownlint rules is available here .","title":"markdownlint"},{"location":"code-reviews/recipes/markdown/#write-good","text":"write-good is a linter for English text that helps writing better documentation. npm install -g write-good Run write-good write-good *.md Run write-good without installing it npx write-good *.md Write Good is also available as an extension for VS Code","title":"write-good"},{"location":"code-reviews/recipes/markdown/#vscode-extensions","text":"","title":"VSCode Extensions"},{"location":"code-reviews/recipes/markdown/#write-good-linter","text":"The Write Good Linter Extension integrates with VS Code to give grammar and language advice while editing the document.","title":"Write Good Linter"},{"location":"code-reviews/recipes/markdown/#markdownlint-extension","text":"The markdownlint extension examines the Markdown documents, showing warnings for rule violations while editing.","title":"markdownlint Extension"},{"location":"code-reviews/recipes/markdown/#build-validation","text":"","title":"Build Validation"},{"location":"code-reviews/recipes/markdown/#linting","text":"To automate linting with markdownlint for PR validation in GitHub actions, you can either use linters aggregator as we do with MegaLinter in this repository or use the following YAML. name : Markdownlint on : push : paths : - \"**/*.md\" pull_request : paths : - \"**/*.md\" jobs : lint : runs-on : ubuntu-latest steps : - uses : actions/checkout@v2 - name : Use Node.js uses : actions/setup-node@v1 with : node-version : 12.x - name : Run Markdownlint run : | npm i -g markdownlint-cli markdownlint \"**/*.md\" --ignore node_modules","title":"Linting"},{"location":"code-reviews/recipes/markdown/#checking-links","text":"To automate link check in your markdown files add markdown-link-check action to your validation pipeline: markdown-link-check : runs-on : ubuntu-latest steps : - uses : actions/checkout@master - uses : gaurav-nelson/github-action-markdown-link-check@v1 More information about markdown-link-check action options can be found at markdown-link-check home page","title":"Checking Links"},{"location":"code-reviews/recipes/markdown/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these documentation specific code review items Is the document easy to read and understand and does it follow good writing guidelines ? Is there a single source of truth or is content repeated in more than one document? Is the documentation up to date with the code? Is the documentation technically, and ethically correct?","title":"Code Review Checklist"},{"location":"code-reviews/recipes/markdown/#writing-style-guidelines","text":"The following are some examples of writing style guidelines. Agree in your team which guidelines you should apply to your project documentation. Save your guidelines together with your documentation, so they are easy to refer back to.","title":"Writing Style Guidelines"},{"location":"code-reviews/recipes/markdown/#wording","text":"Use inclusive language, and avoid jargon and uncommon words. The docs should be easy to understand Be clear and concise, stick to the goal of the document Use active voice Spell check and grammar check the text Always follow chronological order Visit Plain English for tips on how to write documentation that is easy to understand.","title":"Wording"},{"location":"code-reviews/recipes/markdown/#document-organization","text":"Organize documents by topic rather than type, this makes it easier to find the documentation Each folder should have a top-level README.md and any other documents within that folder should link directly or indirectly from that README.md Document names with more than one word should use underscores instead of spaces, for example machine_learning_pipeline_design.md . The same applies to images","title":"Document Organization"},{"location":"code-reviews/recipes/markdown/#headings","text":"Start with a H1 (single # in markdown) and respect the order H1 > H2 > H3 etc Follow each heading with text before proceeding with the next heading Avoid putting numbers in headings. Numbers shift, and can create outdated titles Avoid using symbols and special characters in headers, this causes problems with anchor links Avoid links in headers","title":"Headings"},{"location":"code-reviews/recipes/markdown/#resources","text":"Avoid duplication of content, instead link to the single source of truth Link but don't summarize. Summarizing content on another page leads to the content living in two places Use meaningful anchor texts, e.g. instead of writing Follow the instructions [here](../recipes/markdown.md) write Follow the [Markdown guidelines](../recipes/markdown.md) Make sure links to Microsoft docs do not contain the language marker /en-us/ or /fr-fr/ , as this is automatically determined by the site itself.","title":"Resources"},{"location":"code-reviews/recipes/markdown/#lists","text":"List items should start with capital letters if possible Use ordered lists when the items describe a sequence to follow, otherwise use unordered lists For ordered lists, prefix each item with 1. When rendered, the list items will appear with sequential numbering. This avoids number-gaps in list Do not add commas , or semicolons ; to the end of list items, and avoid periods . unless the list item represents a complete sentence","title":"Lists"},{"location":"code-reviews/recipes/markdown/#images","text":"Place images in a separate directory named img Name images appropriately, avoiding generic names like screenshot.png Avoid adding large images or videos to source control, link to an external location instead","title":"Images"},{"location":"code-reviews/recipes/markdown/#emphasis-and-special-sections","text":"Use bold or italic to emphasize For sections that everyone reading this document needs to be aware of, use blocks Use backticks for code, a single backtick for inline code like pip install flake8 and 3 backticks for code blocks followed by the language for syntax highlighting def add ( num1 : int , num2 : int ): return num1 + num2 Use check boxes for task lists Item 1 Item 2 Item 3 Add a References section to the end of the document with links to external references Prefer tables to lists for comparisons and reports to make research and results more readable Option Pros Cons Option 1 Some pros Some cons Option 2 Some pros Some cons","title":"Emphasis and Special Sections"},{"location":"code-reviews/recipes/markdown/#general","text":"Always use Markdown syntax, don't mix with HTML Make sure the extension of the files is .md - if the extension is missing, a linter might ignore the files","title":"General"},{"location":"code-reviews/recipes/python/","text":"Python Code Reviews Style Guide Developers should follow the PEP8 style guide with type hints . The use of type hints throughout paired with linting and type hint checking avoids common errors that are tricky to debug. Projects should check Python code with automated tools. Linting should be added to build validation, and both linting and code formatting can be added to your pre-commit hooks and VS Code. Code Analysis / Linting The 2 most popular python linters are Pylint and Flake8 . Both check adherence to PEP8 but vary a bit in what other rules they check. In general Pylint tends to be a bit more stringent and give more false positives but both are good options for linting python code. Both Pylint and Flake8 can be configured in VS Code using the VS Code python extension . Flake8 Flake8 is a simple and fast wrapper around Pyflakes (for detecting coding errors) and pycodestyle (for pep8). Install Flake8 pip install flake8 Add an extension for the pydocstyle (for doc strings ) tool to flake8. pip install flake8-docstrings Add an extension for pep8-naming (for naming conventions in pep8) tool to flake8. pip install pep8-naming Run Flake8 flake8 . # lint the whole project Pylint Install Pylint pip install pylint Run Pylint pylint src # lint the source directory Automatic Code Formatting Black Black is an unapologetic code formatting tool. It removes all need from pycodestyle nagging about formatting, so the team can focus on content vs style. It's not possible to configure black for your own style needs. pip install black Format python code black [ file/folder ] autopep8 Autopep8 is more lenient and allows more configuration if you want less stringent formatting. pip install autopep8 Format python code autopep8 [ file/folder ] --in-place yapf yapf Yet Another Python Formatter is a python formatter from Google based on ideas from gofmt. This is also more configurable, and a good option for automatic code formatting. pip install yapf Format python code yapf [ file/folder ] --in-place Bandit Bandit is a tool designed by the Python Code Quality Authority (PyCQA) to perform static analysis of Python code, specifically targeting security issues. It scans for common security issues in Python codebase. Installation : Add Bandit to your development environment with: pip install bandit VSCode Extensions Python The Python language extension is the base extension you should have installed for python development with VS Code. It enables intellisense, debugging, linting (with the above linters), testing with pytest or unittest, and code formatting with the formatters mentioned above. Pyright The Pyright extension augments VS Code with static type checking when you use type hints def add ( first_value : int , second_value : int ) -> int : return first_value + second_value Build Validation To automate linting with flake8 and testing with pytest in Azure Devops you can add the following snippet to you azure-pipelines.yaml file. trigger : branches : include : - develop - master paths : include : - src/* pool : vmImage : 'ubuntu-latest' jobs : - job : LintAndTest displayName : Lint and Test steps : - checkout : self lfs : true - task : UsePythonVersion@0 displayName : 'Set Python version to 3.6' inputs : versionSpec : '3.6' - script : pip3 install --user -r requirements.txt displayName : 'Install dependencies' - script : | # Install Flake8 pip3 install --user flake8 # Install PyTest pip3 install --user pytest displayName : 'Install Flake8 and PyTest' - script : | python3 -m flake8 displayName : 'Run Flake8 linter' - script : | # Run PyTest tester python3 -m pytest --junitxml=./test-results.xml displayName : 'Run PyTest Tester' - task : PublishTestResults@2 displayName : 'Publish PyTest results' condition : succeededOrFailed() inputs : testResultsFiles : '**/test-*.xml' testRunTitle : 'Publish test results for Python $(python.version)' To perform a PR validation on GitHub you can use a similar YAML configuration with GitHub Actions Pre-Commit Hooks Pre-commit hooks allow you to format and lint code locally before submitting the pull request. Adding pre-commit hooks for your python repository is easy using the pre-commit package Install pre-commit and add to the requirements.txt pip install pre-commit Add a .pre-commit-config.yaml file in the root of the repository, with the desired pre-commit actions repos : - repo : https://github.com/ambv/black rev : stable hooks : - id : black language_version : python3.6 - repo : https://github.com/pre-commit/pre-commit-hooks rev : v1.2.3 hooks : - id : flake8 Each individual developer that wants to set up pre-commit hooks can then run pre-commit install At the next attempted commit any lint failures will block the commit. Note: Installing pre-commit hooks is voluntary and done by each developer individually. Thus, it's not a replacement for build validation on the server Code Review Checklist In addition to the Code Review Checklist you should also look for these python specific code review items Are all new packages used included in requirements.txt Does the code pass all lint checks? Do functions use type hints, and are there any type hint errors? Is the code readable and using pythonic constructs wherever possible.","title":"Python Code Reviews"},{"location":"code-reviews/recipes/python/#python-code-reviews","text":"","title":"Python Code Reviews"},{"location":"code-reviews/recipes/python/#style-guide","text":"Developers should follow the PEP8 style guide with type hints . The use of type hints throughout paired with linting and type hint checking avoids common errors that are tricky to debug. Projects should check Python code with automated tools. Linting should be added to build validation, and both linting and code formatting can be added to your pre-commit hooks and VS Code.","title":"Style Guide"},{"location":"code-reviews/recipes/python/#code-analysis-linting","text":"The 2 most popular python linters are Pylint and Flake8 . Both check adherence to PEP8 but vary a bit in what other rules they check. In general Pylint tends to be a bit more stringent and give more false positives but both are good options for linting python code. Both Pylint and Flake8 can be configured in VS Code using the VS Code python extension .","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/python/#flake8","text":"Flake8 is a simple and fast wrapper around Pyflakes (for detecting coding errors) and pycodestyle (for pep8). Install Flake8 pip install flake8 Add an extension for the pydocstyle (for doc strings ) tool to flake8. pip install flake8-docstrings Add an extension for pep8-naming (for naming conventions in pep8) tool to flake8. pip install pep8-naming Run Flake8 flake8 . # lint the whole project","title":"Flake8"},{"location":"code-reviews/recipes/python/#pylint","text":"Install Pylint pip install pylint Run Pylint pylint src # lint the source directory","title":"Pylint"},{"location":"code-reviews/recipes/python/#automatic-code-formatting","text":"","title":"Automatic Code Formatting"},{"location":"code-reviews/recipes/python/#black","text":"Black is an unapologetic code formatting tool. It removes all need from pycodestyle nagging about formatting, so the team can focus on content vs style. It's not possible to configure black for your own style needs. pip install black Format python code black [ file/folder ]","title":"Black"},{"location":"code-reviews/recipes/python/#autopep8","text":"Autopep8 is more lenient and allows more configuration if you want less stringent formatting. pip install autopep8 Format python code autopep8 [ file/folder ] --in-place","title":"autopep8"},{"location":"code-reviews/recipes/python/#yapf","text":"yapf Yet Another Python Formatter is a python formatter from Google based on ideas from gofmt. This is also more configurable, and a good option for automatic code formatting. pip install yapf Format python code yapf [ file/folder ] --in-place","title":"yapf"},{"location":"code-reviews/recipes/python/#bandit","text":"Bandit is a tool designed by the Python Code Quality Authority (PyCQA) to perform static analysis of Python code, specifically targeting security issues. It scans for common security issues in Python codebase. Installation : Add Bandit to your development environment with: pip install bandit","title":"Bandit"},{"location":"code-reviews/recipes/python/#vscode-extensions","text":"","title":"VSCode Extensions"},{"location":"code-reviews/recipes/python/#python","text":"The Python language extension is the base extension you should have installed for python development with VS Code. It enables intellisense, debugging, linting (with the above linters), testing with pytest or unittest, and code formatting with the formatters mentioned above.","title":"Python"},{"location":"code-reviews/recipes/python/#pyright","text":"The Pyright extension augments VS Code with static type checking when you use type hints def add ( first_value : int , second_value : int ) -> int : return first_value + second_value","title":"Pyright"},{"location":"code-reviews/recipes/python/#build-validation","text":"To automate linting with flake8 and testing with pytest in Azure Devops you can add the following snippet to you azure-pipelines.yaml file. trigger : branches : include : - develop - master paths : include : - src/* pool : vmImage : 'ubuntu-latest' jobs : - job : LintAndTest displayName : Lint and Test steps : - checkout : self lfs : true - task : UsePythonVersion@0 displayName : 'Set Python version to 3.6' inputs : versionSpec : '3.6' - script : pip3 install --user -r requirements.txt displayName : 'Install dependencies' - script : | # Install Flake8 pip3 install --user flake8 # Install PyTest pip3 install --user pytest displayName : 'Install Flake8 and PyTest' - script : | python3 -m flake8 displayName : 'Run Flake8 linter' - script : | # Run PyTest tester python3 -m pytest --junitxml=./test-results.xml displayName : 'Run PyTest Tester' - task : PublishTestResults@2 displayName : 'Publish PyTest results' condition : succeededOrFailed() inputs : testResultsFiles : '**/test-*.xml' testRunTitle : 'Publish test results for Python $(python.version)' To perform a PR validation on GitHub you can use a similar YAML configuration with GitHub Actions","title":"Build Validation"},{"location":"code-reviews/recipes/python/#pre-commit-hooks","text":"Pre-commit hooks allow you to format and lint code locally before submitting the pull request. Adding pre-commit hooks for your python repository is easy using the pre-commit package Install pre-commit and add to the requirements.txt pip install pre-commit Add a .pre-commit-config.yaml file in the root of the repository, with the desired pre-commit actions repos : - repo : https://github.com/ambv/black rev : stable hooks : - id : black language_version : python3.6 - repo : https://github.com/pre-commit/pre-commit-hooks rev : v1.2.3 hooks : - id : flake8 Each individual developer that wants to set up pre-commit hooks can then run pre-commit install At the next attempted commit any lint failures will block the commit. Note: Installing pre-commit hooks is voluntary and done by each developer individually. Thus, it's not a replacement for build validation on the server","title":"Pre-Commit Hooks"},{"location":"code-reviews/recipes/python/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these python specific code review items Are all new packages used included in requirements.txt Does the code pass all lint checks? Do functions use type hints, and are there any type hint errors? Is the code readable and using pythonic constructs wherever possible.","title":"Code Review Checklist"},{"location":"code-reviews/recipes/terraform/","text":"Terraform Code Reviews Style Guide Developers should follow the terraform style guide . Projects should check Terraform scripts with automated tools. Code Analysis / Linting TFLint TFLint is a Terraform linter focused on possible errors, best practices, etc. Once TFLint installed in the environment, it can be invoked using the VS Code terraform extension . VSCode Extensions The following VS Code extensions are widely used. Terraform extension This extension provides syntax highlighting, linting, formatting and validation capabilities. Azure Terraform extension This extension provides Terraform command support, resource graph visualization and CloudShell integration inside VS Code. Build Validation Ensure you enforce the style guides during build. The following example script can be used to install terraform, and a linter that then checks for formatting and common errors. #! /bin/bash set -e SCRIPT_DIR = $( dirname \" $BASH_SOURCE \" ) cd \" $SCRIPT_DIR \" TF_VERSION = 0 .12.4 TF_LINT_VERSION = 0 .9.1 echo -e \"\\n\\n>>> Installing Terraform 0.12\" # Install terraform tooling for linting terraform wget -q https://releases.hashicorp.com/terraform/ ${ TF_VERSION } /terraform_ ${ TF_VERSION } _linux_amd64.zip -O /tmp/terraform.zip sudo unzip -q -o -d /usr/local/bin/ /tmp/terraform.zip echo \"\" echo -e \"\\n\\n>>> Install tflint (3rd party)\" wget -q https://github.com/wata727/tflint/releases/download/v ${ TF_LINT_VERSION } /tflint_linux_amd64.zip -O /tmp/tflint.zip sudo unzip -q -o -d /usr/local/bin/ /tmp/tflint.zip echo -e \"\\n\\n>>> Terraform version\" terraform -version echo -e \"\\n\\n>>> Terraform Format (if this fails use 'terraform fmt -recursive' command to resolve\" terraform fmt -recursive -diff -check echo -e \"\\n\\n>>> tflint\" tflint echo -e \"\\n\\n>>> Terraform init\" terraform init echo -e \"\\n\\n>>> Terraform validate\" terraform validate Code Review Checklist In addition to the Code Review Checklist you should also look for these Terraform specific code review items Providers Are all providers used in the terraform scripts versioned to prevent breaking changes in the future? Repository Organization The code split into reusable modules? Modules are split into separate .tf files where appropriate? The repository contains a README.md describing the architecture provisioned? If Terraform code is mixed with application source code, the Terraform code isolated into a dedicated folder? Terraform State The Terraform project configured using Azure Storage as remote state backend? The remote state backend storage account key stored a secure location (e.g. Azure Key Vault)? The project is configured to use state files based on the environment, and the deployment pipeline is configured to supply the state file name dynamically? Variables If the infrastructure will be different depending on the environment (e.g. Dev, UAT, Production), the environment specific parameters are supplied via a .tfvars file? All variables have type information. E.g. a list(string) or string . All variables have a description stating the purpose of the variable and its usage. default values are not supplied for variables which must be supplied by a user. Testing Unit and integration tests covering the Terraform code exist (e.g. Terratest , terratest-abstraction )? Naming and Code Structure Resource definitions and data sources are used correctly in the Terraform scripts? resource: Indicates to Terraform that the current configuration is in charge of managing the life cycle of the object data: Indicates to Terraform that you only want to get a reference to the existing object, but don\u2019t want to manage it as part of this configuration The resource names start with their containing provider's name followed by an underscore? e.g. resource from the provider postgresql might be named as postgresql_database ? The try function is only used with simple attribute references and type conversion functions? Overuse of the try function to suppress errors will lead to a configuration that is hard to understand and maintain. Explicit type conversion functions used to normalize types are only returned in module outputs? Explicit type conversions are rarely necessary in Terraform because it will convert types automatically where required. The Sensitive property on schema set to true for the fields that contains sensitive information? This will prevent the field's values from showing up in CLI output. General Recommendations Try avoiding nesting sub configuration within resources. Create a separate resource section for resources even though they can be declared as sub-element of a resource. For example, declaring subnets within virtual network vs declaring subnets as a separate resources compared to virtual network on Azure. Never hard-code any value in configuration. Declare them in locals section if a variable is needed multiple times as a static value and are internal to the configuration. The name s of the resources created on Azure should not be hard-coded or static. These names should be dynamic and user-provided using variable block. This is helpful especially in unit testing when multiple tests are running in parallel trying to create resources on Azure but need different names (few resources in Azure need to be named uniquely e.g. storage accounts). It is a good practice to output the ID of resources created on Azure from configuration. This is especially helpful when adding dynamic blocks for sub-elements/child elements to the parent resource. Use the required_providers block for establishing the dependency for providers along with pre-determined version. Use the terraform block to declare the provider dependency with exact version and also the terraform CLI version needed for the configuration. Validate the variable values supplied based on usage and type of variable. The validation can be done to variables by adding validation block. Validate that the component SKUs are the right ones, e.g. standard vs premium.","title":"Terraform Code Reviews"},{"location":"code-reviews/recipes/terraform/#terraform-code-reviews","text":"","title":"Terraform Code Reviews"},{"location":"code-reviews/recipes/terraform/#style-guide","text":"Developers should follow the terraform style guide . Projects should check Terraform scripts with automated tools.","title":"Style Guide"},{"location":"code-reviews/recipes/terraform/#code-analysis-linting","text":"","title":"Code Analysis / Linting"},{"location":"code-reviews/recipes/terraform/#tflint","text":"TFLint is a Terraform linter focused on possible errors, best practices, etc. Once TFLint installed in the environment, it can be invoked using the VS Code terraform extension .","title":"TFLint"},{"location":"code-reviews/recipes/terraform/#vscode-extensions","text":"The following VS Code extensions are widely used.","title":"VSCode Extensions"},{"location":"code-reviews/recipes/terraform/#terraform-extension","text":"This extension provides syntax highlighting, linting, formatting and validation capabilities.","title":"Terraform extension"},{"location":"code-reviews/recipes/terraform/#azure-terraform-extension","text":"This extension provides Terraform command support, resource graph visualization and CloudShell integration inside VS Code.","title":"Azure Terraform extension"},{"location":"code-reviews/recipes/terraform/#build-validation","text":"Ensure you enforce the style guides during build. The following example script can be used to install terraform, and a linter that then checks for formatting and common errors. #! /bin/bash set -e SCRIPT_DIR = $( dirname \" $BASH_SOURCE \" ) cd \" $SCRIPT_DIR \" TF_VERSION = 0 .12.4 TF_LINT_VERSION = 0 .9.1 echo -e \"\\n\\n>>> Installing Terraform 0.12\" # Install terraform tooling for linting terraform wget -q https://releases.hashicorp.com/terraform/ ${ TF_VERSION } /terraform_ ${ TF_VERSION } _linux_amd64.zip -O /tmp/terraform.zip sudo unzip -q -o -d /usr/local/bin/ /tmp/terraform.zip echo \"\" echo -e \"\\n\\n>>> Install tflint (3rd party)\" wget -q https://github.com/wata727/tflint/releases/download/v ${ TF_LINT_VERSION } /tflint_linux_amd64.zip -O /tmp/tflint.zip sudo unzip -q -o -d /usr/local/bin/ /tmp/tflint.zip echo -e \"\\n\\n>>> Terraform version\" terraform -version echo -e \"\\n\\n>>> Terraform Format (if this fails use 'terraform fmt -recursive' command to resolve\" terraform fmt -recursive -diff -check echo -e \"\\n\\n>>> tflint\" tflint echo -e \"\\n\\n>>> Terraform init\" terraform init echo -e \"\\n\\n>>> Terraform validate\" terraform validate","title":"Build Validation"},{"location":"code-reviews/recipes/terraform/#code-review-checklist","text":"In addition to the Code Review Checklist you should also look for these Terraform specific code review items","title":"Code Review Checklist"},{"location":"code-reviews/recipes/terraform/#providers","text":"Are all providers used in the terraform scripts versioned to prevent breaking changes in the future?","title":"Providers"},{"location":"code-reviews/recipes/terraform/#repository-organization","text":"The code split into reusable modules? Modules are split into separate .tf files where appropriate? The repository contains a README.md describing the architecture provisioned? If Terraform code is mixed with application source code, the Terraform code isolated into a dedicated folder?","title":"Repository Organization"},{"location":"code-reviews/recipes/terraform/#terraform-state","text":"The Terraform project configured using Azure Storage as remote state backend? The remote state backend storage account key stored a secure location (e.g. Azure Key Vault)? The project is configured to use state files based on the environment, and the deployment pipeline is configured to supply the state file name dynamically?","title":"Terraform State"},{"location":"code-reviews/recipes/terraform/#variables","text":"If the infrastructure will be different depending on the environment (e.g. Dev, UAT, Production), the environment specific parameters are supplied via a .tfvars file? All variables have type information. E.g. a list(string) or string . All variables have a description stating the purpose of the variable and its usage. default values are not supplied for variables which must be supplied by a user.","title":"Variables"},{"location":"code-reviews/recipes/terraform/#testing","text":"Unit and integration tests covering the Terraform code exist (e.g. Terratest , terratest-abstraction )?","title":"Testing"},{"location":"code-reviews/recipes/terraform/#naming-and-code-structure","text":"Resource definitions and data sources are used correctly in the Terraform scripts? resource: Indicates to Terraform that the current configuration is in charge of managing the life cycle of the object data: Indicates to Terraform that you only want to get a reference to the existing object, but don\u2019t want to manage it as part of this configuration The resource names start with their containing provider's name followed by an underscore? e.g. resource from the provider postgresql might be named as postgresql_database ? The try function is only used with simple attribute references and type conversion functions? Overuse of the try function to suppress errors will lead to a configuration that is hard to understand and maintain. Explicit type conversion functions used to normalize types are only returned in module outputs? Explicit type conversions are rarely necessary in Terraform because it will convert types automatically where required. The Sensitive property on schema set to true for the fields that contains sensitive information? This will prevent the field's values from showing up in CLI output.","title":"Naming and Code Structure"},{"location":"code-reviews/recipes/terraform/#general-recommendations","text":"Try avoiding nesting sub configuration within resources. Create a separate resource section for resources even though they can be declared as sub-element of a resource. For example, declaring subnets within virtual network vs declaring subnets as a separate resources compared to virtual network on Azure. Never hard-code any value in configuration. Declare them in locals section if a variable is needed multiple times as a static value and are internal to the configuration. The name s of the resources created on Azure should not be hard-coded or static. These names should be dynamic and user-provided using variable block. This is helpful especially in unit testing when multiple tests are running in parallel trying to create resources on Azure but need different names (few resources in Azure need to be named uniquely e.g. storage accounts). It is a good practice to output the ID of resources created on Azure from configuration. This is especially helpful when adding dynamic blocks for sub-elements/child elements to the parent resource. Use the required_providers block for establishing the dependency for providers along with pre-determined version. Use the terraform block to declare the provider dependency with exact version and also the terraform CLI version needed for the configuration. Validate the variable values supplied based on usage and type of variable. The validation can be done to variables by adding validation block. Validate that the component SKUs are the right ones, e.g. standard vs premium.","title":"General Recommendations"},{"location":"design/exception-handling/","text":"Exception Handling Exception Constructs Almost all language platforms offer a construct of exception or equivalent to handle error scenarios. The underlying platform, used libraries or the authored code can \"throw\" exceptions to initiate an error flow. Some of the advantages of using exceptions are - Abstract different kind of errors Breaks the control flow from different code structures Navigate the call stack till the right catch block is identified Automatic collection of call stack Define different error handling flows thru multiple catch blocks Define finally block to cleanup resources Here is some guidance on exception handling in .Net C# Exception fundamentals Handling exceptions in .Net Custom Exceptions Although the platform offers numerous types of exceptions, often we need custom defined exceptions to arrive at an optimal low level design for error handling. The advantages of using custom exceptions are - Define exceptions specific to business domain of the requirement. E.g. InvalidCustomerException Wrap system/platform exceptions to define more generic system exception so that overall code base is more tech stack agnostic. E.g DatabaseWriteException which wraps MongoWriteException. Enrich the exception with more information about the code flow of the error. Enrich the exception with more information about the data context of the error. E.g. RecordId in property in DatabaseWriteException which carries the Id of the record failed to update. Define custom error message which is more business user friendly or support team friendly. Custom Exception Hierarchy Below diagram shows a sample hierarchy of custom exceptions. It defines a BaseException class which derives from System.Exception class and parent of all custom exceptions. BaseException also has additional properties for ActionCode and ResultCode. ActionCode represents the \"flow\" in which the error happened. ResultCode represents the exact error that happened. These additional properties help in defining different error handling flows in the catch blocks. Defines a number of System exceptions which derive from SystemException class. They will address all the errors generated by the technical aspects of the code. Like connectivity, read, write, buffer overflow etc Defines a number of Business exceptions which derive from BusinessException class. They will address all the errors generated by the business aspects of the code. Like data validations, duplicate rows. Error Details in API Response When an error occurs in an API, it has to rendered as response with all the necessary fields. There can be custom response schema drafted for these purposes. But one of the popular formats is the problem detail structure - Problem details There are inbuilt problem details middleware library built in ASP.Net core. For further details refer to below link Problem details service in ASP.Net core","title":"Exception Handling"},{"location":"design/exception-handling/#exception-handling","text":"","title":"Exception Handling"},{"location":"design/exception-handling/#exception-constructs","text":"Almost all language platforms offer a construct of exception or equivalent to handle error scenarios. The underlying platform, used libraries or the authored code can \"throw\" exceptions to initiate an error flow. Some of the advantages of using exceptions are - Abstract different kind of errors Breaks the control flow from different code structures Navigate the call stack till the right catch block is identified Automatic collection of call stack Define different error handling flows thru multiple catch blocks Define finally block to cleanup resources Here is some guidance on exception handling in .Net C# Exception fundamentals Handling exceptions in .Net","title":"Exception Constructs"},{"location":"design/exception-handling/#custom-exceptions","text":"Although the platform offers numerous types of exceptions, often we need custom defined exceptions to arrive at an optimal low level design for error handling. The advantages of using custom exceptions are - Define exceptions specific to business domain of the requirement. E.g. InvalidCustomerException Wrap system/platform exceptions to define more generic system exception so that overall code base is more tech stack agnostic. E.g DatabaseWriteException which wraps MongoWriteException. Enrich the exception with more information about the code flow of the error. Enrich the exception with more information about the data context of the error. E.g. RecordId in property in DatabaseWriteException which carries the Id of the record failed to update. Define custom error message which is more business user friendly or support team friendly.","title":"Custom Exceptions"},{"location":"design/exception-handling/#custom-exception-hierarchy","text":"Below diagram shows a sample hierarchy of custom exceptions. It defines a BaseException class which derives from System.Exception class and parent of all custom exceptions. BaseException also has additional properties for ActionCode and ResultCode. ActionCode represents the \"flow\" in which the error happened. ResultCode represents the exact error that happened. These additional properties help in defining different error handling flows in the catch blocks. Defines a number of System exceptions which derive from SystemException class. They will address all the errors generated by the technical aspects of the code. Like connectivity, read, write, buffer overflow etc Defines a number of Business exceptions which derive from BusinessException class. They will address all the errors generated by the business aspects of the code. Like data validations, duplicate rows.","title":"Custom Exception Hierarchy"},{"location":"design/exception-handling/#error-details-in-api-response","text":"When an error occurs in an API, it has to rendered as response with all the necessary fields. There can be custom response schema drafted for these purposes. But one of the popular formats is the problem detail structure - Problem details There are inbuilt problem details middleware library built in ASP.Net core. For further details refer to below link Problem details service in ASP.Net core","title":"Error Details in API Response"},{"location":"design/readme/","text":"Design Designing software well is hard. ISE has collected a number of practices which we find help in the design process. This covers not only technical design of software, but also architecture design and non-functional requirements gathering for new projects. Goals Provide recommendations for how to design software for maintainability, ease of extension, adherence to best practices, and sustainability. Reference or define process or checklists to help ensure well-designed software. Collate and point to reference sources (guides, repos, articles) that can help shortcut the learning process. Code Examples Folder Structure Folder Structure For Python Repository Project Templates Rust Actix Web, Diesel ORM, Test Containers, Onion Architecture Python Flask, SQLAlchemy ORM, Test Containers, Onion Architecture","title":"Design"},{"location":"design/readme/#design","text":"Designing software well is hard. ISE has collected a number of practices which we find help in the design process. This covers not only technical design of software, but also architecture design and non-functional requirements gathering for new projects.","title":"Design"},{"location":"design/readme/#goals","text":"Provide recommendations for how to design software for maintainability, ease of extension, adherence to best practices, and sustainability. Reference or define process or checklists to help ensure well-designed software. Collate and point to reference sources (guides, repos, articles) that can help shortcut the learning process.","title":"Goals"},{"location":"design/readme/#code-examples","text":"Folder Structure Folder Structure For Python Repository Project Templates Rust Actix Web, Diesel ORM, Test Containers, Onion Architecture Python Flask, SQLAlchemy ORM, Test Containers, Onion Architecture","title":"Code Examples"},{"location":"design/design-patterns/","text":"Design Patterns The design patterns section recommends patterns of software and architecture design. This section provides a curated list of commonly used patterns from trusted sources. Rather than duplicate or replace the cited sources, this section aims to compliment them with suggestions, guidance, and learnings based on firsthand experiences.","title":"Design Patterns"},{"location":"design/design-patterns/#design-patterns","text":"The design patterns section recommends patterns of software and architecture design. This section provides a curated list of commonly used patterns from trusted sources. Rather than duplicate or replace the cited sources, this section aims to compliment them with suggestions, guidance, and learnings based on firsthand experiences.","title":"Design Patterns"},{"location":"design/design-patterns/cloud-resource-design-guidance/","text":"Cloud Resource Design Guidance As cloud usage scales, considerations for subscription design, management groups, and resource naming/tagging conventions have an impact on governance, operations management, and adoption patterns. Note: Always work with the relevant stakeholders to ensure that introducing new patterns provides the intended value. When working in an existing cloud environment, it is important to understand any current patterns and how they are used before making a change to them. Resources The following references can be used to understand the latest best practices in organizing cloud resources: Organizing Subscriptions Resource Tagging Decision Guide Resource Naming Conventions Recommended Azure Resource Abbreviations Organizing Dev/Test/Production Workloads Tooling Azure Resource Naming Tool","title":"Cloud Resource Design Guidance"},{"location":"design/design-patterns/cloud-resource-design-guidance/#cloud-resource-design-guidance","text":"As cloud usage scales, considerations for subscription design, management groups, and resource naming/tagging conventions have an impact on governance, operations management, and adoption patterns. Note: Always work with the relevant stakeholders to ensure that introducing new patterns provides the intended value. When working in an existing cloud environment, it is important to understand any current patterns and how they are used before making a change to them.","title":"Cloud Resource Design Guidance"},{"location":"design/design-patterns/cloud-resource-design-guidance/#resources","text":"The following references can be used to understand the latest best practices in organizing cloud resources: Organizing Subscriptions Resource Tagging Decision Guide Resource Naming Conventions Recommended Azure Resource Abbreviations Organizing Dev/Test/Production Workloads","title":"Resources"},{"location":"design/design-patterns/cloud-resource-design-guidance/#tooling","text":"Azure Resource Naming Tool","title":"Tooling"},{"location":"design/design-patterns/data-heavy-design-guidance/","text":"Data and DataOps Fundamentals Most projects involve some type of data storage, data processing and data ops. For these projects, as with all projects, we follow the general guidelines laid out in other sections around security, testing, observability, CI/CD etc. Goal The goal of this section is to briefly describe how to apply the fundamentals to data heavy projects or portions of the project. Isolation Please be cautious of which isolation levels you are using. Even with a database that offers serializability, it is possible that within a transaction or connection you are leveraging a lower isolation level than the database offers. In particular, read uncommitted (or eventual consistency), can have a lot of unpredictable side effects and introduce bugs that are difficult to reason about. Eventually consistent systems should be treated as a last resort for achieving your scalability requirements; batching, sharding, and caching are all recommended solutions to increase your scalability. If none of these options are tenable, consider evaluating the \"New SQL\" databases like CockroachDB or TiDB, before leveraging an option that relies on eventual consistency. There are other levels of isolation, outside the isolation levels mentioned in the link above. Some of these have nuances different from the 4 main levels, and can be difficult to compare. Snapshot Isolation, strict serializability, \"read your own writes\", monotonic reads, bounded staleness, causal consistency, and linearizability are all other terms you can look into to learn more on the subject. Concurrency Control Your systems should (almost) always leverage some form of concurrency control, to ensure correctness amongst competing requests and to prevent data races. The 2 forms of concurrency control are pessimistic and optimistic . A pessimistic transaction involves a first request to \"lock the data\", and a second request to write the data. In between these requests, no other requests touching that data will succeed. See 2 Phase Locking (also often known as 2 Phase Commit) for more info. The (more) recommended approach is optimistic concurrency, where a user can read the object at a specific version, and update the object if and only if it hasn't changed. This is typically done via the Etag Header . A simple way to accomplish this on the database side is to increment a version number on each update. This can be done in a single executed statement as: WARNING: the below will not work when using an isolation level at or lower than read uncommitted (eventual consistency). -- Please treat this as pseudo code, and adjust as necessary. UPDATE < table_name > SET field1 = value1 , ..., fieldN = valueN , version = $ new_version WHERE ID = $ id AND version = $ version Data Tiering (Data Quality) Develop a common understanding of the quality of your datasets so that everyone understands the quality of the data, and expected use cases and limitations. A common data quality model is Bronze , Silver , Gold Bronze: This is a landing area for your raw datasets with none or minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store. Silver: These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by data scientists. Gold: These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard fact and dimension tables. Divide your data lake into three major areas containing your Bronze, Silver and Gold datasets. Note: Additional storage areas for malformed data, intermediate (sandbox) data, and libraries/packages/binaries are also useful when designing your storage organization. Data Validation Validate data early in your pipeline Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all datasets conform to a specific schema and known data invariants. This can also potentially prevent data pipeline failures in case of unexpected changes to the input data. Data that does not pass this validation stage can be rerouted to a record store dedicated for malformed data for diagnostic purposes. It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can be used to replay the data pipeline for both testing (i.e. testing data validation logic) and data recovery purposes (i.e. data corruption is introduced due to a bug in the data transformation code and thus the pipeline needs to be replayed). Idempotent Data Pipelines Make your data pipelines re-playable and idempotent Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines re-playable and idempotent, you can recover from this state through deployment of code fixes, and re-playing the data pipelines. Idempotency also ensures data-duplication is mitigated when replaying your data pipelines. Testing Ensure data transformation code is testable Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages. While it is possible to run tests against notebooks, by extracting the code into packages, you increase the developer productivity by increasing the speed of the feedback cycle. CI/CD, Source Control and Code Reviews All artifacts needed to build the data pipeline from scratch should be in source control. This included infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures etc.), reference/application data, data pipeline definitions and data validation and transformation logic. Any new artifacts (code) introduced to the repository should be code reviewed, both automatically (linting, credential scanning etc.) and peer reviewed. There should be a safe, repeatable process (CI/CD) to move the changes through dev, test and finally production. Security and Configuration Maintain a central, secure location for sensitive configuration such as database connection strings that can be accessed by the appropriate services within the specific environment. On Azure this is typically solved through securing secrets in a Key Vault per environment, then having the relevant services query KeyVault for the configuration Observability Monitor infrastructure, pipelines and data A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data should also be monitored. A common area that should have data monitoring is the malformed record store. End to End and Azure Technology Samples The DataOps for the Modern Data Warehouse repo contains both end-to-end and technology specific samples on how to implement DataOps on Azure. Image: CI/CD for Data pipelines on Azure - from DataOps for the Modern Data Warehouse repo","title":"Data and DataOps Fundamentals"},{"location":"design/design-patterns/data-heavy-design-guidance/#data-and-dataops-fundamentals","text":"Most projects involve some type of data storage, data processing and data ops. For these projects, as with all projects, we follow the general guidelines laid out in other sections around security, testing, observability, CI/CD etc.","title":"Data and DataOps Fundamentals"},{"location":"design/design-patterns/data-heavy-design-guidance/#goal","text":"The goal of this section is to briefly describe how to apply the fundamentals to data heavy projects or portions of the project.","title":"Goal"},{"location":"design/design-patterns/data-heavy-design-guidance/#isolation","text":"Please be cautious of which isolation levels you are using. Even with a database that offers serializability, it is possible that within a transaction or connection you are leveraging a lower isolation level than the database offers. In particular, read uncommitted (or eventual consistency), can have a lot of unpredictable side effects and introduce bugs that are difficult to reason about. Eventually consistent systems should be treated as a last resort for achieving your scalability requirements; batching, sharding, and caching are all recommended solutions to increase your scalability. If none of these options are tenable, consider evaluating the \"New SQL\" databases like CockroachDB or TiDB, before leveraging an option that relies on eventual consistency. There are other levels of isolation, outside the isolation levels mentioned in the link above. Some of these have nuances different from the 4 main levels, and can be difficult to compare. Snapshot Isolation, strict serializability, \"read your own writes\", monotonic reads, bounded staleness, causal consistency, and linearizability are all other terms you can look into to learn more on the subject.","title":"Isolation"},{"location":"design/design-patterns/data-heavy-design-guidance/#concurrency-control","text":"Your systems should (almost) always leverage some form of concurrency control, to ensure correctness amongst competing requests and to prevent data races. The 2 forms of concurrency control are pessimistic and optimistic . A pessimistic transaction involves a first request to \"lock the data\", and a second request to write the data. In between these requests, no other requests touching that data will succeed. See 2 Phase Locking (also often known as 2 Phase Commit) for more info. The (more) recommended approach is optimistic concurrency, where a user can read the object at a specific version, and update the object if and only if it hasn't changed. This is typically done via the Etag Header . A simple way to accomplish this on the database side is to increment a version number on each update. This can be done in a single executed statement as: WARNING: the below will not work when using an isolation level at or lower than read uncommitted (eventual consistency). -- Please treat this as pseudo code, and adjust as necessary. UPDATE < table_name > SET field1 = value1 , ..., fieldN = valueN , version = $ new_version WHERE ID = $ id AND version = $ version","title":"Concurrency Control"},{"location":"design/design-patterns/data-heavy-design-guidance/#data-tiering-data-quality","text":"Develop a common understanding of the quality of your datasets so that everyone understands the quality of the data, and expected use cases and limitations. A common data quality model is Bronze , Silver , Gold Bronze: This is a landing area for your raw datasets with none or minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store. Silver: These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by data scientists. Gold: These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard fact and dimension tables. Divide your data lake into three major areas containing your Bronze, Silver and Gold datasets. Note: Additional storage areas for malformed data, intermediate (sandbox) data, and libraries/packages/binaries are also useful when designing your storage organization.","title":"Data Tiering (Data Quality)"},{"location":"design/design-patterns/data-heavy-design-guidance/#data-validation","text":"Validate data early in your pipeline Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all datasets conform to a specific schema and known data invariants. This can also potentially prevent data pipeline failures in case of unexpected changes to the input data. Data that does not pass this validation stage can be rerouted to a record store dedicated for malformed data for diagnostic purposes. It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can be used to replay the data pipeline for both testing (i.e. testing data validation logic) and data recovery purposes (i.e. data corruption is introduced due to a bug in the data transformation code and thus the pipeline needs to be replayed).","title":"Data Validation"},{"location":"design/design-patterns/data-heavy-design-guidance/#idempotent-data-pipelines","text":"Make your data pipelines re-playable and idempotent Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines re-playable and idempotent, you can recover from this state through deployment of code fixes, and re-playing the data pipelines. Idempotency also ensures data-duplication is mitigated when replaying your data pipelines.","title":"Idempotent Data Pipelines"},{"location":"design/design-patterns/data-heavy-design-guidance/#testing","text":"Ensure data transformation code is testable Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages. While it is possible to run tests against notebooks, by extracting the code into packages, you increase the developer productivity by increasing the speed of the feedback cycle.","title":"Testing"},{"location":"design/design-patterns/data-heavy-design-guidance/#cicd-source-control-and-code-reviews","text":"All artifacts needed to build the data pipeline from scratch should be in source control. This included infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures etc.), reference/application data, data pipeline definitions and data validation and transformation logic. Any new artifacts (code) introduced to the repository should be code reviewed, both automatically (linting, credential scanning etc.) and peer reviewed. There should be a safe, repeatable process (CI/CD) to move the changes through dev, test and finally production.","title":"CI/CD, Source Control and Code Reviews"},{"location":"design/design-patterns/data-heavy-design-guidance/#security-and-configuration","text":"Maintain a central, secure location for sensitive configuration such as database connection strings that can be accessed by the appropriate services within the specific environment. On Azure this is typically solved through securing secrets in a Key Vault per environment, then having the relevant services query KeyVault for the configuration","title":"Security and Configuration"},{"location":"design/design-patterns/data-heavy-design-guidance/#observability","text":"Monitor infrastructure, pipelines and data A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data should also be monitored. A common area that should have data monitoring is the malformed record store.","title":"Observability"},{"location":"design/design-patterns/data-heavy-design-guidance/#end-to-end-and-azure-technology-samples","text":"The DataOps for the Modern Data Warehouse repo contains both end-to-end and technology specific samples on how to implement DataOps on Azure. Image: CI/CD for Data pipelines on Azure - from DataOps for the Modern Data Warehouse repo","title":"End to End and Azure Technology Samples"},{"location":"design/design-patterns/distributed-system-design-reference/","text":"Distributed System Design Reference Distributed systems introduce new and interesting problems that need addressing. Software engineering as a field has dealt with these problems for years, and there are phenomenal resources available for reference when creating a new distributed system. Some that we recommend are as follows: Martin Fowler's Patterns of Distributed Systems microservices.io Azure's Cloud Design Patterns","title":"Distributed System Design Reference"},{"location":"design/design-patterns/distributed-system-design-reference/#distributed-system-design-reference","text":"Distributed systems introduce new and interesting problems that need addressing. Software engineering as a field has dealt with these problems for years, and there are phenomenal resources available for reference when creating a new distributed system. Some that we recommend are as follows: Martin Fowler's Patterns of Distributed Systems microservices.io Azure's Cloud Design Patterns","title":"Distributed System Design Reference"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/","text":"Network Architecture Guidance for Azure The following are some best practices when setting up and working with network resources in Azure Cloud environments. Note: When working in an existing cloud environment, it is important to understand any current patterns, and how they are used, before making a change to them. You should also work with the relevant stakeholders to make sure that any new patterns you introduce provide enough value to make the change. Networking and VNet Setup Hub-and-Spoke Topology A hub-and-spoke network topology is a common architecture pattern used in Azure for organizing and managing network resources. It is based on the concept of a central hub that connects to various spoke networks. This model is particularly useful for organizing resources, maintaining security, and simplifying network management. The hub-and-spoke model is implemented using Azure Virtual Networks (VNet) and VNet peering. The hub: The central VNet acts as a hub, providing shared services such as network security, monitoring, and connectivity to on-premises or other cloud environments. Common components in the hub include Network Virtual Appliances (NVAs), Azure Firewall, VPN Gateway, and ExpressRoute Gateway. The spokes: The spoke VNets represent separate units or applications within an organization, each with its own set of resources and services. They connect to the hub through VNet peering, which allows for communication between the hub and spoke VNets. Implementing a hub-and-spoke model in Azure offers several benefits: Isolation and segmentation: By dividing resources into separate spoke VNets, you can isolate and segment workloads, preventing any potential issues or security risks from affecting other parts of the network. Centralized management: The hub VNet acts as a single point of management for shared services, making it easier to maintain, monitor, and enforce policies across the network. Simplified connectivity: VNet peering enables seamless communication between the hub and spoke VNets without the need for complex routing or additional gateways, reducing latency and management overhead. Scalability: The hub-and-spoke model can easily scale to accommodate additional spokes as the organization grows or as new applications and services are introduced. Cost savings: By centralizing shared services in the hub, organizations can reduce the costs associated with deploying and managing multiple instances of the same services across different VNets. Read more about hub-and-spoke topology When deploying hub/spoke, it is recommended that you do so in connection with landing zones . This ensures consistency across all environments as well as guardrails to ensure a high level of security while giving developers freedom within development environments. Firewall and Security When using a hub-and-spoke topology it is possible to deploy a centralized firewall in the Hub that all outgoing traffic or traffic to/from certain VNets, this allows for centralized threat protection while minimizing costs. DNS The best practices for handling DNS in Azure, and in cloud environments in general, include using managed DNS services. Some of the benefits of using managed DNS services is that the resources are designed to be secure, easy to deploy and configure. DNS forwarding: Set up DNS forwarding between your on-premises DNS servers and Azure DNS servers for name resolution across environments. Use Azure Private DNS zones for Azure resources: Configure Azure Private DNS zones for your Azure resources to ensure name resolution is kept within the virtual network. Read more about Hybrid/Multi-Cloud DNS infrastructure and Azure DNS infrastructure IP Allocation When allocating IP address spaces to Azure Virtual Networks (VNets), it's essential to follow best practices for proper management, and scalability. Here are some recommendations for IP allocation to VNets: Reserve IP addresses: Reserve IP addresses in your address space for critical resources or services. Public IP allocation: Minimize the use of public IP addresses and use Azure Private Link when possible to access services over a private connection. IP address management (IPAM): Use IPAM solutions to manage and track IP address allocation across your hybrid environment. Plan your address space: Choose an appropriate private address space (from RFC 1918) for your VNets that is large enough to accommodate future growth. Avoid overlapping with on-premises or other cloud networks. Use CIDR notation: Use Classless Inter-Domain Routing (CIDR) notation to define the VNet address space, which allows more efficient allocation and prevents wasting IP addresses. Use subnets: Divide your VNets into smaller subnets based on security, application, or environment requirements. This allows for better network management and security. Consider leaving a buffer between VNets: While it's not strictly necessary, leaving a buffer between VNets can be beneficial in some cases, especially when you anticipate future growth or when you might need to merge VNets. This can help avoid re-addressing conflicts when expanding or merging networks. Reserve IP addresses: Reserve a range of IP addresses within your VNet address space for critical resources or services. This ensures that they have a static IP address, which is essential for specific services or applications. Plan for hybrid scenarios: If you're working in a hybrid environment with on-premises or multi-cloud networks, ensure that you plan for IP address allocation across all environments. This includes avoiding overlapping address spaces and reserving IP addresses for specific resources like VPN gateways or ExpressRoute circuits. Read more at azure-best-practices/plan-for-ip-addressing Resource Allocation For resource allocation the best practices from Cloud Resource Design Guidance should be followed.","title":"Network Architecture Guidance for Azure"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#network-architecture-guidance-for-azure","text":"The following are some best practices when setting up and working with network resources in Azure Cloud environments. Note: When working in an existing cloud environment, it is important to understand any current patterns, and how they are used, before making a change to them. You should also work with the relevant stakeholders to make sure that any new patterns you introduce provide enough value to make the change.","title":"Network Architecture Guidance for Azure"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#networking-and-vnet-setup","text":"","title":"Networking and VNet Setup"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#hub-and-spoke-topology","text":"A hub-and-spoke network topology is a common architecture pattern used in Azure for organizing and managing network resources. It is based on the concept of a central hub that connects to various spoke networks. This model is particularly useful for organizing resources, maintaining security, and simplifying network management. The hub-and-spoke model is implemented using Azure Virtual Networks (VNet) and VNet peering. The hub: The central VNet acts as a hub, providing shared services such as network security, monitoring, and connectivity to on-premises or other cloud environments. Common components in the hub include Network Virtual Appliances (NVAs), Azure Firewall, VPN Gateway, and ExpressRoute Gateway. The spokes: The spoke VNets represent separate units or applications within an organization, each with its own set of resources and services. They connect to the hub through VNet peering, which allows for communication between the hub and spoke VNets. Implementing a hub-and-spoke model in Azure offers several benefits: Isolation and segmentation: By dividing resources into separate spoke VNets, you can isolate and segment workloads, preventing any potential issues or security risks from affecting other parts of the network. Centralized management: The hub VNet acts as a single point of management for shared services, making it easier to maintain, monitor, and enforce policies across the network. Simplified connectivity: VNet peering enables seamless communication between the hub and spoke VNets without the need for complex routing or additional gateways, reducing latency and management overhead. Scalability: The hub-and-spoke model can easily scale to accommodate additional spokes as the organization grows or as new applications and services are introduced. Cost savings: By centralizing shared services in the hub, organizations can reduce the costs associated with deploying and managing multiple instances of the same services across different VNets. Read more about hub-and-spoke topology When deploying hub/spoke, it is recommended that you do so in connection with landing zones . This ensures consistency across all environments as well as guardrails to ensure a high level of security while giving developers freedom within development environments.","title":"Hub-and-Spoke Topology"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#firewall-and-security","text":"When using a hub-and-spoke topology it is possible to deploy a centralized firewall in the Hub that all outgoing traffic or traffic to/from certain VNets, this allows for centralized threat protection while minimizing costs.","title":"Firewall and Security"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#dns","text":"The best practices for handling DNS in Azure, and in cloud environments in general, include using managed DNS services. Some of the benefits of using managed DNS services is that the resources are designed to be secure, easy to deploy and configure. DNS forwarding: Set up DNS forwarding between your on-premises DNS servers and Azure DNS servers for name resolution across environments. Use Azure Private DNS zones for Azure resources: Configure Azure Private DNS zones for your Azure resources to ensure name resolution is kept within the virtual network. Read more about Hybrid/Multi-Cloud DNS infrastructure and Azure DNS infrastructure","title":"DNS"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#ip-allocation","text":"When allocating IP address spaces to Azure Virtual Networks (VNets), it's essential to follow best practices for proper management, and scalability. Here are some recommendations for IP allocation to VNets: Reserve IP addresses: Reserve IP addresses in your address space for critical resources or services. Public IP allocation: Minimize the use of public IP addresses and use Azure Private Link when possible to access services over a private connection. IP address management (IPAM): Use IPAM solutions to manage and track IP address allocation across your hybrid environment. Plan your address space: Choose an appropriate private address space (from RFC 1918) for your VNets that is large enough to accommodate future growth. Avoid overlapping with on-premises or other cloud networks. Use CIDR notation: Use Classless Inter-Domain Routing (CIDR) notation to define the VNet address space, which allows more efficient allocation and prevents wasting IP addresses. Use subnets: Divide your VNets into smaller subnets based on security, application, or environment requirements. This allows for better network management and security. Consider leaving a buffer between VNets: While it's not strictly necessary, leaving a buffer between VNets can be beneficial in some cases, especially when you anticipate future growth or when you might need to merge VNets. This can help avoid re-addressing conflicts when expanding or merging networks. Reserve IP addresses: Reserve a range of IP addresses within your VNet address space for critical resources or services. This ensures that they have a static IP address, which is essential for specific services or applications. Plan for hybrid scenarios: If you're working in a hybrid environment with on-premises or multi-cloud networks, ensure that you plan for IP address allocation across all environments. This includes avoiding overlapping address spaces and reserving IP addresses for specific resources like VPN gateways or ExpressRoute circuits. Read more at azure-best-practices/plan-for-ip-addressing","title":"IP Allocation"},{"location":"design/design-patterns/network-architecture-guidance-for-azure/#resource-allocation","text":"For resource allocation the best practices from Cloud Resource Design Guidance should be followed.","title":"Resource Allocation"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/","text":"Network Architecture Guidance for Hybrid The following are best practices around how to design and configure resources, used for Hybrid and Multi-Cloud environments. Note: When working in an existing hybrid environment, it is important to understand any current patterns, and how they are used before making any changes. Hub-and-Spoke Topology The hub-and-spoke topology doesn't change much when using cloud/hybrid if configured correctly, The main different is that the hub VNet is peering to the on-prem network via a ExpressRoute and that all traffic from Azure might exit via the ExpressRoute and the on-prem internet connection. The generalized best practices are in Network Architecture Guidance for Azure#Hub and Spoke topology IP Allocation When working with Hybrid deployment, take extra care when planning IP allocation as there is a much greater risk of overlapping network ranges. The general best practices are available in the Network Architecture Guidance for Azure#ip-allocation Read more about this in Azure Best Practices Plan for IP Addressing ExpressRoute Environments using Express often tunnel all traffic from Azure via a private link (ExpressRoute) to an on-prem location. This imposes a few problems when working with PAAS offerings as not all of them connect via their respective private endpoint and instead use an external IP for outgoing connections, or some PAAS to PASS traffic occur internally in azure and won't function with disabled public networks. Two notable services here are data planes copies of storage accounts and a lot of the services not supporting private endpoints. Choose the right ExpressRoute circuit: Select an appropriate SKU (Standard or Premium) and bandwidth based on your organization's requirements. Redundancy: Ensure redundancy by provisioning two ExpressRoute circuits in different peering locations. Monitoring: Use Azure Monitor and Network Performance Monitor (NPM) to monitor the health and performance of your ExpressRoute circuits. DNS General best practices are available in Network Architecture Guidance for Azure#dns When using Azure DNS in a hybrid or multi-cloud environment it is important to ensure a consistent DNS and forwarding configuration which ensures that records are automatically updated and that all DNS servers are aware of each other and know which server is the authoritative for the different records. Read more about Hybrid/Multi-Cloud DNS infrastructure Resource Allocation For resource allocation the best practices from Cloud Resource Design Guidance should be followed.","title":"Network Architecture Guidance for Hybrid"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#network-architecture-guidance-for-hybrid","text":"The following are best practices around how to design and configure resources, used for Hybrid and Multi-Cloud environments. Note: When working in an existing hybrid environment, it is important to understand any current patterns, and how they are used before making any changes.","title":"Network Architecture Guidance for Hybrid"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#hub-and-spoke-topology","text":"The hub-and-spoke topology doesn't change much when using cloud/hybrid if configured correctly, The main different is that the hub VNet is peering to the on-prem network via a ExpressRoute and that all traffic from Azure might exit via the ExpressRoute and the on-prem internet connection. The generalized best practices are in Network Architecture Guidance for Azure#Hub and Spoke topology","title":"Hub-and-Spoke Topology"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#ip-allocation","text":"When working with Hybrid deployment, take extra care when planning IP allocation as there is a much greater risk of overlapping network ranges. The general best practices are available in the Network Architecture Guidance for Azure#ip-allocation Read more about this in Azure Best Practices Plan for IP Addressing","title":"IP Allocation"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#expressroute","text":"Environments using Express often tunnel all traffic from Azure via a private link (ExpressRoute) to an on-prem location. This imposes a few problems when working with PAAS offerings as not all of them connect via their respective private endpoint and instead use an external IP for outgoing connections, or some PAAS to PASS traffic occur internally in azure and won't function with disabled public networks. Two notable services here are data planes copies of storage accounts and a lot of the services not supporting private endpoints. Choose the right ExpressRoute circuit: Select an appropriate SKU (Standard or Premium) and bandwidth based on your organization's requirements. Redundancy: Ensure redundancy by provisioning two ExpressRoute circuits in different peering locations. Monitoring: Use Azure Monitor and Network Performance Monitor (NPM) to monitor the health and performance of your ExpressRoute circuits.","title":"ExpressRoute"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#dns","text":"General best practices are available in Network Architecture Guidance for Azure#dns When using Azure DNS in a hybrid or multi-cloud environment it is important to ensure a consistent DNS and forwarding configuration which ensures that records are automatically updated and that all DNS servers are aware of each other and know which server is the authoritative for the different records. Read more about Hybrid/Multi-Cloud DNS infrastructure","title":"DNS"},{"location":"design/design-patterns/network-architecture-guidance-for-hybrid/#resource-allocation","text":"For resource allocation the best practices from Cloud Resource Design Guidance should be followed.","title":"Resource Allocation"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/","text":"Non-Functional Requirements Capture Goals In software engineering projects, non-functional requirements, also known as quality attributes, are specifications that define the operational attributes of a system rather than its specific behaviors. Unlike functional requirements, which outline what a system should do, non-functional requirements describe how the system performs certain functions under specific conditions. Non-functional requirements generally increase the cost as they require special efforts during the implementation, but by defining these requirements in detail early in the engagement, they can be properly evaluated when the cost of their impact on subsequent design decisions is comparatively low. Documenting Non-Functional Requirements - Best Practices Be specific: Avoid ambiguity and make sure the requirement is quantitative, measurable and testable. Relate requirements with business objectives and understand the real impact of the system's behavior. Break it down: Try to define requirements at the component or process scope instead of the whole solution. Understand trade-off: Non-functional requirements may be in conflict with each other and it can be difficult to balance them and prioritize which one to implement. Template This template can serve as a structured framework for capturing and documenting non-functional requirements effectively. Adjustments can be made to tailor it to the specific needs and preferences of the project team. Requirement name: name or title Description: brief description. Describe the importance and impact of this requirement to the business. Priority: High/Medium/Low or Must-have/Nice-to-have, etc Measurement/Metric: metric or measurement criteria Verification Method: Automated test, benchmark, simulation, prototyping, etc. Constraints: Budget, Time, Resources, Infrastructure, etc. Owner/Responsible Party Dependencies: technical dependencies, data dependencies, regulatory dependencies, etc. Examples To support the process of capturing a project's comprehensive non-functional requirements, this document offers a taxonomy for non-functional requirements and provides a framework for their identification, exploration, assignment of customer stakeholders, and eventual codification into formal engineering requirements as input to subsequent solution design. Operational Requirements Quality Attribute Description Common Metrics Availability System's uptime and accessibility to users. - Uptime: Uptime measures the percentage of time that a system is operational and available for use. It is typically expressed as a percentage of total time (e.g., 99.9% uptime means the system is available 99.9% of the time). Common thresholds for uptime include: 99% uptime: The system is available 99% of the time, allowing for approximately 3.65 days of downtime per year. 99.9% uptime (three nines): The system is available 99.9% of the time, allowing for approximately 8.76 hours of downtime per year. 99.99% uptime (four nines): The system is available 99.99% of the time, allowing for approximately 52.56 minutes of downtime per year. 99.999% uptime (five nines): The system is available 99.999% of the time, allowing for approximately 5.26 minutes of downtime per year. Data Integrity Accuracy and consistency of data throughout its lifecycle. - Error Rate: The proportion of data entries that contain errors or inaccuracies. (\\text{Error Rate} = \\left( \\frac{\\text{Number of Errors}}{\\text{Total Number of Entries}} \\right) \\times 100) - Accuracy Rate: The percentage of data entries that are correct and match the source of truth. (\\text{Accuracy Rate} = \\left( \\frac{\\text{Number of Accurate Entries}}{\\text{Total Number of Entries}} \\right) \\times 100) - Duplicate Record Rate: The percentage of data entries that are duplicates. (\\text{Duplicate Record Rate} = \\left( \\frac{\\text{Number of Duplicate Entries}}{\\text{Total Number of Entries}} \\right) \\times 100) Disaster recovery and business continuity Determine the system's requirements for disaster recovery and business continuity, including backup and recovery procedures and disaster recovery testing. - Backup and Recovery: The application must have a Backup and Recovery plan in place that includes regular backups of all data and configurations, and a process for restoring data and functionality in the event of a disaster or disruption. - Redundancy: The application must have Redundancy built into its infrastructure, such as redundant servers, network devices, and power supplies, to ensure high availability and minimize downtime in the event of a failure. - Failover and high availability: The application must be designed to support Failover and high availability, such as by using load balancers or Failover clusters, to ensure that it can continue to operate in the event of a system failure or disruption. - Disaster Recovery plan: The application must have a comprehensive disaster Recovery plan that includes procedures for restoring data and functionality in the event of a major disaster, such as a natural disaster, cyber attack, or other catastrophic event. - Testing and Maintenance: The application must be regularly tested and maintained to ensure that it can withstand a disaster or disruption, and that all systems, processes, and data can be quickly restored and recovered. Reliability System's ability to maintain functionality under varying conditions and failure scenarios. - Mean Time Between Failures (MTBF): The system should achieve an MTBF of at least 1000 hours, indicating a high level of reliability with infrequent failures. - Mean Time to Recover (MTTR): The system should aim for an MTTR of less than 1 hour, ensuring quick recovery and minimal disruption in the event of a failure. - Redundancy Levels: The system should include redundancy mechanisms to achieve a redundancy level of N+1, ensuring high availability and fault tolerance. Performance Requirements Quality Attribute Description Common Metrics Capacity Maximum load or volume that the system can handle within specified performance criteria. - Maximum Load Capacity: The system should be capable of handling peak loads without exceeding predefined performance degradation thresholds. Maximum load capacity may be expressed in terms of concurrent users, transactions per second, or data volume. - Resource Utilization: Measures the percentage of system resources (CPU, memory, disk I/O, network bandwidth) consumed under normal operation. - Concurrency: Measures the number of simultaneous users or transactions the system can handle without degradation in performance. - Throughput: Measures the rate at which the system processes transactions, requests, or data. Thresholds may be defined in terms of transactions per second, requests per minute, or data throughput in bytes per second. Performance Define the expected response times, throughput, and resource usage of the solution. - Response time: The application must load and respond to user interactions within 500 ms for button clicks. - Throughput: The application must be able to handle 100 concurrent users or 500 transactions per second. - Resource utilization: The application must use less than 80% of CPU and 1 GB of memory. - Error rates: The application must have an error rate less than 1% of all requests, and be able to handle and recover from errors gracefully, without impacting user experience or data integrity. Scalability Determine how the system will handle increased user loads or larger datasets over time. - Load Balancing: The application must be able to handle a minimum of 250 concurrent users and support load balancing across at least 3 servers to handle peak traffic. - Database Scalability: The application's database must be able to handle at least 1 million records and support partitioning or sharding to ensure efficient storage and retrieval of data. - Cloud-Based Infrastructure: The application must be deployed on cloud-based infrastructure that can handle at least 100,000 requests per hour, and be able to scale up or down to meet changing demand. Microservices Architecture: The application must be designed using a microservices architecture that allows for easy scaling of individual services, and be able to handle at least 500 requests per second. - Caching: The application must be able to cache at least 10,000 records, with a cache hit rate of 95%, and support caching across multiple servers to ensure high availability. Security and Compliance Requirements Quality Attribute Description Common Metrics Compliance Adherence to legal, regulatory, and industry standards and requirements. See Microsoft Purview Compliance Manager Privacy Protection of sensitive information and compliance with privacy regulations. - Compliance with Privacy Regulations: Achieve full compliance with GDPR, CCPA and HIPAA. - Data Anonymization: Implement anonymization techniques in protecting individual privacy while still allowing for data analysis. - Data Encryption: Ensure that sensitive data is encrypted according to encryption standards and best practices. - User Privacy Preferences: The ability to respect and accommodate user privacy preferences regarding data collection, processing, and sharing. Security Establish the security requirements of the system, such as authentication, authorization, encryption, and compliance with industry or legal regulations. See Threat Modeling Tool Sustainability Ability to operate over an extended period while minimizing environmental impact and resource consumption. - Energy Efficiency: Kilowatt-hours/Transaction. - Carbon Footprint: Tons of CO2 emissions per year. System Maintainability Requirements Quality Attribute Description Common Metrics Interoperability Ability to interact and exchange data with other systems or components. - Data Format Compatibility: The system must be interoperable with various Electronic Health Records (HER) systems to exchange patient data securely. - Protocol Compatibility: The system should import and export banking information from the ERP using REST protocol. - API Compatibility: The solution must adhere to API standards, ensuring backward compatibility with previous API versions, and providing comprehensive documentation for developers. Maintainability Ease of modifying, updating, and extending the software over time. - Code Complexity: The level of complexity in the system's codebase, measured using metrics such as cyclomatic complexity or lines of code per function. Lower code complexity makes maintenance tasks easier and reduces the likelihood of introducing defects. A cyclomatic complexity score of less than 10 or a lines of code per function metric below 50 is often desirable. - Code Coverage: The percentage of code covered by automated tests. Higher code coverage indicates better testability and facilitates easier maintenance by enabling faster detection of defects. A code coverage threshold of 80% or higher is commonly targeted. - Documentation Quality: The comprehensiveness and clarity of documentation accompanying the system, including design documents, technical specifications, and user manuals. Well-written documentation reduces the time and effort required for maintenance tasks. Documentation should cover at least 80% of system functionality with clear explanations and examples. - Dependency Management: The management of external dependencies and libraries used in the system. Proper dependency management reduces the risk of compatibility issues and simplifies maintenance tasks such as updates and patches. - Code Churn: The frequency of code changes within a software system. High code churn may indicate instability or frequent updates, making maintenance more challenging. A code churn rate of less than 20% is generally considered acceptable. Observability The ability to measure a system's internal state and performance based on the outputs it generates, such as logs, metrics, and traces. -System Metrics: CPU usage, memory usage, disk I/O, network I/O, and other resource utilization metrics. - Application Metrics: Response times, request rates, error rates, and throughput. - Custom Metrics: Application-specific metrics, such as user sign-ups, or specific business logic indicators. Portability Ability to run the software on different platforms, environments, and devices. - Platform Compatibility: The ability of the software to run on different operating systems (e.g., Windows, macOS, Linux) or platforms (e.g., desktop, mobile, web). Portability requires the software to be compatible with multiple platforms, with a goal of supporting at least three major platforms. - Hardware Compatibility: The ability of the software to run on different hardware configurations, such as varying processor architectures (e.g., x86, ARM) or memory sizes. Portability involves ensuring compatibility with a wide range of hardware configurations, with a goal of supporting common hardware architectures. - File System Independence: The software's ability to operate independently of the underlying file system, ensuring compatibility with different file systems (e.g., NTFS, ext4, APFS). Portability involves using file system abstraction layers or APIs to abstract file system operations and ensure consistency across platforms. - Data Format Compatibility: The software's ability to read and write data in different formats, ensuring compatibility with common data interchange formats (e.g., JSON, XML, CSV). Portability involves supporting standard data formats and providing mechanisms for data conversion and interoperability. User Experience Requirements Quality Attribute Description Common Metrics Accessibility The solution must be usable by people with disabilities. Compliance with accessibility standards. Support for assistive technologies - Alternative Text for Images: All images and non-text content must have alternative text descriptions that can be read by screen readers. - Color contrast: The application must use color schemes that meet the recommended contrast ratio between foreground and background colors to ensure visibility for users with low vision. - Focus indicators: The application must provide visible focus indicators to highlight the currently focused element, which is especially important for users who rely on keyboard navigation. - Captions and Transcripts: All audio and video content must have captions and transcripts, to ensure that users with hearing impairments can access the content. - Language identification: The application must correctly identify the language of the content, to ensure that screen readers and other assistive technologies can read the content properly. Internationalization and Localization Adaptation of the software for use in different languages and cultures. Tailoring the software to meet the specific needs of different regions or locales. - Language and Locale Support: The software's support for different languages, character sets, and locales. Portability requires internationalization and localization efforts to ensure that the software can be used effectively in different regions and cultures, with support for at least five major languages. - Multi currency: The system's support for multiple currencies, allowing different symbols and conversion rates. Usability Intuitiveness, ease of learning, and user satisfaction with the software interface. - Task Completion Time: The average time it takes for users to complete specific tasks. A user must be able to complete an account settings in less than 2 minutes. - Ease of Navigation: The ease with which users can navigate through the system and find the information they need. This can be measured by observing user interactions or conducting usability tests. - User Satisfaction: User satisfaction can be measured using surveys, feedback forms, or satisfaction ratings. A satisfaction score of 70% or higher is typically considered satisfactory. - Learnability: The ease with which new users can learn to use the system. This can be measured by the time it takes for users to perform basic tasks or by conducting usability tests with novice users.","title":"Non-Functional Requirements Capture"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#non-functional-requirements-capture","text":"","title":"Non-Functional Requirements Capture"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#goals","text":"In software engineering projects, non-functional requirements, also known as quality attributes, are specifications that define the operational attributes of a system rather than its specific behaviors. Unlike functional requirements, which outline what a system should do, non-functional requirements describe how the system performs certain functions under specific conditions. Non-functional requirements generally increase the cost as they require special efforts during the implementation, but by defining these requirements in detail early in the engagement, they can be properly evaluated when the cost of their impact on subsequent design decisions is comparatively low.","title":"Goals"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#documenting-non-functional-requirements-best-practices","text":"Be specific: Avoid ambiguity and make sure the requirement is quantitative, measurable and testable. Relate requirements with business objectives and understand the real impact of the system's behavior. Break it down: Try to define requirements at the component or process scope instead of the whole solution. Understand trade-off: Non-functional requirements may be in conflict with each other and it can be difficult to balance them and prioritize which one to implement.","title":"Documenting Non-Functional Requirements - Best Practices"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#template","text":"This template can serve as a structured framework for capturing and documenting non-functional requirements effectively. Adjustments can be made to tailor it to the specific needs and preferences of the project team. Requirement name: name or title Description: brief description. Describe the importance and impact of this requirement to the business. Priority: High/Medium/Low or Must-have/Nice-to-have, etc Measurement/Metric: metric or measurement criteria Verification Method: Automated test, benchmark, simulation, prototyping, etc. Constraints: Budget, Time, Resources, Infrastructure, etc. Owner/Responsible Party Dependencies: technical dependencies, data dependencies, regulatory dependencies, etc.","title":"Template"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#examples","text":"To support the process of capturing a project's comprehensive non-functional requirements, this document offers a taxonomy for non-functional requirements and provides a framework for their identification, exploration, assignment of customer stakeholders, and eventual codification into formal engineering requirements as input to subsequent solution design.","title":"Examples"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#operational-requirements","text":"Quality Attribute Description Common Metrics Availability System's uptime and accessibility to users. - Uptime: Uptime measures the percentage of time that a system is operational and available for use. It is typically expressed as a percentage of total time (e.g., 99.9% uptime means the system is available 99.9% of the time). Common thresholds for uptime include: 99% uptime: The system is available 99% of the time, allowing for approximately 3.65 days of downtime per year. 99.9% uptime (three nines): The system is available 99.9% of the time, allowing for approximately 8.76 hours of downtime per year. 99.99% uptime (four nines): The system is available 99.99% of the time, allowing for approximately 52.56 minutes of downtime per year. 99.999% uptime (five nines): The system is available 99.999% of the time, allowing for approximately 5.26 minutes of downtime per year. Data Integrity Accuracy and consistency of data throughout its lifecycle. - Error Rate: The proportion of data entries that contain errors or inaccuracies. (\\text{Error Rate} = \\left( \\frac{\\text{Number of Errors}}{\\text{Total Number of Entries}} \\right) \\times 100) - Accuracy Rate: The percentage of data entries that are correct and match the source of truth. (\\text{Accuracy Rate} = \\left( \\frac{\\text{Number of Accurate Entries}}{\\text{Total Number of Entries}} \\right) \\times 100) - Duplicate Record Rate: The percentage of data entries that are duplicates. (\\text{Duplicate Record Rate} = \\left( \\frac{\\text{Number of Duplicate Entries}}{\\text{Total Number of Entries}} \\right) \\times 100) Disaster recovery and business continuity Determine the system's requirements for disaster recovery and business continuity, including backup and recovery procedures and disaster recovery testing. - Backup and Recovery: The application must have a Backup and Recovery plan in place that includes regular backups of all data and configurations, and a process for restoring data and functionality in the event of a disaster or disruption. - Redundancy: The application must have Redundancy built into its infrastructure, such as redundant servers, network devices, and power supplies, to ensure high availability and minimize downtime in the event of a failure. - Failover and high availability: The application must be designed to support Failover and high availability, such as by using load balancers or Failover clusters, to ensure that it can continue to operate in the event of a system failure or disruption. - Disaster Recovery plan: The application must have a comprehensive disaster Recovery plan that includes procedures for restoring data and functionality in the event of a major disaster, such as a natural disaster, cyber attack, or other catastrophic event. - Testing and Maintenance: The application must be regularly tested and maintained to ensure that it can withstand a disaster or disruption, and that all systems, processes, and data can be quickly restored and recovered. Reliability System's ability to maintain functionality under varying conditions and failure scenarios. - Mean Time Between Failures (MTBF): The system should achieve an MTBF of at least 1000 hours, indicating a high level of reliability with infrequent failures. - Mean Time to Recover (MTTR): The system should aim for an MTTR of less than 1 hour, ensuring quick recovery and minimal disruption in the event of a failure. - Redundancy Levels: The system should include redundancy mechanisms to achieve a redundancy level of N+1, ensuring high availability and fault tolerance.","title":"Operational Requirements"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#performance-requirements","text":"Quality Attribute Description Common Metrics Capacity Maximum load or volume that the system can handle within specified performance criteria. - Maximum Load Capacity: The system should be capable of handling peak loads without exceeding predefined performance degradation thresholds. Maximum load capacity may be expressed in terms of concurrent users, transactions per second, or data volume. - Resource Utilization: Measures the percentage of system resources (CPU, memory, disk I/O, network bandwidth) consumed under normal operation. - Concurrency: Measures the number of simultaneous users or transactions the system can handle without degradation in performance. - Throughput: Measures the rate at which the system processes transactions, requests, or data. Thresholds may be defined in terms of transactions per second, requests per minute, or data throughput in bytes per second. Performance Define the expected response times, throughput, and resource usage of the solution. - Response time: The application must load and respond to user interactions within 500 ms for button clicks. - Throughput: The application must be able to handle 100 concurrent users or 500 transactions per second. - Resource utilization: The application must use less than 80% of CPU and 1 GB of memory. - Error rates: The application must have an error rate less than 1% of all requests, and be able to handle and recover from errors gracefully, without impacting user experience or data integrity. Scalability Determine how the system will handle increased user loads or larger datasets over time. - Load Balancing: The application must be able to handle a minimum of 250 concurrent users and support load balancing across at least 3 servers to handle peak traffic. - Database Scalability: The application's database must be able to handle at least 1 million records and support partitioning or sharding to ensure efficient storage and retrieval of data. - Cloud-Based Infrastructure: The application must be deployed on cloud-based infrastructure that can handle at least 100,000 requests per hour, and be able to scale up or down to meet changing demand. Microservices Architecture: The application must be designed using a microservices architecture that allows for easy scaling of individual services, and be able to handle at least 500 requests per second. - Caching: The application must be able to cache at least 10,000 records, with a cache hit rate of 95%, and support caching across multiple servers to ensure high availability.","title":"Performance Requirements"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#security-and-compliance-requirements","text":"Quality Attribute Description Common Metrics Compliance Adherence to legal, regulatory, and industry standards and requirements. See Microsoft Purview Compliance Manager Privacy Protection of sensitive information and compliance with privacy regulations. - Compliance with Privacy Regulations: Achieve full compliance with GDPR, CCPA and HIPAA. - Data Anonymization: Implement anonymization techniques in protecting individual privacy while still allowing for data analysis. - Data Encryption: Ensure that sensitive data is encrypted according to encryption standards and best practices. - User Privacy Preferences: The ability to respect and accommodate user privacy preferences regarding data collection, processing, and sharing. Security Establish the security requirements of the system, such as authentication, authorization, encryption, and compliance with industry or legal regulations. See Threat Modeling Tool Sustainability Ability to operate over an extended period while minimizing environmental impact and resource consumption. - Energy Efficiency: Kilowatt-hours/Transaction. - Carbon Footprint: Tons of CO2 emissions per year.","title":"Security and Compliance Requirements"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#system-maintainability-requirements","text":"Quality Attribute Description Common Metrics Interoperability Ability to interact and exchange data with other systems or components. - Data Format Compatibility: The system must be interoperable with various Electronic Health Records (HER) systems to exchange patient data securely. - Protocol Compatibility: The system should import and export banking information from the ERP using REST protocol. - API Compatibility: The solution must adhere to API standards, ensuring backward compatibility with previous API versions, and providing comprehensive documentation for developers. Maintainability Ease of modifying, updating, and extending the software over time. - Code Complexity: The level of complexity in the system's codebase, measured using metrics such as cyclomatic complexity or lines of code per function. Lower code complexity makes maintenance tasks easier and reduces the likelihood of introducing defects. A cyclomatic complexity score of less than 10 or a lines of code per function metric below 50 is often desirable. - Code Coverage: The percentage of code covered by automated tests. Higher code coverage indicates better testability and facilitates easier maintenance by enabling faster detection of defects. A code coverage threshold of 80% or higher is commonly targeted. - Documentation Quality: The comprehensiveness and clarity of documentation accompanying the system, including design documents, technical specifications, and user manuals. Well-written documentation reduces the time and effort required for maintenance tasks. Documentation should cover at least 80% of system functionality with clear explanations and examples. - Dependency Management: The management of external dependencies and libraries used in the system. Proper dependency management reduces the risk of compatibility issues and simplifies maintenance tasks such as updates and patches. - Code Churn: The frequency of code changes within a software system. High code churn may indicate instability or frequent updates, making maintenance more challenging. A code churn rate of less than 20% is generally considered acceptable. Observability The ability to measure a system's internal state and performance based on the outputs it generates, such as logs, metrics, and traces. -System Metrics: CPU usage, memory usage, disk I/O, network I/O, and other resource utilization metrics. - Application Metrics: Response times, request rates, error rates, and throughput. - Custom Metrics: Application-specific metrics, such as user sign-ups, or specific business logic indicators. Portability Ability to run the software on different platforms, environments, and devices. - Platform Compatibility: The ability of the software to run on different operating systems (e.g., Windows, macOS, Linux) or platforms (e.g., desktop, mobile, web). Portability requires the software to be compatible with multiple platforms, with a goal of supporting at least three major platforms. - Hardware Compatibility: The ability of the software to run on different hardware configurations, such as varying processor architectures (e.g., x86, ARM) or memory sizes. Portability involves ensuring compatibility with a wide range of hardware configurations, with a goal of supporting common hardware architectures. - File System Independence: The software's ability to operate independently of the underlying file system, ensuring compatibility with different file systems (e.g., NTFS, ext4, APFS). Portability involves using file system abstraction layers or APIs to abstract file system operations and ensure consistency across platforms. - Data Format Compatibility: The software's ability to read and write data in different formats, ensuring compatibility with common data interchange formats (e.g., JSON, XML, CSV). Portability involves supporting standard data formats and providing mechanisms for data conversion and interoperability.","title":"System Maintainability Requirements"},{"location":"design/design-patterns/non-functional-requirements-capture-guide/#user-experience-requirements","text":"Quality Attribute Description Common Metrics Accessibility The solution must be usable by people with disabilities. Compliance with accessibility standards. Support for assistive technologies - Alternative Text for Images: All images and non-text content must have alternative text descriptions that can be read by screen readers. - Color contrast: The application must use color schemes that meet the recommended contrast ratio between foreground and background colors to ensure visibility for users with low vision. - Focus indicators: The application must provide visible focus indicators to highlight the currently focused element, which is especially important for users who rely on keyboard navigation. - Captions and Transcripts: All audio and video content must have captions and transcripts, to ensure that users with hearing impairments can access the content. - Language identification: The application must correctly identify the language of the content, to ensure that screen readers and other assistive technologies can read the content properly. Internationalization and Localization Adaptation of the software for use in different languages and cultures. Tailoring the software to meet the specific needs of different regions or locales. - Language and Locale Support: The software's support for different languages, character sets, and locales. Portability requires internationalization and localization efforts to ensure that the software can be used effectively in different regions and cultures, with support for at least five major languages. - Multi currency: The system's support for multiple currencies, allowing different symbols and conversion rates. Usability Intuitiveness, ease of learning, and user satisfaction with the software interface. - Task Completion Time: The average time it takes for users to complete specific tasks. A user must be able to complete an account settings in less than 2 minutes. - Ease of Navigation: The ease with which users can navigate through the system and find the information they need. This can be measured by observing user interactions or conducting usability tests. - User Satisfaction: User satisfaction can be measured using surveys, feedback forms, or satisfaction ratings. A satisfaction score of 70% or higher is typically considered satisfactory. - Learnability: The ease with which new users can learn to use the system. This can be measured by the time it takes for users to perform basic tasks or by conducting usability tests with novice users.","title":"User Experience Requirements"},{"location":"design/design-patterns/object-oriented-design-reference/","text":"Object-Oriented Design Reference When writing software for large projects, the hardest part is often communication and maintenance. Following proven design patterns can optimize for maintenance, readability, and ease of extension. In particular, object-oriented design patterns are well-established in the industry. Please refer to the following resources to create strong object-oriented designs: Design Patterns Wikipedia Object Oriented Design Website","title":"Object-Oriented Design Reference"},{"location":"design/design-patterns/object-oriented-design-reference/#object-oriented-design-reference","text":"When writing software for large projects, the hardest part is often communication and maintenance. Following proven design patterns can optimize for maintenance, readability, and ease of extension. In particular, object-oriented design patterns are well-established in the industry. Please refer to the following resources to create strong object-oriented designs: Design Patterns Wikipedia Object Oriented Design Website","title":"Object-Oriented Design Reference"},{"location":"design/design-patterns/rest-api-design-guidance/","text":"REST API Design Guidance Goals Elevate Microsoft's published REST API design guidelines . Highlight common design decisions and factors to consider when designing. Provide additional resources to inform API design in areas not directly addressed by the Microsoft guidelines. Common API Design Decisions The Microsoft REST API guidelines provide design guidance covering a multitude of use-cases. The following sections are a good place to start as they are likely required considerations by any REST API design: URL Structure HTTP Methods HTTP Status Codes Collections JSON Standardizations Versioning Naming Creating API Contracts As different development teams expose APIs to access various REST based services, it's important to have an API contract to share between the producer and consumers of APIs. Open API format is one of the most popular API description format. This Open API document can be produced in two ways: Design-First - Team starts developing APIs by first describing API designs as an Open API document and later generates server side boilerplate code with the help of this document. Code-First - Team starts writing the server side API interface code e.g. controllers, DTOs etc. and later generates and Open API document from it. Design-First Approach A Design-First approach means that APIs are treated as \"first-class citizens\" and everything about a project revolves around the idea that at the end these APIs will be consumed by clients. So based on the business requirements API development team first start describing API designs as an Open API document and collaborate with the stakeholders to gather feedback. This approach is quite useful if a project is about developing externally exposed set of APIs which will be consumed by partners. In this approach, we first agree upon an API contract (Open API document) creating clear expectations on both API producer and consumer sides so both teams can begin work in parallel as per the pre-agreed API design. Key Benefits of this approach: Early API design feedback. Clearly established expectations for both consumer & producer as both have agreed upon an API contract. Development teams can work in parallel. Testing team can use API contracts to write early tests even before business logic is in place. By looking at different models, paths, attributes and other aspects of the API testing can provide their input which can be very valuable. During an agile development cycle API definitions are not impacted by incremental dev changes. API design is not influenced by actual implementation limitations & code structure. Server side boilerplate code e.g. controllers, DTOs etc. can be auto generated from API contracts. May improve collaboration between API producer & consumer teams. Planning a Design-First Development: Identify use cases & key services which API should offer. Identify key stakeholders of API and try to include them during API design phase to get continuous feedback. Write API contract definitions. Maintain consistent style for API status codes, versioning, error responses etc. Encourage peer reviews via pull requests. Generate server side boilerplate code & client SDKs from API contract definitions. Important Points to consider: If API requirements changes often during initial development phase, than a Design-First approach may not be a good fit as this will introduce additional overhead, requiring repeated updates & maintenance to the API contract definitions. It might be worthwhile to first try out your platform specific code generator and evaluate how much more additional work will be required in order to meet your project requirements and coding guidelines because it is possible that a particular platform specific code generator might not be able to generate a flexible & maintainable implementation of actual code. For instance If your web framework requires annotations to be present on your controller classes (e.g. for API versioning or authentication), make sure that the code generation tool you use fully supports them. Microsoft TypeSpec is a valuable tool for developers who are working on complex APIs. By providing reusable patterns it can streamline API development and promote best practices. We have put together some samples about how to enforce an API design-first approach in a GitHub CI/CD pipeline to help accelerate it's adoption in a Design-First Development. Code-First Approach A Code-First approach means that development teams first implements server side API interface code e.g. controllers, DTOs etc. and than generates API contract definitions out of it. In current times this approach is more widely popular within developer community than Design-First Approach. This approach has the advantages of allowing the team to quickly implement APIs and also providing the flexibility to react very quickly to any unexpected API requirement changes. Key Benefits of this approach: Rapid development of APIs as development team can start implementing APIs much faster directly after understanding key requirements & use cases. Development team has better control & flexibility to implement server side API interfaces in a way which best suited for project structure. More popular among development teams so its easier to get consensus on a related topic and also has more ready to use code examples available on various blogs or developer forums regarding how to generate Open API definitions out of actual code. During initial phase of development where both API producer & consumers requirements might change often this approach is better as it provides flexibility to quickly react on such changes. Important Points to consider: A generated Open API definition can become outdated, so its important to have automated checks to avoid this otherwise generated client SDKs will be out of sync and may cause issues for API consumers. With Agile development, it is hard to ensure that definitions embedded in runtime code remain stable, especially across rounds of refactoring and when serving multiple concurrent API versions. It might be useful to regularly generate Open API definition and store it in version control system otherwise generating the OpenAPI definition at runtime might makes it more complex in scenarios where that definition is required at development/CI time. How to Interpret and Apply the Guidelines The API guidelines document includes a section on how to apply the guidelines depending on whether the API is new or existing. In particular, when working in an existing API ecosystem, be sure to align with stakeholders on a definition of what constitutes a breaking change to understand the impact of implementing certain best practices. We do not recommend making a breaking change to a service that predates these guidelines simply for the sake of compliance. Resources Microsoft's Recommended Reading List for REST APIs Documentation - Guidance - REST APIs Detailed HTTP status code definitions Semantic Versioning Other Public API Guidelines Microsoft TypeSpec Microsoft TypeSpec GitHub Workflow samples","title":"REST API Design Guidance"},{"location":"design/design-patterns/rest-api-design-guidance/#rest-api-design-guidance","text":"","title":"REST API Design Guidance"},{"location":"design/design-patterns/rest-api-design-guidance/#goals","text":"Elevate Microsoft's published REST API design guidelines . Highlight common design decisions and factors to consider when designing. Provide additional resources to inform API design in areas not directly addressed by the Microsoft guidelines.","title":"Goals"},{"location":"design/design-patterns/rest-api-design-guidance/#common-api-design-decisions","text":"The Microsoft REST API guidelines provide design guidance covering a multitude of use-cases. The following sections are a good place to start as they are likely required considerations by any REST API design: URL Structure HTTP Methods HTTP Status Codes Collections JSON Standardizations Versioning Naming","title":"Common API Design Decisions"},{"location":"design/design-patterns/rest-api-design-guidance/#creating-api-contracts","text":"As different development teams expose APIs to access various REST based services, it's important to have an API contract to share between the producer and consumers of APIs. Open API format is one of the most popular API description format. This Open API document can be produced in two ways: Design-First - Team starts developing APIs by first describing API designs as an Open API document and later generates server side boilerplate code with the help of this document. Code-First - Team starts writing the server side API interface code e.g. controllers, DTOs etc. and later generates and Open API document from it.","title":"Creating API Contracts"},{"location":"design/design-patterns/rest-api-design-guidance/#design-first-approach","text":"A Design-First approach means that APIs are treated as \"first-class citizens\" and everything about a project revolves around the idea that at the end these APIs will be consumed by clients. So based on the business requirements API development team first start describing API designs as an Open API document and collaborate with the stakeholders to gather feedback. This approach is quite useful if a project is about developing externally exposed set of APIs which will be consumed by partners. In this approach, we first agree upon an API contract (Open API document) creating clear expectations on both API producer and consumer sides so both teams can begin work in parallel as per the pre-agreed API design. Key Benefits of this approach: Early API design feedback. Clearly established expectations for both consumer & producer as both have agreed upon an API contract. Development teams can work in parallel. Testing team can use API contracts to write early tests even before business logic is in place. By looking at different models, paths, attributes and other aspects of the API testing can provide their input which can be very valuable. During an agile development cycle API definitions are not impacted by incremental dev changes. API design is not influenced by actual implementation limitations & code structure. Server side boilerplate code e.g. controllers, DTOs etc. can be auto generated from API contracts. May improve collaboration between API producer & consumer teams. Planning a Design-First Development: Identify use cases & key services which API should offer. Identify key stakeholders of API and try to include them during API design phase to get continuous feedback. Write API contract definitions. Maintain consistent style for API status codes, versioning, error responses etc. Encourage peer reviews via pull requests. Generate server side boilerplate code & client SDKs from API contract definitions. Important Points to consider: If API requirements changes often during initial development phase, than a Design-First approach may not be a good fit as this will introduce additional overhead, requiring repeated updates & maintenance to the API contract definitions. It might be worthwhile to first try out your platform specific code generator and evaluate how much more additional work will be required in order to meet your project requirements and coding guidelines because it is possible that a particular platform specific code generator might not be able to generate a flexible & maintainable implementation of actual code. For instance If your web framework requires annotations to be present on your controller classes (e.g. for API versioning or authentication), make sure that the code generation tool you use fully supports them. Microsoft TypeSpec is a valuable tool for developers who are working on complex APIs. By providing reusable patterns it can streamline API development and promote best practices. We have put together some samples about how to enforce an API design-first approach in a GitHub CI/CD pipeline to help accelerate it's adoption in a Design-First Development.","title":"Design-First Approach"},{"location":"design/design-patterns/rest-api-design-guidance/#code-first-approach","text":"A Code-First approach means that development teams first implements server side API interface code e.g. controllers, DTOs etc. and than generates API contract definitions out of it. In current times this approach is more widely popular within developer community than Design-First Approach. This approach has the advantages of allowing the team to quickly implement APIs and also providing the flexibility to react very quickly to any unexpected API requirement changes. Key Benefits of this approach: Rapid development of APIs as development team can start implementing APIs much faster directly after understanding key requirements & use cases. Development team has better control & flexibility to implement server side API interfaces in a way which best suited for project structure. More popular among development teams so its easier to get consensus on a related topic and also has more ready to use code examples available on various blogs or developer forums regarding how to generate Open API definitions out of actual code. During initial phase of development where both API producer & consumers requirements might change often this approach is better as it provides flexibility to quickly react on such changes. Important Points to consider: A generated Open API definition can become outdated, so its important to have automated checks to avoid this otherwise generated client SDKs will be out of sync and may cause issues for API consumers. With Agile development, it is hard to ensure that definitions embedded in runtime code remain stable, especially across rounds of refactoring and when serving multiple concurrent API versions. It might be useful to regularly generate Open API definition and store it in version control system otherwise generating the OpenAPI definition at runtime might makes it more complex in scenarios where that definition is required at development/CI time.","title":"Code-First Approach"},{"location":"design/design-patterns/rest-api-design-guidance/#how-to-interpret-and-apply-the-guidelines","text":"The API guidelines document includes a section on how to apply the guidelines depending on whether the API is new or existing. In particular, when working in an existing API ecosystem, be sure to align with stakeholders on a definition of what constitutes a breaking change to understand the impact of implementing certain best practices. We do not recommend making a breaking change to a service that predates these guidelines simply for the sake of compliance.","title":"How to Interpret and Apply the Guidelines"},{"location":"design/design-patterns/rest-api-design-guidance/#resources","text":"Microsoft's Recommended Reading List for REST APIs Documentation - Guidance - REST APIs Detailed HTTP status code definitions Semantic Versioning Other Public API Guidelines Microsoft TypeSpec Microsoft TypeSpec GitHub Workflow samples","title":"Resources"},{"location":"design/design-reviews/","text":"Design Reviews Goals Reduce technical debt for our customers Continue to iterate on design after Game Plan review Generate useful technical artifacts that can be referenced by Microsoft and customers Measures Cost of Change When incorporating design reviews as part of the engineering process, decisions are front-loaded before implementation begins. Making a decision of using Azure Kubernetes Service instead of App Services at the design phase likely only requires updating documentation. However, making this pivot after implementation has started or after a solution is in use is much more costly. Are these changes occurring before or after implementation? How large of effort are they typically? Reviewer Participation How many individuals participate across the designs created? Cumulatively if this is a larger number this would indicate a wider contribution of ideas and perspectives. A lower number (i.e. same 2 individuals only on every review) might indicate a limited set of perspectives. Is anyone participating from outside the core development team? Time To Potential Solutions How long does it typically take to go from requirements to solution options (multiple)? There is a healthy balancing act between spending too much or too little time evaluating different potential solutions. Too little time puts higher risk of costly changes required after implementation. Too much time delays target value from being delivered; as well as subsequent features in queue. However, the faster the team can identify the most critical information necessary to make an informed decision , the faster value can be provided with lower risk of costly changes down the road. Time to Decisions How long does it take to make a decision on which solution to implement? There is also a healthy balancing act in supporting a healthy debate while not hindering the team's delivery. The ideal case is for a team to quickly digest the solution options presented, ask questions, and debate before finally reaching quorum on a particular approach. In cases where no quorum can be reached, the person with the most context on the problem (typically story owner) should make the final decision. Prioritize delivering value and learning. Disagree and commit! Impact Solutions can be quickly be operated into customer's production environment Easier for other dev crews to leverage your teams work Easier for engineers to ramp up on projects Increase team velocity by front-loading changes and decisions when they cost the least Increased team engagement and transparency by soliciting wide reviewer participation Participation Dev Crew The dev crew should always participate in all design review sessions Domain Experts Domain experts should participate in design review sessions as needed ISE Tech Domains Customer subject-matter experts (SME) Senior Leadership Facilitation Guidance Recipes Please see our Design Review Recipes for guidance on design process. Sync Design Reviews via In-Person / Virtual Meetings Joint meetings with dev crew, subject-matter experts (SMEs) and customer engineers Async Design Reviews via Pull-Requests See the async design review recipe for guidance on facilitating async design reviews. This can be useful for teams that are geographically distributed across different time-zones. Technical Spike A technical spike is most often used for evaluating the impact new technology has on the current implementation. Please read more here . Design Documentation Document and update the architecture design in the project design documentation Track and document design decisions in a decision log Document decision process in trade studies when multiple solutions exist for the given problem Early on in engagements, the team must decide where to land artifacts generated from design reviews. Typically, we meet the customer where they are at (for example, using their Confluence instance to land documentation if that is their preferred process). However, similar to storing decision logs, trade studies, etc. in the development repo, there are also large benefits to maintaining design review artifacts in the repo as well. Usually these artifacts can be further added to root level documentation directory or even at the root of the corresponding project if the repo is monolithic. In adding them to the project repo, these artifacts must similarly be reviewed in Pull Requests (typically preceding but sometimes accompanying implementation) which allows async review/discussion. Furthermore, artifacts can then easily link to other sections of the repo and source code files (via markdown links ).","title":"Design Reviews"},{"location":"design/design-reviews/#design-reviews","text":"","title":"Design Reviews"},{"location":"design/design-reviews/#goals","text":"Reduce technical debt for our customers Continue to iterate on design after Game Plan review Generate useful technical artifacts that can be referenced by Microsoft and customers","title":"Goals"},{"location":"design/design-reviews/#measures","text":"","title":"Measures"},{"location":"design/design-reviews/#cost-of-change","text":"When incorporating design reviews as part of the engineering process, decisions are front-loaded before implementation begins. Making a decision of using Azure Kubernetes Service instead of App Services at the design phase likely only requires updating documentation. However, making this pivot after implementation has started or after a solution is in use is much more costly. Are these changes occurring before or after implementation? How large of effort are they typically?","title":"Cost of Change"},{"location":"design/design-reviews/#reviewer-participation","text":"How many individuals participate across the designs created? Cumulatively if this is a larger number this would indicate a wider contribution of ideas and perspectives. A lower number (i.e. same 2 individuals only on every review) might indicate a limited set of perspectives. Is anyone participating from outside the core development team?","title":"Reviewer Participation"},{"location":"design/design-reviews/#time-to-potential-solutions","text":"How long does it typically take to go from requirements to solution options (multiple)? There is a healthy balancing act between spending too much or too little time evaluating different potential solutions. Too little time puts higher risk of costly changes required after implementation. Too much time delays target value from being delivered; as well as subsequent features in queue. However, the faster the team can identify the most critical information necessary to make an informed decision , the faster value can be provided with lower risk of costly changes down the road.","title":"Time To Potential Solutions"},{"location":"design/design-reviews/#time-to-decisions","text":"How long does it take to make a decision on which solution to implement? There is also a healthy balancing act in supporting a healthy debate while not hindering the team's delivery. The ideal case is for a team to quickly digest the solution options presented, ask questions, and debate before finally reaching quorum on a particular approach. In cases where no quorum can be reached, the person with the most context on the problem (typically story owner) should make the final decision. Prioritize delivering value and learning. Disagree and commit!","title":"Time to Decisions"},{"location":"design/design-reviews/#impact","text":"Solutions can be quickly be operated into customer's production environment Easier for other dev crews to leverage your teams work Easier for engineers to ramp up on projects Increase team velocity by front-loading changes and decisions when they cost the least Increased team engagement and transparency by soliciting wide reviewer participation","title":"Impact"},{"location":"design/design-reviews/#participation","text":"","title":"Participation"},{"location":"design/design-reviews/#dev-crew","text":"The dev crew should always participate in all design review sessions","title":"Dev Crew"},{"location":"design/design-reviews/#domain-experts","text":"Domain experts should participate in design review sessions as needed ISE Tech Domains Customer subject-matter experts (SME) Senior Leadership","title":"Domain Experts"},{"location":"design/design-reviews/#facilitation-guidance","text":"","title":"Facilitation Guidance"},{"location":"design/design-reviews/#recipes","text":"Please see our Design Review Recipes for guidance on design process.","title":"Recipes"},{"location":"design/design-reviews/#sync-design-reviews-via-in-person-virtual-meetings","text":"Joint meetings with dev crew, subject-matter experts (SMEs) and customer engineers","title":"Sync Design Reviews via In-Person / Virtual Meetings"},{"location":"design/design-reviews/#async-design-reviews-via-pull-requests","text":"See the async design review recipe for guidance on facilitating async design reviews. This can be useful for teams that are geographically distributed across different time-zones.","title":"Async Design Reviews via Pull-Requests"},{"location":"design/design-reviews/#technical-spike","text":"A technical spike is most often used for evaluating the impact new technology has on the current implementation. Please read more here .","title":"Technical Spike"},{"location":"design/design-reviews/#design-documentation","text":"Document and update the architecture design in the project design documentation Track and document design decisions in a decision log Document decision process in trade studies when multiple solutions exist for the given problem Early on in engagements, the team must decide where to land artifacts generated from design reviews. Typically, we meet the customer where they are at (for example, using their Confluence instance to land documentation if that is their preferred process). However, similar to storing decision logs, trade studies, etc. in the development repo, there are also large benefits to maintaining design review artifacts in the repo as well. Usually these artifacts can be further added to root level documentation directory or even at the root of the corresponding project if the repo is monolithic. In adding them to the project repo, these artifacts must similarly be reviewed in Pull Requests (typically preceding but sometimes accompanying implementation) which allows async review/discussion. Furthermore, artifacts can then easily link to other sections of the repo and source code files (via markdown links ).","title":"Design Documentation"},{"location":"design/design-reviews/decision-log/","text":"Design Decision Log Not all requirements can be captured in the beginning of an agile project during one or more design sessions. The initial architecture design can evolve or change during the project, especially if there are multiple possible technology choices that can be made. Tracking these changes within a large document is in most cases not ideal, as one can lose oversight over the design changes made at which point in time. Having to scan through a large document to find a specific content takes time, and in many cases the consequences of a decision is not documented. Why is it Important to Track Design Decisions Tracking an architecture design decision can have many advantages: Developers and project stakeholders can see the decision log and track the changes, even as the team composition changes over time. The log is kept up-to-date. The context of a decision including the consequences for the team are documented with the decision. It is easier to find the design decision in a log than having to read a large document. What is a Recommended Format for Tracking Decisions In addition to incorporating a design decision as an update of the overall design documentation of the project, the decisions could be tracked as Architecture Decision Records as Michael Nygard proposed in his blog. The effort invested in design reviews and discussions can be different throughout the course of a project. Sometimes decisions are made quickly without having to go into a detailed comparison of competing technologies. In some cases, it is necessary to have a more elaborate study of advantages and disadvantages, as is described in the documentation of Trade Studies . In other cases, it can be helpful to conduct Engineering Feasibility Spikes . An ADR can incorporate each of these different approaches. Architecture Decision Record (ADR) An architecture decision record has the structure [Ascending number]. [Title of decision] The title should give the reader the information on what was decided upon. Example: 001. App level logging with Serilog and Application Insights Hint: When several developers regularly start ADRs in parallel, it becomes difficult to deal with conflicting ascending numbers. An easy way to overcome this is to give ADRs the ID of the work item they relate to. Date: The date the decision was made. Status: [Proposed/Accepted/Deprecated/Superseded] A proposed design can be reviewed by the development team prior to accepting it. A previous decision can be superseded by a new one, or the ADR record marked as deprecated in case it is not valid anymore. Context: The text should provide the reader an understanding of the problem, or as Michael Nygard puts it, a value-neutral [an objective] description of the forces at play. Example: Due to the microservices design of the platform, we need to ensure consistency of logging throughout each service so tracking of usage, performance, errors etc. can be performed end-to-end. A single logging/monitoring framework should be used where possible to achieve this, whilst allowing the flexibility for integration/export into other tools at a later stage. The developers should be equipped with a simple interface to log messages and metrics. If the development team had a data-driven approach to back the decision, i.e., a study that evaluates the potential choices against a set of objective criteria by following the guidance in Trade Studies , the study should be referred to in this section. Decision: The decision made, it should begin with 'We will...' or 'We have agreed to ... Example: We have agreed to utilize Serilog as the Dotnet Logging framework of choice at the application level, with integration into Log Analytics and Application Insights for analysis. Consequences: The resulting context, after having applied the decision. Example: Sampling will need to be configured in Application Insights so that it does not become overly-expensive when ingesting millions of messages, but also does not prevent capture of essential information. The team will need to only log what is agreed to be essential for monitoring as part of design reviews, to reduce noise and unnecessary levels of sampling. Where to Store ADRs ADRs can be stored and tracked in any version control system such as git. As a recommended practice, ADRs can be added as pull request in the proposed status to be discussed by the team until it is updated to accepted to be merged with the main branch. They are usually stored in a folder structure doc/adr or doc/arch . Additionally, it can be useful to track ADRs in a decision-log.md to provide useful metadata in an obvious format. Decision Logs A decision log is a Markdown file containing a table which provides executive summaries of the decisions contained in ADRs, as well as some other metadata. You can see a template table at doc/decision-log.md . When to Track ADRs Architecture design decisions are usually tracked whenever significant decisions are made that affect the structure and characteristics of the solution or framework we are building. ADRs can also be used to document results of spikes when evaluating different technology choices. Examples of ADRs The first ADR could be the decision to use ADRs to track design decisions, 0001-record-architecture-decisions.md , followed by actual decisions in the engagement as in the example used above, 0002-app-level-logging.md .","title":"Design Decision Log"},{"location":"design/design-reviews/decision-log/#design-decision-log","text":"Not all requirements can be captured in the beginning of an agile project during one or more design sessions. The initial architecture design can evolve or change during the project, especially if there are multiple possible technology choices that can be made. Tracking these changes within a large document is in most cases not ideal, as one can lose oversight over the design changes made at which point in time. Having to scan through a large document to find a specific content takes time, and in many cases the consequences of a decision is not documented.","title":"Design Decision Log"},{"location":"design/design-reviews/decision-log/#why-is-it-important-to-track-design-decisions","text":"Tracking an architecture design decision can have many advantages: Developers and project stakeholders can see the decision log and track the changes, even as the team composition changes over time. The log is kept up-to-date. The context of a decision including the consequences for the team are documented with the decision. It is easier to find the design decision in a log than having to read a large document.","title":"Why is it Important to Track Design Decisions"},{"location":"design/design-reviews/decision-log/#what-is-a-recommended-format-for-tracking-decisions","text":"In addition to incorporating a design decision as an update of the overall design documentation of the project, the decisions could be tracked as Architecture Decision Records as Michael Nygard proposed in his blog. The effort invested in design reviews and discussions can be different throughout the course of a project. Sometimes decisions are made quickly without having to go into a detailed comparison of competing technologies. In some cases, it is necessary to have a more elaborate study of advantages and disadvantages, as is described in the documentation of Trade Studies . In other cases, it can be helpful to conduct Engineering Feasibility Spikes . An ADR can incorporate each of these different approaches.","title":"What is a Recommended Format for Tracking Decisions"},{"location":"design/design-reviews/decision-log/#architecture-decision-record-adr","text":"An architecture decision record has the structure [Ascending number]. [Title of decision] The title should give the reader the information on what was decided upon. Example: 001. App level logging with Serilog and Application Insights Hint: When several developers regularly start ADRs in parallel, it becomes difficult to deal with conflicting ascending numbers. An easy way to overcome this is to give ADRs the ID of the work item they relate to. Date: The date the decision was made. Status: [Proposed/Accepted/Deprecated/Superseded] A proposed design can be reviewed by the development team prior to accepting it. A previous decision can be superseded by a new one, or the ADR record marked as deprecated in case it is not valid anymore. Context: The text should provide the reader an understanding of the problem, or as Michael Nygard puts it, a value-neutral [an objective] description of the forces at play. Example: Due to the microservices design of the platform, we need to ensure consistency of logging throughout each service so tracking of usage, performance, errors etc. can be performed end-to-end. A single logging/monitoring framework should be used where possible to achieve this, whilst allowing the flexibility for integration/export into other tools at a later stage. The developers should be equipped with a simple interface to log messages and metrics. If the development team had a data-driven approach to back the decision, i.e., a study that evaluates the potential choices against a set of objective criteria by following the guidance in Trade Studies , the study should be referred to in this section. Decision: The decision made, it should begin with 'We will...' or 'We have agreed to ... Example: We have agreed to utilize Serilog as the Dotnet Logging framework of choice at the application level, with integration into Log Analytics and Application Insights for analysis. Consequences: The resulting context, after having applied the decision. Example: Sampling will need to be configured in Application Insights so that it does not become overly-expensive when ingesting millions of messages, but also does not prevent capture of essential information. The team will need to only log what is agreed to be essential for monitoring as part of design reviews, to reduce noise and unnecessary levels of sampling.","title":"Architecture Decision Record (ADR)"},{"location":"design/design-reviews/decision-log/#where-to-store-adrs","text":"ADRs can be stored and tracked in any version control system such as git. As a recommended practice, ADRs can be added as pull request in the proposed status to be discussed by the team until it is updated to accepted to be merged with the main branch. They are usually stored in a folder structure doc/adr or doc/arch . Additionally, it can be useful to track ADRs in a decision-log.md to provide useful metadata in an obvious format.","title":"Where to Store ADRs"},{"location":"design/design-reviews/decision-log/#decision-logs","text":"A decision log is a Markdown file containing a table which provides executive summaries of the decisions contained in ADRs, as well as some other metadata. You can see a template table at doc/decision-log.md .","title":"Decision Logs"},{"location":"design/design-reviews/decision-log/#when-to-track-adrs","text":"Architecture design decisions are usually tracked whenever significant decisions are made that affect the structure and characteristics of the solution or framework we are building. ADRs can also be used to document results of spikes when evaluating different technology choices.","title":"When to Track ADRs"},{"location":"design/design-reviews/decision-log/#examples-of-adrs","text":"The first ADR could be the decision to use ADRs to track design decisions, 0001-record-architecture-decisions.md , followed by actual decisions in the engagement as in the example used above, 0002-app-level-logging.md .","title":"Examples of ADRs"},{"location":"design/design-reviews/decision-log/doc/decision-log/","text":"Decision Log This document is used to track key decisions that are made during the course of the project. This can be used at a later stage to understand why decisions were made and by whom. Decision Date Alternatives Considered Reasoning Detailed doc Made By Work Required A one-sentence summary of the decision made. Date the decision was made. A list of the other approaches considered. A two to three sentence summary of why the decision was made. A link to the ADR with the format [Title] DR. Who made this decision? A link to the work item for the linked ADR.","title":"Decision Log"},{"location":"design/design-reviews/decision-log/doc/decision-log/#decision-log","text":"This document is used to track key decisions that are made during the course of the project. This can be used at a later stage to understand why decisions were made and by whom. Decision Date Alternatives Considered Reasoning Detailed doc Made By Work Required A one-sentence summary of the decision made. Date the decision was made. A list of the other approaches considered. A two to three sentence summary of why the decision was made. A link to the ADR with the format [Title] DR. Who made this decision? A link to the work item for the linked ADR.","title":"Decision Log"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/","text":"1. Record architecture decisions Date: 2020-03-20 Status Accepted Context We need to record the architectural decisions made on this project. Decision We will use Architecture Decision Records, as described by Michael Nygard . Consequences See Michael Nygard's article, linked above. For a lightweight ADR tool set, see Nat Pryce's adr-tools .","title":"1. Record architecture decisions"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#1-record-architecture-decisions","text":"Date: 2020-03-20","title":"1. Record architecture decisions"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#status","text":"Accepted","title":"Status"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#context","text":"We need to record the architectural decisions made on this project.","title":"Context"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#decision","text":"We will use Architecture Decision Records, as described by Michael Nygard .","title":"Decision"},{"location":"design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/#consequences","text":"See Michael Nygard's article, linked above. For a lightweight ADR tool set, see Nat Pryce's adr-tools .","title":"Consequences"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/","text":"2. App-level Logging with Serilog and Application Insights Date: 2020-04-08 Status Accepted Context Due to the microservices design of the platform, we need to ensure consistency of logging throughout each service so tracking of usage, performance, errors etc. can be performed end-to-end. A single logging/monitoring framework should be used where possible to achieve this, whilst allowing the flexibility for integration/export into other tools at a later stage. The developers should be equipped with a simple interface to log messages and metrics. Decision We have agreed to utilize Serilog as the Dotnet Logging framework of choice at the application level, with integration into Log Analytics and Application Insights for analysis. Consequences Sampling will need to be configured in Application Insights so that it does not become overly-expensive when ingesting millions of messages, but also does not prevent capture of essential information. The team will need to only log what is agreed to be essential for monitoring as part of design reviews, to reduce noise and unnecessary levels of sampling.","title":"2. App-level Logging with Serilog and Application Insights"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#2-app-level-logging-with-serilog-and-application-insights","text":"Date: 2020-04-08","title":"2. App-level Logging with Serilog and Application Insights"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#status","text":"Accepted","title":"Status"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#context","text":"Due to the microservices design of the platform, we need to ensure consistency of logging throughout each service so tracking of usage, performance, errors etc. can be performed end-to-end. A single logging/monitoring framework should be used where possible to achieve this, whilst allowing the flexibility for integration/export into other tools at a later stage. The developers should be equipped with a simple interface to log messages and metrics.","title":"Context"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#decision","text":"We have agreed to utilize Serilog as the Dotnet Logging framework of choice at the application level, with integration into Log Analytics and Application Insights for analysis.","title":"Decision"},{"location":"design/design-reviews/decision-log/doc/adr/0002-app-level-logging/#consequences","text":"Sampling will need to be configured in Application Insights so that it does not become overly-expensive when ingesting millions of messages, but also does not prevent capture of essential information. The team will need to only log what is agreed to be essential for monitoring as part of design reviews, to reduce noise and unnecessary levels of sampling.","title":"Consequences"},{"location":"design/design-reviews/decision-log/examples/memory/","text":"Memory These examples were taken from the Memory project, an internal tool for tracking an individual's accomplishments. The main example here is the Decision Log . Since this log was used from the start, the decisions are mostly based on technology choices made in the start of the project. All line items have a link out to the trade studies done for each technology choice.","title":"Memory"},{"location":"design/design-reviews/decision-log/examples/memory/#memory","text":"These examples were taken from the Memory project, an internal tool for tracking an individual's accomplishments. The main example here is the Decision Log . Since this log was used from the start, the decisions are mostly based on technology choices made in the start of the project. All line items have a link out to the trade studies done for each technology choice.","title":"Memory"},{"location":"design/design-reviews/decision-log/examples/memory/decision-log/","text":"Decision Log This document is used to track key decisions that are made during the course of the project. This can be used at a later stage to understand why decisions were made and by whom. Decision Date Alternatives Considered Reasoning Detailed doc Made By Work Required Use Architecture Decision Records 01/25/2021 Standard Design Docs An easy and low cost solution of tracking architecture decisions over the lifetime of a project Record Architecture Decisions Dev Team #21654 Use ArgoCD 01/26/2021 FluxCD ArgoCD is more feature rich, will support more scenarios, and will be a better tool to put in our tool belts. So we have decided at this point to go with ArgoCD GitOps Trade Study Dev Team #21672 Use Helm 01/28/2021 Kustomize, Kubes, Gitkube, Draft Platform maturity, templating, ArgoCD support K8s Package Manager Trade Study Dev Team #21674 Use CosmosDB 01/29/2021 Blob Storage, CosmosDB, SQL Server, Neo4j, JanusGraph, ArangoDB CosmosDB has better Azure integration, managed identity, and the Gremlin API is powerful. Graph Storage Trade Study and Decision Dev Team #21650 Use Azure Traffic Manager 02/02/2021 Azure Front Door A lightweight solution to route traffic between multiple k8s regional clusters Routing Trade Study Dev Team #21673 Use Linkerd + Contour 02/02/2021 Istio, Consul, Ambassador, Traefik A CNCF backed cloud native k8s stack to deliver service mesh, API gateway and ingress Routing Trade Study Dev Team #21673 Use ARM Templates 02/02/2021 Terraform, Pulumi, Az CLI Azure Native, Az Monitoring and incremental updates support Automated Deployment Trade Study Dev Team #21651 Use 99designs/gqlgen 02/04/2021 graphql, graphql-go, thunder Type safety, auto-registration and code generation GraphQL Golang Trade Study Dev Team #21775 Create normalized role data model 03/25/2021 Career Stage Profiles (CSP), Microsoft Role Library Requires a data model that support the data requirements of both role systems Role Data Model Schema Dev Team #22035 Design for edges and vertices 03/25/2021 N/A N/A Data Model Dev Team #21976 Use grammes 03/29/2021 Gremlin, gremgo, gremcos Balance of documentation and maturity Gremlin API library Trade Study Dev Team #21870 Design for Gremlin implementation 04/02/2021 N/A N/A Gremlin Dev Team #21980 Design for Gremlin implementation 04/02/2021 N/A N/A Gremlin Dev Team #21980 Expose 1:1 data model from API to DB 04/02/2021 Exposing a minified version of data model contract Team decided that there were no pieces of data that we can rule out as being useful. Will update if data model becomes too complex API README Dev Team #21658 Deprecate SonarCloud 04/05/2021 Checkstyle, PMD, FindBugs Requires paid plan to use in a private repo Code Quality & Security Dev Team #22090 Adopted Stable Tagging Strategy 04/08/2021 N/A Team aligned on the proposed docker container tagging strategy Tagging Strategy Dev Team #22005","title":"Decision Log"},{"location":"design/design-reviews/decision-log/examples/memory/decision-log/#decision-log","text":"This document is used to track key decisions that are made during the course of the project. This can be used at a later stage to understand why decisions were made and by whom. Decision Date Alternatives Considered Reasoning Detailed doc Made By Work Required Use Architecture Decision Records 01/25/2021 Standard Design Docs An easy and low cost solution of tracking architecture decisions over the lifetime of a project Record Architecture Decisions Dev Team #21654 Use ArgoCD 01/26/2021 FluxCD ArgoCD is more feature rich, will support more scenarios, and will be a better tool to put in our tool belts. So we have decided at this point to go with ArgoCD GitOps Trade Study Dev Team #21672 Use Helm 01/28/2021 Kustomize, Kubes, Gitkube, Draft Platform maturity, templating, ArgoCD support K8s Package Manager Trade Study Dev Team #21674 Use CosmosDB 01/29/2021 Blob Storage, CosmosDB, SQL Server, Neo4j, JanusGraph, ArangoDB CosmosDB has better Azure integration, managed identity, and the Gremlin API is powerful. Graph Storage Trade Study and Decision Dev Team #21650 Use Azure Traffic Manager 02/02/2021 Azure Front Door A lightweight solution to route traffic between multiple k8s regional clusters Routing Trade Study Dev Team #21673 Use Linkerd + Contour 02/02/2021 Istio, Consul, Ambassador, Traefik A CNCF backed cloud native k8s stack to deliver service mesh, API gateway and ingress Routing Trade Study Dev Team #21673 Use ARM Templates 02/02/2021 Terraform, Pulumi, Az CLI Azure Native, Az Monitoring and incremental updates support Automated Deployment Trade Study Dev Team #21651 Use 99designs/gqlgen 02/04/2021 graphql, graphql-go, thunder Type safety, auto-registration and code generation GraphQL Golang Trade Study Dev Team #21775 Create normalized role data model 03/25/2021 Career Stage Profiles (CSP), Microsoft Role Library Requires a data model that support the data requirements of both role systems Role Data Model Schema Dev Team #22035 Design for edges and vertices 03/25/2021 N/A N/A Data Model Dev Team #21976 Use grammes 03/29/2021 Gremlin, gremgo, gremcos Balance of documentation and maturity Gremlin API library Trade Study Dev Team #21870 Design for Gremlin implementation 04/02/2021 N/A N/A Gremlin Dev Team #21980 Design for Gremlin implementation 04/02/2021 N/A N/A Gremlin Dev Team #21980 Expose 1:1 data model from API to DB 04/02/2021 Exposing a minified version of data model contract Team decided that there were no pieces of data that we can rule out as being useful. Will update if data model becomes too complex API README Dev Team #21658 Deprecate SonarCloud 04/05/2021 Checkstyle, PMD, FindBugs Requires paid plan to use in a private repo Code Quality & Security Dev Team #22090 Adopted Stable Tagging Strategy 04/08/2021 N/A Team aligned on the proposed docker container tagging strategy Tagging Strategy Dev Team #22005","title":"Decision Log"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/","text":"Graph Model Graph Vertices and Edges The set of vertices (entities) and edges (relationships) of the graph model Vertex (Source) Edge Type Relationship Type Vertex (Target) Notes Required Profession Applies 1:many Discipline Top most level of categorization * Discipline Defines 1:many Role Groups of related roles within a profession * AppliedBy 1:1 Profession 1 Role Requires 1:many Responsibility Individual role mapped to an employee 1+ Requires 1:many Competency 1+ RequiredBy 1:1 Discipline 1 Succeeds 1:1 Role Supports career progression between roles 1 Precedes 1:1 Role Supports career progression between roles 1 AssignedTo 1:many User Profile * Responsibility Expects 1:many Key Result A group of expected outcomes and key results for employees within a role 1+ ExpectedBy 1:1 Role 1 Competency Describes 1:many Behavior A set of behaviors that contribute to success 1+ DescribedBy 1:1 Role 1 Key Result ExpectedBy 1:1 Responsibility The expected outcome of performing a responsibility 1 Behavior ContributesTo 1:1 Competency The way in which one acts or conducts oneself 1 User Profile Fulfills many:1 Role 1+ Authors 1:many Entry * Reads many:many Entry * Entry SharedWith many:many User Profile Business logic should add manager to this list by default. These users should only have read access. * Demonstrates many:many Competency * Demonstrates many:many Behavior * Demonstrates many:many Responsibility * Demonstrates many:many Result * AuthoredBy many:1 UserProfile 1+ DiscussedBy 1:many Commentary * References many:many Artifact * Competency DemonstratedBy many:many Entry * Behavior DemonstratedBy many:many Entry * Responsibility DemonstratedBy many:many Entry * Result DemonstratedBy many:many Entry * Commentary Discusses many:1 Entry * Artifact ReferencedBy many:many Entry 1+ Graph Properties The full set of data properties available on each vertex and edge Vertex/Edge Property Data Type Notes Required (Any) ID guid 1 Profession Title String 1 Description String 0 Discipline Title String 1 Description String 0 Role Title String 1 Description String 0 Level Band String SDE, SDE II, Senior, etc 1 Responsibility Title String 1 Description String 0 Competency Title String 1 Description String 0 Key Result Description String 1 Behavior Description String 1 User Profile Theme selection string there are only 2: dark, light 1 PersonaId guid[] there are only 2: User, Admin 1+ UserId guid Points to AAD object 1 DeploymentRing string[] Is used to deploy new versions 1 Project string[] list of user created projects * Entry Title string 1 DateCreated date 1 ReadyToShare boolean false if draft 1 AreaOfImpact string[] 3 options: self, contribute to others, leverage others * Commentary Data string 1 DateCreated date 1 Artifact Data string 1 DateCreated date 1 ArtifactType string describes the artifact type: markdown, blob link 1 Vertex Descriptions Profession Top most level of categorization { \"title\" : \"Software Engineering\" , \"description\" : \"Description of profession\" , \"disciplines\" : [] } Discipline Groups of related roles within a profession { \"title\" : \"Site Reliability Engineering\" , \"description\" : \"Description of discipline\" , \"roles\" : [] } Role Individual role mapped to an employee { \"title\" : \"Site Reliability Engineering IC2\" , \"description\" : \"Detailed description of role\" , \"responsibilities\" : [], \"competencies\" : [] } Responsibility A group of expected outcomes and key results for employees within a role { \"title\" : \"Technical Knowledge and Domain Specific Expertise\" , \"results\" : [] } Competency A set of behaviors that contribute to success { \"title\" : \"Adaptability\" , \"behaviors\" : [] } Key Result The expected outcome of performing a responsibility { \"description\" : \"Develops a foundational understanding of distributed systems design...\" } Behavior The way in which one acts or conducts oneself { \"description\" : \"Actively seeks information and tests assumptions.\" } User The user object refers to whom a person is. We do not store our own rather use Azure OIDs. User Profile The user profile contains any user settings and edges specific to Memory. Persona A user may hold multiple personas. Entry The same entry object can hold many kinds of data, and at this stage of the project we decide that we will not store external data, so it's up to the user to provide a link to the data for a reader to click into and get redirected to a new tab to open. Note: This means that in the web app, we will need to ensure links are opened in new tabs. Project Projects are just string fields to represent what a user wants to group their entries under. Area of Impact This refers to the 3 areas of impact in the venn-style diagram in the HR tool. The options are: self, contributing to impact of others, building on others' work. Commentary A comment is essentially a piece of text. However, anyone that an entry is shared with can add commentary on an entry. Artifact The artifact object contains the relevant data as markdown, or a link to the relevant data. Full Role JSON Example { \"id\" : \"abc123\" , \"title\" : \"Site Reliability Engineering IC2\" , \"description\" : \"Detailed description of role\" , \"responsibilities\" : [ { \"id\" : \"abc123\" , \"title\" : \"Technical Knowledge and Domain Specific Expertise\" , \"results\" : [ { \"description\" : \"Develops a foundational understanding of distributed systems design...\" }, { \"description\" : \"Develops an understanding of the code, features, and operations of specific products...\" } ] }, { \"id\" : \"abc123\" , \"title\" : \"Contributions to Development and Design\" , \"results\" : [ { \"description\" : \"Develops and tests basic changes to optimize code...\" }, { \"description\" : \"Supports ongoing engagements with product engineering teams...\" } ] } ], \"competencies\" : [ { \"id\" : \"abc123\" , \"title\" : \"Adaptability\" , \"behaviors\" : [ { \"description\" : \"Actively seeks information and tests assumptions.\" }, { \"description\" : \"Shifts his or her approach in response to the demands of a changing situation.\" } ] }, { \"id\" : \"abc123\" , \"title\" : \"Collaboration\" , \"behaviors\" : [ { \"description\" : \"Removes barriers by working with others around a shared need or customer benefit.\" }, { \"description\" : \" Incorporates diverse perspectives to thoroughly address complex business issues.\" } ] } ] } API Data Model Because there is no internal edges or vertices that need to be hidden from API consumers, the API will expose a 1:1 mapping of the current data model for consumption. This is subject to change if our data model becomes too complex for downstream users.","title":"Graph Model"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#graph-model","text":"","title":"Graph Model"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#graph-vertices-and-edges","text":"The set of vertices (entities) and edges (relationships) of the graph model Vertex (Source) Edge Type Relationship Type Vertex (Target) Notes Required Profession Applies 1:many Discipline Top most level of categorization * Discipline Defines 1:many Role Groups of related roles within a profession * AppliedBy 1:1 Profession 1 Role Requires 1:many Responsibility Individual role mapped to an employee 1+ Requires 1:many Competency 1+ RequiredBy 1:1 Discipline 1 Succeeds 1:1 Role Supports career progression between roles 1 Precedes 1:1 Role Supports career progression between roles 1 AssignedTo 1:many User Profile * Responsibility Expects 1:many Key Result A group of expected outcomes and key results for employees within a role 1+ ExpectedBy 1:1 Role 1 Competency Describes 1:many Behavior A set of behaviors that contribute to success 1+ DescribedBy 1:1 Role 1 Key Result ExpectedBy 1:1 Responsibility The expected outcome of performing a responsibility 1 Behavior ContributesTo 1:1 Competency The way in which one acts or conducts oneself 1 User Profile Fulfills many:1 Role 1+ Authors 1:many Entry * Reads many:many Entry * Entry SharedWith many:many User Profile Business logic should add manager to this list by default. These users should only have read access. * Demonstrates many:many Competency * Demonstrates many:many Behavior * Demonstrates many:many Responsibility * Demonstrates many:many Result * AuthoredBy many:1 UserProfile 1+ DiscussedBy 1:many Commentary * References many:many Artifact * Competency DemonstratedBy many:many Entry * Behavior DemonstratedBy many:many Entry * Responsibility DemonstratedBy many:many Entry * Result DemonstratedBy many:many Entry * Commentary Discusses many:1 Entry * Artifact ReferencedBy many:many Entry 1+","title":"Graph Vertices and Edges"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#graph-properties","text":"The full set of data properties available on each vertex and edge Vertex/Edge Property Data Type Notes Required (Any) ID guid 1 Profession Title String 1 Description String 0 Discipline Title String 1 Description String 0 Role Title String 1 Description String 0 Level Band String SDE, SDE II, Senior, etc 1 Responsibility Title String 1 Description String 0 Competency Title String 1 Description String 0 Key Result Description String 1 Behavior Description String 1 User Profile Theme selection string there are only 2: dark, light 1 PersonaId guid[] there are only 2: User, Admin 1+ UserId guid Points to AAD object 1 DeploymentRing string[] Is used to deploy new versions 1 Project string[] list of user created projects * Entry Title string 1 DateCreated date 1 ReadyToShare boolean false if draft 1 AreaOfImpact string[] 3 options: self, contribute to others, leverage others * Commentary Data string 1 DateCreated date 1 Artifact Data string 1 DateCreated date 1 ArtifactType string describes the artifact type: markdown, blob link 1","title":"Graph Properties"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#vertex-descriptions","text":"","title":"Vertex Descriptions"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#profession","text":"Top most level of categorization { \"title\" : \"Software Engineering\" , \"description\" : \"Description of profession\" , \"disciplines\" : [] }","title":"Profession"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#discipline","text":"Groups of related roles within a profession { \"title\" : \"Site Reliability Engineering\" , \"description\" : \"Description of discipline\" , \"roles\" : [] }","title":"Discipline"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#role","text":"Individual role mapped to an employee { \"title\" : \"Site Reliability Engineering IC2\" , \"description\" : \"Detailed description of role\" , \"responsibilities\" : [], \"competencies\" : [] }","title":"Role"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#responsibility","text":"A group of expected outcomes and key results for employees within a role { \"title\" : \"Technical Knowledge and Domain Specific Expertise\" , \"results\" : [] }","title":"Responsibility"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#competency","text":"A set of behaviors that contribute to success { \"title\" : \"Adaptability\" , \"behaviors\" : [] }","title":"Competency"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#key-result","text":"The expected outcome of performing a responsibility { \"description\" : \"Develops a foundational understanding of distributed systems design...\" }","title":"Key Result"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#behavior","text":"The way in which one acts or conducts oneself { \"description\" : \"Actively seeks information and tests assumptions.\" }","title":"Behavior"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#user","text":"The user object refers to whom a person is. We do not store our own rather use Azure OIDs.","title":"User"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#user-profile","text":"The user profile contains any user settings and edges specific to Memory.","title":"User Profile"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#persona","text":"A user may hold multiple personas.","title":"Persona"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#entry","text":"The same entry object can hold many kinds of data, and at this stage of the project we decide that we will not store external data, so it's up to the user to provide a link to the data for a reader to click into and get redirected to a new tab to open. Note: This means that in the web app, we will need to ensure links are opened in new tabs.","title":"Entry"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#project","text":"Projects are just string fields to represent what a user wants to group their entries under.","title":"Project"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#area-of-impact","text":"This refers to the 3 areas of impact in the venn-style diagram in the HR tool. The options are: self, contributing to impact of others, building on others' work.","title":"Area of Impact"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#commentary","text":"A comment is essentially a piece of text. However, anyone that an entry is shared with can add commentary on an entry.","title":"Commentary"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#artifact","text":"The artifact object contains the relevant data as markdown, or a link to the relevant data.","title":"Artifact"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#full-role-json-example","text":"{ \"id\" : \"abc123\" , \"title\" : \"Site Reliability Engineering IC2\" , \"description\" : \"Detailed description of role\" , \"responsibilities\" : [ { \"id\" : \"abc123\" , \"title\" : \"Technical Knowledge and Domain Specific Expertise\" , \"results\" : [ { \"description\" : \"Develops a foundational understanding of distributed systems design...\" }, { \"description\" : \"Develops an understanding of the code, features, and operations of specific products...\" } ] }, { \"id\" : \"abc123\" , \"title\" : \"Contributions to Development and Design\" , \"results\" : [ { \"description\" : \"Develops and tests basic changes to optimize code...\" }, { \"description\" : \"Supports ongoing engagements with product engineering teams...\" } ] } ], \"competencies\" : [ { \"id\" : \"abc123\" , \"title\" : \"Adaptability\" , \"behaviors\" : [ { \"description\" : \"Actively seeks information and tests assumptions.\" }, { \"description\" : \"Shifts his or her approach in response to the demands of a changing situation.\" } ] }, { \"id\" : \"abc123\" , \"title\" : \"Collaboration\" , \"behaviors\" : [ { \"description\" : \"Removes barriers by working with others around a shared need or customer benefit.\" }, { \"description\" : \" Incorporates diverse perspectives to thoroughly address complex business issues.\" } ] } ] }","title":"Full Role JSON Example"},{"location":"design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/#api-data-model","text":"Because there is no internal edges or vertices that need to be hidden from API consumers, the API will expose a 1:1 mapping of the current data model for consumption. This is subject to change if our data model becomes too complex for downstream users.","title":"API Data Model"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/","text":"Application Deployment The Memory application leverages Azure DevOps for work item tracking as well as continuous integration (CI) and continuous deployment (CD). Environments The Memory project uses multiple environments to isolate and test changes before promoting releases to the global user base. New environment rollouts are automatically triggered based upon a successful deployment of the previous stage /environment. The development , staging and production environments leverage slot deployment during an environment rollout. After a new release is deployed to a staging slot, it is validated through a series of functional integration tests. Upon a 100% pass rate of all tests the staging & production slots are swapped effectively making updates to the environment available. Any errors or failed tests halt the deployment in the current stage and prevent changes to further environments. Each deployed environment is completely isolated and does not share any components. They each have unique resource instances of Azure Traffic Manager, Cosmos DB, etc. Deployment Dependencies Development Staging Production CI Quality Gates Development Staging Manual Approval Local The local environment is used by individual software engineers during the development of new features and components. Engineers leverage some components from the deployed development environment that are not available on certain platforms or are unable to run locally. CosmosDB (Emulator only exists for Windows) The local environment also does not use Azure Traffic Manager. The frontend web app directly communicates to the backend REST API typically running on a separate localhost port mapping. Development The development environment is used as the first quality gate. All code that is checked into the main branch is automatically deployed to this environment after all CI quality gates have passed. Dev Regions West US (westus) Staging The staging environment is used to validate new features, components and other changes prior to production rollout. This environment is primarily used by developers, QA and other company stakeholders. Staging Regions West US (westus) East US (eastus) Production The production environment is used by the worldwide user base. Changes to this environment are gated by manual approval by your product's leadership team in addition to other automatic quality gates. Production Regions West US (westus) Central US (centralus) East US (eastus) Environment Variable Group Infrastructure Setup (memory-common) appName businessUnit serviceConnection subscriptionId Development Setup (memory-dev) environmentName (placeholder) Staging Setup (memory-staging) environmentName (placeholder) Production Setup (memory-prod) environmentName (placeholder)","title":"Application Deployment"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#application-deployment","text":"The Memory application leverages Azure DevOps for work item tracking as well as continuous integration (CI) and continuous deployment (CD).","title":"Application Deployment"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#environments","text":"The Memory project uses multiple environments to isolate and test changes before promoting releases to the global user base. New environment rollouts are automatically triggered based upon a successful deployment of the previous stage /environment. The development , staging and production environments leverage slot deployment during an environment rollout. After a new release is deployed to a staging slot, it is validated through a series of functional integration tests. Upon a 100% pass rate of all tests the staging & production slots are swapped effectively making updates to the environment available. Any errors or failed tests halt the deployment in the current stage and prevent changes to further environments. Each deployed environment is completely isolated and does not share any components. They each have unique resource instances of Azure Traffic Manager, Cosmos DB, etc.","title":"Environments"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#deployment-dependencies","text":"Development Staging Production CI Quality Gates Development Staging Manual Approval","title":"Deployment Dependencies"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#local","text":"The local environment is used by individual software engineers during the development of new features and components. Engineers leverage some components from the deployed development environment that are not available on certain platforms or are unable to run locally. CosmosDB (Emulator only exists for Windows) The local environment also does not use Azure Traffic Manager. The frontend web app directly communicates to the backend REST API typically running on a separate localhost port mapping.","title":"Local"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#development","text":"The development environment is used as the first quality gate. All code that is checked into the main branch is automatically deployed to this environment after all CI quality gates have passed.","title":"Development"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#dev-regions","text":"West US (westus)","title":"Dev Regions"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#staging","text":"The staging environment is used to validate new features, components and other changes prior to production rollout. This environment is primarily used by developers, QA and other company stakeholders.","title":"Staging"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#staging-regions","text":"West US (westus) East US (eastus)","title":"Staging Regions"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#production","text":"The production environment is used by the worldwide user base. Changes to this environment are gated by manual approval by your product's leadership team in addition to other automatic quality gates.","title":"Production"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#production-regions","text":"West US (westus) Central US (centralus) East US (eastus)","title":"Production Regions"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#environment-variable-group","text":"","title":"Environment Variable Group"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#infrastructure-setup-memory-common","text":"appName businessUnit serviceConnection subscriptionId","title":"Infrastructure Setup (memory-common)"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#development-setup-memory-dev","text":"environmentName (placeholder)","title":"Development Setup (memory-dev)"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#staging-setup-memory-staging","text":"environmentName (placeholder)","title":"Staging Setup (memory-staging)"},{"location":"design/design-reviews/decision-log/examples/memory/Deployment/Environments/#production-setup-memory-prod","text":"environmentName (placeholder)","title":"Production Setup (memory-prod)"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/","text":"Trade Study: GitOps Conducted by: Tess and Jeff Backlog Work Item: #21672 Decision Makers: Wallace, whole team Overview For Memory, we will be creating a cloud native application with infrastructure as code. We will use GitOps for Continuous Deployment through pull requests infrastructure changes to be reflected. Overall, between our two options, one is more simple and targeted in a way that we believe would meet the requirements for this project. The other does the same, with additional features that may or may not be worth the extra configuration and setup. Evaluation Criteria Repo style: mono versus multi Policy Enforcement Deployment Methods Deployment Monitoring Admission Control Azure Documentation availability Maintainability Maturity User Interface Solutions Flux Flux is a tool created by Waveworks and is built on top of Kubernetes' API extension system, supports multi-tenancy, and integrates seamlessly with popular tools like Prometheus. Flux Acceptance Criteria Evaluation Repo style: mono versus multi Flux supports both as of v2 Policy Enforcement Azure Policy is in Preview Deployment Methods Define a Helm release using Helm Controllers Kustomization describes deployments Deployment Monitoring Flux works with Prometheus for deployment monitoring as well as Grafana dashboards Admission Control Flux uses RBAC from Kubernetes to lock down sync permissions. Uses the service account to access image pull secrets Azure Documentation availability Great, better when using Helm Operators Maintainability Manage via YAML files in git repo Maturity v2 is published under Apache license in GitHub , it works with Helm v3, and has PR commits from as recently as today 945 stars, 94 forks User Interface CLI, the simplest lightweight option Other features to call out (see more on website) Flux only supports Pull-based deployments which means it must be paired with an operator Flux can send notifications and receive webhooks for syncing Health assessments Dependency management Automatic deployment Garbage collection Deploy on commit Variations Controllers Both Controller options are optional. The Helm Controller additionally fetches helm artifacts to publish, see below diagram. The Kustomize Controller manages state and continuous deployment. We will not decide between the controller to use here, as that's a separate trade study, however we will note that Helm is more widely documented within Flux documentation. Flux v1 Flux v1 is only in maintenance mode and should not be used anymore. So this section does not consider the v1 option a valid option. GitOps Toolkit Flux v2 is built on top of the GitOps Toolkit , however we do not evaluate using the GitOps Toolkit alone as that is for when you want to make your own CD system, which is not what we want. ArgoCD with Helm Charts ArgoCD is a declarative, GitOps-based Continuous Delivery (CD) tool for Kubernetes. ArgoCD with Helm Acceptance Criteria Evaluation Repo style: mono versus multi ArgoCD supports both Policy Enforcement Azure Policy is in Preview Deployment Methods Deploy with Helm Chart Use Kustomize to apply some post-rendering to the Helm release templates Deployment Monitoring Argo CD expose two sets of Prometheus metrics (application metrics and API server metrics) for deployment monitoring. Admission Control ArgoCD use RBAC feature. RBAC requires SSO configuration or one or more local users setup. Once SSO or local users are configured, additional RBAC roles can be defined Argo CD does not have its own user management system and has only one built-in user admin. The admin user is a superuser, and it has unrestricted access to the system Authorization is handled via JWT tokens and checking group claims in them Azure Documentation availability Argo has documentation on Azure AD Maturity Has PR commits from as recently as today 5,000 stars, 1,100 forks Maintainability Can use GitOps to manage it User Interface ArgoCD has a GUI and can be used across clusters Other features to call out (see more on website) ArgoCD support both pull model and push model for continuous delivery Argo can send notifications, but you need a separate tool for it Argo can receive webhooks Health assessments Potentially much more useful multi-tenancy tools. Manages multiple projects, maps them to teams, etc. SSO Integration Garbage collection Results This section should contain a table that has each solution rated against each of the evaluation criteria: Solution Repo style Policy Enforcement Deployment Methods Deployment Monitoring Admission Control Azure Doc Maintainability Maturity UI Flux mono, multi Azure Policy, preview Helm, Kustomize Prometheus, Grafana RBAC Yes on Azure YAML in git repo 945 stars, 94 forks, currently maintained CLI ArgoCD mono, multi Azure Policy, preview Helm, Kustomize, KSonnet, ... Prometheus, Grafana RBAC Only in their own docs manifests in git repo 5,000 stars, 1,100 forks GUI, multiple clusters in same GUI Decision ArgoCD is more feature rich, will support more scenarios, and will be a better tool to put in our tool belts. So we have decided at this point to go with ArgoCD. Resources GitOps Enforcement Monitoring Policies Deployment Push with ArgoCD in Azure DevOps","title":"Trade Study: GitOps"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#trade-study-gitops","text":"Conducted by: Tess and Jeff Backlog Work Item: #21672 Decision Makers: Wallace, whole team","title":"Trade Study: GitOps"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#overview","text":"For Memory, we will be creating a cloud native application with infrastructure as code. We will use GitOps for Continuous Deployment through pull requests infrastructure changes to be reflected. Overall, between our two options, one is more simple and targeted in a way that we believe would meet the requirements for this project. The other does the same, with additional features that may or may not be worth the extra configuration and setup.","title":"Overview"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#evaluation-criteria","text":"Repo style: mono versus multi Policy Enforcement Deployment Methods Deployment Monitoring Admission Control Azure Documentation availability Maintainability Maturity User Interface","title":"Evaluation Criteria"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#solutions","text":"","title":"Solutions"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#flux","text":"Flux is a tool created by Waveworks and is built on top of Kubernetes' API extension system, supports multi-tenancy, and integrates seamlessly with popular tools like Prometheus.","title":"Flux"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#flux-acceptance-criteria-evaluation","text":"Repo style: mono versus multi Flux supports both as of v2 Policy Enforcement Azure Policy is in Preview Deployment Methods Define a Helm release using Helm Controllers Kustomization describes deployments Deployment Monitoring Flux works with Prometheus for deployment monitoring as well as Grafana dashboards Admission Control Flux uses RBAC from Kubernetes to lock down sync permissions. Uses the service account to access image pull secrets Azure Documentation availability Great, better when using Helm Operators Maintainability Manage via YAML files in git repo Maturity v2 is published under Apache license in GitHub , it works with Helm v3, and has PR commits from as recently as today 945 stars, 94 forks User Interface CLI, the simplest lightweight option Other features to call out (see more on website) Flux only supports Pull-based deployments which means it must be paired with an operator Flux can send notifications and receive webhooks for syncing Health assessments Dependency management Automatic deployment Garbage collection Deploy on commit","title":"Flux Acceptance Criteria Evaluation"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#variations","text":"","title":"Variations"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#controllers","text":"Both Controller options are optional. The Helm Controller additionally fetches helm artifacts to publish, see below diagram. The Kustomize Controller manages state and continuous deployment. We will not decide between the controller to use here, as that's a separate trade study, however we will note that Helm is more widely documented within Flux documentation.","title":"Controllers"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#flux-v1","text":"Flux v1 is only in maintenance mode and should not be used anymore. So this section does not consider the v1 option a valid option.","title":"Flux v1"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#gitops-toolkit","text":"Flux v2 is built on top of the GitOps Toolkit , however we do not evaluate using the GitOps Toolkit alone as that is for when you want to make your own CD system, which is not what we want.","title":"GitOps Toolkit"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#argocd-with-helm-charts","text":"ArgoCD is a declarative, GitOps-based Continuous Delivery (CD) tool for Kubernetes.","title":"ArgoCD with Helm Charts"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#argocd-with-helm-acceptance-criteria-evaluation","text":"Repo style: mono versus multi ArgoCD supports both Policy Enforcement Azure Policy is in Preview Deployment Methods Deploy with Helm Chart Use Kustomize to apply some post-rendering to the Helm release templates Deployment Monitoring Argo CD expose two sets of Prometheus metrics (application metrics and API server metrics) for deployment monitoring. Admission Control ArgoCD use RBAC feature. RBAC requires SSO configuration or one or more local users setup. Once SSO or local users are configured, additional RBAC roles can be defined Argo CD does not have its own user management system and has only one built-in user admin. The admin user is a superuser, and it has unrestricted access to the system Authorization is handled via JWT tokens and checking group claims in them Azure Documentation availability Argo has documentation on Azure AD Maturity Has PR commits from as recently as today 5,000 stars, 1,100 forks Maintainability Can use GitOps to manage it User Interface ArgoCD has a GUI and can be used across clusters Other features to call out (see more on website) ArgoCD support both pull model and push model for continuous delivery Argo can send notifications, but you need a separate tool for it Argo can receive webhooks Health assessments Potentially much more useful multi-tenancy tools. Manages multiple projects, maps them to teams, etc. SSO Integration Garbage collection","title":"ArgoCD with Helm Acceptance Criteria Evaluation"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#results","text":"This section should contain a table that has each solution rated against each of the evaluation criteria: Solution Repo style Policy Enforcement Deployment Methods Deployment Monitoring Admission Control Azure Doc Maintainability Maturity UI Flux mono, multi Azure Policy, preview Helm, Kustomize Prometheus, Grafana RBAC Yes on Azure YAML in git repo 945 stars, 94 forks, currently maintained CLI ArgoCD mono, multi Azure Policy, preview Helm, Kustomize, KSonnet, ... Prometheus, Grafana RBAC Only in their own docs manifests in git repo 5,000 stars, 1,100 forks GUI, multiple clusters in same GUI","title":"Results"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#decision","text":"ArgoCD is more feature rich, will support more scenarios, and will be a better tool to put in our tool belts. So we have decided at this point to go with ArgoCD.","title":"Decision"},{"location":"design/design-reviews/decision-log/examples/memory/trade-studies/gitops/#resources","text":"GitOps Enforcement Monitoring Policies Deployment Push with ArgoCD in Azure DevOps","title":"Resources"},{"location":"design/design-reviews/recipes/","text":"Design Review Recipes Design reviews come in all shapes and sizes. There are also different items to consider when creating a design at different stages during an engagement Design Review Process Incorporate design reviews throughout the lifetime of an engagement Design Review Templates Game Plan The same template already in use today High level architecture and design Includes technologies, languages & products to complete engagement objective Milestone / Epic Design Review Should be considered when an engagement contains multiple milestones or epics Design should be more detailed than game plan May require unique deployment, security and/or privacy characteristics from other milestones Feature / Story Design Review Design for complex features or stories Will reuse deployment, security and other characteristics defined within game plan or milestone May require new libraries, OSS or patterns to accomplish goals Task Design Review Highly detailed design for a complex tasks with many unknowns Will integrate into higher level feature/component designs","title":"Design Review Recipes"},{"location":"design/design-reviews/recipes/#design-review-recipes","text":"Design reviews come in all shapes and sizes. There are also different items to consider when creating a design at different stages during an engagement","title":"Design Review Recipes"},{"location":"design/design-reviews/recipes/#design-review-process","text":"Incorporate design reviews throughout the lifetime of an engagement","title":"Design Review Process"},{"location":"design/design-reviews/recipes/#design-review-templates","text":"","title":"Design Review Templates"},{"location":"design/design-reviews/recipes/#game-plan","text":"The same template already in use today High level architecture and design Includes technologies, languages & products to complete engagement objective","title":"Game Plan"},{"location":"design/design-reviews/recipes/#milestone-epic-design-review","text":"Should be considered when an engagement contains multiple milestones or epics Design should be more detailed than game plan May require unique deployment, security and/or privacy characteristics from other milestones","title":"Milestone / Epic Design Review"},{"location":"design/design-reviews/recipes/#feature-story-design-review","text":"Design for complex features or stories Will reuse deployment, security and other characteristics defined within game plan or milestone May require new libraries, OSS or patterns to accomplish goals","title":"Feature / Story Design Review"},{"location":"design/design-reviews/recipes/#task-design-review","text":"Highly detailed design for a complex tasks with many unknowns Will integrate into higher level feature/component designs","title":"Task Design Review"},{"location":"design/design-reviews/recipes/async-design-reviews/","text":"Async Design Reviews Goals Allow team members to review designs as their work schedule allows. Impact This in turn results in the following benefits: Higher Participation & Accessibility . They do not need to be online and available at the same time as others to review. Reduced Time Constraint . Reviewers can spend longer than the duration of a single meeting to think through the approach and provide feedback. Measures The metrics and/or KPIs used for design reviews overall would still apply. See design reviews for measures guidance. Participation The participation should be same as any design review. See design reviews for participation guidance. Facilitation Guidance The concept is to have the design follow the same workflow as any code changes to implement story or task. Rather than code however, the artifacts being added or changed are Markdown documents as well as any other supporting artifacts (prototypes, code samples, diagrams, etc). Prerequisites Source Controlled Design Docs Design documentation must live in a source control repository that supports pull requests (i.e. git). The following guidelines can be used to determine what repository houses the docs Keeping docs in the same repo as the affected code allows for the docs to be updated atomically alongside code within the same pull request. If the documentation represents code that lives in many different repositories, it may make more sense to keep the docs in their own repository. Place the docs so that they do not trigger CI builds for the affected code (assuming the documentation was the only change). This can be done by placing them in an isolated directory should they live alongside the code they represent. See directory structure example below. -root --src --docs <-- exclude from ci build trigger --design Workflow The designer branches the repo with the documentation. The designer works on adding or updating documentation relevant to the design. The designer submits pull request and requests specific team members to review. Reviewers provide feedback to Designer who incorporates the feedback. (OPTIONAL) Design review meeting might be held to give deeper explanation of design to reviewers. Design is approved/accepted and merged to main branch. Tips for Faster Review Cycles To make sure a design is reviewed in a timely manner, it's important to directly request reviews from team members. If team members are assigned without asking, or if no one is assigned it's likely the design will sit for longer without review. Try the following actions: Make it the designer's responsibility to find reviewers for their design The designer should ask a team member directly (face-to-face conversation, async messaging, etc) if they are available to review. Only if they agree, then assign them as a reviewer. Indicate if the design is ready to be merged once approved. Indicate Design Completeness It helps the reviewer to understand if the design is ready to be accepted or if its still a work-in-progress. The level and type of feedback the reviewer provides will likely be different depending on its state. Try the following actions to indicate the design state Mark the PR as a Draft. Some ALM tools support opening a pull request as a Draft such as Azure DevOps. Prefix the title with \"DRAFT\", \"WIP\", or \"work-in-progress\". Set the pull request to automatically merge after approvals and checks have passed. This can indicate to the reviewer the design is complete from the designer's perspective. Practice Inclusive Behaviors The designated reviewers are not the only team members that can provide feedback on the design. If other team members voluntarily committed time to providing feedback or asking questions, be sure to respond. Utilize face-to-face conversation (in person or virtual) to resolve feedback or questions from others as needed. This aids in building team cohesiveness in ensuring everyone understands and is willing to commit to a given design. This practice demonstrates inclusive behavior ; which will promote trust and respect within the team. Respond to all PR comments objectively and respectively irrespective of the authors level, position, or title. After two round trips of question/response, resort to synchronous communication for resolution (i.e. virtual or physical face-to-face conversation).","title":"Async Design Reviews"},{"location":"design/design-reviews/recipes/async-design-reviews/#async-design-reviews","text":"","title":"Async Design Reviews"},{"location":"design/design-reviews/recipes/async-design-reviews/#goals","text":"Allow team members to review designs as their work schedule allows.","title":"Goals"},{"location":"design/design-reviews/recipes/async-design-reviews/#impact","text":"This in turn results in the following benefits: Higher Participation & Accessibility . They do not need to be online and available at the same time as others to review. Reduced Time Constraint . Reviewers can spend longer than the duration of a single meeting to think through the approach and provide feedback.","title":"Impact"},{"location":"design/design-reviews/recipes/async-design-reviews/#measures","text":"The metrics and/or KPIs used for design reviews overall would still apply. See design reviews for measures guidance.","title":"Measures"},{"location":"design/design-reviews/recipes/async-design-reviews/#participation","text":"The participation should be same as any design review. See design reviews for participation guidance.","title":"Participation"},{"location":"design/design-reviews/recipes/async-design-reviews/#facilitation-guidance","text":"The concept is to have the design follow the same workflow as any code changes to implement story or task. Rather than code however, the artifacts being added or changed are Markdown documents as well as any other supporting artifacts (prototypes, code samples, diagrams, etc).","title":"Facilitation Guidance"},{"location":"design/design-reviews/recipes/async-design-reviews/#prerequisites","text":"","title":"Prerequisites"},{"location":"design/design-reviews/recipes/async-design-reviews/#source-controlled-design-docs","text":"Design documentation must live in a source control repository that supports pull requests (i.e. git). The following guidelines can be used to determine what repository houses the docs Keeping docs in the same repo as the affected code allows for the docs to be updated atomically alongside code within the same pull request. If the documentation represents code that lives in many different repositories, it may make more sense to keep the docs in their own repository. Place the docs so that they do not trigger CI builds for the affected code (assuming the documentation was the only change). This can be done by placing them in an isolated directory should they live alongside the code they represent. See directory structure example below. -root --src --docs <-- exclude from ci build trigger --design","title":"Source Controlled Design Docs"},{"location":"design/design-reviews/recipes/async-design-reviews/#workflow","text":"The designer branches the repo with the documentation. The designer works on adding or updating documentation relevant to the design. The designer submits pull request and requests specific team members to review. Reviewers provide feedback to Designer who incorporates the feedback. (OPTIONAL) Design review meeting might be held to give deeper explanation of design to reviewers. Design is approved/accepted and merged to main branch.","title":"Workflow"},{"location":"design/design-reviews/recipes/async-design-reviews/#tips-for-faster-review-cycles","text":"To make sure a design is reviewed in a timely manner, it's important to directly request reviews from team members. If team members are assigned without asking, or if no one is assigned it's likely the design will sit for longer without review. Try the following actions: Make it the designer's responsibility to find reviewers for their design The designer should ask a team member directly (face-to-face conversation, async messaging, etc) if they are available to review. Only if they agree, then assign them as a reviewer. Indicate if the design is ready to be merged once approved.","title":"Tips for Faster Review Cycles"},{"location":"design/design-reviews/recipes/async-design-reviews/#indicate-design-completeness","text":"It helps the reviewer to understand if the design is ready to be accepted or if its still a work-in-progress. The level and type of feedback the reviewer provides will likely be different depending on its state. Try the following actions to indicate the design state Mark the PR as a Draft. Some ALM tools support opening a pull request as a Draft such as Azure DevOps. Prefix the title with \"DRAFT\", \"WIP\", or \"work-in-progress\". Set the pull request to automatically merge after approvals and checks have passed. This can indicate to the reviewer the design is complete from the designer's perspective.","title":"Indicate Design Completeness"},{"location":"design/design-reviews/recipes/async-design-reviews/#practice-inclusive-behaviors","text":"The designated reviewers are not the only team members that can provide feedback on the design. If other team members voluntarily committed time to providing feedback or asking questions, be sure to respond. Utilize face-to-face conversation (in person or virtual) to resolve feedback or questions from others as needed. This aids in building team cohesiveness in ensuring everyone understands and is willing to commit to a given design. This practice demonstrates inclusive behavior ; which will promote trust and respect within the team. Respond to all PR comments objectively and respectively irrespective of the authors level, position, or title. After two round trips of question/response, resort to synchronous communication for resolution (i.e. virtual or physical face-to-face conversation).","title":"Practice Inclusive Behaviors"},{"location":"design/design-reviews/recipes/engagement-process/","text":"Incorporating Design Reviews into an Engagement Introduction Design reviews should not feel like a burden. Design reviews can be easily incorporated into the dev crew process with minimal overhead. Only create design reviews when needed. Not every story or task requires a complete design review. Leverage this guidance to make changes that best fit in with the team. Every team works differently. Leverage Microsoft subject-matter experts (SME) as needed during design reviews. Not every story needs SME or leadership sign-off. Most design reviews can be fully executed within a dev crew. Use diagrams to visualize concepts and architecture. The following guidelines outline how Microsoft and the customer together can incorporate design reviews into their day-to-day agile processes. Envisioning / Architecture Design Session (ADS) Early in an engagement Microsoft works with customers to understand their unique goals and objectives and establish a definition of done. Microsoft dives deep into existing customer infrastructure and architecture to understand potential constraints. Additionally, we seek to understand and uncover specific non-functional requirements that influence the solution. During this time the team uncovers many unknowns, leveraging all new-found information, in order to help generate an impactful design that meets customer goals. After ADS it can be helpful to conduct Engineering Feasibility Spikes to further de-risk technologies being considered for the engagement. Tip : All unknowns have not been addressed at this point. Sprint Planning In many engagements Microsoft works with customers using a SCRUM agile development process which begins with sprint planning. Sprint planning is a great opportunity to dive deep into the next set of high priority work. Some key points to address are the following: Identify stories that require design reviews Separate design from implementation for complex stories Assign an owner to each design story Stories that will benefit from design reviews have one or more of the following in common: There are many unknown or unclear requirements There is a wide distribution of anticipated workload, or story pointing, across the dev crew The developer cannot clearly illustrate all tasks required for the story Tip: After sprint planning is complete the team should consider hosting an initial design review discussion to dive deep in the design requirement of the stories that were identified. This will provide more clarity so that the team can move forward with a design review, synchronously or asynchronously, and complete tasks. Sprint Backlog Refinement If your team is not already hosting a Sprint Backlog Refinement session at least once per week you should consider it. It is a great opportunity to: Keep the backlog clean Re-prioritize work based on shifting business priorities Fill in missing descriptions and acceptance criteria Identify stories that require design reviews The team can follow the same steps from sprint planning to help identify which stories require design reviews. This can often save much time during the actual sprint planning meetings to focus on the task at hand. Sprint Retrospectives Sprint retrospectives are a great time to check in with the dev team, identify what is working or not working, and propose changes to keep improving. It is also a great time to check in on design reviews Did any of the designs change from last sprint? How have design changes impacted the engagement? Have previous design artifacts been updated to reflect new changes? All design artifacts should be treated as a living document. As requirements change or uncover more unknowns the dev crew should retroactively update all design artifacts. Missing this critical step may cause the customer to incur future technical debt. Artifacts that are not up to date are bugs in the design. Tip: Keep your artifacts up to date by adding it to your teams definition of done for all user stories. Sync Design Reviews It is often helpful to schedule 1-2 design sessions per sprint as part of the normal aforementioned meeting cadence. Throughout the sprint, folks can add design topics to the meeting agenda and if there is nothing to discuss for a particular meeting occurrence, it can simply be cancelled. While these sessions may not always be used, they help project members align on timing and purpose early on and establish precedence, often encouraging participation so design topics don't slip through the cracks. Oftentimes, it is helpful for those project members intending to present their design to the wider group to distribute documentation on their design prior to the session so that other participants can come prepared with context heading into the session. It should be noted that the necessity of these sessions certainly evolves over the course of the engagement. Early on, or in other times of more ambiguity, these meetings are typically used more often and more fully. Lastly, while it is suggested that sync design reviews are scheduled during the normal sprint cadence, scheduling ad-hoc sessions should not be discouraged - even if these reviews are limited to the participants of a specific workstream. Wrap-up Sprints Wrap-up sprints are a great time to tie up loose ends with the customer and hand-off solution. Customer hand-off becomes a lot easier when there are design artifacts to reference and deliver alongside the completed solution. During your wrap-up sprints the dev crew should consider the following: Are the design artifacts up to date? Are the design artifacts stored in an accessible location?","title":"Incorporating Design Reviews into an Engagement"},{"location":"design/design-reviews/recipes/engagement-process/#incorporating-design-reviews-into-an-engagement","text":"","title":"Incorporating Design Reviews into an Engagement"},{"location":"design/design-reviews/recipes/engagement-process/#introduction","text":"Design reviews should not feel like a burden. Design reviews can be easily incorporated into the dev crew process with minimal overhead. Only create design reviews when needed. Not every story or task requires a complete design review. Leverage this guidance to make changes that best fit in with the team. Every team works differently. Leverage Microsoft subject-matter experts (SME) as needed during design reviews. Not every story needs SME or leadership sign-off. Most design reviews can be fully executed within a dev crew. Use diagrams to visualize concepts and architecture. The following guidelines outline how Microsoft and the customer together can incorporate design reviews into their day-to-day agile processes.","title":"Introduction"},{"location":"design/design-reviews/recipes/engagement-process/#envisioning-architecture-design-session-ads","text":"Early in an engagement Microsoft works with customers to understand their unique goals and objectives and establish a definition of done. Microsoft dives deep into existing customer infrastructure and architecture to understand potential constraints. Additionally, we seek to understand and uncover specific non-functional requirements that influence the solution. During this time the team uncovers many unknowns, leveraging all new-found information, in order to help generate an impactful design that meets customer goals. After ADS it can be helpful to conduct Engineering Feasibility Spikes to further de-risk technologies being considered for the engagement. Tip : All unknowns have not been addressed at this point.","title":"Envisioning / Architecture Design Session (ADS)"},{"location":"design/design-reviews/recipes/engagement-process/#sprint-planning","text":"In many engagements Microsoft works with customers using a SCRUM agile development process which begins with sprint planning. Sprint planning is a great opportunity to dive deep into the next set of high priority work. Some key points to address are the following: Identify stories that require design reviews Separate design from implementation for complex stories Assign an owner to each design story Stories that will benefit from design reviews have one or more of the following in common: There are many unknown or unclear requirements There is a wide distribution of anticipated workload, or story pointing, across the dev crew The developer cannot clearly illustrate all tasks required for the story Tip: After sprint planning is complete the team should consider hosting an initial design review discussion to dive deep in the design requirement of the stories that were identified. This will provide more clarity so that the team can move forward with a design review, synchronously or asynchronously, and complete tasks.","title":"Sprint Planning"},{"location":"design/design-reviews/recipes/engagement-process/#sprint-backlog-refinement","text":"If your team is not already hosting a Sprint Backlog Refinement session at least once per week you should consider it. It is a great opportunity to: Keep the backlog clean Re-prioritize work based on shifting business priorities Fill in missing descriptions and acceptance criteria Identify stories that require design reviews The team can follow the same steps from sprint planning to help identify which stories require design reviews. This can often save much time during the actual sprint planning meetings to focus on the task at hand.","title":"Sprint Backlog Refinement"},{"location":"design/design-reviews/recipes/engagement-process/#sprint-retrospectives","text":"Sprint retrospectives are a great time to check in with the dev team, identify what is working or not working, and propose changes to keep improving. It is also a great time to check in on design reviews Did any of the designs change from last sprint? How have design changes impacted the engagement? Have previous design artifacts been updated to reflect new changes? All design artifacts should be treated as a living document. As requirements change or uncover more unknowns the dev crew should retroactively update all design artifacts. Missing this critical step may cause the customer to incur future technical debt. Artifacts that are not up to date are bugs in the design. Tip: Keep your artifacts up to date by adding it to your teams definition of done for all user stories.","title":"Sprint Retrospectives"},{"location":"design/design-reviews/recipes/engagement-process/#sync-design-reviews","text":"It is often helpful to schedule 1-2 design sessions per sprint as part of the normal aforementioned meeting cadence. Throughout the sprint, folks can add design topics to the meeting agenda and if there is nothing to discuss for a particular meeting occurrence, it can simply be cancelled. While these sessions may not always be used, they help project members align on timing and purpose early on and establish precedence, often encouraging participation so design topics don't slip through the cracks. Oftentimes, it is helpful for those project members intending to present their design to the wider group to distribute documentation on their design prior to the session so that other participants can come prepared with context heading into the session. It should be noted that the necessity of these sessions certainly evolves over the course of the engagement. Early on, or in other times of more ambiguity, these meetings are typically used more often and more fully. Lastly, while it is suggested that sync design reviews are scheduled during the normal sprint cadence, scheduling ad-hoc sessions should not be discouraged - even if these reviews are limited to the participants of a specific workstream.","title":"Sync Design Reviews"},{"location":"design/design-reviews/recipes/engagement-process/#wrap-up-sprints","text":"Wrap-up sprints are a great time to tie up loose ends with the customer and hand-off solution. Customer hand-off becomes a lot easier when there are design artifacts to reference and deliver alongside the completed solution. During your wrap-up sprints the dev crew should consider the following: Are the design artifacts up to date? Are the design artifacts stored in an accessible location?","title":"Wrap-up Sprints"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/","text":"Engineering Feasibility Spikes: Identifying and Mitigating Risk Introduction Some engagements require more de-risking than others. Even after Architectural Design Sessions (ADS) an engagement may still have substantial technical unknowns. These types of engagements warrant an exploratory/validation phase where Engineering Feasibility Spikes can be conducted immediately after envisioning/ADS and before engineering sprints. Engineering Feasibility Spikes Are regimented yet collaborative time-boxed investigatory activities conducted in a feedback loop to capitalize on individual learnings to inform the team. Increase the team\u2019s knowledge and understanding while minimizing engagement risks. The following guidelines outline how Microsoft and the customer can incorporate engineering feasibility spikes into the day-to-day agile processes. Pre-Mortem A good way to gauge what engineering spikes to conduct is to do a pre-mortem. What is a Pre-Mortem? A 90-minute meeting after envisioning/ADS that includes the entire team (and can also include the customer) which answers \"Imagine the project has failed. What problems and challenges caused this failure?\" Allows the entire team to initially raise concerns and risks early in the engagement. This input is used to decide which risks to pursue as engineering spikes. Sharing Learnings & Current Progress Feedback Loop The key element from conducting the engineering feasibility spikes is sharing the outcomes in-flight. The team gets together and shares learning on a weekly basis (or more frequently if needed). The sharing is done via a 30-minute call. Everyone on the Dev Crew joins the call (even if not everyone is assigned an engineering spike story or even if the spike work was underway and not fully completed). The feedback loop is significantly tighter/shorter than in sprint-based agile process. Instead of using the Sprint as the forcing function to adjust/pivot/re-prioritize, the interim sharing sessions were the trigger. Re-Prioritizing the Next Spikes After the team shares current progress, another round of planning is done. This allows the team to Establish a very tight feedback loop. Re-prioritize the next spike(s) because of the outcome from the current engineering feasibility spikes. Adjusting Based on Context During the sharing call, and when the team believes it has enough information, the team sometimes comes to the realization that the original spike acceptance criteria is no longer valid. The team pivots into another area that provides more value. A decision log can be used to track outcomes. Engineering Feasibility Sprints Diagram The process is depicted in the diagram below. Benefits Creating Code Samples to Prove Out Ideas It is important to note to be intentional about the spikes not aiming to produce production-level code. The team sometimes must write code to arrive at the technical learning. The team must be cognizant that the code written for the spikes is not going to serve as the code for the final solution. The code written is just enough to drive the investigation forward with greater confidence. For example, supposed the team was exploring the API choreography of creating a Graph client with various Azure Active Directory (AAD) authentication flows and permissions. The code to demonstrate this is implemented in a console app, but it could have been done via an Express server, etc. The fact that it was a console app was not important, but rather the ability of the Graph client to be able to do operations against the Graph API endpoint with the minimal number of permissions is the main learning goal. Targeted Conversations By sharing the progress of the spike, the team\u2019s collective knowledge increases. The spikes allow the team to drive succinct conversations with various Product Groups (PGs) and other subject matter experts (SMEs). Rather than speaking at a hypothetical level, the team playbacks project/architecture concerns and concretely points out why something is a showstopper or not a viable way forward. Increased Customer Trust This process leads to increased customer trust. Using this process, the team Brings the customer along in the decision-making process and guides them how to go forward. Provides answers with confidence and suggests sound architectural designs. Conducting engineering feasibility spikes sets the team and the customer up for success, especially if it highlights technology learnings that help the customer fully understand the feasibility/viability of an engineering solution. Summary of Key Points A pre-mortem can involve the whole team in surfacing business and technical risks. The key purpose of the engineering feasibility spike is learning. Learning comes from both conducting and sharing insights from spikes. Use new spike infused learnings to revise, refine, re-prioritize, or create the next set of spikes. When spikes are completed, look for new weekly rhythms like adding a \u2018risk\u2019 column to the retro board or raising topics at daily standup to identify emerging risks.","title":"Engineering Feasibility Spikes: Identifying and Mitigating Risk"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#engineering-feasibility-spikes-identifying-and-mitigating-risk","text":"","title":"Engineering Feasibility Spikes: Identifying and Mitigating Risk"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#introduction","text":"Some engagements require more de-risking than others. Even after Architectural Design Sessions (ADS) an engagement may still have substantial technical unknowns. These types of engagements warrant an exploratory/validation phase where Engineering Feasibility Spikes can be conducted immediately after envisioning/ADS and before engineering sprints.","title":"Introduction"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#engineering-feasibility-spikes","text":"Are regimented yet collaborative time-boxed investigatory activities conducted in a feedback loop to capitalize on individual learnings to inform the team. Increase the team\u2019s knowledge and understanding while minimizing engagement risks. The following guidelines outline how Microsoft and the customer can incorporate engineering feasibility spikes into the day-to-day agile processes.","title":"Engineering Feasibility Spikes"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#pre-mortem","text":"A good way to gauge what engineering spikes to conduct is to do a pre-mortem.","title":"Pre-Mortem"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#what-is-a-pre-mortem","text":"A 90-minute meeting after envisioning/ADS that includes the entire team (and can also include the customer) which answers \"Imagine the project has failed. What problems and challenges caused this failure?\" Allows the entire team to initially raise concerns and risks early in the engagement. This input is used to decide which risks to pursue as engineering spikes.","title":"What is a Pre-Mortem?"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#sharing-learnings-current-progress","text":"","title":"Sharing Learnings &amp; Current Progress"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#feedback-loop","text":"The key element from conducting the engineering feasibility spikes is sharing the outcomes in-flight. The team gets together and shares learning on a weekly basis (or more frequently if needed). The sharing is done via a 30-minute call. Everyone on the Dev Crew joins the call (even if not everyone is assigned an engineering spike story or even if the spike work was underway and not fully completed). The feedback loop is significantly tighter/shorter than in sprint-based agile process. Instead of using the Sprint as the forcing function to adjust/pivot/re-prioritize, the interim sharing sessions were the trigger.","title":"Feedback Loop"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#re-prioritizing-the-next-spikes","text":"After the team shares current progress, another round of planning is done. This allows the team to Establish a very tight feedback loop. Re-prioritize the next spike(s) because of the outcome from the current engineering feasibility spikes.","title":"Re-Prioritizing the Next Spikes"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#adjusting-based-on-context","text":"During the sharing call, and when the team believes it has enough information, the team sometimes comes to the realization that the original spike acceptance criteria is no longer valid. The team pivots into another area that provides more value. A decision log can be used to track outcomes.","title":"Adjusting Based on Context"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#engineering-feasibility-sprints-diagram","text":"The process is depicted in the diagram below.","title":"Engineering Feasibility Sprints Diagram"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#benefits","text":"","title":"Benefits"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#creating-code-samples-to-prove-out-ideas","text":"It is important to note to be intentional about the spikes not aiming to produce production-level code. The team sometimes must write code to arrive at the technical learning. The team must be cognizant that the code written for the spikes is not going to serve as the code for the final solution. The code written is just enough to drive the investigation forward with greater confidence. For example, supposed the team was exploring the API choreography of creating a Graph client with various Azure Active Directory (AAD) authentication flows and permissions. The code to demonstrate this is implemented in a console app, but it could have been done via an Express server, etc. The fact that it was a console app was not important, but rather the ability of the Graph client to be able to do operations against the Graph API endpoint with the minimal number of permissions is the main learning goal.","title":"Creating Code Samples to Prove Out Ideas"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#targeted-conversations","text":"By sharing the progress of the spike, the team\u2019s collective knowledge increases. The spikes allow the team to drive succinct conversations with various Product Groups (PGs) and other subject matter experts (SMEs). Rather than speaking at a hypothetical level, the team playbacks project/architecture concerns and concretely points out why something is a showstopper or not a viable way forward.","title":"Targeted Conversations"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#increased-customer-trust","text":"This process leads to increased customer trust. Using this process, the team Brings the customer along in the decision-making process and guides them how to go forward. Provides answers with confidence and suggests sound architectural designs. Conducting engineering feasibility spikes sets the team and the customer up for success, especially if it highlights technology learnings that help the customer fully understand the feasibility/viability of an engineering solution.","title":"Increased Customer Trust"},{"location":"design/design-reviews/recipes/engineering-feasibility-spikes/#summary-of-key-points","text":"A pre-mortem can involve the whole team in surfacing business and technical risks. The key purpose of the engineering feasibility spike is learning. Learning comes from both conducting and sharing insights from spikes. Use new spike infused learnings to revise, refine, re-prioritize, or create the next set of spikes. When spikes are completed, look for new weekly rhythms like adding a \u2018risk\u2019 column to the retro board or raising topics at daily standup to identify emerging risks.","title":"Summary of Key Points"},{"location":"design/design-reviews/recipes/high-level-design-recipe/","text":"High Level / Game Plan Design Recipe Why is this Valuable? Design at macroscopic level shows the interactions between systems and services that will be used to accomplish the project. It is intended to ensure there is high level understanding of the plan for what to build, which off-the-shelf components will be used, and which external components will need to interact with the deliverable. Things to Keep in Mind As with all other aspects of the project, design reviews must provide a friendly and safe environment so that any team member feels comfortable proposing a design for review and can use the opportunity to grow and learn from the constructive / non-judgemental feedback from peers and subject-matter experts (see Team Agreements ). Attempt to illustrate different personas involved in the use cases and how/which boxes are their entry points. Prefer pictures over paragraphs. The diagrams aren't intended to generate code, so they should be fairly high level. Artifacts should indicate the direction of calls (are they outbound, inbound, or bidirectional?) and call out system boundaries where ports might need to be opened or additional infrastructure work may be needed to allow calls to be made. Sequence diagrams are helpful to show the flow of calls among components + systems. Generic box diagrams depicting data flow or call origination/destination are useful. However, the title should clearly define what the arrows show indicate. In most cases, a diagram will show either data flow or call directions but not both. Visualize the contrasting aspects of the system/diagram for ease of communication. e.g. differing technologies employed, modified vs. untouched components, or internet vs. local cloud components. Colors, grouping boxes, and iconography can be used for differentiating. Prefer ease-of-understanding for communicating ideas over strict UML correctness. Design reviews should be lightweight and should not feel like an additional process overhead. Examples","title":"High Level / Game Plan Design Recipe"},{"location":"design/design-reviews/recipes/high-level-design-recipe/#high-level-game-plan-design-recipe","text":"","title":"High Level / Game Plan Design Recipe"},{"location":"design/design-reviews/recipes/high-level-design-recipe/#why-is-this-valuable","text":"Design at macroscopic level shows the interactions between systems and services that will be used to accomplish the project. It is intended to ensure there is high level understanding of the plan for what to build, which off-the-shelf components will be used, and which external components will need to interact with the deliverable.","title":"Why is this Valuable?"},{"location":"design/design-reviews/recipes/high-level-design-recipe/#things-to-keep-in-mind","text":"As with all other aspects of the project, design reviews must provide a friendly and safe environment so that any team member feels comfortable proposing a design for review and can use the opportunity to grow and learn from the constructive / non-judgemental feedback from peers and subject-matter experts (see Team Agreements ). Attempt to illustrate different personas involved in the use cases and how/which boxes are their entry points. Prefer pictures over paragraphs. The diagrams aren't intended to generate code, so they should be fairly high level. Artifacts should indicate the direction of calls (are they outbound, inbound, or bidirectional?) and call out system boundaries where ports might need to be opened or additional infrastructure work may be needed to allow calls to be made. Sequence diagrams are helpful to show the flow of calls among components + systems. Generic box diagrams depicting data flow or call origination/destination are useful. However, the title should clearly define what the arrows show indicate. In most cases, a diagram will show either data flow or call directions but not both. Visualize the contrasting aspects of the system/diagram for ease of communication. e.g. differing technologies employed, modified vs. untouched components, or internet vs. local cloud components. Colors, grouping boxes, and iconography can be used for differentiating. Prefer ease-of-understanding for communicating ideas over strict UML correctness. Design reviews should be lightweight and should not feel like an additional process overhead.","title":"Things to Keep in Mind"},{"location":"design/design-reviews/recipes/high-level-design-recipe/#examples","text":"","title":"Examples"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/","text":"Milestone / Epic Design Review Recipe Why is this Valuable? Design at epic/milestone level can help the team make better decisions about prioritization by summarizing the value, effort, complexity, risks, and dependencies. This brief document can help the team align on the selected approach and briefly explain the rationale for other teams, subject-matter experts, project advisors, and new team members. Things to Keep in Mind As with all other aspects of the project, design reviews must provide a friendly and safe environment so that any team member feels comfortable proposing a design for review and can use the opportunity to grow and learn from the constructive / non-judgemental feedback from peers and subject-matter experts (see Team Agreements ). Design reviews should be lightweight and should not feel like an additional process overhead. Dev Lead can usually provide guidance on whether a given epic/milestone needs a design review and can help other team members in preparation. This is not a strict template that must be followed and teams should not be bogged down with polished \"design presentations\". Think of the recipe below as a \"menu of options\" for potential questions to think through in designing this epic. Not all sections are required for every epic. Focus on sections and questions that are most relevant for making the decision and rationalizing the trade-offs. Milestone/epic design is considered high-level design but is usually more detailed than the design included in the Game Plan, but will likely re-use some technologies, non-functional requirements, and constraints mentioned in the Game Plan. As the team learned more about the project and further refined the scope of the epic, they may specifically call out notable changes to the overall approach and, in particular, highlight any unique deployment, security, private, scalability, etc. characteristics of this milestone. Template You can download the Milestone/Epic Design Review Template , copy it into your project, and use it as described in the async design review recipe .","title":"Milestone / Epic Design Review Recipe"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/#milestone-epic-design-review-recipe","text":"","title":"Milestone / Epic Design Review Recipe"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/#why-is-this-valuable","text":"Design at epic/milestone level can help the team make better decisions about prioritization by summarizing the value, effort, complexity, risks, and dependencies. This brief document can help the team align on the selected approach and briefly explain the rationale for other teams, subject-matter experts, project advisors, and new team members.","title":"Why is this Valuable?"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/#things-to-keep-in-mind","text":"As with all other aspects of the project, design reviews must provide a friendly and safe environment so that any team member feels comfortable proposing a design for review and can use the opportunity to grow and learn from the constructive / non-judgemental feedback from peers and subject-matter experts (see Team Agreements ). Design reviews should be lightweight and should not feel like an additional process overhead. Dev Lead can usually provide guidance on whether a given epic/milestone needs a design review and can help other team members in preparation. This is not a strict template that must be followed and teams should not be bogged down with polished \"design presentations\". Think of the recipe below as a \"menu of options\" for potential questions to think through in designing this epic. Not all sections are required for every epic. Focus on sections and questions that are most relevant for making the decision and rationalizing the trade-offs. Milestone/epic design is considered high-level design but is usually more detailed than the design included in the Game Plan, but will likely re-use some technologies, non-functional requirements, and constraints mentioned in the Game Plan. As the team learned more about the project and further refined the scope of the epic, they may specifically call out notable changes to the overall approach and, in particular, highlight any unique deployment, security, private, scalability, etc. characteristics of this milestone.","title":"Things to Keep in Mind"},{"location":"design/design-reviews/recipes/milestone-epic-design-review-recipe/#template","text":"You can download the Milestone/Epic Design Review Template , copy it into your project, and use it as described in the async design review recipe .","title":"Template"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/","text":"Preferred Diagram Tooling At each stage in the engagement process, diagrams are a key part of the design review. The preferred tooling for creating and maintaining diagrams is to choose one of the following: Microsoft Visio Microsoft PowerPoint The .drawio.png (or .drawio ) format from diagrams.net (formerly draw.io ) In all cases, we recommend storing the exported PNG images from these diagrams in the repo along with the source files so they can easily be referenced in documentation and more easily reviewed during PRs. The .drawio.png format stores both at once. Microsoft Visio It contains a lot of shapes out of the box, including Azure icons, the desktop app exists on PC, and there's a great Web app. Most diagrams in the Azure Architecture Center are Visio diagrams. Microsoft PowerPoint Diagrams can be easily reused in presentations, a PowerPoint license is pretty common, the desktop app exists on PC and on the Mac, and there's a great Web app. .drawio.png There are different desktop, web apps and VS Code extensions. This tooling can be used like Visio or LucidChart, without the licensing/remote storage concerns. Furthermore, Diagrams.net has a collection of Azure/Office/Microsoft icons, as well as other well-known tech, so it is not only useful for swimlanes and flow diagrams, but also for architecture diagrams. .drawio.png should be preferred over the .drawio format. The .drawio.png format uses the metadata layer within the PNG file-format to hide SVG vector graphics representation, then renders the .png when saving. This clever use of both the meta layer and image layer allows anyone to further edit the PNG file. It also renders like a normal PNG in browsers and other viewers, making it easy to transfer and embed. Furthermore, it can be edited within VSCode very easily using the Draw.io Integration VSCode Extension .","title":"Preferred Diagram Tooling"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/#preferred-diagram-tooling","text":"At each stage in the engagement process, diagrams are a key part of the design review. The preferred tooling for creating and maintaining diagrams is to choose one of the following: Microsoft Visio Microsoft PowerPoint The .drawio.png (or .drawio ) format from diagrams.net (formerly draw.io ) In all cases, we recommend storing the exported PNG images from these diagrams in the repo along with the source files so they can easily be referenced in documentation and more easily reviewed during PRs. The .drawio.png format stores both at once.","title":"Preferred Diagram Tooling"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/#microsoft-visio","text":"It contains a lot of shapes out of the box, including Azure icons, the desktop app exists on PC, and there's a great Web app. Most diagrams in the Azure Architecture Center are Visio diagrams.","title":"Microsoft Visio"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/#microsoft-powerpoint","text":"Diagrams can be easily reused in presentations, a PowerPoint license is pretty common, the desktop app exists on PC and on the Mac, and there's a great Web app.","title":"Microsoft PowerPoint"},{"location":"design/design-reviews/recipes/preferred-diagram-tooling/#drawiopng","text":"There are different desktop, web apps and VS Code extensions. This tooling can be used like Visio or LucidChart, without the licensing/remote storage concerns. Furthermore, Diagrams.net has a collection of Azure/Office/Microsoft icons, as well as other well-known tech, so it is not only useful for swimlanes and flow diagrams, but also for architecture diagrams. .drawio.png should be preferred over the .drawio format. The .drawio.png format uses the metadata layer within the PNG file-format to hide SVG vector graphics representation, then renders the .png when saving. This clever use of both the meta layer and image layer allows anyone to further edit the PNG file. It also renders like a normal PNG in browsers and other viewers, making it easy to transfer and embed. Furthermore, it can be edited within VSCode very easily using the Draw.io Integration VSCode Extension .","title":".drawio.png"},{"location":"design/design-reviews/recipes/technical-spike/","text":"Technical Spike From Wikipedia ... A spike in a sprint can be used in a number of ways: As a way to familiarize the team with new hardware or software To analyze a problem thoroughly and assist in properly dividing work among separate team members. Spike tests can also be used to mitigate future risk, and may uncover additional issues that have escaped notice. A distinction can be made between technical spikes and functional spikes. The technical spike is used more often for evaluating the impact new technology has on the current implementation. A functional spike is used to determine the interaction with a new feature or implementation. Engineering feasibility spikes can also be conducted to de-risk an engagement and increase the team's understanding. Deliverable Generally the deliverable from a Technical Spike should be a document detailing what was evaluated and the outcome of that evaluation. The specifics contained in the document will vary, but there are some general principles that might be helpful. Problem Statement/Goals: Be sure to include a section that clearly details why an evaluation is being done and what the outcome of this evaluation should be. This is helpful to ensure that the technical spike was productive and advanced the overall project in some way. Make sure it is repeatable: Detail the components used, installation instructions, configuration, etc. required to build the environment that was used for evaluation and testing. If any testing is performed, make sure to include the scripts, links to the applications, configuration options, etc. so that testing could be performed again. There are many reasons that the evaluation environment may need to be rebuilt. For example: Another scenario needs to be tested. A new version of the technology has been released. The technology needs to be tested on a new platform. Fact-Finding: The goal of a spike should be fact-finding, not decision-making or recommendation. Ideally, the technology spike digs into a number of technical questions and gets answers so that the broader project team can then come back together and agree on an appropriate course forward. Evidence: Generally you will use sections to summarize the results of testing which do not include the potentially hundreds of detailed results, however, you should include all detailed testing results in an appendix or an attachment. Having full results detailed somewhere will help the team trust the results. In addition, data can be interpreted lots of different ways, and it may be necessary to go back to the original data for a new interpretation. Organization: The technical documentation can be lengthy. It is generally a good idea to organize sections with headers and include a table of contents. Generally sections towards the beginning of the document should summarize data and use one or more appendices for more details.","title":"Technical Spike"},{"location":"design/design-reviews/recipes/technical-spike/#technical-spike","text":"From Wikipedia ... A spike in a sprint can be used in a number of ways: As a way to familiarize the team with new hardware or software To analyze a problem thoroughly and assist in properly dividing work among separate team members. Spike tests can also be used to mitigate future risk, and may uncover additional issues that have escaped notice. A distinction can be made between technical spikes and functional spikes. The technical spike is used more often for evaluating the impact new technology has on the current implementation. A functional spike is used to determine the interaction with a new feature or implementation. Engineering feasibility spikes can also be conducted to de-risk an engagement and increase the team's understanding.","title":"Technical Spike"},{"location":"design/design-reviews/recipes/technical-spike/#deliverable","text":"Generally the deliverable from a Technical Spike should be a document detailing what was evaluated and the outcome of that evaluation. The specifics contained in the document will vary, but there are some general principles that might be helpful. Problem Statement/Goals: Be sure to include a section that clearly details why an evaluation is being done and what the outcome of this evaluation should be. This is helpful to ensure that the technical spike was productive and advanced the overall project in some way. Make sure it is repeatable: Detail the components used, installation instructions, configuration, etc. required to build the environment that was used for evaluation and testing. If any testing is performed, make sure to include the scripts, links to the applications, configuration options, etc. so that testing could be performed again. There are many reasons that the evaluation environment may need to be rebuilt. For example: Another scenario needs to be tested. A new version of the technology has been released. The technology needs to be tested on a new platform. Fact-Finding: The goal of a spike should be fact-finding, not decision-making or recommendation. Ideally, the technology spike digs into a number of technical questions and gets answers so that the broader project team can then come back together and agree on an appropriate course forward. Evidence: Generally you will use sections to summarize the results of testing which do not include the potentially hundreds of detailed results, however, you should include all detailed testing results in an appendix or an attachment. Having full results detailed somewhere will help the team trust the results. In addition, data can be interpreted lots of different ways, and it may be necessary to go back to the original data for a new interpretation. Organization: The technical documentation can be lengthy. It is generally a good idea to organize sections with headers and include a table of contents. Generally sections towards the beginning of the document should summarize data and use one or more appendices for more details.","title":"Deliverable"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/","text":"Template: Feature / Story Design Review [DRAFT/WIP] [Feature or Story Design Title] Does the feature re-use or extend existing patterns / interfaces that have already been established for the project? Does the feature expose new patterns or interfaces that will establish a new standard for new future development? Feature/Story Name Engagement: [Engagement] Customer: [Customer] Authors: [Author1, Author2, etc.] Overview/Problem Statement It can also be a link to the work item . Describe the feature/story with a high-level summary. Consider additional background and justification, for posterity and historical context. List any assumptions that were made for this design. Goals/In-Scope List the goals that the feature/story will help us achieve that are most relevant for the design review discussion. This should include acceptance criteria required to meet definition of done . Non-Goals / Out-of-Scope List the non-goals for the feature/story. This contains work that is beyond the scope of what the feature/component/service is intended for. Proposed Design Briefly describe the high-level architecture for the feature/story. Relevant diagrams (e.g. sequence, component, context, deployment) should be included here. Technology Describe the relevant OS, Web server, presentation layer, persistence layer, caching, eventing/messaging/jobs, etc. \u2013 whatever is applicable to the overall technology solution and how are they going to be used. Describe the usage of any libraries of OSS components. Briefly list the languages(s) and platform(s) that comprise the stack. Non-Functional Requirements What are the primary performance and scalability concerns for this feature/story? Are there specific latency, availability, and RTO/RPO objectives that must be met? Are there specific bottlenecks or potential problem areas? For example, are operations CPU or I/O (network, disk) bound? How large are the data sets and how fast do they grow? What is the expected usage pattern of the service? For example, will there be peaks and valleys of intense concurrent usage? Are there specific cost constraints? (e.g. $ per transaction/device/user) Dependencies Does this feature/story need to be sequenced after another feature/story assigned to the same team and why? Is the feature/story dependent on another team completing other work? Will the team need to wait for that work to be completed or could the work proceed in parallel? Risks & Mitigation Does the team need assistance from subject-matter experts? What security and privacy concerns does this milestone/epic have? Is all sensitive information and secrets treated in a safe and secure manner? Open Questions List any open questions/concerns here. Resources List any additional resources here including links to backlog items, work items or other documents.","title":"Template: Feature / Story Design Review"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#template-feature-story-design-review","text":"","title":"Template: Feature / Story Design Review"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#draftwip-feature-or-story-design-title","text":"Does the feature re-use or extend existing patterns / interfaces that have already been established for the project? Does the feature expose new patterns or interfaces that will establish a new standard for new future development? Feature/Story Name Engagement: [Engagement] Customer: [Customer] Authors: [Author1, Author2, etc.]","title":"[DRAFT/WIP] [Feature or Story Design Title]"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#overviewproblem-statement","text":"It can also be a link to the work item . Describe the feature/story with a high-level summary. Consider additional background and justification, for posterity and historical context. List any assumptions that were made for this design.","title":"Overview/Problem Statement"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#goalsin-scope","text":"List the goals that the feature/story will help us achieve that are most relevant for the design review discussion. This should include acceptance criteria required to meet definition of done .","title":"Goals/In-Scope"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#non-goals-out-of-scope","text":"List the non-goals for the feature/story. This contains work that is beyond the scope of what the feature/component/service is intended for.","title":"Non-Goals / Out-of-Scope"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#proposed-design","text":"Briefly describe the high-level architecture for the feature/story. Relevant diagrams (e.g. sequence, component, context, deployment) should be included here.","title":"Proposed Design"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#technology","text":"Describe the relevant OS, Web server, presentation layer, persistence layer, caching, eventing/messaging/jobs, etc. \u2013 whatever is applicable to the overall technology solution and how are they going to be used. Describe the usage of any libraries of OSS components. Briefly list the languages(s) and platform(s) that comprise the stack.","title":"Technology"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#non-functional-requirements","text":"What are the primary performance and scalability concerns for this feature/story? Are there specific latency, availability, and RTO/RPO objectives that must be met? Are there specific bottlenecks or potential problem areas? For example, are operations CPU or I/O (network, disk) bound? How large are the data sets and how fast do they grow? What is the expected usage pattern of the service? For example, will there be peaks and valleys of intense concurrent usage? Are there specific cost constraints? (e.g. $ per transaction/device/user)","title":"Non-Functional Requirements"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#dependencies","text":"Does this feature/story need to be sequenced after another feature/story assigned to the same team and why? Is the feature/story dependent on another team completing other work? Will the team need to wait for that work to be completed or could the work proceed in parallel?","title":"Dependencies"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#risks-mitigation","text":"Does the team need assistance from subject-matter experts? What security and privacy concerns does this milestone/epic have? Is all sensitive information and secrets treated in a safe and secure manner?","title":"Risks &amp; Mitigation"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#open-questions","text":"List any open questions/concerns here.","title":"Open Questions"},{"location":"design/design-reviews/recipes/templates/feature-story-design-review/#resources","text":"List any additional resources here including links to backlog items, work items or other documents.","title":"Resources"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/","text":"Template: Milestone / Epic Design Review [DRAFT/WIP] [Milestone/Epic Design Title] Please refer to the milestone/epic design review recipe for things to keep in mind when using this template. Milestone / Epic: Name Project / Engagement: [Project Engagement] Authors: [Author1, Author2, etc.] Overview / Problem Statement Describe the milestone/epic with a high-level summary and a problem statement. Consider including or linking to any additional background (e.g. Game Plan or Checkpoint docs) if it is useful for historical context. Goals / In-Scope List a few bullet points of goals that this milestone/epic will achieve and that are most relevant for the design review discussion. You may include acceptable criteria required to meet the Definition of Done . Non-goals / Out-of-Scope List a few bullet points of non-goals to clarify the work that is beyond the scope of the design review for this milestone/epic. Proposed Design / Suggested Approach To optimize the time investment, this should be brief since it is likely that details will change as the epic/milestone is further decomposed into features and stories. The goal being to convey the vision and complexity in something that can be understood in a few minutes and can help guide a discussion (either asynchronously via comments or in a meeting). A paragraph to describe the proposed design / suggested approach for this milestone/epic. A diagram (e.g. architecture, sequence, component, deployment, etc.) or pseudo-code snippet to make it easier to talk through the approach. List a few of the alternative approaches that were considered and include the brief key Pros and Cons used to help rationalize the decision. For example: Pros Cons Simple to implement Creates secondary identity system Repeatable pattern/code artifact Deployment requires admin credentials Technology Briefly list the languages(s) and platform(s) that comprise the stack. This may include anything that is needed to understand the overall solution: OS, web server, presentation layer, persistence layer, caching, eventing, etc. Non-Functional Requirements What are the primary performance and scalability concerns for this milestone/epic? Are there specific latency, availability, and RTO/RPO objectives that must be met? Are there specific bottlenecks or potential problem areas? For example, are operations CPU or I/O (network, disk) bound? How large are the data sets and how fast do they grow? What is the expected usage pattern of the service? For example, will there be peaks and valleys of intense concurrent usage? Are there specific cost constraints? (e.g. $ per transaction/device/user) Operationalization Are there any specific considerations for the CI/CD setup of milestone/epic? Is there a process (manual or automated) to promote builds from lower environments to higher ones? Does this milestone/epic require zero-downtime deployments, and if so, how are they achieved? Are there mechanisms in place to rollback a deployment? What is the process for monitoring the functionality provided by this milestone/epic? Dependencies Does this milestone/epic need to be sequenced after another epic assigned to the same team and why? Is the milestone/epic dependent on another team completing other work? Will the team need to wait for that work to be completed or could the work proceed in parallel? Risks & Mitigations Does the team need assistance from subject-matter experts? What security and privacy concerns does this milestone/epic have? Is all sensitive information and secrets treated in a safe and secure manner? Open Questions Include any open questions and concerns. Resources Include any additional resources including links to work items or other documents.","title":"Template: Milestone / Epic Design Review"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#template-milestone-epic-design-review","text":"","title":"Template: Milestone / Epic Design Review"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#draftwip-milestoneepic-design-title","text":"Please refer to the milestone/epic design review recipe for things to keep in mind when using this template. Milestone / Epic: Name Project / Engagement: [Project Engagement] Authors: [Author1, Author2, etc.]","title":"[DRAFT/WIP] [Milestone/Epic Design Title]"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#overview-problem-statement","text":"Describe the milestone/epic with a high-level summary and a problem statement. Consider including or linking to any additional background (e.g. Game Plan or Checkpoint docs) if it is useful for historical context.","title":"Overview / Problem Statement"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#goals-in-scope","text":"List a few bullet points of goals that this milestone/epic will achieve and that are most relevant for the design review discussion. You may include acceptable criteria required to meet the Definition of Done .","title":"Goals / In-Scope"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#non-goals-out-of-scope","text":"List a few bullet points of non-goals to clarify the work that is beyond the scope of the design review for this milestone/epic.","title":"Non-goals / Out-of-Scope"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#proposed-design-suggested-approach","text":"To optimize the time investment, this should be brief since it is likely that details will change as the epic/milestone is further decomposed into features and stories. The goal being to convey the vision and complexity in something that can be understood in a few minutes and can help guide a discussion (either asynchronously via comments or in a meeting). A paragraph to describe the proposed design / suggested approach for this milestone/epic. A diagram (e.g. architecture, sequence, component, deployment, etc.) or pseudo-code snippet to make it easier to talk through the approach. List a few of the alternative approaches that were considered and include the brief key Pros and Cons used to help rationalize the decision. For example: Pros Cons Simple to implement Creates secondary identity system Repeatable pattern/code artifact Deployment requires admin credentials","title":"Proposed Design / Suggested Approach"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#technology","text":"Briefly list the languages(s) and platform(s) that comprise the stack. This may include anything that is needed to understand the overall solution: OS, web server, presentation layer, persistence layer, caching, eventing, etc.","title":"Technology"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#non-functional-requirements","text":"What are the primary performance and scalability concerns for this milestone/epic? Are there specific latency, availability, and RTO/RPO objectives that must be met? Are there specific bottlenecks or potential problem areas? For example, are operations CPU or I/O (network, disk) bound? How large are the data sets and how fast do they grow? What is the expected usage pattern of the service? For example, will there be peaks and valleys of intense concurrent usage? Are there specific cost constraints? (e.g. $ per transaction/device/user)","title":"Non-Functional Requirements"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#operationalization","text":"Are there any specific considerations for the CI/CD setup of milestone/epic? Is there a process (manual or automated) to promote builds from lower environments to higher ones? Does this milestone/epic require zero-downtime deployments, and if so, how are they achieved? Are there mechanisms in place to rollback a deployment? What is the process for monitoring the functionality provided by this milestone/epic?","title":"Operationalization"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#dependencies","text":"Does this milestone/epic need to be sequenced after another epic assigned to the same team and why? Is the milestone/epic dependent on another team completing other work? Will the team need to wait for that work to be completed or could the work proceed in parallel?","title":"Dependencies"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#risks-mitigations","text":"Does the team need assistance from subject-matter experts? What security and privacy concerns does this milestone/epic have? Is all sensitive information and secrets treated in a safe and secure manner?","title":"Risks &amp; Mitigations"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#open-questions","text":"Include any open questions and concerns.","title":"Open Questions"},{"location":"design/design-reviews/recipes/templates/milestone-epic-design-review/#resources","text":"Include any additional resources including links to work items or other documents.","title":"Resources"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/","text":"Template: Task Design Review [DRAFT/WIP] [Task Design Title] When developing a design document for a new task, it should contain a detailed design proposal demonstrating how it will solve the goals outlined below. Not all tasks require a design review, but when they do it is likely that there many unknowns, or the solution may be more complex. The design should include diagrams, pseudocode, interface contracts as needed to provide a detailed understanding of the proposal. Task Name Story Name Engagement: [Engagement] Customer: [Customer] Authors: [Author1, Author2, etc.] Overview/Problem Statement It can also be a link to the work item . Describe the task with a high-level summary. Consider additional background and justification, for posterity and historical context. Goals/In-Scope List a few bullet points of what this task will achieve and that are most relevant for the design review discussion. This should include acceptance criteria required to meet the definition of done . Non-goals / Out-of-Scope List a few bullet points of non-goals to clarify the work that is beyond the scope of the design review for this task. Proposed Options Describe the detailed design to accomplish the proposed task. What patterns & practices will be used and why were they chosen. Were any alternate proposals considered? What new components are required to be developed? Are there any existing components that require updates? Relevant diagrams (e.g. sequence, component, context, deployment) should be included here. Technology Choices Describe any libraries and OSS components that will be used to complete the task. Briefly list the languages(s) and platform(s) that comprise the stack. Open Questions List any open questions/concerns here. Resources List any additional resources here including links to backlog items, work items or other documents.","title":"Template: Task Design Review"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#template-task-design-review","text":"","title":"Template: Task Design Review"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#draftwip-task-design-title","text":"When developing a design document for a new task, it should contain a detailed design proposal demonstrating how it will solve the goals outlined below. Not all tasks require a design review, but when they do it is likely that there many unknowns, or the solution may be more complex. The design should include diagrams, pseudocode, interface contracts as needed to provide a detailed understanding of the proposal. Task Name Story Name Engagement: [Engagement] Customer: [Customer] Authors: [Author1, Author2, etc.]","title":"[DRAFT/WIP] [Task Design Title]"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#overviewproblem-statement","text":"It can also be a link to the work item . Describe the task with a high-level summary. Consider additional background and justification, for posterity and historical context.","title":"Overview/Problem Statement"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#goalsin-scope","text":"List a few bullet points of what this task will achieve and that are most relevant for the design review discussion. This should include acceptance criteria required to meet the definition of done .","title":"Goals/In-Scope"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#non-goals-out-of-scope","text":"List a few bullet points of non-goals to clarify the work that is beyond the scope of the design review for this task.","title":"Non-goals / Out-of-Scope"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#proposed-options","text":"Describe the detailed design to accomplish the proposed task. What patterns & practices will be used and why were they chosen. Were any alternate proposals considered? What new components are required to be developed? Are there any existing components that require updates? Relevant diagrams (e.g. sequence, component, context, deployment) should be included here.","title":"Proposed Options"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#technology-choices","text":"Describe any libraries and OSS components that will be used to complete the task. Briefly list the languages(s) and platform(s) that comprise the stack.","title":"Technology Choices"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#open-questions","text":"List any open questions/concerns here.","title":"Open Questions"},{"location":"design/design-reviews/recipes/templates/template-task-design-review/#resources","text":"List any additional resources here including links to backlog items, work items or other documents.","title":"Resources"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/","text":"Template: Technical Spike Spike: [Spike Name] Conducted by: {Names and at least one email address for follow-up questions} Backlog Work Item: {Link to the work item to provide more context} Sprint : {Which sprint did the study take place. Include sprint start date} Goal Describe what question(s) the spike intends to answer and why. Method Describe how the team will uncover the answer to the question(s) the spike intends to answer. For example: Build prototype to test. Research existing documents and samples. Discuss with subject matter experts. Evidence Document the evidence collected that informed the conclusions below. Examples may include: Recorded or live demos of a prototype providing the desired capabilities Metrics collected while testing the prototype Documentation that indicates the solution can provided the desired capabilities Conclusions What was the answer to the question(s) outlined at the start of the spike? Capture what was learned that will inform future work. Next Steps What work is expected as an outcome of the learning within this spike. Was there work that was blocked or dependent on the learning within this spike?","title":"Template: Technical Spike"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#template-technical-spike","text":"","title":"Template: Technical Spike"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#spike-spike-name","text":"Conducted by: {Names and at least one email address for follow-up questions} Backlog Work Item: {Link to the work item to provide more context} Sprint : {Which sprint did the study take place. Include sprint start date}","title":"Spike: [Spike Name]"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#goal","text":"Describe what question(s) the spike intends to answer and why.","title":"Goal"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#method","text":"Describe how the team will uncover the answer to the question(s) the spike intends to answer. For example: Build prototype to test. Research existing documents and samples. Discuss with subject matter experts.","title":"Method"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#evidence","text":"Document the evidence collected that informed the conclusions below. Examples may include: Recorded or live demos of a prototype providing the desired capabilities Metrics collected while testing the prototype Documentation that indicates the solution can provided the desired capabilities","title":"Evidence"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#conclusions","text":"What was the answer to the question(s) outlined at the start of the spike? Capture what was learned that will inform future work.","title":"Conclusions"},{"location":"design/design-reviews/recipes/templates/template-technical-spike/#next-steps","text":"What work is expected as an outcome of the learning within this spike. Was there work that was blocked or dependent on the learning within this spike?","title":"Next Steps"},{"location":"design/design-reviews/trade-studies/","text":"Trade Studies Trade studies are a tool for selecting the best option out of several possible options for a given problem (for example: compute, storage). They evaluate potential choices against a set of objective criteria/requirements to clearly lay out the benefits and limitations of each solution. Trade studies are a concept from systems engineering that we adapted for software projects. Trade studies have proved to be a critical tool to drive alignment with the stakeholders, earn credibility while doing so and ensure our decisions were backed by data and not bias. When to Use Trade studies go hand in hand with high level architecture design. This usually occurs as project requirements are solidifying, before coding begins. Trade studies continue to be useful throughout the project any time there are multiple options that need to be selected from. New decision point could occur from changing requirements, getting results of a research spike, or identifying challenges that were not originally seen. Trade studies should be avoided if there is a clear solution choice. Because they require each solution to be fully thought out, they have the potential to take a lot of time to complete. When there is a clear design, the trade study should be omitted, and an entry should be made in the Decision Log documenting the decision. Why Trade Studies Trade studies are a way of formalizing the design process and leaving a documentation record for why the decision was made. This gives a few advantages: The trade study template guides a user through the design process. This provides structure to the design stage. Having a uniform design process aids splitting work amongst team members. We have had success with engineers pairing to define requirements, evaluation criteria, and brainstorming possible solutions. Then they can each split to review solutions in parallel, before rejoining to make the final decision. The completed trade study document helps drive alignment across the team and decision makers. For presenting results of the study, the document itself can be used to highlight the main points. Alternatively, we have extracted requirements, diagrams for each solution, and the results table into a slide deck to give high level overviews of the results. The completed trade study gets checked into the code repository, providing documentation of the decision process. This leaves a history of the requirements at the time that lead to each decision. Also, the results table gives a quick reference for how the decision would be impacted if requirements change as the project proceeds. Flow of a Trade Study Trade studies can vary widely in scope; however, they follow the common pattern below: Solidify the requirements \u2013 Work with the stakeholders to agree on the requirements for the functionality that you are trying to build. Create evaluation criteria \u2013 This is a set of qualitative and quantitative assessment points that represent the requirements. Taken together, they become an easy to measure stand-in for the potentially abstract requirements. Brainstorm solutions \u2013 Gather a list of possible solutions to the problem. Then, use your best judgement to pick the 2-4 solutions that seem most promising. For assistance narrowing solutions, remember to reach out to subject-matter experts and other teams who may have gone through a similar decision. Evaluate shortlisted solutions \u2013 Dive deep into each solution and measure it against the evaluation criteria. In this stage, time box your research to avoid overly investing in any given area. Compare results and choose solution - Align the decision with the team. If you are unable to decide, then a clear list of action items and owners to drive the final decision must be produced. Template See template.md for an example of how to structure the above information. This template was created to guide a user through conducting a trade study. Once the decision has been made we recommend adding an entry to the Decision Log that has references back to the full text of the trade study.","title":"Trade Studies"},{"location":"design/design-reviews/trade-studies/#trade-studies","text":"Trade studies are a tool for selecting the best option out of several possible options for a given problem (for example: compute, storage). They evaluate potential choices against a set of objective criteria/requirements to clearly lay out the benefits and limitations of each solution. Trade studies are a concept from systems engineering that we adapted for software projects. Trade studies have proved to be a critical tool to drive alignment with the stakeholders, earn credibility while doing so and ensure our decisions were backed by data and not bias.","title":"Trade Studies"},{"location":"design/design-reviews/trade-studies/#when-to-use","text":"Trade studies go hand in hand with high level architecture design. This usually occurs as project requirements are solidifying, before coding begins. Trade studies continue to be useful throughout the project any time there are multiple options that need to be selected from. New decision point could occur from changing requirements, getting results of a research spike, or identifying challenges that were not originally seen. Trade studies should be avoided if there is a clear solution choice. Because they require each solution to be fully thought out, they have the potential to take a lot of time to complete. When there is a clear design, the trade study should be omitted, and an entry should be made in the Decision Log documenting the decision.","title":"When to Use"},{"location":"design/design-reviews/trade-studies/#why-trade-studies","text":"Trade studies are a way of formalizing the design process and leaving a documentation record for why the decision was made. This gives a few advantages: The trade study template guides a user through the design process. This provides structure to the design stage. Having a uniform design process aids splitting work amongst team members. We have had success with engineers pairing to define requirements, evaluation criteria, and brainstorming possible solutions. Then they can each split to review solutions in parallel, before rejoining to make the final decision. The completed trade study document helps drive alignment across the team and decision makers. For presenting results of the study, the document itself can be used to highlight the main points. Alternatively, we have extracted requirements, diagrams for each solution, and the results table into a slide deck to give high level overviews of the results. The completed trade study gets checked into the code repository, providing documentation of the decision process. This leaves a history of the requirements at the time that lead to each decision. Also, the results table gives a quick reference for how the decision would be impacted if requirements change as the project proceeds.","title":"Why Trade Studies"},{"location":"design/design-reviews/trade-studies/#flow-of-a-trade-study","text":"Trade studies can vary widely in scope; however, they follow the common pattern below: Solidify the requirements \u2013 Work with the stakeholders to agree on the requirements for the functionality that you are trying to build. Create evaluation criteria \u2013 This is a set of qualitative and quantitative assessment points that represent the requirements. Taken together, they become an easy to measure stand-in for the potentially abstract requirements. Brainstorm solutions \u2013 Gather a list of possible solutions to the problem. Then, use your best judgement to pick the 2-4 solutions that seem most promising. For assistance narrowing solutions, remember to reach out to subject-matter experts and other teams who may have gone through a similar decision. Evaluate shortlisted solutions \u2013 Dive deep into each solution and measure it against the evaluation criteria. In this stage, time box your research to avoid overly investing in any given area. Compare results and choose solution - Align the decision with the team. If you are unable to decide, then a clear list of action items and owners to drive the final decision must be produced.","title":"Flow of a Trade Study"},{"location":"design/design-reviews/trade-studies/#template","text":"See template.md for an example of how to structure the above information. This template was created to guide a user through conducting a trade study. Once the decision has been made we recommend adding an entry to the Decision Log that has references back to the full text of the trade study.","title":"Template"},{"location":"design/design-reviews/trade-studies/template/","text":"Trade Study Template This generic template can be used for any situation where we have a set of requirements that can be satisfied by multiple solutions. They can range in scope from choice of which open source package to use, to full architecture designs. Trade Study/Design: [Trade Study Name] Conducted by: {Names of those that can answer follow-up questions and at least one email address} Backlog Work Item: {Link to the work item to provide more context} Sprint: {Which sprint did the study take place? Include sprint start date} Decision: {Solution chosen to proceed with} Decision Makers: IMPORTANT Designs should be completed within a sprint. Most designs will benefit from brevity. To accomplish this: Narrow the scope of the design. Narrow evaluation to 2 to 3 solutions. Design experiments to collect evidence as fast as possible. Overview Description of the problem we are solving. This should include: Assumptions about the rest of the system Constraints that apply to the system, both business and technical Requirements for the functionality that needs to be implemented, including possible inputs and outputs (optional) A diagram showing the different pieces Desired Outcomes The following section should establish the desired capabilities of the solution for it to be successful. This can be done by answering the following questions either directly or via link to related artifact (i.e. PBI or Feature description). Acceptance: What capabilities should be demonstrable for a stakeholder to accept the solution? Justification: How does this contribute to the broader project objectives? IMPORTANT This is not intended to define outcomes for the design activity itself. It is intended to define the outcomes for the solution being designed. As mentioned in the User Interface section, if the trade study is analyzing an application development solution, make use of the persona stories to derive desired outcomes. For example, if a persona story exemplifies a certain accessibility requirement, the parallel desired outcome may be \"The application must be accessible for people with vision-based disabilities\". Evaluation Criteria The former should be condensed down to a set of \"evaluation criteria\" that we can rate any potential solutions against. Examples of evaluation criteria: Runs on Windows and Linux - Binary response Compute Usage - Could be categories that effectively rank different options: High, Medium, Low Cost of the solution \u2013 An estimated numeric field The results section contains a table evaluating each solution against the evaluation criteria. Key Metrics (Optional) If available, describe any measurable metrics that are important to the success of the solution. Examples include, but are not limited to: Performance & Scale targets such as, Requests/Second, Latency, and Response time (at a given percentile). Azure consumption cost budget. For example, given certain usage, solution expected to cost X dollars per month. Availability uptime of XX% over X time period. Consistency. Writes available for read within X milliseconds. Recovery point objective (RPO) & Recovery time objective (RTO). Constraints (Optional) If applicable, describe the boundaries from which we have to design the solution. This could be thought of as the \"box\" the team has to work within. This box may be defined as: Technologies, services, and languages an organization is comfortable operating/managing. Devices, operating systems, and/or browsers that must be supported. Backward Compatibility. For example, public interfaces consumed by client or third party apps cannot introduce breaking changes. Integrations or dependencies with other systems. For example, push notifications to client apps must be done via existing websockets channel. Accessibility Accessibility is never optional . Microsoft has made a public commitment to always produce accessible applications. For more information visit the official Microsoft accessibility site and read the Accessibility page. Consider the following prompts when determining application accessibility requirements: Does the application meet industry accessibility standards? Are training, support, and documentation resources accessible? Is the application designed to be inclusive for people will a broad range of abilities, languages, and cultures? Solution Hypotheses Enumerate the solutions that are believed to deliver the outcomes defined above. Note: Limiting the evaluated solutions to 2 or 3 potential candidates can help manage the time spent on the evaluation. If there are more than 3 candidates, prioritize what the team feels are the top 3. If appropriate, the eliminated candidates can be mentioned to capture why they were eliminated. Additionally, there should be at least two options compared, otherwise you didn't need a trade study. [Solution 1] Add a brief description of the solution and how its expected to produce the desired outcomes. If appropriate, illustrations/diagrams can be used to reduce the amount of text explanation required to describe the solution. NOTE: Using present tense language to describe the solution can help avoid confusion between current state and future state. For example, use \"This solution works by doing...\" vs. \"This solution would work by doing...\". Each solution section should contain the following: Description of the solution (optional) A diagram to quickly reference the solution Possible variations - things that are small variations on the main solution can be grouped together Evaluation of the idea based on the evaluation criteria above The depth, detail, and contents of these sections will vary based on the complexity of the functionality being developed. Experiment(s) Describe how the solution will be evaluated to prove or dis-prove that it will produce the desired outcomes. This could take many forms such as building a prototype and researching existing documentation and sample solutions. Additionally, document any assumptions made as part of the experiment. NOTE: Time boxing these experiments can be beneficial to make sure the team is making the best use of the time by focusing on collecting key evidence in the simplest/fastest way possible. Evidence Present the evidence collected during experimentation that supports the hypothesis that this solution will meet the desired outcomes. Examples may include: Recorded or live demos of a prototype providing the desired capabilities Metrics collected while testing the prototype Documentation that indicates the solution can provide the desired capabilities NOTE: Evidence is not required for every capability, metric, or constraint for the design to be considered done. Instead, focus on presenting evidence that is most relevant and impactful towards supporting or eliminating the hypothesis. [Solution 2] ... [Solution N] ... Results This section should contain a table that has each solution rated against each of the evaluation criteria: Solution Evaluation Criteria 1 Evaluation Criteria 2 ... Evaluation Criteria N Solution 1 Solution 2 ... Solution M Note: The formatting of the table can change. In the past, we have had success with qualitative descriptions in the table entries and color coding the cells to represent good, fair, bad. Decision The chosen solution, or a list of questions that need to be answered before the decision can be made. In the latter case, each question needs an action item and an assigned person for answering the question. Once those questions are answered, the document must be updated to reflect the answers, and the final decision. In the first case, describe which solution was chosen and why. Summarize what evidence informed the decision and how that evidence mapped to the desired outcomes. Note: Decisions should be made with the understanding that they can change as the team learns more. It's a starting point, not a contract. Next Steps What work is expected once a decision has been reached? Examples include but are not limited to: Creating new PBI's or modifying existing ones Follow up spikes Creating specification for public interfaces and integrations between other work streams. Decision Log Entry","title":"Trade Study Template"},{"location":"design/design-reviews/trade-studies/template/#trade-study-template","text":"This generic template can be used for any situation where we have a set of requirements that can be satisfied by multiple solutions. They can range in scope from choice of which open source package to use, to full architecture designs.","title":"Trade Study Template"},{"location":"design/design-reviews/trade-studies/template/#trade-studydesign-trade-study-name","text":"Conducted by: {Names of those that can answer follow-up questions and at least one email address} Backlog Work Item: {Link to the work item to provide more context} Sprint: {Which sprint did the study take place? Include sprint start date} Decision: {Solution chosen to proceed with} Decision Makers: IMPORTANT Designs should be completed within a sprint. Most designs will benefit from brevity. To accomplish this: Narrow the scope of the design. Narrow evaluation to 2 to 3 solutions. Design experiments to collect evidence as fast as possible.","title":"Trade Study/Design: [Trade Study Name]"},{"location":"design/design-reviews/trade-studies/template/#overview","text":"Description of the problem we are solving. This should include: Assumptions about the rest of the system Constraints that apply to the system, both business and technical Requirements for the functionality that needs to be implemented, including possible inputs and outputs (optional) A diagram showing the different pieces","title":"Overview"},{"location":"design/design-reviews/trade-studies/template/#desired-outcomes","text":"The following section should establish the desired capabilities of the solution for it to be successful. This can be done by answering the following questions either directly or via link to related artifact (i.e. PBI or Feature description). Acceptance: What capabilities should be demonstrable for a stakeholder to accept the solution? Justification: How does this contribute to the broader project objectives? IMPORTANT This is not intended to define outcomes for the design activity itself. It is intended to define the outcomes for the solution being designed. As mentioned in the User Interface section, if the trade study is analyzing an application development solution, make use of the persona stories to derive desired outcomes. For example, if a persona story exemplifies a certain accessibility requirement, the parallel desired outcome may be \"The application must be accessible for people with vision-based disabilities\".","title":"Desired Outcomes"},{"location":"design/design-reviews/trade-studies/template/#evaluation-criteria","text":"The former should be condensed down to a set of \"evaluation criteria\" that we can rate any potential solutions against. Examples of evaluation criteria: Runs on Windows and Linux - Binary response Compute Usage - Could be categories that effectively rank different options: High, Medium, Low Cost of the solution \u2013 An estimated numeric field The results section contains a table evaluating each solution against the evaluation criteria.","title":"Evaluation Criteria"},{"location":"design/design-reviews/trade-studies/template/#key-metrics-optional","text":"If available, describe any measurable metrics that are important to the success of the solution. Examples include, but are not limited to: Performance & Scale targets such as, Requests/Second, Latency, and Response time (at a given percentile). Azure consumption cost budget. For example, given certain usage, solution expected to cost X dollars per month. Availability uptime of XX% over X time period. Consistency. Writes available for read within X milliseconds. Recovery point objective (RPO) & Recovery time objective (RTO).","title":"Key Metrics (Optional)"},{"location":"design/design-reviews/trade-studies/template/#constraints-optional","text":"If applicable, describe the boundaries from which we have to design the solution. This could be thought of as the \"box\" the team has to work within. This box may be defined as: Technologies, services, and languages an organization is comfortable operating/managing. Devices, operating systems, and/or browsers that must be supported. Backward Compatibility. For example, public interfaces consumed by client or third party apps cannot introduce breaking changes. Integrations or dependencies with other systems. For example, push notifications to client apps must be done via existing websockets channel.","title":"Constraints (Optional)"},{"location":"design/design-reviews/trade-studies/template/#accessibility","text":"Accessibility is never optional . Microsoft has made a public commitment to always produce accessible applications. For more information visit the official Microsoft accessibility site and read the Accessibility page. Consider the following prompts when determining application accessibility requirements: Does the application meet industry accessibility standards? Are training, support, and documentation resources accessible? Is the application designed to be inclusive for people will a broad range of abilities, languages, and cultures?","title":"Accessibility"},{"location":"design/design-reviews/trade-studies/template/#solution-hypotheses","text":"Enumerate the solutions that are believed to deliver the outcomes defined above. Note: Limiting the evaluated solutions to 2 or 3 potential candidates can help manage the time spent on the evaluation. If there are more than 3 candidates, prioritize what the team feels are the top 3. If appropriate, the eliminated candidates can be mentioned to capture why they were eliminated. Additionally, there should be at least two options compared, otherwise you didn't need a trade study.","title":"Solution Hypotheses"},{"location":"design/design-reviews/trade-studies/template/#solution-1","text":"Add a brief description of the solution and how its expected to produce the desired outcomes. If appropriate, illustrations/diagrams can be used to reduce the amount of text explanation required to describe the solution. NOTE: Using present tense language to describe the solution can help avoid confusion between current state and future state. For example, use \"This solution works by doing...\" vs. \"This solution would work by doing...\". Each solution section should contain the following: Description of the solution (optional) A diagram to quickly reference the solution Possible variations - things that are small variations on the main solution can be grouped together Evaluation of the idea based on the evaluation criteria above The depth, detail, and contents of these sections will vary based on the complexity of the functionality being developed.","title":"[Solution 1]"},{"location":"design/design-reviews/trade-studies/template/#experiments","text":"Describe how the solution will be evaluated to prove or dis-prove that it will produce the desired outcomes. This could take many forms such as building a prototype and researching existing documentation and sample solutions. Additionally, document any assumptions made as part of the experiment. NOTE: Time boxing these experiments can be beneficial to make sure the team is making the best use of the time by focusing on collecting key evidence in the simplest/fastest way possible.","title":"Experiment(s)"},{"location":"design/design-reviews/trade-studies/template/#evidence","text":"Present the evidence collected during experimentation that supports the hypothesis that this solution will meet the desired outcomes. Examples may include: Recorded or live demos of a prototype providing the desired capabilities Metrics collected while testing the prototype Documentation that indicates the solution can provide the desired capabilities NOTE: Evidence is not required for every capability, metric, or constraint for the design to be considered done. Instead, focus on presenting evidence that is most relevant and impactful towards supporting or eliminating the hypothesis.","title":"Evidence"},{"location":"design/design-reviews/trade-studies/template/#solution-2","text":"...","title":"[Solution 2]"},{"location":"design/design-reviews/trade-studies/template/#solution-n","text":"...","title":"[Solution N]"},{"location":"design/design-reviews/trade-studies/template/#results","text":"This section should contain a table that has each solution rated against each of the evaluation criteria: Solution Evaluation Criteria 1 Evaluation Criteria 2 ... Evaluation Criteria N Solution 1 Solution 2 ... Solution M Note: The formatting of the table can change. In the past, we have had success with qualitative descriptions in the table entries and color coding the cells to represent good, fair, bad.","title":"Results"},{"location":"design/design-reviews/trade-studies/template/#decision","text":"The chosen solution, or a list of questions that need to be answered before the decision can be made. In the latter case, each question needs an action item and an assigned person for answering the question. Once those questions are answered, the document must be updated to reflect the answers, and the final decision. In the first case, describe which solution was chosen and why. Summarize what evidence informed the decision and how that evidence mapped to the desired outcomes. Note: Decisions should be made with the understanding that they can change as the team learns more. It's a starting point, not a contract.","title":"Decision"},{"location":"design/design-reviews/trade-studies/template/#next-steps","text":"What work is expected once a decision has been reached? Examples include but are not limited to: Creating new PBI's or modifying existing ones Follow up spikes Creating specification for public interfaces and integrations between other work streams. Decision Log Entry","title":"Next Steps"},{"location":"design/diagram-types/","text":"Diagram Types Creating and maintaining diagrams is a challenge for any team. Common reasons across these challenges include: Not leveraging tools to assist in generating diagrams Uncertainty on what to include in a diagram and when to create one Overcoming these challenges and effectively using design diagrams can amplify a team's ability to execute throughout the entire Software Development Lifecycle, from the design phase when proposing various designs to leveraging it as documentation as part of the maintenance phase. This section will share sample tools for diagram generation, provide a high level overview of the different types of diagrams and provide examples of some of these types. There are two primary classes of diagrams: Structural Behavior Within each of these classes, there are many types of diagrams, each intended to convey specific types of information. When different types of diagrams are effectively used in a solution, system, or repository, one can deliver a cohesive and incrementally detailed design. Sample Design Diagrams This section contains educational material and examples for the following design diagrams: Class Diagrams - Useful to document the structural design of a codebase's relationship between classes, and their corresponding methods Component Diagrams - Useful to document a high level structural overview of all the components and their direct \"touch points\" with other Components Sequence Diagrams - Useful to document a behavior overview of the system, capturing the various \"use cases\" or \"actions\" that triggers the system to perform some business logic Deployment Diagram - Useful in order to document the networking and hosting environments where the system will operate in Supplemental Resources Each of the above types of diagrams will provide specific resources related to its type. Below are the generic resources: Visual Paradigm UML Structural vs Behavior Diagrams PlantUML - requires a generator from code to PlantUML syntax to generate diagrams C# to PlantUML Drawing manually","title":"Diagram Types"},{"location":"design/diagram-types/#diagram-types","text":"Creating and maintaining diagrams is a challenge for any team. Common reasons across these challenges include: Not leveraging tools to assist in generating diagrams Uncertainty on what to include in a diagram and when to create one Overcoming these challenges and effectively using design diagrams can amplify a team's ability to execute throughout the entire Software Development Lifecycle, from the design phase when proposing various designs to leveraging it as documentation as part of the maintenance phase. This section will share sample tools for diagram generation, provide a high level overview of the different types of diagrams and provide examples of some of these types. There are two primary classes of diagrams: Structural Behavior Within each of these classes, there are many types of diagrams, each intended to convey specific types of information. When different types of diagrams are effectively used in a solution, system, or repository, one can deliver a cohesive and incrementally detailed design.","title":"Diagram Types"},{"location":"design/diagram-types/#sample-design-diagrams","text":"This section contains educational material and examples for the following design diagrams: Class Diagrams - Useful to document the structural design of a codebase's relationship between classes, and their corresponding methods Component Diagrams - Useful to document a high level structural overview of all the components and their direct \"touch points\" with other Components Sequence Diagrams - Useful to document a behavior overview of the system, capturing the various \"use cases\" or \"actions\" that triggers the system to perform some business logic Deployment Diagram - Useful in order to document the networking and hosting environments where the system will operate in","title":"Sample Design Diagrams"},{"location":"design/diagram-types/#supplemental-resources","text":"Each of the above types of diagrams will provide specific resources related to its type. Below are the generic resources: Visual Paradigm UML Structural vs Behavior Diagrams PlantUML - requires a generator from code to PlantUML syntax to generate diagrams C# to PlantUML Drawing manually","title":"Supplemental Resources"},{"location":"design/diagram-types/class-diagrams/","text":"Class Diagrams Purpose This document is intended to provide a baseline understanding for what, why, and how to incorporate Class Diagrams as part of your engagement. Regarding the how , the section at the bottom will provide tools and plugins to automate as much as possible when generating Class Diagrams through VSCode. Wikipedia defines UML Class Diagrams as: a type of static structure diagram that describes the structure of a system by showing the system's classes, their attributes, operations (or methods), and the relationships among objects. The key terms to make a note of here are: static structure showing the system's classes, attributes, operations, and relationships Class Diagrams are a type of a static structure because it focuses on the properties, and relationships of classes. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics. Essential Takeaways Each \"Component\" (Stand alone piece of software - think datastores, microservices, serverless functions, user interfaces, etc...) of a Product or System will have it's own Class Diagram. Class Diagrams should tell a \"story\", where each Diagram will require Engineers to really think about: The responsibility / operations of each class. What can (should) the class perform? The class' attributes and properties. What can be set by an implementor of this class? What are all (if any) universally static properties? The visibility or accessibility that a class' operation may have to other classes The relationship between each class or the various instances When to Create? Because Class Diagrams represent one of the more granular depiction of what a \"product\" or \"system\" is composed of, it is recommended to begin the creation of these diagrams at the beginning and throughout the engineering portions of an engagement. This does mean that any code change (new feature, enhancement, code refactor) might involve updating one or many Class Diagrams. Although this might seem like a downside of Class Diagrams, it actually can become a very strong benefit. Because Class Diagrams tell a \"story\" for each Component of a product (see the previous section), it requires a substantial amount of upfront thought and design considerations. This amount of upfront thought ultimately results in making more effective code changes, and may even minimize the level of refactors in future stages of the engagement. Class Diagrams also provides quick \"alert indicators\" when a refactor might be necessary. Reasons could be due to seeing that a particular class might be doing too much, have too many dependencies, or when the codebase might produce a very \"messy\" or \"chaotic\" Class Diagram. If the Class Diagram is unreadable, the code will probably be unreadable Examples One can find many examples online such as at UML Diagrams . Below are some basic examples: Versioning Because Class Diagrams will be changing rapidly, essentially anytime a class is changed in the code, and because it might be very large in size, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: Wait until the engagement progresses (maybe 10-20% completion) before publishing a Class Diagram. It is not worth publishing a Class Diagram from the beginning as it will be changing daily Once the most crucial classes are developed, update the published diagram periodically. Ideally whenever a large refactor or net new class is introduced. If the team uses an IDE plugin to automatically generate the diagram from their development environment, this becomes more of a documentation task rather than a necessity As the engagement approaches its end (90-100% completion), update the published diagram whenever a change to an existing class as part of a feature or story acceptance criteria Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches. Resources Wikipedia Visual Paradigm VS Code Plugins: C#, Visual Basic, C++ using Class Designer Component TypeScript classdiagram-ts PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax C# to PlantUML Drawing manually","title":"Class Diagrams"},{"location":"design/diagram-types/class-diagrams/#class-diagrams","text":"","title":"Class Diagrams"},{"location":"design/diagram-types/class-diagrams/#purpose","text":"This document is intended to provide a baseline understanding for what, why, and how to incorporate Class Diagrams as part of your engagement. Regarding the how , the section at the bottom will provide tools and plugins to automate as much as possible when generating Class Diagrams through VSCode. Wikipedia defines UML Class Diagrams as: a type of static structure diagram that describes the structure of a system by showing the system's classes, their attributes, operations (or methods), and the relationships among objects. The key terms to make a note of here are: static structure showing the system's classes, attributes, operations, and relationships Class Diagrams are a type of a static structure because it focuses on the properties, and relationships of classes. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics.","title":"Purpose"},{"location":"design/diagram-types/class-diagrams/#essential-takeaways","text":"Each \"Component\" (Stand alone piece of software - think datastores, microservices, serverless functions, user interfaces, etc...) of a Product or System will have it's own Class Diagram. Class Diagrams should tell a \"story\", where each Diagram will require Engineers to really think about: The responsibility / operations of each class. What can (should) the class perform? The class' attributes and properties. What can be set by an implementor of this class? What are all (if any) universally static properties? The visibility or accessibility that a class' operation may have to other classes The relationship between each class or the various instances","title":"Essential Takeaways"},{"location":"design/diagram-types/class-diagrams/#when-to-create","text":"Because Class Diagrams represent one of the more granular depiction of what a \"product\" or \"system\" is composed of, it is recommended to begin the creation of these diagrams at the beginning and throughout the engineering portions of an engagement. This does mean that any code change (new feature, enhancement, code refactor) might involve updating one or many Class Diagrams. Although this might seem like a downside of Class Diagrams, it actually can become a very strong benefit. Because Class Diagrams tell a \"story\" for each Component of a product (see the previous section), it requires a substantial amount of upfront thought and design considerations. This amount of upfront thought ultimately results in making more effective code changes, and may even minimize the level of refactors in future stages of the engagement. Class Diagrams also provides quick \"alert indicators\" when a refactor might be necessary. Reasons could be due to seeing that a particular class might be doing too much, have too many dependencies, or when the codebase might produce a very \"messy\" or \"chaotic\" Class Diagram. If the Class Diagram is unreadable, the code will probably be unreadable","title":"When to Create?"},{"location":"design/diagram-types/class-diagrams/#examples","text":"One can find many examples online such as at UML Diagrams . Below are some basic examples:","title":"Examples"},{"location":"design/diagram-types/class-diagrams/#versioning","text":"Because Class Diagrams will be changing rapidly, essentially anytime a class is changed in the code, and because it might be very large in size, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: Wait until the engagement progresses (maybe 10-20% completion) before publishing a Class Diagram. It is not worth publishing a Class Diagram from the beginning as it will be changing daily Once the most crucial classes are developed, update the published diagram periodically. Ideally whenever a large refactor or net new class is introduced. If the team uses an IDE plugin to automatically generate the diagram from their development environment, this becomes more of a documentation task rather than a necessity As the engagement approaches its end (90-100% completion), update the published diagram whenever a change to an existing class as part of a feature or story acceptance criteria Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches.","title":"Versioning"},{"location":"design/diagram-types/class-diagrams/#resources","text":"Wikipedia Visual Paradigm VS Code Plugins: C#, Visual Basic, C++ using Class Designer Component TypeScript classdiagram-ts PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax C# to PlantUML Drawing manually","title":"Resources"},{"location":"design/diagram-types/component-diagrams/","text":"Component Diagrams Purpose This document is intended to provide a baseline understanding for what, why, and how to incorporate Component Diagrams as part of your engagement. Regarding the how , the section at the bottom will provide tools and plugins to streamline as much as possible when generating Component Diagrams through VSCode. Wikipedia defines UML Component Diagrams as: a component diagram depicts how components are wired together to form larger components or software systems. Component Diagrams are a type of a static structure because it focuses on the responsibility and relationships between components as part of the overall system or solution. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics. ...Hold on a second... what is a Component? A Component is a runnable solution that performs a set of operations and can possibly be interfaced through a particular API. One can think of Components as a \"stand alone\" piece of software - think datastores, microservices, serverless functions, user interfaces, etc... Essential Takeaways The primary two takeaways from a Component Diagram should be: A quick view of all the various components (User Interface, Service, Data Storage) involved in the system The immediate \"touch points\" that a particular Component has with other Components, including how that \"touch point\" is accomplished (HTTP, FTP, etc...) Depending on the complexity of the system, a team might decide to create several Component Diagrams. Where there is one diagram per Component (depicting all it's immediate \"touch points\" with other Components). Or if a system is simple, the team might decide to create a single Component Diagram capturing all Components in the diagram. When to Create? Because Component Diagrams represent a high level overview of the entire system from a Component focus, it is recommended to begin the creation of this diagram from the beginning of an engagement, and update it as the various Components are identified, developed, and introduced into the system. Otherwise, if this is left till later, then there is risk that: the team won't be able to identify areas of improvement the team or other necessary stakeholders won't have a full understanding on how the system works as it is being developed Because of the inherent granularity of the system, the Component Diagrams won't have to be updated as often as Class Diagrams . Things that might merit updating a Component Diagram could be: A deletion or addition of a new Component into the system A change to a system Component's interaction APIs A change to a system Component's immediate \"touch points\" with other Components Because Component Diagrams focuses on informing the various \"touch points\" between Components, it requires some upfront thought in order to determine what Components are needed and what interaction mechanisms are most effective per the system requirements. This amount of upfront thought should be approached in a pragmatic manner - as the design may evolve over time, and that is perfectly fine, as long as changes are influenced based on functional requirements and non-functional requirements. Examples Below are some basic examples: Versioning Because Component Diagrams will be changing periodically, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Component Diagram will provide a common visual to all engineers when working on the different parts of the solution Throughout the engagement, update the published diagram periodically. Ideally whenever a new Component is introduced into the system, or whenever a new \"touch point\" occurs between Components Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches. Resources Wikipedia Visual Paradigm VS Code Plugins: PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Component Diagrams"},{"location":"design/diagram-types/component-diagrams/#component-diagrams","text":"","title":"Component Diagrams"},{"location":"design/diagram-types/component-diagrams/#purpose","text":"This document is intended to provide a baseline understanding for what, why, and how to incorporate Component Diagrams as part of your engagement. Regarding the how , the section at the bottom will provide tools and plugins to streamline as much as possible when generating Component Diagrams through VSCode. Wikipedia defines UML Component Diagrams as: a component diagram depicts how components are wired together to form larger components or software systems. Component Diagrams are a type of a static structure because it focuses on the responsibility and relationships between components as part of the overall system or solution. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics. ...Hold on a second... what is a Component? A Component is a runnable solution that performs a set of operations and can possibly be interfaced through a particular API. One can think of Components as a \"stand alone\" piece of software - think datastores, microservices, serverless functions, user interfaces, etc...","title":"Purpose"},{"location":"design/diagram-types/component-diagrams/#essential-takeaways","text":"The primary two takeaways from a Component Diagram should be: A quick view of all the various components (User Interface, Service, Data Storage) involved in the system The immediate \"touch points\" that a particular Component has with other Components, including how that \"touch point\" is accomplished (HTTP, FTP, etc...) Depending on the complexity of the system, a team might decide to create several Component Diagrams. Where there is one diagram per Component (depicting all it's immediate \"touch points\" with other Components). Or if a system is simple, the team might decide to create a single Component Diagram capturing all Components in the diagram.","title":"Essential Takeaways"},{"location":"design/diagram-types/component-diagrams/#when-to-create","text":"Because Component Diagrams represent a high level overview of the entire system from a Component focus, it is recommended to begin the creation of this diagram from the beginning of an engagement, and update it as the various Components are identified, developed, and introduced into the system. Otherwise, if this is left till later, then there is risk that: the team won't be able to identify areas of improvement the team or other necessary stakeholders won't have a full understanding on how the system works as it is being developed Because of the inherent granularity of the system, the Component Diagrams won't have to be updated as often as Class Diagrams . Things that might merit updating a Component Diagram could be: A deletion or addition of a new Component into the system A change to a system Component's interaction APIs A change to a system Component's immediate \"touch points\" with other Components Because Component Diagrams focuses on informing the various \"touch points\" between Components, it requires some upfront thought in order to determine what Components are needed and what interaction mechanisms are most effective per the system requirements. This amount of upfront thought should be approached in a pragmatic manner - as the design may evolve over time, and that is perfectly fine, as long as changes are influenced based on functional requirements and non-functional requirements.","title":"When to Create?"},{"location":"design/diagram-types/component-diagrams/#examples","text":"Below are some basic examples:","title":"Examples"},{"location":"design/diagram-types/component-diagrams/#versioning","text":"Because Component Diagrams will be changing periodically, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Component Diagram will provide a common visual to all engineers when working on the different parts of the solution Throughout the engagement, update the published diagram periodically. Ideally whenever a new Component is introduced into the system, or whenever a new \"touch point\" occurs between Components Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches.","title":"Versioning"},{"location":"design/diagram-types/component-diagrams/#resources","text":"Wikipedia Visual Paradigm VS Code Plugins: PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Resources"},{"location":"design/diagram-types/deployment-diagrams/","text":"Deployment Diagrams Purpose This document is intended to provide a baseline understanding for what, why, and how to incorporate Deployment Diagrams as part of your engagement. Wikipedia defines UML Deployment Diagrams as: models the physical deployment of artifacts on nodes Deployment Diagrams are a type of a static structure because it focuses on the infrastructure and hosting where all aspects of the system reside in. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics. Essential Takeaways The Deployment diagram should contain all Components identified in the Component Diagram(s) , but captured alongside the following elements: Firewalls VNETs and subnets Virtual machines Cloud Services Data Stores Servers (Web, proxy) Load Balancers This diagram should inform the audience: where things are hosted / running in what network boundaries are involved in the system When to Create? Because Deployment Diagrams represent the final \"hosting\" architecture, it's recommended to create the \"final envisioned\" diagram from the beginning of an engagement. This allows the team to have a shared idea on what the team is working towards. Keep in mind that this might change if any non-functional requirement was not considered at the start of the engagement. This is okay, but requires creating the necessary Backlog Items and updating the Deployment diagram in order to capture these changes. It's also worthwhile to create and maintain a Deployment Diagram depicting the \"current\" state of the system. At times, it may be beneficial for there to be a Deployment Diagram per each environment (Dev, QA, Staging, Prod, etc...). However, this adds to the amount of maintenance required and should only be performed if there are substantial differences across environments. The \"current\" Deployment diagram should be updated when: A new element has been introduced or removed in the system (see the \"Essential Takeaways\" section for a list of possible elements) Examples Below are some basic examples: Versioning Because Deployment Diagrams will be changing periodically, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Component Diagram will provide a common visual to all engineers when working on the different parts of the solution Throughout the engagement, update the \"actual / current\" diagram (state represented from the \"main\" branch) periodically. Ideally whenever a new Component is introduced into the system, or whenever a new \"touch point\" occurs between Components. Resources Wikipedia Visual Paradigm PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Deployment Diagrams"},{"location":"design/diagram-types/deployment-diagrams/#deployment-diagrams","text":"","title":"Deployment Diagrams"},{"location":"design/diagram-types/deployment-diagrams/#purpose","text":"This document is intended to provide a baseline understanding for what, why, and how to incorporate Deployment Diagrams as part of your engagement. Wikipedia defines UML Deployment Diagrams as: models the physical deployment of artifacts on nodes Deployment Diagrams are a type of a static structure because it focuses on the infrastructure and hosting where all aspects of the system reside in. It is not supposed to inform about the data flow, the caller or callee responsibilities, the request flows, nor any other \"behavior\" related characteristics.","title":"Purpose"},{"location":"design/diagram-types/deployment-diagrams/#essential-takeaways","text":"The Deployment diagram should contain all Components identified in the Component Diagram(s) , but captured alongside the following elements: Firewalls VNETs and subnets Virtual machines Cloud Services Data Stores Servers (Web, proxy) Load Balancers This diagram should inform the audience: where things are hosted / running in what network boundaries are involved in the system","title":"Essential Takeaways"},{"location":"design/diagram-types/deployment-diagrams/#when-to-create","text":"Because Deployment Diagrams represent the final \"hosting\" architecture, it's recommended to create the \"final envisioned\" diagram from the beginning of an engagement. This allows the team to have a shared idea on what the team is working towards. Keep in mind that this might change if any non-functional requirement was not considered at the start of the engagement. This is okay, but requires creating the necessary Backlog Items and updating the Deployment diagram in order to capture these changes. It's also worthwhile to create and maintain a Deployment Diagram depicting the \"current\" state of the system. At times, it may be beneficial for there to be a Deployment Diagram per each environment (Dev, QA, Staging, Prod, etc...). However, this adds to the amount of maintenance required and should only be performed if there are substantial differences across environments. The \"current\" Deployment diagram should be updated when: A new element has been introduced or removed in the system (see the \"Essential Takeaways\" section for a list of possible elements)","title":"When to Create?"},{"location":"design/diagram-types/deployment-diagrams/#examples","text":"Below are some basic examples:","title":"Examples"},{"location":"design/diagram-types/deployment-diagrams/#versioning","text":"Because Deployment Diagrams will be changing periodically, it's recommended to \"publish\" an image of the generated diagram periodically. The frequency might vary as the engagement proceeds. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Component Diagram will provide a common visual to all engineers when working on the different parts of the solution Throughout the engagement, update the \"actual / current\" diagram (state represented from the \"main\" branch) periodically. Ideally whenever a new Component is introduced into the system, or whenever a new \"touch point\" occurs between Components.","title":"Versioning"},{"location":"design/diagram-types/deployment-diagrams/#resources","text":"Wikipedia Visual Paradigm PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Resources"},{"location":"design/diagram-types/sequence-diagrams/","text":"Sequence Diagrams Purpose This document is intended to provide a baseline understanding for what, why, and how to incorporate Sequence Diagrams as part of an engagement. Regarding the how , the section at the bottom will provide tools and plugins to streamline as much as possible when generating Sequence Diagrams through VSCode. Wikipedia defines UML Sequence Diagrams responsible to: depict the objects involved in the scenario and the sequence of messages exchanged between the objects needed to carry out the functionality of the scenario What is a scenario ? It can be: an actual user persona performing an action a system specific trigger (time based, condition based) that results in an action to occur What is a message in this context? It can be: a synchronous or asynchronous request a transfer of any form of data between any objects What is an object in this context? It can be: any specific user persona any service any data store a system (black box composed of unknown services, data stores or other components) an abstract sub-scenario (in order to minimize high complexity of a scenario) Essential Takeaways A Sequence Diagram should: start with a scenario indicate which object or \"actor\" initiated that scenario have the scenario clearly indicate what the \"end\" state is, even if it doesn't necessarily end back with the object that initiated the scenario It is okay for a single Sequence Diagram to have many different scenarios if they have some related context that merits them being grouped. Another important thing to keep in mind, is that the objects involved in a Sequence Diagram should refer to existing Components from a Component Diagram . There are 2 areas where complexity can result in an overly \"crowded\" Sequence Diagram, making it costly to maintain. They are: Large number of objects / components involved in a particular scenario Capturing all the possible \"failure\" situations that a scenario may encounter Large Number of Objects A Sequence Diagram typically starts with an end user persona performing an action, and then shows all the various components and request/data transfers that are involved in that scenario. However, more often than not, the complete end-to-end flow for that scenario may be too complex in order to capture within a single Sequence Diagram. When this level of complexity occurs, consider creating separate sub-scenario Sequence Diagrams , and using it as an object in a particular Sequence Diagram. Examples for this are \"Authentication\" or \"Authorization\". Almost all user persona scenarios will have several objects/components involved in either of these sub-scenarios, but it is not necessary to include them in every Sequence Diagram once the sub-scenarios have a stand-alone Sequence Diagram created. Be sure that when using this approach of sub-scenarios to give it a name that encapsulates what the sub-scenarios is performing, and to determine the appropriate \"actor\" and \"action\" that initiates the sub-scenarios. The combination and story telling between these end user Sequence Diagrams and the sub-scenarios Sequence Diagrams can greatly improve readability by distributing the level of complexity across multiple diagrams and take advantage of reusability of common sub-scenarios. Handling Large Number of Failure Situations Another factor of high complexity is the possible failure situations that a particular scenario may encounter. Each object / component involved in the scenario could have several different \"failure\" situations, which could result in a very crowded and messy Sequence Diagram. In order to make it realistic to manage all these scenarios, try to: Identify the most common failure situations that an \"actor\" may face as part of a scenario. Capturing these in a sequence diagram and documenting the other scenarios without having to manage them in a diagram will accomplish the goal of awareness \"Bubble up\" and \"abstract\" all the vast number of failure situations that can occur downstream in the system, and depict how the object / component closest to the \"actor\" handles all these failures and informs the \"actor\" of them When to Create? Because Sequence Diagrams represent a detailed overview of the behavior of the system, outlining the various messages/requests sent within the system, it is recommended to begin the creation of these diagrams from the beginning of an engagement. While updating it as the various communications between Components are introduced into the system. The risks of not creating Sequence Diagrams early on are that: the team will not create any because of it being perceived more as a \"chore\" instead of adding value the team will be unable to gain insights in time, from visualizing the various messages and requests sent between Components, in order to perform any potential refactoring the team or other necessary stakeholders won't have a complete understanding of the request/message/data flow within the system Because of the inherent granularity of the system, the Sequence Diagrams won't have to be updated as often as Class Diagrams , but may require more maintenance than Component Diagrams . Things that might merit updating a Sequence Diagram could be: A new request/message/data being sent across Components involved in a scenario A change to one or several Components involved in a Sequence Diagram. Such as splitting a component into multiple ones, or consolidating many Components into a single one The introduction of a new Use Case or scenario that the system now supports Examples Place Order Scenario: A \"Member\" user persona places an order, which can be composed of many \"order items\" The \"Member\" user persona can be either of type \"VIP\" or \"Ordinary\" Depending on the \"Member type\", each \"order item\" will be shipped using either a Courier or via Mail If the \"Member\" user persona selected the option to be informed once all \"order items\" have been shipped, then the system will send a notification Facebook User Authentication Scenario: A user persona uses a Web Browser to interact with an \"application\" which tries to access a specific \"Facebook resource\" The \"Facebook Authorization Server\" is involved in order to have the user to authenticate with Facebook The user persona then receives a \"permission form\" in order to authorize the \"application\" access to the \"Facebook resource\" If the \"application\" was not authorized, then the \"application\" returns back an error If the \"application\" was authorized, then the \"application\" retrieves an \"access token\" from the \"Facebook Authorization Server\" and uses it to securely access the \"Facebook resource\" from the \"Facebook Content Server\". Once the content is obtained, the \"application\" sends it to the Web Browser Versioning Because Sequence Diagrams are more expensive to maintain, it's recommended to \"publish\" an image of the generated diagram often, whenever a new \"use case\" or \"scenario\" is identified as part of the system behavior or requirements. The most important element to these diagrams is to ensure that the latest version is accurate . If the latest diagram shows a sequence of communication between components that are no longer valid, then the diagram causes more harm than good. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Sequence Diagram will provide a common visual to all engineers when working on the different parts of the solution (focusing on the data flow and request flow) Throughout the engagement, update the published diagram periodically. Ideally whenever a new \"use case\" or \"scenario\" is identified, or when a Component is introduced or removed in the system, or when a change in data/request flow is made in the system Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches. Resources Wikipedia Visual Paradigm VS Code Plugins: PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Sequence Diagrams"},{"location":"design/diagram-types/sequence-diagrams/#sequence-diagrams","text":"","title":"Sequence Diagrams"},{"location":"design/diagram-types/sequence-diagrams/#purpose","text":"This document is intended to provide a baseline understanding for what, why, and how to incorporate Sequence Diagrams as part of an engagement. Regarding the how , the section at the bottom will provide tools and plugins to streamline as much as possible when generating Sequence Diagrams through VSCode. Wikipedia defines UML Sequence Diagrams responsible to: depict the objects involved in the scenario and the sequence of messages exchanged between the objects needed to carry out the functionality of the scenario What is a scenario ? It can be: an actual user persona performing an action a system specific trigger (time based, condition based) that results in an action to occur What is a message in this context? It can be: a synchronous or asynchronous request a transfer of any form of data between any objects What is an object in this context? It can be: any specific user persona any service any data store a system (black box composed of unknown services, data stores or other components) an abstract sub-scenario (in order to minimize high complexity of a scenario)","title":"Purpose"},{"location":"design/diagram-types/sequence-diagrams/#essential-takeaways","text":"A Sequence Diagram should: start with a scenario indicate which object or \"actor\" initiated that scenario have the scenario clearly indicate what the \"end\" state is, even if it doesn't necessarily end back with the object that initiated the scenario It is okay for a single Sequence Diagram to have many different scenarios if they have some related context that merits them being grouped. Another important thing to keep in mind, is that the objects involved in a Sequence Diagram should refer to existing Components from a Component Diagram . There are 2 areas where complexity can result in an overly \"crowded\" Sequence Diagram, making it costly to maintain. They are: Large number of objects / components involved in a particular scenario Capturing all the possible \"failure\" situations that a scenario may encounter","title":"Essential Takeaways"},{"location":"design/diagram-types/sequence-diagrams/#large-number-of-objects","text":"A Sequence Diagram typically starts with an end user persona performing an action, and then shows all the various components and request/data transfers that are involved in that scenario. However, more often than not, the complete end-to-end flow for that scenario may be too complex in order to capture within a single Sequence Diagram. When this level of complexity occurs, consider creating separate sub-scenario Sequence Diagrams , and using it as an object in a particular Sequence Diagram. Examples for this are \"Authentication\" or \"Authorization\". Almost all user persona scenarios will have several objects/components involved in either of these sub-scenarios, but it is not necessary to include them in every Sequence Diagram once the sub-scenarios have a stand-alone Sequence Diagram created. Be sure that when using this approach of sub-scenarios to give it a name that encapsulates what the sub-scenarios is performing, and to determine the appropriate \"actor\" and \"action\" that initiates the sub-scenarios. The combination and story telling between these end user Sequence Diagrams and the sub-scenarios Sequence Diagrams can greatly improve readability by distributing the level of complexity across multiple diagrams and take advantage of reusability of common sub-scenarios.","title":"Large Number of Objects"},{"location":"design/diagram-types/sequence-diagrams/#handling-large-number-of-failure-situations","text":"Another factor of high complexity is the possible failure situations that a particular scenario may encounter. Each object / component involved in the scenario could have several different \"failure\" situations, which could result in a very crowded and messy Sequence Diagram. In order to make it realistic to manage all these scenarios, try to: Identify the most common failure situations that an \"actor\" may face as part of a scenario. Capturing these in a sequence diagram and documenting the other scenarios without having to manage them in a diagram will accomplish the goal of awareness \"Bubble up\" and \"abstract\" all the vast number of failure situations that can occur downstream in the system, and depict how the object / component closest to the \"actor\" handles all these failures and informs the \"actor\" of them","title":"Handling Large Number of Failure Situations"},{"location":"design/diagram-types/sequence-diagrams/#when-to-create","text":"Because Sequence Diagrams represent a detailed overview of the behavior of the system, outlining the various messages/requests sent within the system, it is recommended to begin the creation of these diagrams from the beginning of an engagement. While updating it as the various communications between Components are introduced into the system. The risks of not creating Sequence Diagrams early on are that: the team will not create any because of it being perceived more as a \"chore\" instead of adding value the team will be unable to gain insights in time, from visualizing the various messages and requests sent between Components, in order to perform any potential refactoring the team or other necessary stakeholders won't have a complete understanding of the request/message/data flow within the system Because of the inherent granularity of the system, the Sequence Diagrams won't have to be updated as often as Class Diagrams , but may require more maintenance than Component Diagrams . Things that might merit updating a Sequence Diagram could be: A new request/message/data being sent across Components involved in a scenario A change to one or several Components involved in a Sequence Diagram. Such as splitting a component into multiple ones, or consolidating many Components into a single one The introduction of a new Use Case or scenario that the system now supports","title":"When to Create?"},{"location":"design/diagram-types/sequence-diagrams/#examples","text":"Place Order Scenario: A \"Member\" user persona places an order, which can be composed of many \"order items\" The \"Member\" user persona can be either of type \"VIP\" or \"Ordinary\" Depending on the \"Member type\", each \"order item\" will be shipped using either a Courier or via Mail If the \"Member\" user persona selected the option to be informed once all \"order items\" have been shipped, then the system will send a notification Facebook User Authentication Scenario: A user persona uses a Web Browser to interact with an \"application\" which tries to access a specific \"Facebook resource\" The \"Facebook Authorization Server\" is involved in order to have the user to authenticate with Facebook The user persona then receives a \"permission form\" in order to authorize the \"application\" access to the \"Facebook resource\" If the \"application\" was not authorized, then the \"application\" returns back an error If the \"application\" was authorized, then the \"application\" retrieves an \"access token\" from the \"Facebook Authorization Server\" and uses it to securely access the \"Facebook resource\" from the \"Facebook Content Server\". Once the content is obtained, the \"application\" sends it to the Web Browser","title":"Examples"},{"location":"design/diagram-types/sequence-diagrams/#versioning","text":"Because Sequence Diagrams are more expensive to maintain, it's recommended to \"publish\" an image of the generated diagram often, whenever a new \"use case\" or \"scenario\" is identified as part of the system behavior or requirements. The most important element to these diagrams is to ensure that the latest version is accurate . If the latest diagram shows a sequence of communication between components that are no longer valid, then the diagram causes more harm than good. The below approach can be used to assist the team on how often to update the published version of the diagram: At the beginning of the engagement, publishing an \"envisioned\" version of the Sequence Diagram will provide a common visual to all engineers when working on the different parts of the solution (focusing on the data flow and request flow) Throughout the engagement, update the published diagram periodically. Ideally whenever a new \"use case\" or \"scenario\" is identified, or when a Component is introduced or removed in the system, or when a change in data/request flow is made in the system Depending on the tool being used, automatic versioning might be performed whenever an update to the Diagram is performed. If not, it is recommended to capture distinct versions whenever there is a particular customer need to have a snapshot of the project at a particular point in time. The hard requirement is that the latest diagram should be published and everyone should know how to access it as the customer hand-off approaches.","title":"Versioning"},{"location":"design/diagram-types/sequence-diagrams/#resources","text":"Wikipedia Visual Paradigm VS Code Plugins: PlantUML - requires a generator from code to PlantUML syntax to generate diagrams PlantUML Syntax Drawing manually","title":"Resources"},{"location":"design/sustainability/","text":"Sustainable Software Engineering The choices made throughout the engineering process regarding cloud services, software architecture design and automation can have a big impact on the carbon footprint of a solution. Some choices are always beneficial, like turning off unused resources. Other choices require a more nuanced understanding of the business case at hand and its potential carbon impact. Goal One goal of this section is to provide tangible guidance for what sustainable actions you can apply in certain situations and the tools to be able to implement those recommendations. Another goal is to highlight the many resources available to learn about the wider domain of sustainable software. Sustainable Engineering Checklist This checklist should be used to quickly identify scenarios for which common sustainable actions exist. Check the box if the scenario applies to your project, then go through the actions and tools you can use to build more sustainable software for those cases. If there are important nuances to consider, they will be linked in the Disclaimers section. For readability some considerations are blank, indicating that the action applies to the first consideration above it. \u2705 Consideration Action Principle Tools Disclaimers For any running software/services Shutdown unused resources. Electricity Consumption Identify Unassociated Resources Resize physical or virtual machines to improve utilization. Energy Proportionality Azure Advisor Cost Recommendations Understanding Advisor Recommendations For development and testing VMs Configure VMs to shutdown during off-hours Electricity Consumption Start/Stop VMs during off-hours For VMs with attached volumes Limit the amount of attached storage capacity to what you expect to use and expand as necessary Electricity Consumption Expanding storage of active VMs Understanding the energy cost of storage For systems using object storage (Azure Blob Storage, AWS S3, GCP Cloud Storage, etc) Compress infrequently accessed data Electricity Consumption , Embodied Carbon Compressing and extracting files in .NET Understanding the energy cost of storage Delete data when it is no longer needed Electricity Consumption Configuring a lifecycle management policy Understanding the energy cost of storage For systems running in on-premise data centers Migrate to hyperscale cloud provider Embodied Carbon , Electricity Consumption Cloud Adoption Approaches Carbon benefits of cloud computing For systems migrating to a hyperscale cloud provider Consider physically shipping data to the provider Networking Azure Data Box Understanding data shipping tradeoffs For time-flexible workloads Utilize \"Spot VMs\" for compute Demand Shaping How to use Spot VMs For services with varied utilization patterns Configure Autoscaling Energy Proportionality Autoscaling Documentation Use serverless functions Energy Proportionality Serverless Architecture Design For services with geographically co-located users (EG internal employee apps) Select a data center region that is physically close to them Networking Azure products available by region Consider running edge devices to reduce excessive data transfer Networking Azure Stack Edge Understanding edge tradeoffs For systems sending data over the network Use caching policies to keep data on the local machine Networking HTTP caching APIs , Cache Management in .NET Understanding caching tradeoffs Consider caching data close to end users with a CDN Networking Benefits of a CDN Understanding CDN tradeoffs Send only the data that will be used Networking Compress data to reduce the size Networking Compressing and extracting files in .NET When designing for the end user Consider giving users visibility and control over their energy usage Electricity Consumption Demand Shaping Designing for eco-mode Design and test your application to be compatible for a wide variety of devices, especially older devices Embodied Carbon Extending device lifespan Compatibility Testing When selecting a programming language Consider the energy efficiency of languages Electricity Consumption Reasoning about the energy consumption of programming languages , Programming Language Energy Efficiency (PDF) Making informed programming language choices Resources Principles of Green Software Engineering Green Software Foundation Microsoft Cloud for Sustainability Learning Module: Sustainable Software Engineering Tools Carbon-Aware SDK \"Awesome List\" of Green Software Emissions Impact Azure GreenAI Carbon-Intensity API Projects Sustainability through SpotVMs","title":"Sustainable Software Engineering"},{"location":"design/sustainability/#sustainable-software-engineering","text":"The choices made throughout the engineering process regarding cloud services, software architecture design and automation can have a big impact on the carbon footprint of a solution. Some choices are always beneficial, like turning off unused resources. Other choices require a more nuanced understanding of the business case at hand and its potential carbon impact.","title":"Sustainable Software Engineering"},{"location":"design/sustainability/#goal","text":"One goal of this section is to provide tangible guidance for what sustainable actions you can apply in certain situations and the tools to be able to implement those recommendations. Another goal is to highlight the many resources available to learn about the wider domain of sustainable software.","title":"Goal"},{"location":"design/sustainability/#sustainable-engineering-checklist","text":"This checklist should be used to quickly identify scenarios for which common sustainable actions exist. Check the box if the scenario applies to your project, then go through the actions and tools you can use to build more sustainable software for those cases. If there are important nuances to consider, they will be linked in the Disclaimers section. For readability some considerations are blank, indicating that the action applies to the first consideration above it. \u2705 Consideration Action Principle Tools Disclaimers For any running software/services Shutdown unused resources. Electricity Consumption Identify Unassociated Resources Resize physical or virtual machines to improve utilization. Energy Proportionality Azure Advisor Cost Recommendations Understanding Advisor Recommendations For development and testing VMs Configure VMs to shutdown during off-hours Electricity Consumption Start/Stop VMs during off-hours For VMs with attached volumes Limit the amount of attached storage capacity to what you expect to use and expand as necessary Electricity Consumption Expanding storage of active VMs Understanding the energy cost of storage For systems using object storage (Azure Blob Storage, AWS S3, GCP Cloud Storage, etc) Compress infrequently accessed data Electricity Consumption , Embodied Carbon Compressing and extracting files in .NET Understanding the energy cost of storage Delete data when it is no longer needed Electricity Consumption Configuring a lifecycle management policy Understanding the energy cost of storage For systems running in on-premise data centers Migrate to hyperscale cloud provider Embodied Carbon , Electricity Consumption Cloud Adoption Approaches Carbon benefits of cloud computing For systems migrating to a hyperscale cloud provider Consider physically shipping data to the provider Networking Azure Data Box Understanding data shipping tradeoffs For time-flexible workloads Utilize \"Spot VMs\" for compute Demand Shaping How to use Spot VMs For services with varied utilization patterns Configure Autoscaling Energy Proportionality Autoscaling Documentation Use serverless functions Energy Proportionality Serverless Architecture Design For services with geographically co-located users (EG internal employee apps) Select a data center region that is physically close to them Networking Azure products available by region Consider running edge devices to reduce excessive data transfer Networking Azure Stack Edge Understanding edge tradeoffs For systems sending data over the network Use caching policies to keep data on the local machine Networking HTTP caching APIs , Cache Management in .NET Understanding caching tradeoffs Consider caching data close to end users with a CDN Networking Benefits of a CDN Understanding CDN tradeoffs Send only the data that will be used Networking Compress data to reduce the size Networking Compressing and extracting files in .NET When designing for the end user Consider giving users visibility and control over their energy usage Electricity Consumption Demand Shaping Designing for eco-mode Design and test your application to be compatible for a wide variety of devices, especially older devices Embodied Carbon Extending device lifespan Compatibility Testing When selecting a programming language Consider the energy efficiency of languages Electricity Consumption Reasoning about the energy consumption of programming languages , Programming Language Energy Efficiency (PDF) Making informed programming language choices","title":"Sustainable Engineering Checklist"},{"location":"design/sustainability/#resources","text":"Principles of Green Software Engineering Green Software Foundation Microsoft Cloud for Sustainability Learning Module: Sustainable Software Engineering","title":"Resources"},{"location":"design/sustainability/#tools","text":"Carbon-Aware SDK \"Awesome List\" of Green Software Emissions Impact Azure GreenAI Carbon-Intensity API","title":"Tools"},{"location":"design/sustainability/#projects","text":"Sustainability through SpotVMs","title":"Projects"},{"location":"design/sustainability/sustainable-action-disclaimers/","text":"Disclaimers The following disclaimers provide more details about how to consider the impact of particular actions recommended by the Sustainable Engineering Checklist . ACTION: Resize Physical or Virtual Machines to Improve Utilization Recommendations from cost-savings tools are usually aligned with carbon-reduction, but as sustainability is not the purpose of such tools, carbon-savings are not guaranteed. How a cloud provider or data center manages unused capacity is also a factor in determining how impactful this action may be. For example: The sustainable impact of using smaller VMs in the same family are typically beneficial or neutral. When cores are no longer reserved they can be used by others instead of bringing new servers online. The sustainable impact of changing VM families can be harder to reason about because the underlying hardware and reserved cores may be changing with them. ACTION: Migrate to a Hyperscale Cloud Provider Carbon savings from hyperscale cloud providers are generally attributable to four key features: IT operational efficiency, IT equipment efficiency, data center infrastructure efficiency, and renewable electricity. Microsoft Cloud, for example, is between 22 and 93 percent more energy efficient than traditional enterprise data centers, depending on the specific comparison being made. When taking into account renewable energy purchases, the Microsoft Cloud is between 72 and 98 percent more carbon efficient. Source (PDF) ACTION: Consider Running an Edge Device Running an edge device negates many of the benefits of hyperscale compute facilities, so considering the local energy grid mix and the typical timing of the workloads is important to determine if this is beneficial overall. The larger volume of data that needs to be transmitted, the more this solution becomes appealing. For example, sending large amounts of audio and video content for processing. ACTION: Consider Physically Shipping Data to the Provider Shipping physical items has its own carbon impact, depending on the mode of transportation, which needs to be understood before making this decision. The larger the volume of data that needs to be transmitted the more this options may be beneficial. ACTION: Consider the Energy Efficiency of Languages When selecting a programming language, the most energy efficient programming language may not always be the best choice for development speed, maintenance, integration with dependent systems, and other project factors. But when deciding between languages that all meet the project needs, energy efficiency can be a helpful consideration. ACTION: Use Caching Policies A cache provides temporary storage of resources that have been requested by an application. Caching can improve application performance by reducing the time required to get a requested resource. Caching can also improve sustainability by decreasing the amount of network traffic. While caching provides these benefits, it also increases the risk that the resource returned to the application is stale, meaning that it is not identical to the resource that would have been sent by the server if caching were not in use. This can create poor user experiences when data accuracy is critical. Additionally, caching may allow unauthorized users or processes to read sensitive data. An authenticated response that is cached may be retrieved from the cache without an additional authorization. Due to security concerns like this, caching is not recommended for middle tier scenarios. ACTION: Consider Caching Data Close to End Users with a CDN Including CDNs in your network architecture adds many additional servers to your software footprint, each with their own local energy grid mix. The details of CDN hardware and the impact of the power that runs it is important to determine if the carbon emissions from running them is lower than the emissions from sending the data over the wire from a more distant source. The larger the volume of data, distance it needs to travel, and frequency of requests, the more this solution becomes appealing.","title":"Disclaimers"},{"location":"design/sustainability/sustainable-action-disclaimers/#disclaimers","text":"The following disclaimers provide more details about how to consider the impact of particular actions recommended by the Sustainable Engineering Checklist .","title":"Disclaimers"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-resize-physical-or-virtual-machines-to-improve-utilization","text":"Recommendations from cost-savings tools are usually aligned with carbon-reduction, but as sustainability is not the purpose of such tools, carbon-savings are not guaranteed. How a cloud provider or data center manages unused capacity is also a factor in determining how impactful this action may be. For example: The sustainable impact of using smaller VMs in the same family are typically beneficial or neutral. When cores are no longer reserved they can be used by others instead of bringing new servers online. The sustainable impact of changing VM families can be harder to reason about because the underlying hardware and reserved cores may be changing with them.","title":"ACTION: Resize Physical or Virtual Machines to Improve Utilization"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-migrate-to-a-hyperscale-cloud-provider","text":"Carbon savings from hyperscale cloud providers are generally attributable to four key features: IT operational efficiency, IT equipment efficiency, data center infrastructure efficiency, and renewable electricity. Microsoft Cloud, for example, is between 22 and 93 percent more energy efficient than traditional enterprise data centers, depending on the specific comparison being made. When taking into account renewable energy purchases, the Microsoft Cloud is between 72 and 98 percent more carbon efficient. Source (PDF)","title":"ACTION: Migrate to a Hyperscale Cloud Provider"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-consider-running-an-edge-device","text":"Running an edge device negates many of the benefits of hyperscale compute facilities, so considering the local energy grid mix and the typical timing of the workloads is important to determine if this is beneficial overall. The larger volume of data that needs to be transmitted, the more this solution becomes appealing. For example, sending large amounts of audio and video content for processing.","title":"ACTION: Consider Running an Edge Device"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-consider-physically-shipping-data-to-the-provider","text":"Shipping physical items has its own carbon impact, depending on the mode of transportation, which needs to be understood before making this decision. The larger the volume of data that needs to be transmitted the more this options may be beneficial.","title":"ACTION: Consider Physically Shipping Data to the Provider"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-consider-the-energy-efficiency-of-languages","text":"When selecting a programming language, the most energy efficient programming language may not always be the best choice for development speed, maintenance, integration with dependent systems, and other project factors. But when deciding between languages that all meet the project needs, energy efficiency can be a helpful consideration.","title":"ACTION: Consider the Energy Efficiency of Languages"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-use-caching-policies","text":"A cache provides temporary storage of resources that have been requested by an application. Caching can improve application performance by reducing the time required to get a requested resource. Caching can also improve sustainability by decreasing the amount of network traffic. While caching provides these benefits, it also increases the risk that the resource returned to the application is stale, meaning that it is not identical to the resource that would have been sent by the server if caching were not in use. This can create poor user experiences when data accuracy is critical. Additionally, caching may allow unauthorized users or processes to read sensitive data. An authenticated response that is cached may be retrieved from the cache without an additional authorization. Due to security concerns like this, caching is not recommended for middle tier scenarios.","title":"ACTION: Use Caching Policies"},{"location":"design/sustainability/sustainable-action-disclaimers/#action-consider-caching-data-close-to-end-users-with-a-cdn","text":"Including CDNs in your network architecture adds many additional servers to your software footprint, each with their own local energy grid mix. The details of CDN hardware and the impact of the power that runs it is important to determine if the carbon emissions from running them is lower than the emissions from sending the data over the wire from a more distant source. The larger the volume of data, distance it needs to travel, and frequency of requests, the more this solution becomes appealing.","title":"ACTION: Consider Caching Data Close to End Users with a CDN"},{"location":"design/sustainability/sustainable-engineering-principles/","text":"Sustainable Principles The following principle overviews provide the foundations supporting specific actions in the Sustainable Engineering Checklist . More details about each principle can be found by following the links in the headings or visiting the Principles of Green Software Engineering website . Electricity Consumption Most electricity is still produced through the burning of fossil fuels and is responsible for 49% of the carbon emitted into the atmosphere. Software consumes electricity in its execution. Running hardware consumes electricity even at zero percent utilization. Some of the best ways we can reduce electricity consumption and the subsequent emissions of carbon pollution is to make our applications more energy efficient when they are running and limit idle hardware. Energy Proportionality The relationship between power and utilization is not proportional. The more you utilize a computer, the more efficient it becomes at converting electricity to useful computing operations. Running your work on as few servers as possible with the highest utilization rate maximizes their energy efficiency. An idle computer, even running at zero percent utilization, still draws electricity. Embodied Carbon Embodied carbon (otherwise referred to as \"Embedded Carbon\") is the amount of carbon pollution emitted during the creation and disposal of a device. When calculating the total carbon pollution for the computers running your software, account for both the carbon pollution to run the computer and the embodied carbon of the computer. Therefore a great way to reduce embodied carbon is to prevent the need for new devices to be manufactured by extending the usefulness of existing ones. Demand Shaping Demand shaping is a strategy of shaping our demand for resources so it matches the existing supply. If supply is high, increase the demand by doing more in your applications. If the supply is low, decrease demand. This means doing less in your applications or delaying work until supply is higher. Networking A network is a series of switches, routers, and servers. All the computers and network equipment in a network consume electricity and have embedded carbon . The internet is a global network of devices typically run off the standard local grid energy mix. When you send data across the internet, you are sending that data through many devices in the network, each one of those devices consuming electricity. As a result, any data you send or receive over the internet emits carbon. The amount of carbon emitted to send data depends on many factors including: Distance the data travels Number of hops between network devices Energy efficiency of the network devices Carbon intensity of energy used by each device at the time the data is transmitted. Network protocol used to coordinate data transmission - e.g. multiplex, header compression, TLS/Quic Recent networking studies - Cloud Carbon Footprint","title":"Sustainable Principles"},{"location":"design/sustainability/sustainable-engineering-principles/#sustainable-principles","text":"The following principle overviews provide the foundations supporting specific actions in the Sustainable Engineering Checklist . More details about each principle can be found by following the links in the headings or visiting the Principles of Green Software Engineering website .","title":"Sustainable Principles"},{"location":"design/sustainability/sustainable-engineering-principles/#electricity-consumption","text":"Most electricity is still produced through the burning of fossil fuels and is responsible for 49% of the carbon emitted into the atmosphere. Software consumes electricity in its execution. Running hardware consumes electricity even at zero percent utilization. Some of the best ways we can reduce electricity consumption and the subsequent emissions of carbon pollution is to make our applications more energy efficient when they are running and limit idle hardware.","title":"Electricity Consumption"},{"location":"design/sustainability/sustainable-engineering-principles/#energy-proportionality","text":"The relationship between power and utilization is not proportional. The more you utilize a computer, the more efficient it becomes at converting electricity to useful computing operations. Running your work on as few servers as possible with the highest utilization rate maximizes their energy efficiency. An idle computer, even running at zero percent utilization, still draws electricity.","title":"Energy Proportionality"},{"location":"design/sustainability/sustainable-engineering-principles/#embodied-carbon","text":"Embodied carbon (otherwise referred to as \"Embedded Carbon\") is the amount of carbon pollution emitted during the creation and disposal of a device. When calculating the total carbon pollution for the computers running your software, account for both the carbon pollution to run the computer and the embodied carbon of the computer. Therefore a great way to reduce embodied carbon is to prevent the need for new devices to be manufactured by extending the usefulness of existing ones.","title":"Embodied Carbon"},{"location":"design/sustainability/sustainable-engineering-principles/#demand-shaping","text":"Demand shaping is a strategy of shaping our demand for resources so it matches the existing supply. If supply is high, increase the demand by doing more in your applications. If the supply is low, decrease demand. This means doing less in your applications or delaying work until supply is higher.","title":"Demand Shaping"},{"location":"design/sustainability/sustainable-engineering-principles/#networking","text":"A network is a series of switches, routers, and servers. All the computers and network equipment in a network consume electricity and have embedded carbon . The internet is a global network of devices typically run off the standard local grid energy mix. When you send data across the internet, you are sending that data through many devices in the network, each one of those devices consuming electricity. As a result, any data you send or receive over the internet emits carbon. The amount of carbon emitted to send data depends on many factors including: Distance the data travels Number of hops between network devices Energy efficiency of the network devices Carbon intensity of energy used by each device at the time the data is transmitted. Network protocol used to coordinate data transmission - e.g. multiplex, header compression, TLS/Quic Recent networking studies - Cloud Carbon Footprint","title":"Networking"},{"location":"developer-experience/","text":"Developer Experience (DevEx) Developer experience refers to how easy or difficult it is for a developer to perform essential tasks needed to implement a change. A positive developer experience would mean these tasks are relatively easy for the team (see measures below). The essential tasks are identified below. Build - Verify that changes are free of syntax error and compile. Test - Verify that all automated tests pass. Start - Launch end-to-end to simulate execution in a deployed environment. Debug - Attach debugger to started solution, set breakpoints, step through code, and inspect variables. If effort is invested to make these activities as easy as possible, the returns on that effort will increase the longer the project runs, and the larger the team is . Defining End-to-End This document makes several references to running a solution end-to-end (aka E2E). End-to-end for the purposes of this document is scoped to the software that is owned, built, and shipped by the team. Systems owned by other teams or third-party vendors is not within the E2E scope for the purposes of this document. Goals Maximize the amount of time engineers spend on writing code that fulfills story acceptance and done-done criteria. Minimize the amount of time spent manual setup and configuration of tooling Minimize regressions and new defects by making end-to-end testing easy Impact Developer experience can have a significant impact on the efficiency of the day-to-day execution of the team. A positive experience can pay dividends throughout the lifetime of the project; especially as new developers join the team. Increased Velocity - Team spends less time on non-value-add activities such as dev/local environment setup, waiting on remote environments to test, and rework (fixing defects). Improved Quality - When it's easy to debug and test, developers will do more of it. This will translate to fewer defects being introduced. Easier Onboarding & Adoption - When dev essential tasks are automated, there is less documentation to write and, subsequently, less to read to get started! Most importantly, the customer will continue to accrue these benefits long after the code-with engagement. Measures Time to First E2E Result (aka F5 Contract) Assuming a laptop/pc that has never run the solution, how long does it take to set up and run the whole system end-to-end and see a result. Time To First Commit How long does it take to make a change that can be verified/tested locally. A locally verified/tested change is one that passes test cases without introducing regression or breaking changes. Participation Providing a positive developer experience is a team effort. However, certain members can take ownership of different areas to help hold the entire team accountable. Dev Lead - Set the Bar The following are examples of how the Dev Lead might set the bar for dev experience Determines development environment (suggested IDE, hosting, etc) Determines source control environment and number of repos required Given development environment and repo structure, sets expectations for team to meet in terms of steps to perform the essential dev tasks Nominates the DevEx Champion IDE choice is NOT intended to mandate that all team members must use the same IDE. However, this choice will direct where tight-integration investment will be prioritized. For example, if Visual Studio Code is the suggested IDE then, the team would focus on integrating VS code tasks and launch configurations over similar integrations for other IDEs. Team members should still feel free to use their preferred IDE as long as it does not negatively impact the team. DevEx Champion - Identify Iterative Improvements The DevEx champion takes ownership in holding the team accountable for providing a positive developer experience. The following outline responsibilities for the DevEx champion. Actively seek opportunities for improving the solution developer experience Work with the Dev Lead to iteratively improve team expectations for developer experience Curate a backlog actionable stories that identify areas for improvement and prioritize with respect to project delivery goals by engaging directly with the Product Owner and Customer. Serve as subject-matter expert for the rest of the team. Help the team determine how to implement DevEx expectations and identify deviations. Team Members - Assert Expectations The team members of the team can also help hold each other accountable for providing a positive developer experience. The following are examples of areas team members can help identify where the team's DevEx expectations are not being met. Pull requests. Try the changes locally to see if they are adhering to the team's DevEx expectations. Design Reviews. Look for proposals that may negatively affect the solution's DevEx. These might include Introduction of new tech whose testability is limited to manual steps in a deployed environment. Addition of new repository New Team Members - Identify Iterative Improvements New team members are uniquely positioned to identify instances of undocumented Collective Wisdom . The following outlines responsibilities of new team members as it relates to DevEx: If you come across missing, incomplete or incorrect documentation while onboarding, you should record the issue as a new defect(s) and assign it to the product owner to triage. If no onboarding documentation exists, note the steps you took in a new user story. Assign the new story to the product owner to triage. Facilitation Guidance The following outline examples of several strategies that can be adopted to promote a positive developer experience. It is expected that each team should define what a positive dev experience means within the context of their project. Additionally, refine that over time via feedback mechanisms such as sprint and project retrospectives. Establish Hotkeys Assign hotkeys to each of the essential tasks. Task Windows Build CTRL+SHIFT+B Test CTRL+R,T Start With Debugging F5 The F5 Contract The F5 contract aims for the ability to run the end-to-end solution with the following steps. Clone - git clone [ my-repo-url-here ] Configure - set any configuration values that need to be unique to the individual (i.e. update a .env file) Press F5 - launch the solution with debugging attached. Most IDEs have some form of a task runner that can be used to automate the build, execute, and attach steps. Try to leverage these such that the steps can all be run with as few manual steps as possible. DevEx Champion Actively Seek Improvements The DevEx champion should actively seek areas where the team has opportunity to improve. For example, do they need to deploy their changes to an environment off their laptop before they can validate if what they did worked. Rather than debugging locally, do they have to do this repetitively to get to a working solution? Does this take several minutes each iteration? Does this block other developers due to the contention on the environment? The following are ceremonies that the DevEx champion can use to find potential opportunities Retrospectives. Is feedback being raised that relates to the essential tasks being difficult or unwieldy? Standup Blockers. Are individuals getting blocked or stumbling on the essential tasks? As opportunities are identified, the DevEx champion can translate these into actionable stories for the product backlog. Make Tasks Cross Platform For essential tasks being standardized during the engagement, ensure that different platforms are accounted for. Team members may have different operating systems and ensuring the tasks are cross-platform will provide an additional opportunity to improve the experience. See the making tasks cross platform recipe for guidance on how tasks can be configured to include different platforms. Create an Onboarding Guide When welcoming new team members to the engagement, there are many areas for them to get adjusted to and bring them up to speed including codebase, coding standards, team agreements, and team culture. By adopting a strong onboarding practice such as an onboarding guide in a centralized location that explains the scope of the project, processes, setup details, and software required, new members can have all the necessary resources for them to be efficient, successful and a valuable team member from the start. See the onboarding guide recipe for guidance on what an onboarding guide may look like. Standardize Essential Tasks Apply a common strategy across solution components for performing the essential tasks Standardize the configuration for solution components Standardize the way tests are run for each component Standardize the way each component is started and stopped locally Standardize how to document the essential tasks for each component This standardization will enable the team to more easily automate these tasks across all components at the solution level. See Solution-level Essential Tasks below. Solution-level Essential Tasks Automate the ability to execute each essential task across all solution components. An example would be mapping the build action in the IDE to run the build task for each component in the solution. More importantly, configure the IDE start action to start all components within the solution. This will provide significant efficiency for the engineering team when dealing with multi-component solutions. When this is not implemented, the engineers must repeat each of the essential tasks manually for each component in the solution. In this situation, the number of steps required to perform each essential task is multiplied by the number of components in the system [Configuration steps + Build steps + Start/Debug steps + Stop steps + Run test steps + Documenting all of the above] * [many solution components] = TOO MANY STEPS VS. [Configuration steps + Build steps + Start/Debug steps + Stop steps + Run test steps + Documenting all of the above] * [1 solution] = MINIMUM NUMBER OF STEPS Observability Observability alleviates unforeseen challenges for the developer in a complex distributed system. It identifies project bottlenecks quicker and with more precision, enhancing performance as the developer seeks to deploy code changes. Adding observability improves the experience when identifying and resolving bugs or broken code. This results in fewer or less severe current and future production failures. There are many observability strategies a developer can use alongside best engineering practices. These resources improve the DevEx by ensuring a shared view of the complex system throughout the entire lifecycle. Observability in code via logging, exception handling and exposing of relevant application metrics for example, promotes the consistent visibility of real time performance. The observability pillars, logging , metrics , and tracing , detail when to enable each of the three specific types of observability. Minimize the Number of Repositories Splitting a solution across multiple repositories can negatively impact the above measures. This can also negatively impact other areas such as Pull Requests, Automated Testing, Continuous Integration, and Continuous Delivery. Similar to the IDE instances, the negative impact is multiplied by the number of repositories. [Clone steps + Branching steps + Commit steps + CI steps + Pull Request reviews & merges ] * [many source code repositories] = TOO MANY STEPS VS. [Clone steps + Branching steps + Commit steps + CI steps + Pull Request reviews & merges ] * [1 source code repository] = MINIMUM NUMBER OF STEPS Atomic Pull Requests When the solution is encapsulated within a single repository, it also allows pull requests to represent a change across multiple layers. This is especially helpful when a change requires changes to a shared contract between multiple components. For example, a story requires that an api endpoint is changed. With this strategy the api and web client could be updated with the same pull request. This avoids the main branch being broken temporarily while waiting on dependent pull requests to merge. Minimize Remote Dependencies for Local Development The fewer dependencies on components that cannot run a developer's machine translate to fewer steps required to get started. Therefore, fewer dependencies will positively impact the measures above. The following strategies can be used to reduce these dependencies Use an Emulator If available, emulators are implementations of technologies that are typically only available in cloud environments. A good example is the CosmosDB emulator . Use DI + Toggle to Mock Remote Dependencies When the solution depends on a technology that cannot be run on a developer's machine, the setup and testing of that solution can be challenging. One strategy that can be employed is to create the ability to swap that dependency for one that can run locally. Abstract the layer that has the remote dependency behind an interface owned by the solution (not the remote dependency). Create an implementation of that interface using a technology that can be run locally. Create a factory that decides which instance to use. This decision could be based on environment configuration (i.e. the toggle). Then, the original class that depends on the remote tech instead should depend on the factory to provide which instance to use. Much of this strategy can be simplified with proper dependency injection technique and/or framework. See example below that swaps Azure Service Bus implementation for RabbitMQ which can be run locally. interface IPublisher { send ( message : string ) : void } class RabbitMQPublisher implements IPublisher { send ( message : string ) { //todo: send the message via RabbitMQ } } class AzureServiceBusPublisher implements IPublisher { send ( message : string ) { //todo: send the message via Azure Service Bus } } interface IPublisherFactory { create () : IPublisher } class PublisherFactory { create () : IPublisher { // use env var value to determine which instance should be used if ( process . env . UseAsb ){ return new AzureServiceBusPublisher (); } else { return new RabbitMqPublisher (); } } } class MyService { //inject the factory constructor ( private readonly publisherFactory : IPublisherFactory ){ } sendAMessage ( message : string ) : void { //use the factory to determine which instance to use const publisher : IPublisher = this . publisherFactory . create (); publisher . send ( message ); } } The recipes section has a more complete discussion on DI as part of a high productivity inner dev loop","title":"Developer Experience (DevEx)"},{"location":"developer-experience/#developer-experience-devex","text":"Developer experience refers to how easy or difficult it is for a developer to perform essential tasks needed to implement a change. A positive developer experience would mean these tasks are relatively easy for the team (see measures below). The essential tasks are identified below. Build - Verify that changes are free of syntax error and compile. Test - Verify that all automated tests pass. Start - Launch end-to-end to simulate execution in a deployed environment. Debug - Attach debugger to started solution, set breakpoints, step through code, and inspect variables. If effort is invested to make these activities as easy as possible, the returns on that effort will increase the longer the project runs, and the larger the team is .","title":"Developer Experience (DevEx)"},{"location":"developer-experience/#defining-end-to-end","text":"This document makes several references to running a solution end-to-end (aka E2E). End-to-end for the purposes of this document is scoped to the software that is owned, built, and shipped by the team. Systems owned by other teams or third-party vendors is not within the E2E scope for the purposes of this document.","title":"Defining End-to-End"},{"location":"developer-experience/#goals","text":"Maximize the amount of time engineers spend on writing code that fulfills story acceptance and done-done criteria. Minimize the amount of time spent manual setup and configuration of tooling Minimize regressions and new defects by making end-to-end testing easy","title":"Goals"},{"location":"developer-experience/#impact","text":"Developer experience can have a significant impact on the efficiency of the day-to-day execution of the team. A positive experience can pay dividends throughout the lifetime of the project; especially as new developers join the team. Increased Velocity - Team spends less time on non-value-add activities such as dev/local environment setup, waiting on remote environments to test, and rework (fixing defects). Improved Quality - When it's easy to debug and test, developers will do more of it. This will translate to fewer defects being introduced. Easier Onboarding & Adoption - When dev essential tasks are automated, there is less documentation to write and, subsequently, less to read to get started! Most importantly, the customer will continue to accrue these benefits long after the code-with engagement.","title":"Impact"},{"location":"developer-experience/#measures","text":"","title":"Measures"},{"location":"developer-experience/#time-to-first-e2e-result-aka-f5-contract","text":"Assuming a laptop/pc that has never run the solution, how long does it take to set up and run the whole system end-to-end and see a result.","title":"Time to First E2E Result (aka F5 Contract)"},{"location":"developer-experience/#time-to-first-commit","text":"How long does it take to make a change that can be verified/tested locally. A locally verified/tested change is one that passes test cases without introducing regression or breaking changes.","title":"Time To First Commit"},{"location":"developer-experience/#participation","text":"Providing a positive developer experience is a team effort. However, certain members can take ownership of different areas to help hold the entire team accountable.","title":"Participation"},{"location":"developer-experience/#dev-lead-set-the-bar","text":"The following are examples of how the Dev Lead might set the bar for dev experience Determines development environment (suggested IDE, hosting, etc) Determines source control environment and number of repos required Given development environment and repo structure, sets expectations for team to meet in terms of steps to perform the essential dev tasks Nominates the DevEx Champion IDE choice is NOT intended to mandate that all team members must use the same IDE. However, this choice will direct where tight-integration investment will be prioritized. For example, if Visual Studio Code is the suggested IDE then, the team would focus on integrating VS code tasks and launch configurations over similar integrations for other IDEs. Team members should still feel free to use their preferred IDE as long as it does not negatively impact the team.","title":"Dev Lead - Set the Bar"},{"location":"developer-experience/#devex-champion-identify-iterative-improvements","text":"The DevEx champion takes ownership in holding the team accountable for providing a positive developer experience. The following outline responsibilities for the DevEx champion. Actively seek opportunities for improving the solution developer experience Work with the Dev Lead to iteratively improve team expectations for developer experience Curate a backlog actionable stories that identify areas for improvement and prioritize with respect to project delivery goals by engaging directly with the Product Owner and Customer. Serve as subject-matter expert for the rest of the team. Help the team determine how to implement DevEx expectations and identify deviations.","title":"DevEx Champion - Identify Iterative Improvements"},{"location":"developer-experience/#team-members-assert-expectations","text":"The team members of the team can also help hold each other accountable for providing a positive developer experience. The following are examples of areas team members can help identify where the team's DevEx expectations are not being met. Pull requests. Try the changes locally to see if they are adhering to the team's DevEx expectations. Design Reviews. Look for proposals that may negatively affect the solution's DevEx. These might include Introduction of new tech whose testability is limited to manual steps in a deployed environment. Addition of new repository","title":"Team Members - Assert Expectations"},{"location":"developer-experience/#new-team-members-identify-iterative-improvements","text":"New team members are uniquely positioned to identify instances of undocumented Collective Wisdom . The following outlines responsibilities of new team members as it relates to DevEx: If you come across missing, incomplete or incorrect documentation while onboarding, you should record the issue as a new defect(s) and assign it to the product owner to triage. If no onboarding documentation exists, note the steps you took in a new user story. Assign the new story to the product owner to triage.","title":"New Team Members - Identify Iterative Improvements"},{"location":"developer-experience/#facilitation-guidance","text":"The following outline examples of several strategies that can be adopted to promote a positive developer experience. It is expected that each team should define what a positive dev experience means within the context of their project. Additionally, refine that over time via feedback mechanisms such as sprint and project retrospectives.","title":"Facilitation Guidance"},{"location":"developer-experience/#establish-hotkeys","text":"Assign hotkeys to each of the essential tasks. Task Windows Build CTRL+SHIFT+B Test CTRL+R,T Start With Debugging F5","title":"Establish Hotkeys"},{"location":"developer-experience/#the-f5-contract","text":"The F5 contract aims for the ability to run the end-to-end solution with the following steps. Clone - git clone [ my-repo-url-here ] Configure - set any configuration values that need to be unique to the individual (i.e. update a .env file) Press F5 - launch the solution with debugging attached. Most IDEs have some form of a task runner that can be used to automate the build, execute, and attach steps. Try to leverage these such that the steps can all be run with as few manual steps as possible.","title":"The F5 Contract"},{"location":"developer-experience/#devex-champion-actively-seek-improvements","text":"The DevEx champion should actively seek areas where the team has opportunity to improve. For example, do they need to deploy their changes to an environment off their laptop before they can validate if what they did worked. Rather than debugging locally, do they have to do this repetitively to get to a working solution? Does this take several minutes each iteration? Does this block other developers due to the contention on the environment? The following are ceremonies that the DevEx champion can use to find potential opportunities Retrospectives. Is feedback being raised that relates to the essential tasks being difficult or unwieldy? Standup Blockers. Are individuals getting blocked or stumbling on the essential tasks? As opportunities are identified, the DevEx champion can translate these into actionable stories for the product backlog.","title":"DevEx Champion Actively Seek Improvements"},{"location":"developer-experience/#make-tasks-cross-platform","text":"For essential tasks being standardized during the engagement, ensure that different platforms are accounted for. Team members may have different operating systems and ensuring the tasks are cross-platform will provide an additional opportunity to improve the experience. See the making tasks cross platform recipe for guidance on how tasks can be configured to include different platforms.","title":"Make Tasks Cross Platform"},{"location":"developer-experience/#create-an-onboarding-guide","text":"When welcoming new team members to the engagement, there are many areas for them to get adjusted to and bring them up to speed including codebase, coding standards, team agreements, and team culture. By adopting a strong onboarding practice such as an onboarding guide in a centralized location that explains the scope of the project, processes, setup details, and software required, new members can have all the necessary resources for them to be efficient, successful and a valuable team member from the start. See the onboarding guide recipe for guidance on what an onboarding guide may look like.","title":"Create an Onboarding Guide"},{"location":"developer-experience/#standardize-essential-tasks","text":"Apply a common strategy across solution components for performing the essential tasks Standardize the configuration for solution components Standardize the way tests are run for each component Standardize the way each component is started and stopped locally Standardize how to document the essential tasks for each component This standardization will enable the team to more easily automate these tasks across all components at the solution level. See Solution-level Essential Tasks below.","title":"Standardize Essential Tasks"},{"location":"developer-experience/#solution-level-essential-tasks","text":"Automate the ability to execute each essential task across all solution components. An example would be mapping the build action in the IDE to run the build task for each component in the solution. More importantly, configure the IDE start action to start all components within the solution. This will provide significant efficiency for the engineering team when dealing with multi-component solutions. When this is not implemented, the engineers must repeat each of the essential tasks manually for each component in the solution. In this situation, the number of steps required to perform each essential task is multiplied by the number of components in the system [Configuration steps + Build steps + Start/Debug steps + Stop steps + Run test steps + Documenting all of the above] * [many solution components] = TOO MANY STEPS VS. [Configuration steps + Build steps + Start/Debug steps + Stop steps + Run test steps + Documenting all of the above] * [1 solution] = MINIMUM NUMBER OF STEPS","title":"Solution-level Essential Tasks"},{"location":"developer-experience/#observability","text":"Observability alleviates unforeseen challenges for the developer in a complex distributed system. It identifies project bottlenecks quicker and with more precision, enhancing performance as the developer seeks to deploy code changes. Adding observability improves the experience when identifying and resolving bugs or broken code. This results in fewer or less severe current and future production failures. There are many observability strategies a developer can use alongside best engineering practices. These resources improve the DevEx by ensuring a shared view of the complex system throughout the entire lifecycle. Observability in code via logging, exception handling and exposing of relevant application metrics for example, promotes the consistent visibility of real time performance. The observability pillars, logging , metrics , and tracing , detail when to enable each of the three specific types of observability.","title":"Observability"},{"location":"developer-experience/#minimize-the-number-of-repositories","text":"Splitting a solution across multiple repositories can negatively impact the above measures. This can also negatively impact other areas such as Pull Requests, Automated Testing, Continuous Integration, and Continuous Delivery. Similar to the IDE instances, the negative impact is multiplied by the number of repositories. [Clone steps + Branching steps + Commit steps + CI steps + Pull Request reviews & merges ] * [many source code repositories] = TOO MANY STEPS VS. [Clone steps + Branching steps + Commit steps + CI steps + Pull Request reviews & merges ] * [1 source code repository] = MINIMUM NUMBER OF STEPS","title":"Minimize the Number of Repositories"},{"location":"developer-experience/#atomic-pull-requests","text":"When the solution is encapsulated within a single repository, it also allows pull requests to represent a change across multiple layers. This is especially helpful when a change requires changes to a shared contract between multiple components. For example, a story requires that an api endpoint is changed. With this strategy the api and web client could be updated with the same pull request. This avoids the main branch being broken temporarily while waiting on dependent pull requests to merge.","title":"Atomic Pull Requests"},{"location":"developer-experience/#minimize-remote-dependencies-for-local-development","text":"The fewer dependencies on components that cannot run a developer's machine translate to fewer steps required to get started. Therefore, fewer dependencies will positively impact the measures above. The following strategies can be used to reduce these dependencies","title":"Minimize Remote Dependencies for Local Development"},{"location":"developer-experience/#use-an-emulator","text":"If available, emulators are implementations of technologies that are typically only available in cloud environments. A good example is the CosmosDB emulator .","title":"Use an Emulator"},{"location":"developer-experience/#use-di-toggle-to-mock-remote-dependencies","text":"When the solution depends on a technology that cannot be run on a developer's machine, the setup and testing of that solution can be challenging. One strategy that can be employed is to create the ability to swap that dependency for one that can run locally. Abstract the layer that has the remote dependency behind an interface owned by the solution (not the remote dependency). Create an implementation of that interface using a technology that can be run locally. Create a factory that decides which instance to use. This decision could be based on environment configuration (i.e. the toggle). Then, the original class that depends on the remote tech instead should depend on the factory to provide which instance to use. Much of this strategy can be simplified with proper dependency injection technique and/or framework. See example below that swaps Azure Service Bus implementation for RabbitMQ which can be run locally. interface IPublisher { send ( message : string ) : void } class RabbitMQPublisher implements IPublisher { send ( message : string ) { //todo: send the message via RabbitMQ } } class AzureServiceBusPublisher implements IPublisher { send ( message : string ) { //todo: send the message via Azure Service Bus } } interface IPublisherFactory { create () : IPublisher } class PublisherFactory { create () : IPublisher { // use env var value to determine which instance should be used if ( process . env . UseAsb ){ return new AzureServiceBusPublisher (); } else { return new RabbitMqPublisher (); } } } class MyService { //inject the factory constructor ( private readonly publisherFactory : IPublisherFactory ){ } sendAMessage ( message : string ) : void { //use the factory to determine which instance to use const publisher : IPublisher = this . publisherFactory . create (); publisher . send ( message ); } } The recipes section has a more complete discussion on DI as part of a high productivity inner dev loop","title":"Use DI + Toggle to Mock Remote Dependencies"},{"location":"developer-experience/client-app-inner-loop/","text":"Separating Client Apps from the Services They Consume During Development Client Apps typically rely on remote services to power their apps. However, development schedules between the client app and the services don't always fully align. For a high velocity inner dev loop, client app development must be decoupled from the backend services while still allowing the app to \"invoke\" the services for local testing. Options Several options exist to decouple client app development from the backend services. The options range from embedding mock implementation of the services into the application, others rely on simplified versions of the services. This document lists several options and discusses trade-offs. Embedded Mocks An embedded mock solution includes classes that implement the service interfaces locally. Interfaces and data classes, also called models or data transfer objects or DTOs, are often generated from the services' API specs using tools like nswag ( RicoSuter/NSwag: The Swagger/OpenAPI toolchain for .NET, ASP.NET Core and TypeScript. (github.com) ) or autorest ( Azure/autorest: OpenAPI (f.k.a Swagger) Specification code generator. Supports C#, PowerShell, Go, Java, Node.js, TypeScript, Python, Ruby (github.com) ). A simple service implementation can return a static response. For RESTful services, the JSON responses for the stubs can be stored as application resources or simply as static strings. public Task < UserProfile > GetUserAsync ( long userId , CancellationToken cancellationToken ) { PetProfile result = Newtonsoft . Json . JsonConvert . DeserializeObject < UserProfile > ( MockUserProfile . UserProfile , new Newtonsoft . Json . JsonSerializerSettings ()); return Task . FromResult ( result ); } More sophisticated can randomly return errors to test the app's resiliency code paths. Mocks can be activated via conditional compilation or dynamically via app configuration. In either case, it is recommended to ensure that mocks, service responses and externalized configurations are not included in the final release to avoid confusing behavior and inclusion of potential vulnerabilities. Sample: Registering Mocks via Dependency Injection Dependency Injection Containers like Unity ( Unity Container Introduction | Unity Container ) make it easy to switch between mock services and real service client implementations. Since both implement the same interface, implementations can be registered with the Unity container. public static void Bootstrap ( IUnityContainer container ) { #if DEBUG container . RegisterSingleton < IUserServiceClient , MockUserService > (); #else container . RegisterSingleton < IUserServiceClient , UserServiceClient > (); #endif } Consuming Mocks via Dependency Injection The code consuming the interfaces will not notice the difference. public class UserPageModel { private readonly IUserServiceClient userServiceClient ; public UserPageModel ( IUserServiceClient userServiceClient ) { this . userServiceClient = userServiceClient ; } // ... } Local Services The approach with Locally Running Services is to replace the call in the client from pointing to the actual endpoint (whether dev, QA, prod, etc.) to a local endpoint. This approach also enables injecting traffic capture and shaping proxies like Postman ( Postman API Platform | Sign Up for Free ) or Fiddler ( Fiddler | Web Debugging Proxy and Troubleshooting Solutions (telerik.com) ). The advantage of this approach is that the APIs are decoupled from the client and can be independently updated/modified (e.g. changing response codes, changing data) without requiring changes to the client. This helps to unlock new development scenarios and provides flexibility during the development phase. The challenge with this approach is that it does require setup, configuration, and running of the services locally. There are tools that help to simplify that process (e.g. JsonServer , Postman Mock Server ). High-Fidelity Local Services A local service stub implements the expected APIs. Just like the embedded mock, it can be generated based on existing API contracts (e.g. OpenAPI). A high-fidelity approach packages the real services together with simplified data in docker containers that can be run locally using docker-compose before the client app is started for local debugging and testing. To enable running services fully local the \"local version\" substitutes dependent cloud services with local alternatives, e.g. file storage instead of blobs, locally running SQL Server instead of SQL AzureDB. This approach also enables full fidelity integration testing without spinning up distributed deployments. Stub / Fake Services Lower fidelity approaches run stub services, that could be generated from API specs, or run fake servers like JsonServer ( JsonServer.io: A fake json server API Service for prototyping and testing. ) or Postman. All these services would respond with predetermined and configured JSON messages. How to Decide Pros Cons Example when developing for: Example When not to Use Embedded Mocks Simplifies the F5 developer experience Tightly coupled with Client More static type data scenarios Testing (e.g. unit tests, integration tests) No external dependencies to manage Hard coded data Initial integration with services Mocking via Dependency Injection can be a non-trivial effort High-Fidelity Local Services Loosely Coupled from Client Extra tooling required i.e. local infrastructure overhead URL Routes When API contract are not available Easier to independently modify response Extra setup and configuration of services Independent updates to services Can utilize HTTP traffic Easier to replace with real services at a later time Stub/Fake Services Loosely coupled from client Extra tooling required i.e. local infrastructure overhead Response Codes When API Contracts available Easier to independently modify response Extra setup and configuration of services Complex/variable data scenarios When API Contracts are note available Independent updates to services Might not provide full fidelity of expected API Can utilize HTTP traffic Easier to replace with real services at a later time","title":"Separating Client Apps from the Services They Consume During Development"},{"location":"developer-experience/client-app-inner-loop/#separating-client-apps-from-the-services-they-consume-during-development","text":"Client Apps typically rely on remote services to power their apps. However, development schedules between the client app and the services don't always fully align. For a high velocity inner dev loop, client app development must be decoupled from the backend services while still allowing the app to \"invoke\" the services for local testing.","title":"Separating Client Apps from the Services They Consume During Development"},{"location":"developer-experience/client-app-inner-loop/#options","text":"Several options exist to decouple client app development from the backend services. The options range from embedding mock implementation of the services into the application, others rely on simplified versions of the services. This document lists several options and discusses trade-offs.","title":"Options"},{"location":"developer-experience/client-app-inner-loop/#embedded-mocks","text":"An embedded mock solution includes classes that implement the service interfaces locally. Interfaces and data classes, also called models or data transfer objects or DTOs, are often generated from the services' API specs using tools like nswag ( RicoSuter/NSwag: The Swagger/OpenAPI toolchain for .NET, ASP.NET Core and TypeScript. (github.com) ) or autorest ( Azure/autorest: OpenAPI (f.k.a Swagger) Specification code generator. Supports C#, PowerShell, Go, Java, Node.js, TypeScript, Python, Ruby (github.com) ). A simple service implementation can return a static response. For RESTful services, the JSON responses for the stubs can be stored as application resources or simply as static strings. public Task < UserProfile > GetUserAsync ( long userId , CancellationToken cancellationToken ) { PetProfile result = Newtonsoft . Json . JsonConvert . DeserializeObject < UserProfile > ( MockUserProfile . UserProfile , new Newtonsoft . Json . JsonSerializerSettings ()); return Task . FromResult ( result ); } More sophisticated can randomly return errors to test the app's resiliency code paths. Mocks can be activated via conditional compilation or dynamically via app configuration. In either case, it is recommended to ensure that mocks, service responses and externalized configurations are not included in the final release to avoid confusing behavior and inclusion of potential vulnerabilities.","title":"Embedded Mocks"},{"location":"developer-experience/client-app-inner-loop/#sample-registering-mocks-via-dependency-injection","text":"Dependency Injection Containers like Unity ( Unity Container Introduction | Unity Container ) make it easy to switch between mock services and real service client implementations. Since both implement the same interface, implementations can be registered with the Unity container. public static void Bootstrap ( IUnityContainer container ) { #if DEBUG container . RegisterSingleton < IUserServiceClient , MockUserService > (); #else container . RegisterSingleton < IUserServiceClient , UserServiceClient > (); #endif }","title":"Sample: Registering Mocks via Dependency Injection"},{"location":"developer-experience/client-app-inner-loop/#consuming-mocks-via-dependency-injection","text":"The code consuming the interfaces will not notice the difference. public class UserPageModel { private readonly IUserServiceClient userServiceClient ; public UserPageModel ( IUserServiceClient userServiceClient ) { this . userServiceClient = userServiceClient ; } // ... }","title":"Consuming Mocks via Dependency Injection"},{"location":"developer-experience/client-app-inner-loop/#local-services","text":"The approach with Locally Running Services is to replace the call in the client from pointing to the actual endpoint (whether dev, QA, prod, etc.) to a local endpoint. This approach also enables injecting traffic capture and shaping proxies like Postman ( Postman API Platform | Sign Up for Free ) or Fiddler ( Fiddler | Web Debugging Proxy and Troubleshooting Solutions (telerik.com) ). The advantage of this approach is that the APIs are decoupled from the client and can be independently updated/modified (e.g. changing response codes, changing data) without requiring changes to the client. This helps to unlock new development scenarios and provides flexibility during the development phase. The challenge with this approach is that it does require setup, configuration, and running of the services locally. There are tools that help to simplify that process (e.g. JsonServer , Postman Mock Server ).","title":"Local Services"},{"location":"developer-experience/client-app-inner-loop/#high-fidelity-local-services","text":"A local service stub implements the expected APIs. Just like the embedded mock, it can be generated based on existing API contracts (e.g. OpenAPI). A high-fidelity approach packages the real services together with simplified data in docker containers that can be run locally using docker-compose before the client app is started for local debugging and testing. To enable running services fully local the \"local version\" substitutes dependent cloud services with local alternatives, e.g. file storage instead of blobs, locally running SQL Server instead of SQL AzureDB. This approach also enables full fidelity integration testing without spinning up distributed deployments.","title":"High-Fidelity Local Services"},{"location":"developer-experience/client-app-inner-loop/#stub-fake-services","text":"Lower fidelity approaches run stub services, that could be generated from API specs, or run fake servers like JsonServer ( JsonServer.io: A fake json server API Service for prototyping and testing. ) or Postman. All these services would respond with predetermined and configured JSON messages.","title":"Stub / Fake Services"},{"location":"developer-experience/client-app-inner-loop/#how-to-decide","text":"Pros Cons Example when developing for: Example When not to Use Embedded Mocks Simplifies the F5 developer experience Tightly coupled with Client More static type data scenarios Testing (e.g. unit tests, integration tests) No external dependencies to manage Hard coded data Initial integration with services Mocking via Dependency Injection can be a non-trivial effort High-Fidelity Local Services Loosely Coupled from Client Extra tooling required i.e. local infrastructure overhead URL Routes When API contract are not available Easier to independently modify response Extra setup and configuration of services Independent updates to services Can utilize HTTP traffic Easier to replace with real services at a later time Stub/Fake Services Loosely coupled from client Extra tooling required i.e. local infrastructure overhead Response Codes When API Contracts available Easier to independently modify response Extra setup and configuration of services Complex/variable data scenarios When API Contracts are note available Independent updates to services Might not provide full fidelity of expected API Can utilize HTTP traffic Easier to replace with real services at a later time","title":"How to Decide"},{"location":"developer-experience/copilots/","text":"Copilots There are a number of AI tools that can improve the developer experience. This article will discuss tooling that is available as well as advice on when it might be appropriate to use such tooling. GitHub Copilot The current version of GitHub Copilot can provide code completion in many popular IDEs. For instance, the VS Code extension that can be installed from the VS Code Marketplace. It requires a GitHub account to use. For more information about what IDEs are supported, what languages are supported, cost, features, etc., please checkout out the information on Copilot and Copilot for Business . Some example use-cases for GitHub Copilot include: Write Documentation . For example, the above paragraph was written using Copilot. Write Unit Tests . Given that setup and assertions are often consistent across unit tests, Copilot tends to be very accurate. Unblock . It is often hard start writing when staring at a blank page, Copilot can fill the space with something that may or may not be what you ultimately want to do, but it can help get you in the right head space. If you want Copilot to write something useful for you, try writing a comment that describes what your code is going to do - it can often take it from there. GitHub Copilot Labs Copilot has a GitHub Copilot labs extension that offers additional features that are not yet ready for prime-time. For VS Code, you can install it from the VS Code Marketplace. These features include: Explain . Copilot can explain what the code is doing in natural language. Translate . Copilot can translate code from one language to another. Brushes . You can select code that Copilot then modifies inline based on a \"brush\" you select, for example, to make the code more readable, fix bugs, improve debugging, document, etc. Generate Tests . Copilot can generate unit tests for your code. Though currently this is limited to JavaScript and TypeScript. GitHub Copilot X The next version of Copilot offers a number of new use-cases beyond code completion. These include: Chat . Rather than just providing code completion, Copilot will be able to have a conversation with you about what you want to do. It has context about the code you are working on and can provide suggestions based on that context. Beyond just writing code, consider using chat to: Build SQL Indexes . Given a query, Copilot can generate a SQL index that will improve the performance of the query. Write Regular Expressions . These are notoriously difficult to write, but Copilot can generate them for you if you give some sample input and describe what you want to extract. Improve and Validate . If you are unsure of the implications of writing code a particular way, you can ask questions about it. For instance, you might ask if there is a way to write the code that is more performant or uses less memory. Once it gives you an opinion, you can ask it to provide documentation validating that assertion. Explain . Copilot can explain what the code is doing in natural language. Write Code . Given prompting by the developer it can write code that you can one-click deploy into existing or new files. Debug . Copilot can analyze your code and propose solutions to fix bugs. It can do most of what Labs can do with \"brushes\" as \"topics\", but whereas Labs changes the code in your file, the chat functionality just shows what it would change in the window. However, there is also an \"inline mode\" for GitHub Copilot Chat that allows you to make changes to your code inline which does not have this same limitation. ChatGPT / Bing Chat For coding, generic AI chat tools such as ChatGPT and Bing Chat are less useful, but they still have their place. GitHub Copilot will only answer \"questions about coding\" and it's interpretation of that rule can be a little restrictive. Some cases for using ChatGPT or Bing Chat include: Write Documentation . Copilot can write documentation, but using ChatGPT or Bing Chat, you can expand your documentation to include business information, use-cases, additional context, etc. Change Perspective . ChatGPT can impersonate a persona or even a system and answer questions from that perspective. For example, you can ask it to explain what a particular piece of code does from the perspective of a user. You might have ChatGPT imagine it is a database administrator and ask it to explain how to improve a particular query. When using Bing Chat, experiment with modes, sometimes changing to Creative Mode can give the results you need. Prompt Engineering Chat AI tools are only as good as the prompts you give them. The quality and appropriateness of the output can vary greatly depending on the prompt. In addition, many of these tools restrict the number of prompts you can send in a given amount of time. To learn more about prompt engineering, you might review some open source documentation here . Considerations It is important when using AI tools to understand how the data (including private or commercial code) might be used by the system. Read more about how GitHub Copilot handles your data and code here .","title":"Copilots"},{"location":"developer-experience/copilots/#copilots","text":"There are a number of AI tools that can improve the developer experience. This article will discuss tooling that is available as well as advice on when it might be appropriate to use such tooling.","title":"Copilots"},{"location":"developer-experience/copilots/#github-copilot","text":"The current version of GitHub Copilot can provide code completion in many popular IDEs. For instance, the VS Code extension that can be installed from the VS Code Marketplace. It requires a GitHub account to use. For more information about what IDEs are supported, what languages are supported, cost, features, etc., please checkout out the information on Copilot and Copilot for Business . Some example use-cases for GitHub Copilot include: Write Documentation . For example, the above paragraph was written using Copilot. Write Unit Tests . Given that setup and assertions are often consistent across unit tests, Copilot tends to be very accurate. Unblock . It is often hard start writing when staring at a blank page, Copilot can fill the space with something that may or may not be what you ultimately want to do, but it can help get you in the right head space. If you want Copilot to write something useful for you, try writing a comment that describes what your code is going to do - it can often take it from there.","title":"GitHub Copilot"},{"location":"developer-experience/copilots/#github-copilot-labs","text":"Copilot has a GitHub Copilot labs extension that offers additional features that are not yet ready for prime-time. For VS Code, you can install it from the VS Code Marketplace. These features include: Explain . Copilot can explain what the code is doing in natural language. Translate . Copilot can translate code from one language to another. Brushes . You can select code that Copilot then modifies inline based on a \"brush\" you select, for example, to make the code more readable, fix bugs, improve debugging, document, etc. Generate Tests . Copilot can generate unit tests for your code. Though currently this is limited to JavaScript and TypeScript.","title":"GitHub Copilot Labs"},{"location":"developer-experience/copilots/#github-copilot-x","text":"The next version of Copilot offers a number of new use-cases beyond code completion. These include: Chat . Rather than just providing code completion, Copilot will be able to have a conversation with you about what you want to do. It has context about the code you are working on and can provide suggestions based on that context. Beyond just writing code, consider using chat to: Build SQL Indexes . Given a query, Copilot can generate a SQL index that will improve the performance of the query. Write Regular Expressions . These are notoriously difficult to write, but Copilot can generate them for you if you give some sample input and describe what you want to extract. Improve and Validate . If you are unsure of the implications of writing code a particular way, you can ask questions about it. For instance, you might ask if there is a way to write the code that is more performant or uses less memory. Once it gives you an opinion, you can ask it to provide documentation validating that assertion. Explain . Copilot can explain what the code is doing in natural language. Write Code . Given prompting by the developer it can write code that you can one-click deploy into existing or new files. Debug . Copilot can analyze your code and propose solutions to fix bugs. It can do most of what Labs can do with \"brushes\" as \"topics\", but whereas Labs changes the code in your file, the chat functionality just shows what it would change in the window. However, there is also an \"inline mode\" for GitHub Copilot Chat that allows you to make changes to your code inline which does not have this same limitation.","title":"GitHub Copilot X"},{"location":"developer-experience/copilots/#chatgpt-bing-chat","text":"For coding, generic AI chat tools such as ChatGPT and Bing Chat are less useful, but they still have their place. GitHub Copilot will only answer \"questions about coding\" and it's interpretation of that rule can be a little restrictive. Some cases for using ChatGPT or Bing Chat include: Write Documentation . Copilot can write documentation, but using ChatGPT or Bing Chat, you can expand your documentation to include business information, use-cases, additional context, etc. Change Perspective . ChatGPT can impersonate a persona or even a system and answer questions from that perspective. For example, you can ask it to explain what a particular piece of code does from the perspective of a user. You might have ChatGPT imagine it is a database administrator and ask it to explain how to improve a particular query. When using Bing Chat, experiment with modes, sometimes changing to Creative Mode can give the results you need.","title":"ChatGPT / Bing Chat"},{"location":"developer-experience/copilots/#prompt-engineering","text":"Chat AI tools are only as good as the prompts you give them. The quality and appropriateness of the output can vary greatly depending on the prompt. In addition, many of these tools restrict the number of prompts you can send in a given amount of time. To learn more about prompt engineering, you might review some open source documentation here .","title":"Prompt Engineering"},{"location":"developer-experience/copilots/#considerations","text":"It is important when using AI tools to understand how the data (including private or commercial code) might be used by the system. Read more about how GitHub Copilot handles your data and code here .","title":"Considerations"},{"location":"developer-experience/cross-platform-tasks/","text":"Cross Platform Tasks There are several options to alleviate cross-platform compatibility issues. Running tasks in a container Using the tasks-system in VS Code which provides options to allow commands to be executed specific to an operating system. Docker or Container Based Using containers as development machines allows developers to get started with minimal setup and abstracts the development environment from the host OS by having it run in a container. DevContainers can also help in standardizing the local developer experience across the team. The following are some good resources to get started with running tasks in DevContainers Developing inside a container . Tutorial on Development in Containers For samples projects and dev container templates see VS Code Dev Containers Recipe Dev Containers Library Tasks in VSCode Running Node.js The example below offers insight into running Node.js executable as a command with tasks.json and how it can be treated differently on Windows and Linux. { \"label\" : \"Run Node\" , \"type\" : \"process\" , \"windows\" : { \"command\" : \"C:\\\\Program Files\\\\nodejs\\\\node.exe\" }, \"linux\" : { \"command\" : \"/usr/bin/node\" } } In this example, to run Node.js, there is a specific windows command, and a specific linux command. This allows for platform specific properties. When these are defined, they will be used instead of the default properties when the command is executed on the Windows operating system or on Linux. Custom Tasks Not all scripts or tasks can be auto-detected in the workspace. It may be necessary at times to defined your own custom tasks. In this example, we have a script to run in order to set up some environment correctly. The script is stored in a folder inside your workspace and named test.sh for Linux & macOS and test.cmd for Windows. With the tasks.json file, the execution of this script can be made possible with a custom task that defines what to do on different operating systems. { \"version\" : \"2.0.0\" , \"tasks\" : [ { \"label\" : \"Run tests\" , \"type\" : \"shell\" , \"command\" : \"./scripts/test.sh\" , \"windows\" : { \"command\" : \".\\\\scripts\\\\test.cmd\" }, \"group\" : \"test\" , \"presentation\" : { \"reveal\" : \"always\" , \"panel\" : \"new\" } } ] } The command here is a shell command and tells the system to run either the test.sh or test.cmd. By default, it will run test.sh with that given path. This example here also defines Windows specific properties and tells it execute test.cmd instead of the default. Resources VS Code Docs - operating system specific properties","title":"Cross Platform Tasks"},{"location":"developer-experience/cross-platform-tasks/#cross-platform-tasks","text":"There are several options to alleviate cross-platform compatibility issues. Running tasks in a container Using the tasks-system in VS Code which provides options to allow commands to be executed specific to an operating system.","title":"Cross Platform Tasks"},{"location":"developer-experience/cross-platform-tasks/#docker-or-container-based","text":"Using containers as development machines allows developers to get started with minimal setup and abstracts the development environment from the host OS by having it run in a container. DevContainers can also help in standardizing the local developer experience across the team. The following are some good resources to get started with running tasks in DevContainers Developing inside a container . Tutorial on Development in Containers For samples projects and dev container templates see VS Code Dev Containers Recipe Dev Containers Library","title":"Docker or Container Based"},{"location":"developer-experience/cross-platform-tasks/#tasks-in-vscode","text":"","title":"Tasks in VSCode"},{"location":"developer-experience/cross-platform-tasks/#running-nodejs","text":"The example below offers insight into running Node.js executable as a command with tasks.json and how it can be treated differently on Windows and Linux. { \"label\" : \"Run Node\" , \"type\" : \"process\" , \"windows\" : { \"command\" : \"C:\\\\Program Files\\\\nodejs\\\\node.exe\" }, \"linux\" : { \"command\" : \"/usr/bin/node\" } } In this example, to run Node.js, there is a specific windows command, and a specific linux command. This allows for platform specific properties. When these are defined, they will be used instead of the default properties when the command is executed on the Windows operating system or on Linux.","title":"Running Node.js"},{"location":"developer-experience/cross-platform-tasks/#custom-tasks","text":"Not all scripts or tasks can be auto-detected in the workspace. It may be necessary at times to defined your own custom tasks. In this example, we have a script to run in order to set up some environment correctly. The script is stored in a folder inside your workspace and named test.sh for Linux & macOS and test.cmd for Windows. With the tasks.json file, the execution of this script can be made possible with a custom task that defines what to do on different operating systems. { \"version\" : \"2.0.0\" , \"tasks\" : [ { \"label\" : \"Run tests\" , \"type\" : \"shell\" , \"command\" : \"./scripts/test.sh\" , \"windows\" : { \"command\" : \".\\\\scripts\\\\test.cmd\" }, \"group\" : \"test\" , \"presentation\" : { \"reveal\" : \"always\" , \"panel\" : \"new\" } } ] } The command here is a shell command and tells the system to run either the test.sh or test.cmd. By default, it will run test.sh with that given path. This example here also defines Windows specific properties and tells it execute test.cmd instead of the default.","title":"Custom Tasks"},{"location":"developer-experience/cross-platform-tasks/#resources","text":"VS Code Docs - operating system specific properties","title":"Resources"},{"location":"developer-experience/devcontainers-getting-started/","text":"Dev Containers: Getting Started If you are a developer and have experience with Visual Studio Code (VS Code) or Docker, then it's probably time you look at development containers (dev containers). This readme is intended to assist developers in the decision-making process needed to build dev containers. The guidance provided should be especially helpful if you are experiencing VS Code dev containers for the first time. Note: This guide is not about setting up a Docker file for deploying a running Python program for CI/CD. Prerequisites Experience with VS Code Experience with Docker What are Dev Containers? Development containers are a VS Code feature that allows developers to package a local development tool stack into the internals of a Docker container while also bringing the VS Code UI experience with them. Have you ever set a breakpoint inside a Docker container? Maybe not. Dev containers make that possible. This is all made possible through a VS Code extension called the Remote Development Extension Pack that works together with Docker to spin-up a VS Code Server within a Docker container. The VS Code UI component remains local, but your working files are volume mounted into the container. The diagram below, taken directly from the official VS Code docs , illustrates this: If the above diagram is not clear, a basic analogy that might help you intuitively understand dev containers is to think of them as a union between Docker's interactive mode ( docker exec -it 987654e0ff32 ), and the VS Code UI experience that you are used to. To set yourself up for the dev container experience described above, use your VS Code's Extension Marketplace to install the Remote Development Extension Pack . How can Dev Containers Improve Project Collaboration? VS Code dev containers have improved project collaboration between developers on recent team projects by addressing two very specific problems: Inconsistent local developer experiences within a team. Slow onboarding of developers joining a project. The problems listed above were addressed by configuring and then sharing a dev container definition. Dev containers are defined by their base image, and the artifacts that support that base image. The base image and the artifacts that come with it live in the .devcontainer directory. This directory is where configuration begins. A central artifact to the dev container definition is a configuration file called devcontainer.json . This file orchestrates the artifacts needed to support the base image and the dev container lifecycle. Installation of the Remote Development Extension Pack is required to enable this orchestration within a project repo. All developers on the team are expected to share and use the dev container definition (.devcontainer directory) in order to spin-up a container. This definition provides consistent tooling for locally developing an application across a team. The code snippets below demonstrate the common location of a .devcontainer directory and devcontainer.json file within a project repository. They also highlight the correct way to reference a Docker file. $ tree vs-code-remote-try-python # main repo directory \u2514\u2500\u2500\u2500.devcontainers \u251c\u2500\u2500\u2500Dockerfile \u251c\u2500\u2500\u2500devcontainer.json # devco nta i ner .jso n { \"name\" : \"Python 3\" , \"build\" : { \"dockerfile\" : \"Dockerfile\" , \"context\" : \"..\" , // Update 'VARIANT' to pick a Python version: 3, 3.6, 3.7, 3.8 \"args\" : { \"VARIANT\" : \"3.8\" } }, } For a list of devcontainer.json configuration properties, visit VS Code documentation on dev container properties . How do I Decide Which Dev Container is Right for my Use Case? Fortunately, VS Code has a repo gallery of platform specific folders that host dev container definitions (.devcontainer directories) to make getting started with dev containers easier. The code snippet below shows a list of gallery folders that come directly from the VS Code dev container gallery repo : $ tree vs-code-dev-containers # main repo directory \u2514\u2500\u2500\u2500containers \u251c\u2500\u2500\u2500dotnetcore | \u2514\u2500\u2500\u2500.devcontainers # dev container \u251c\u2500\u2500\u2500python-3 | \u2514\u2500\u2500\u2500.devcontainers # dev container \u251c\u2500\u2500\u2500ubuntu | \u2514\u2500\u2500\u2500.devcontainers # dev container \u2514\u2500\u2500\u2500.... Here are the final high-level steps it takes to build a dev container: Decide which platform you'd like to build a local development tool stack around. Browse the VS Code provided dev container gallery of project folders that target your platform and choose the most appropriate one. Inspect the dev container definitions (.devcontainer directory) of a project for the base image, and the artifacts that support that base image. Use what you've discovered to begin setting up the dev container as it is, extending it or building your own from scratch. Going further There are use cases where you would want to go further in configuring your Dev Container. More details here","title":"Dev Containers: Getting Started"},{"location":"developer-experience/devcontainers-getting-started/#dev-containers-getting-started","text":"If you are a developer and have experience with Visual Studio Code (VS Code) or Docker, then it's probably time you look at development containers (dev containers). This readme is intended to assist developers in the decision-making process needed to build dev containers. The guidance provided should be especially helpful if you are experiencing VS Code dev containers for the first time. Note: This guide is not about setting up a Docker file for deploying a running Python program for CI/CD.","title":"Dev Containers: Getting Started"},{"location":"developer-experience/devcontainers-getting-started/#prerequisites","text":"Experience with VS Code Experience with Docker","title":"Prerequisites"},{"location":"developer-experience/devcontainers-getting-started/#what-are-dev-containers","text":"Development containers are a VS Code feature that allows developers to package a local development tool stack into the internals of a Docker container while also bringing the VS Code UI experience with them. Have you ever set a breakpoint inside a Docker container? Maybe not. Dev containers make that possible. This is all made possible through a VS Code extension called the Remote Development Extension Pack that works together with Docker to spin-up a VS Code Server within a Docker container. The VS Code UI component remains local, but your working files are volume mounted into the container. The diagram below, taken directly from the official VS Code docs , illustrates this: If the above diagram is not clear, a basic analogy that might help you intuitively understand dev containers is to think of them as a union between Docker's interactive mode ( docker exec -it 987654e0ff32 ), and the VS Code UI experience that you are used to. To set yourself up for the dev container experience described above, use your VS Code's Extension Marketplace to install the Remote Development Extension Pack .","title":"What are Dev Containers?"},{"location":"developer-experience/devcontainers-getting-started/#how-can-dev-containers-improve-project-collaboration","text":"VS Code dev containers have improved project collaboration between developers on recent team projects by addressing two very specific problems: Inconsistent local developer experiences within a team. Slow onboarding of developers joining a project. The problems listed above were addressed by configuring and then sharing a dev container definition. Dev containers are defined by their base image, and the artifacts that support that base image. The base image and the artifacts that come with it live in the .devcontainer directory. This directory is where configuration begins. A central artifact to the dev container definition is a configuration file called devcontainer.json . This file orchestrates the artifacts needed to support the base image and the dev container lifecycle. Installation of the Remote Development Extension Pack is required to enable this orchestration within a project repo. All developers on the team are expected to share and use the dev container definition (.devcontainer directory) in order to spin-up a container. This definition provides consistent tooling for locally developing an application across a team. The code snippets below demonstrate the common location of a .devcontainer directory and devcontainer.json file within a project repository. They also highlight the correct way to reference a Docker file. $ tree vs-code-remote-try-python # main repo directory \u2514\u2500\u2500\u2500.devcontainers \u251c\u2500\u2500\u2500Dockerfile \u251c\u2500\u2500\u2500devcontainer.json # devco nta i ner .jso n { \"name\" : \"Python 3\" , \"build\" : { \"dockerfile\" : \"Dockerfile\" , \"context\" : \"..\" , // Update 'VARIANT' to pick a Python version: 3, 3.6, 3.7, 3.8 \"args\" : { \"VARIANT\" : \"3.8\" } }, } For a list of devcontainer.json configuration properties, visit VS Code documentation on dev container properties .","title":"How can Dev Containers Improve Project Collaboration?"},{"location":"developer-experience/devcontainers-getting-started/#how-do-i-decide-which-dev-container-is-right-for-my-use-case","text":"Fortunately, VS Code has a repo gallery of platform specific folders that host dev container definitions (.devcontainer directories) to make getting started with dev containers easier. The code snippet below shows a list of gallery folders that come directly from the VS Code dev container gallery repo : $ tree vs-code-dev-containers # main repo directory \u2514\u2500\u2500\u2500containers \u251c\u2500\u2500\u2500dotnetcore | \u2514\u2500\u2500\u2500.devcontainers # dev container \u251c\u2500\u2500\u2500python-3 | \u2514\u2500\u2500\u2500.devcontainers # dev container \u251c\u2500\u2500\u2500ubuntu | \u2514\u2500\u2500\u2500.devcontainers # dev container \u2514\u2500\u2500\u2500.... Here are the final high-level steps it takes to build a dev container: Decide which platform you'd like to build a local development tool stack around. Browse the VS Code provided dev container gallery of project folders that target your platform and choose the most appropriate one. Inspect the dev container definitions (.devcontainer directory) of a project for the base image, and the artifacts that support that base image. Use what you've discovered to begin setting up the dev container as it is, extending it or building your own from scratch.","title":"How do I Decide Which Dev Container is Right for my Use Case?"},{"location":"developer-experience/devcontainers-getting-started/#going-further","text":"There are use cases where you would want to go further in configuring your Dev Container. More details here","title":"Going further"},{"location":"developer-experience/devcontainers-going-further/","text":"Dev Containers: Going further Dev Containers allow developers to share a common working environment, ensuring that the runtime and all dependencies versions are consistent for all developers. Dev containers also allow us to: Leverage existing tools to enhance the Dev Containers with more features, Provide custom tools (such as scripts) for other developers. Existing tools In the development phase, you will most probably need to use tools not installed by default in your Dev Container. For instance, if your project's target is to be deployed on Azure, you will need Azure-cli and maybe Terraform for resources and application deployment. You can find such Dev Containers in the VS Code dev container gallery repo . Some other tools may be: Linters for markdown files, Linters for bash scripts, Etc... Linting files that are not the source code can ensure a common format with common rules for each developer. These checks should be also run in a Continuous Integration Pipeline , but it is a good practice to run them prior opening a Pull Request . Limitation of custom tools If you decide to include Azure-cli in your Dev Container, developers will be able to run commands against their tenant. However, to make the developers' lives easier, we could go further by letting them prefill their connection information, such as the tenant ID and the subscription ID in a secure and persistent way (do not forget that your Dev Container, being a Docker container, might get deleted, or the image could be rebuilt, hence, all customization inside will be lost). One way to achieve this is to leverage environment variables, with untracked .env file part of the solution being injected in the Dev Container. Consider the following files structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500config | \u251c\u2500\u2500\u2500.env | \u251c\u2500\u2500\u2500.env-sample The file config/.env-sample is a tracked file where anyone can find environment variables to set (with no values, obviously): TENANT_ID = SUBSCRIPTION_ID = Then, each developer who clones the repository can create the file config/.env and fills it in with the appropriate values. In order now to inject the .env file into the container, you can update the file devcontainer.json with the following: { ... \"runArgs\" : [ \"--env-file\" , \"config/.env\" ], ... } As soon as the Dev Container is started, these environment variables are sent to the container. Another approach would be to use Docker Compose, a little bit more complex, and probably too much for just environment variables. Using Docker Compose can unlock other settings such as custom dns, ports forwarding or multiple containers. To achieve this, you need to add a file .devcontainer/docker-compose.yml with the following: version : '3' services : my-workspace : env_file : ../config/.env build : context : . dockerfile : Dockerfile command : sleep infinity To use the docker-compose.yml file instead of Dockerfile , we need to adjust devcontainer.json with: { \"name\" : \"My Application\" , \"dockerComposeFile\" : [ \"docker-compose.yml\" ], \"service\" : \"my-workspace\" ... } This approach can be applied for many other tools by preparing what would be required. The idea is to simplify developers' lives and new developers joining the project. Custom tools While working on a project, any developer might end up writing a script to automate a task. This script can be in bash , python or whatever scripting language they are comfortable with. Let's say you want to ensure that all markdown files written are validated against specific rules you have set up. As we have seen above, you can include the tool markdownlint in your Dev Container . Having the tool installed does not mean developer will know how to use it! Consider the following solution structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500docker-compose.yml | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500scripts | \u251c\u2500\u2500\u2500check-markdown.sh \u2514\u2500\u2500\u2500.markdownlint.json The file .devcontainer/Dockerfile installs markdownlint ... RUN apt-get update \\ && export DEBIAN_FRONTEND = noninteractive \\ && apt-get install -y nodejs npm # Add NodeJS tools RUN npm install -g markdownlint-cli ... The file .markdownlint.json contains the rules you want to validate in your markdown files (please refer to the markdownlint site for details). And finally, the script scripts/check-markdown.sh contains the following code to execute markdownlint : # Get the repository root repoRoot = \" $( cd \" $( dirname \" ${ BASH_SOURCE [0] } \" ) /..\" >/dev/null 2 > & 1 && pwd ) \" # Execute markdownlint for the entire solution markdownlint -c \" ${ repoRoot } \" /.markdownlint.json When the Dev Container is loaded, any developer can now run this script in their terminal: /> ./scripts/check-markdown.sh This is a small use case, there are unlimited other possibilities to capitalize on work done by developers to save time. Other considerations Platform architecture When installing tooling, you also need to ensure that you know what host computers developers are using. All Intel based computers, whether they are running Windows, Linux or MacOs will have the same behavior. However, the latest Mac architecture (Apple M1/Silicon) being ARM64, means that the behavior is not the same when building Dev Containers. For instance, if you want to install Azure-cli in your Dev Container, you won't be able to do it the same way you do it for Intel based machines. On Intel based computers you can install the deb package. However, this package is not available on ARM architecture. The only way to install Azure-cli on Linux ARM is via the Python installer pip . To achieve this you need to check the architecture of the host building the Dev Container, either in the Dockerfile, or by calling an external bash script to install remaining tools not having a universal version. Here is a snippet to call from the Dockerfile: # If Intel based, then use the deb file if [[ ` dpkg --print-architecture ` == \"amd64\" ]] ; then sudo curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash ; else # arm based, install pip (and gcc) then azure-cli sudo apt-get -y install gcc python3 -m pip install --upgrade pip python3 -m pip install azure-cli fi Reuse of credentials for GitHub If you develop inside a Dev Container, you will also want to share your GitHub credentials between your host and the Dev Container. Doing so, you would avoid copying your ssh keys back and forth (if you are using ssh to access your repositories). One approach would be to mount your local ~/.ssh folder into your Dev Container. You can either use the mounts option of the devcontainer.json , or use Docker Compose Using mounts : { ... \"mounts\" : [ \"source=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,type=bind\" ], ... } As you can see, ${localEnv:HOME} returns the host home folder, and it maps it to the container home folder. Using Docker Compose: version : '3' services : my-worspace : env_file : ../configs/.env build : context : . dockerfile : Dockerfile volumes : - \"~/.ssh:/home/alex/.ssh\" command : sleep infinity Please note that using Docker Compose requires to edit the devcontainer.json file as we have seen above. You can now access GitHub using the same credentials as your host machine, without worrying of persistence. Allow some customization As a final note, it is also interesting to leave developers some flexibility in their environment for customization. For instance, one might want to add aliases to their environment. However, changing the ~/.bashrc file in the Dev Container is not a good approach as the container might be destroyed. There are numerous ways to set persistence, here is one approach. Consider the following solution structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500docker-compose.yml | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500me | \u251c\u2500\u2500\u2500bashrc_extension The folder me is untracked in the repository, leaving developers the flexibility to add personal resources. One of these resources can be a .bashrc extension containing customization. For instance: # Sample alias alias gaa = \"git add --all\" We can now adapt our Dockerfile to load these changes when the Docker image is built (and of course, do nothing if there is no file): ... RUN echo \"[ -f PATH_TO_WORKSPACE/me/bashrc_extension ] && . PATH_TO_WORKSPACE/me/bashrc_extension\" >> ~/.bashrc ; ...","title":"Dev Containers: Going further"},{"location":"developer-experience/devcontainers-going-further/#dev-containers-going-further","text":"Dev Containers allow developers to share a common working environment, ensuring that the runtime and all dependencies versions are consistent for all developers. Dev containers also allow us to: Leverage existing tools to enhance the Dev Containers with more features, Provide custom tools (such as scripts) for other developers.","title":"Dev Containers: Going further"},{"location":"developer-experience/devcontainers-going-further/#existing-tools","text":"In the development phase, you will most probably need to use tools not installed by default in your Dev Container. For instance, if your project's target is to be deployed on Azure, you will need Azure-cli and maybe Terraform for resources and application deployment. You can find such Dev Containers in the VS Code dev container gallery repo . Some other tools may be: Linters for markdown files, Linters for bash scripts, Etc... Linting files that are not the source code can ensure a common format with common rules for each developer. These checks should be also run in a Continuous Integration Pipeline , but it is a good practice to run them prior opening a Pull Request .","title":"Existing tools"},{"location":"developer-experience/devcontainers-going-further/#limitation-of-custom-tools","text":"If you decide to include Azure-cli in your Dev Container, developers will be able to run commands against their tenant. However, to make the developers' lives easier, we could go further by letting them prefill their connection information, such as the tenant ID and the subscription ID in a secure and persistent way (do not forget that your Dev Container, being a Docker container, might get deleted, or the image could be rebuilt, hence, all customization inside will be lost). One way to achieve this is to leverage environment variables, with untracked .env file part of the solution being injected in the Dev Container. Consider the following files structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500config | \u251c\u2500\u2500\u2500.env | \u251c\u2500\u2500\u2500.env-sample The file config/.env-sample is a tracked file where anyone can find environment variables to set (with no values, obviously): TENANT_ID = SUBSCRIPTION_ID = Then, each developer who clones the repository can create the file config/.env and fills it in with the appropriate values. In order now to inject the .env file into the container, you can update the file devcontainer.json with the following: { ... \"runArgs\" : [ \"--env-file\" , \"config/.env\" ], ... } As soon as the Dev Container is started, these environment variables are sent to the container. Another approach would be to use Docker Compose, a little bit more complex, and probably too much for just environment variables. Using Docker Compose can unlock other settings such as custom dns, ports forwarding or multiple containers. To achieve this, you need to add a file .devcontainer/docker-compose.yml with the following: version : '3' services : my-workspace : env_file : ../config/.env build : context : . dockerfile : Dockerfile command : sleep infinity To use the docker-compose.yml file instead of Dockerfile , we need to adjust devcontainer.json with: { \"name\" : \"My Application\" , \"dockerComposeFile\" : [ \"docker-compose.yml\" ], \"service\" : \"my-workspace\" ... } This approach can be applied for many other tools by preparing what would be required. The idea is to simplify developers' lives and new developers joining the project.","title":"Limitation of custom tools"},{"location":"developer-experience/devcontainers-going-further/#custom-tools","text":"While working on a project, any developer might end up writing a script to automate a task. This script can be in bash , python or whatever scripting language they are comfortable with. Let's say you want to ensure that all markdown files written are validated against specific rules you have set up. As we have seen above, you can include the tool markdownlint in your Dev Container . Having the tool installed does not mean developer will know how to use it! Consider the following solution structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500docker-compose.yml | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500scripts | \u251c\u2500\u2500\u2500check-markdown.sh \u2514\u2500\u2500\u2500.markdownlint.json The file .devcontainer/Dockerfile installs markdownlint ... RUN apt-get update \\ && export DEBIAN_FRONTEND = noninteractive \\ && apt-get install -y nodejs npm # Add NodeJS tools RUN npm install -g markdownlint-cli ... The file .markdownlint.json contains the rules you want to validate in your markdown files (please refer to the markdownlint site for details). And finally, the script scripts/check-markdown.sh contains the following code to execute markdownlint : # Get the repository root repoRoot = \" $( cd \" $( dirname \" ${ BASH_SOURCE [0] } \" ) /..\" >/dev/null 2 > & 1 && pwd ) \" # Execute markdownlint for the entire solution markdownlint -c \" ${ repoRoot } \" /.markdownlint.json When the Dev Container is loaded, any developer can now run this script in their terminal: /> ./scripts/check-markdown.sh This is a small use case, there are unlimited other possibilities to capitalize on work done by developers to save time.","title":"Custom tools"},{"location":"developer-experience/devcontainers-going-further/#other-considerations","text":"","title":"Other considerations"},{"location":"developer-experience/devcontainers-going-further/#platform-architecture","text":"When installing tooling, you also need to ensure that you know what host computers developers are using. All Intel based computers, whether they are running Windows, Linux or MacOs will have the same behavior. However, the latest Mac architecture (Apple M1/Silicon) being ARM64, means that the behavior is not the same when building Dev Containers. For instance, if you want to install Azure-cli in your Dev Container, you won't be able to do it the same way you do it for Intel based machines. On Intel based computers you can install the deb package. However, this package is not available on ARM architecture. The only way to install Azure-cli on Linux ARM is via the Python installer pip . To achieve this you need to check the architecture of the host building the Dev Container, either in the Dockerfile, or by calling an external bash script to install remaining tools not having a universal version. Here is a snippet to call from the Dockerfile: # If Intel based, then use the deb file if [[ ` dpkg --print-architecture ` == \"amd64\" ]] ; then sudo curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash ; else # arm based, install pip (and gcc) then azure-cli sudo apt-get -y install gcc python3 -m pip install --upgrade pip python3 -m pip install azure-cli fi","title":"Platform architecture"},{"location":"developer-experience/devcontainers-going-further/#reuse-of-credentials-for-github","text":"If you develop inside a Dev Container, you will also want to share your GitHub credentials between your host and the Dev Container. Doing so, you would avoid copying your ssh keys back and forth (if you are using ssh to access your repositories). One approach would be to mount your local ~/.ssh folder into your Dev Container. You can either use the mounts option of the devcontainer.json , or use Docker Compose Using mounts : { ... \"mounts\" : [ \"source=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,type=bind\" ], ... } As you can see, ${localEnv:HOME} returns the host home folder, and it maps it to the container home folder. Using Docker Compose: version : '3' services : my-worspace : env_file : ../configs/.env build : context : . dockerfile : Dockerfile volumes : - \"~/.ssh:/home/alex/.ssh\" command : sleep infinity Please note that using Docker Compose requires to edit the devcontainer.json file as we have seen above. You can now access GitHub using the same credentials as your host machine, without worrying of persistence.","title":"Reuse of credentials for GitHub"},{"location":"developer-experience/devcontainers-going-further/#allow-some-customization","text":"As a final note, it is also interesting to leave developers some flexibility in their environment for customization. For instance, one might want to add aliases to their environment. However, changing the ~/.bashrc file in the Dev Container is not a good approach as the container might be destroyed. There are numerous ways to set persistence, here is one approach. Consider the following solution structure: My Application # main repo directory \u2514\u2500\u2500\u2500.devcontainer | \u251c\u2500\u2500\u2500Dockerfile | \u251c\u2500\u2500\u2500docker-compose.yml | \u251c\u2500\u2500\u2500devcontainer.json \u2514\u2500\u2500\u2500me | \u251c\u2500\u2500\u2500bashrc_extension The folder me is untracked in the repository, leaving developers the flexibility to add personal resources. One of these resources can be a .bashrc extension containing customization. For instance: # Sample alias alias gaa = \"git add --all\" We can now adapt our Dockerfile to load these changes when the Docker image is built (and of course, do nothing if there is no file): ... RUN echo \"[ -f PATH_TO_WORKSPACE/me/bashrc_extension ] && . PATH_TO_WORKSPACE/me/bashrc_extension\" >> ~/.bashrc ; ...","title":"Allow some customization"},{"location":"developer-experience/execute-local-pipeline-with-docker/","text":"Executing Pipelines Locally Abstract Having the ability to execute pipeline activities locally has been identified as an opportunity to promote positive developer experience. In this document we will explore a solution which will allow us to have the local CI experience to be as similar as possible to the remote process in the CI server. Using the suggested method will allow us to: Build Lint Unit test E2E test Run Solution Be OS and environment agnostic. Enter Docker Compose Docker Compose allows you to build push or run multi-container Docker applications. Method of Work Dockerize your application(s), including a build step if possible. Add a step in your docker file to execute unit tests. Add a step in the docker file for linting. Create a new dockerfile, possibly in a different folder, which executes end-to-end tests against the cluster. Make sure the default endpoints are configurable (This will become handy in your remote CI server, where you will be able to test against a live environment, if you choose to). Create a docker-compose file which allows you to choose which of the services to run. The default will run all applications and tests, and an optional parameter can run specific services, for example only the application without the tests. Prerequisites Docker Optional: if you clone the sample app, you need to have dotnet core installed. Step by Step with Examples For this tutorial we are going to use a sample dotnet core api application . Here is the docker file for the sample app: # https://hub.docker.com/_/microsoft-dotnet FROM mcr.microsoft.com/dotnet/sdk:5.0 AS build WORKDIR /app # copy csproj and restore as distinct layers COPY ./ ./ RUN dotnet restore RUN dotnet test # copy everything else and build app COPY SampleApp/. ./ RUN dotnet publish -c release -o out --no-restore # final stage/image FROM mcr.microsoft.com/dotnet/aspnet:5.0 WORKDIR /app COPY --from = build /app/out . ENTRYPOINT [ \"dotnet\" , \"SampleNetApi.dll\" ] This script restores all dependencies, builds and runs tests. The dotnet app includes stylecop which fails the build in case of linting issues. Next we will also create a dockerfile to perform an end-to-end test. Usually this will look like a set of scripts, or a dedicated app which performs actual HTTP calls to a running application. For the sake of simplicity the dockerfile itself will run a simple curl command: FROM alpine:3.7 RUN apk --no-cache add curl ENTRYPOINT [ \"curl\" , \"0.0.0.0:8080/weatherforecast\" ] Now we are ready to combine both of the dockerfiles in a docker-compose script: version: '3' services: app: image: app:0.01 build: context: . ports: - \"8080:80\" e2e: image: e2e:0.01 build: context: ./E2E The docker-compose script will launch the 2 dockerfiles, and it will build them if they were not built before. The following command will run docker compose: docker-compose up --build -d Once the images are up, you can make calls to the service. The e2e image will perform the set of e2e tests. If you want to skip the tests, you can simply tell compose to run a specific service by appending the name of the service, as follows: docker-compose up --build -d app Now you have a local script which builds and tests you application. The next step would be make your CI run the docker-compose script. Here is an example of a yaml file used by Azure DevOps pipelines: trigger: - master pool: vmImage: 'ubuntu-latest' variables: solution: '**/*.sln' buildPlatform: 'Any CPU' buildConfiguration: 'Release' steps: - task: DockerCompose@0 displayName: Build, Test, E2E inputs: action: Run services dockerComposeFile: docker-compose.yml - script: dotnet restore SampleApp - script: dotnet build --configuration $( buildConfiguration ) SampleApp displayName: 'dotnet build $(buildConfiguration)' In this script the first step is docker-compose, which uses the same file we created the previous steps. The next steps, do the same using scripts, and are here for comparison. By the end of this step, your CI effectively runs the same build and test commands you run locally.","title":"Executing Pipelines Locally"},{"location":"developer-experience/execute-local-pipeline-with-docker/#executing-pipelines-locally","text":"","title":"Executing Pipelines Locally"},{"location":"developer-experience/execute-local-pipeline-with-docker/#abstract","text":"Having the ability to execute pipeline activities locally has been identified as an opportunity to promote positive developer experience. In this document we will explore a solution which will allow us to have the local CI experience to be as similar as possible to the remote process in the CI server. Using the suggested method will allow us to: Build Lint Unit test E2E test Run Solution Be OS and environment agnostic.","title":"Abstract"},{"location":"developer-experience/execute-local-pipeline-with-docker/#enter-docker-compose","text":"Docker Compose allows you to build push or run multi-container Docker applications.","title":"Enter Docker Compose"},{"location":"developer-experience/execute-local-pipeline-with-docker/#method-of-work","text":"Dockerize your application(s), including a build step if possible. Add a step in your docker file to execute unit tests. Add a step in the docker file for linting. Create a new dockerfile, possibly in a different folder, which executes end-to-end tests against the cluster. Make sure the default endpoints are configurable (This will become handy in your remote CI server, where you will be able to test against a live environment, if you choose to). Create a docker-compose file which allows you to choose which of the services to run. The default will run all applications and tests, and an optional parameter can run specific services, for example only the application without the tests.","title":"Method of Work"},{"location":"developer-experience/execute-local-pipeline-with-docker/#prerequisites","text":"Docker Optional: if you clone the sample app, you need to have dotnet core installed.","title":"Prerequisites"},{"location":"developer-experience/execute-local-pipeline-with-docker/#step-by-step-with-examples","text":"For this tutorial we are going to use a sample dotnet core api application . Here is the docker file for the sample app: # https://hub.docker.com/_/microsoft-dotnet FROM mcr.microsoft.com/dotnet/sdk:5.0 AS build WORKDIR /app # copy csproj and restore as distinct layers COPY ./ ./ RUN dotnet restore RUN dotnet test # copy everything else and build app COPY SampleApp/. ./ RUN dotnet publish -c release -o out --no-restore # final stage/image FROM mcr.microsoft.com/dotnet/aspnet:5.0 WORKDIR /app COPY --from = build /app/out . ENTRYPOINT [ \"dotnet\" , \"SampleNetApi.dll\" ] This script restores all dependencies, builds and runs tests. The dotnet app includes stylecop which fails the build in case of linting issues. Next we will also create a dockerfile to perform an end-to-end test. Usually this will look like a set of scripts, or a dedicated app which performs actual HTTP calls to a running application. For the sake of simplicity the dockerfile itself will run a simple curl command: FROM alpine:3.7 RUN apk --no-cache add curl ENTRYPOINT [ \"curl\" , \"0.0.0.0:8080/weatherforecast\" ] Now we are ready to combine both of the dockerfiles in a docker-compose script: version: '3' services: app: image: app:0.01 build: context: . ports: - \"8080:80\" e2e: image: e2e:0.01 build: context: ./E2E The docker-compose script will launch the 2 dockerfiles, and it will build them if they were not built before. The following command will run docker compose: docker-compose up --build -d Once the images are up, you can make calls to the service. The e2e image will perform the set of e2e tests. If you want to skip the tests, you can simply tell compose to run a specific service by appending the name of the service, as follows: docker-compose up --build -d app Now you have a local script which builds and tests you application. The next step would be make your CI run the docker-compose script. Here is an example of a yaml file used by Azure DevOps pipelines: trigger: - master pool: vmImage: 'ubuntu-latest' variables: solution: '**/*.sln' buildPlatform: 'Any CPU' buildConfiguration: 'Release' steps: - task: DockerCompose@0 displayName: Build, Test, E2E inputs: action: Run services dockerComposeFile: docker-compose.yml - script: dotnet restore SampleApp - script: dotnet build --configuration $( buildConfiguration ) SampleApp displayName: 'dotnet build $(buildConfiguration)' In this script the first step is docker-compose, which uses the same file we created the previous steps. The next steps, do the same using scripts, and are here for comparison. By the end of this step, your CI effectively runs the same build and test commands you run locally.","title":"Step by Step with Examples"},{"location":"developer-experience/fake-services-inner-loop/","text":"Fake Services Inner Dev Loop Introduction Consumers of remote services often find that their development cycle is not in sync with development of remote services, leaving developers of these consumers waiting for the remote services to \"catch up\". One approach to mitigate this issue and improve the inner dev loop is by decoupling and using Mock Services. Various Mock Service options are detailed here . This document will focus on providing an example using the Fake Services approach. API For our example API, we will work against a /User endpoint and the properties for User will be: id - int username - string firstName - string lastName - string email - string password - string phone - string userStatus - int Tooling For the Fake Service approach, we will be using Json-Server . Json-Server is a tool that provides the ability to fully fake REST APIs and run the server locally. It is designed to spin up REST APIs with CRUD functionality with minimal setup. Json-Server requires NodeJS and is installed via NPM. npm install -g json-server Setup In order to run Json-Server, it simply requires a source for data and will infer routes, etc. based on the data file. Note that additional customization can be performed for more advanced scenarios (e.g. custom routes). Details can be found here . For our example, we will use the following data file, db.json : { \"user\" : [ { \"id\" : 0 , \"username\" : \"user1\" , \"firstName\" : \"Kobe\" , \"lastName\" : \"Bryant\" , \"email\" : \"kobe@example.com\" , \"password\" : \"superSecure1\" , \"phone\" : \"(123) 123-1234\" , \"userStatus\" : 0 }, { \"id\" : 1 , \"username\" : \"user2\" , \"firstName\" : \"Shaquille\" , \"lastName\" : \"O'Neal\" , \"email\" : \"shaq@example.com\" , \"password\" : \"superSecure2\" , \"phone\" : \"(123) 123-1235\" , \"userStatus\" : 0 } ] } Run Running Json-Server can be performed by simply running: json-server --watch src/db.json Once running, the User endpoint can be hit on the default localhost port: http:/localhost:3000/user Note that Json-Server can be configured to use other ports using the following syntax: json-server --watch db.json --port 3004 Endpoint The endpoint can be tested by running curl against it and we can narrow down which user object to get back with the following command: curl http://localhost:3000/user/1 which, as expected, returns: { \"id\": 1, \"username\": \"user2\", \"firstName\": \"Shaquille\", \"lastName\": \"O'Neal\", \"email\": \"shaq@example.com\", \"password\": \"superSecure2\", \"phone\": \"(123) 123-1235\", \"userStatus\": 0 }","title":"Fake Services Inner Dev Loop"},{"location":"developer-experience/fake-services-inner-loop/#fake-services-inner-dev-loop","text":"","title":"Fake Services Inner Dev Loop"},{"location":"developer-experience/fake-services-inner-loop/#introduction","text":"Consumers of remote services often find that their development cycle is not in sync with development of remote services, leaving developers of these consumers waiting for the remote services to \"catch up\". One approach to mitigate this issue and improve the inner dev loop is by decoupling and using Mock Services. Various Mock Service options are detailed here . This document will focus on providing an example using the Fake Services approach.","title":"Introduction"},{"location":"developer-experience/fake-services-inner-loop/#api","text":"For our example API, we will work against a /User endpoint and the properties for User will be: id - int username - string firstName - string lastName - string email - string password - string phone - string userStatus - int","title":"API"},{"location":"developer-experience/fake-services-inner-loop/#tooling","text":"For the Fake Service approach, we will be using Json-Server . Json-Server is a tool that provides the ability to fully fake REST APIs and run the server locally. It is designed to spin up REST APIs with CRUD functionality with minimal setup. Json-Server requires NodeJS and is installed via NPM. npm install -g json-server","title":"Tooling"},{"location":"developer-experience/fake-services-inner-loop/#setup","text":"In order to run Json-Server, it simply requires a source for data and will infer routes, etc. based on the data file. Note that additional customization can be performed for more advanced scenarios (e.g. custom routes). Details can be found here . For our example, we will use the following data file, db.json : { \"user\" : [ { \"id\" : 0 , \"username\" : \"user1\" , \"firstName\" : \"Kobe\" , \"lastName\" : \"Bryant\" , \"email\" : \"kobe@example.com\" , \"password\" : \"superSecure1\" , \"phone\" : \"(123) 123-1234\" , \"userStatus\" : 0 }, { \"id\" : 1 , \"username\" : \"user2\" , \"firstName\" : \"Shaquille\" , \"lastName\" : \"O'Neal\" , \"email\" : \"shaq@example.com\" , \"password\" : \"superSecure2\" , \"phone\" : \"(123) 123-1235\" , \"userStatus\" : 0 } ] }","title":"Setup"},{"location":"developer-experience/fake-services-inner-loop/#run","text":"Running Json-Server can be performed by simply running: json-server --watch src/db.json Once running, the User endpoint can be hit on the default localhost port: http:/localhost:3000/user Note that Json-Server can be configured to use other ports using the following syntax: json-server --watch db.json --port 3004","title":"Run"},{"location":"developer-experience/fake-services-inner-loop/#endpoint","text":"The endpoint can be tested by running curl against it and we can narrow down which user object to get back with the following command: curl http://localhost:3000/user/1 which, as expected, returns: { \"id\": 1, \"username\": \"user2\", \"firstName\": \"Shaquille\", \"lastName\": \"O'Neal\", \"email\": \"shaq@example.com\", \"password\": \"superSecure2\", \"phone\": \"(123) 123-1235\", \"userStatus\": 0 }","title":"Endpoint"},{"location":"developer-experience/onboarding-guide-template/","text":"Onboarding Guide Template When developing an onboarding document for a team, it should contain details of engagement scope, team processes, codebase, coding standards, team agreements, software requirements and setup details. The onboarding guide can be used as an index to project specific content if it already exists elsewhere. Allowing this guide to be utilized as a foundation with the links will help keep the guide concise and effective. Overview and Goals List a few sentences explaining the high-level summary and the scope of the engagement. Consider adding any additional background and context as needed. Include the value proposition of the project, goals, what success looks like, and what the team is trying to achieve and why. Contacts List a few of the main contacts for the team and project overall such as the Dev Lead and Product Owner. Consider including the roles of these main contacts so that the team knows who to reach out to depending on the situation. Team Agreement and Code of Conduct Include the team's code of conduct or agreement that defines a set of expectation from each team member and how the team has agreed to operate. Working Agreement Template - working agreement Dev Environment Setup Consider adding steps to run the project end-to-end. This could be in form of a separate wiki page or document that can be linked here. Include any software that needs to be downloaded and specify if a specific version of the software is needed. Project Building Blocks This can include a more in depth description with different areas of the project to help increase the project understanding. It can include different sections on the various components of the project including deployment, e2e testing, repositories. Resources This can include any additional links to documents related to the project It may include links to backlog items, work items, wiki pages or project history.","title":"Onboarding Guide Template"},{"location":"developer-experience/onboarding-guide-template/#onboarding-guide-template","text":"When developing an onboarding document for a team, it should contain details of engagement scope, team processes, codebase, coding standards, team agreements, software requirements and setup details. The onboarding guide can be used as an index to project specific content if it already exists elsewhere. Allowing this guide to be utilized as a foundation with the links will help keep the guide concise and effective.","title":"Onboarding Guide Template"},{"location":"developer-experience/onboarding-guide-template/#overview-and-goals","text":"List a few sentences explaining the high-level summary and the scope of the engagement. Consider adding any additional background and context as needed. Include the value proposition of the project, goals, what success looks like, and what the team is trying to achieve and why.","title":"Overview and Goals"},{"location":"developer-experience/onboarding-guide-template/#contacts","text":"List a few of the main contacts for the team and project overall such as the Dev Lead and Product Owner. Consider including the roles of these main contacts so that the team knows who to reach out to depending on the situation.","title":"Contacts"},{"location":"developer-experience/onboarding-guide-template/#team-agreement-and-code-of-conduct","text":"Include the team's code of conduct or agreement that defines a set of expectation from each team member and how the team has agreed to operate. Working Agreement Template - working agreement","title":"Team Agreement and Code of Conduct"},{"location":"developer-experience/onboarding-guide-template/#dev-environment-setup","text":"Consider adding steps to run the project end-to-end. This could be in form of a separate wiki page or document that can be linked here. Include any software that needs to be downloaded and specify if a specific version of the software is needed.","title":"Dev Environment Setup"},{"location":"developer-experience/onboarding-guide-template/#project-building-blocks","text":"This can include a more in depth description with different areas of the project to help increase the project understanding. It can include different sections on the various components of the project including deployment, e2e testing, repositories.","title":"Project Building Blocks"},{"location":"developer-experience/onboarding-guide-template/#resources","text":"This can include any additional links to documents related to the project It may include links to backlog items, work items, wiki pages or project history.","title":"Resources"},{"location":"developer-experience/toggle-vnet-dev-environment/","text":"Toggle VNet On and Off for Production and Development Environment Problem Statement When deploying resources on Azure in a secure environment, resources are usually created behind a Private Network (VNet), without public access and with private endpoints to consume resources. This is the recommended approach for pre-production or production environments. Accessing protected resources from a local machine implies one of the following options: Use a VPN Use a jump box With SSH activated (less secure) With Bastion (recommended approach) However, a developer may want to deploy a test environment (in a non-production subscription) for their tests during development phase, without the complexity of networking. In addition, infrastructure code should not be duplicated: it has to be the same whether resources are deployed in a production like environment or in development environment. Option The idea is to offer, via a single boolean variable , the option to deploy resources behind a VNet or not using one infrastructure code base. Securing resources behind a VNet usually implies that public accesses are disabled and private endpoints are created. This is something to have in mind because, as a developer, public access must be activated in order to connect to this environment. The deployment pipeline will set these resources behind a VNet and will secure them by removing public accesses. Developers will be able to run the same deployment script, specifying that resources will not be behind a VNet nor have public accesses disabled. Let's consider the following use case: we want to deploy a VNet, a subnet, a storage account with no public access and a private endpoint for the table. The magic variable that will help toggling security will be called behind_vnet , of type boolean. Let's implement this use case using Terraform . The code below does not contain everything, the purpose is to show the pattern and not how to deploy these resources. For more information on Terraform, please refer to the official documentation . There is no if per se in Terraform to define whether a specific resource should be deployed or not based on a variable value. However, we can use the count meta-argument. The strength of this meta-argument is if its value is 0 , the block is skipped. Here is below the code snippets for this deployment: variables.tf variable \"behind_vnet\" { type = bool } main.tf resource \"azurerm_virtual_network\" \"vnet\" { count = var.behind_vnet ? 1 : 0 name = \"MyVnet\" address_space = [ x.x.x.x / 16 ] resource_group_name = \"MyResourceGroup\" location = \"WestEurope\" ... subnet { name = \"subnet_1\" address_prefix = \"x.x.x.x/24\" } } resource \"azurerm_storage_account\" \"storage_account\" { name = \"storage\" resource_group_name = \"MyResourceGroup\" location = \"WestEurope\" tags = var.tags ... public_network_access_enabled = var.behind_vnet ? false : true } resource \"azurerm_private_endpoint\" \"storage_account_table_private_endpoint\" { count = var.behind_vnet ? 1 : 0 name = \"pe-storage\" subnet_id = azurerm_virtual_network.vnet[0].subnet[0].id ... private_service_connection { name = \"psc-storage\" private_connection_resource_id = azurerm_storage_account.storage_account.id subresource_names = [ \"table\" ] ... } private_dns_zone_group { name = \"privateDnsZoneGroup\" ... } } If we run terraform apply -var behind_vnet = true then all the resources above will be deployed, and it is what we want on a pre-production or production environment. The instruction count = var.behind_vnet ? 1 : 0 will set count with the value 1 , therefore blocks will be executed. However, if we run terraform apply -var behind_vnet = false the azurerm_virtual_network and azurerm_private_endpoint resources will be skipped (because count will be 0 ). The resource azurerm_storage_account will be created, with minor differences in some properties: for instance, here, public_network_access_enabled will be set to true (and this is the goal for a developer to be able to access resources created). The same pattern can be applied over and over for the entire infrastructure code. Conclusion With this approach, the same infrastructure code base can be used to target a production like environment with secured resources behind a VNet with no public accesses and also a more permissive development environment. However, there are a couple of trade-offs with this approach: if a resource has the count argument, it needs to be treated as a list, and not a single item. In the example above, if there is a need to reference the resource azurerm_virtual_network later in the code, azurerm_virtual_network.vnet.id will not work. The following must be used azurerm_virtual_network.vnet[0].id # First (and only) item of the collection The meta-argument count cannot be used with for_each for a whole block. That means that the use of loops to deploy multiple endpoints for instance will not work. Each private endpoints will need to be deployed individually.","title":"Toggle VNet On and Off for Production and Development Environment"},{"location":"developer-experience/toggle-vnet-dev-environment/#toggle-vnet-on-and-off-for-production-and-development-environment","text":"","title":"Toggle VNet On and Off for Production and Development Environment"},{"location":"developer-experience/toggle-vnet-dev-environment/#problem-statement","text":"When deploying resources on Azure in a secure environment, resources are usually created behind a Private Network (VNet), without public access and with private endpoints to consume resources. This is the recommended approach for pre-production or production environments. Accessing protected resources from a local machine implies one of the following options: Use a VPN Use a jump box With SSH activated (less secure) With Bastion (recommended approach) However, a developer may want to deploy a test environment (in a non-production subscription) for their tests during development phase, without the complexity of networking. In addition, infrastructure code should not be duplicated: it has to be the same whether resources are deployed in a production like environment or in development environment.","title":"Problem Statement"},{"location":"developer-experience/toggle-vnet-dev-environment/#option","text":"The idea is to offer, via a single boolean variable , the option to deploy resources behind a VNet or not using one infrastructure code base. Securing resources behind a VNet usually implies that public accesses are disabled and private endpoints are created. This is something to have in mind because, as a developer, public access must be activated in order to connect to this environment. The deployment pipeline will set these resources behind a VNet and will secure them by removing public accesses. Developers will be able to run the same deployment script, specifying that resources will not be behind a VNet nor have public accesses disabled. Let's consider the following use case: we want to deploy a VNet, a subnet, a storage account with no public access and a private endpoint for the table. The magic variable that will help toggling security will be called behind_vnet , of type boolean. Let's implement this use case using Terraform . The code below does not contain everything, the purpose is to show the pattern and not how to deploy these resources. For more information on Terraform, please refer to the official documentation . There is no if per se in Terraform to define whether a specific resource should be deployed or not based on a variable value. However, we can use the count meta-argument. The strength of this meta-argument is if its value is 0 , the block is skipped. Here is below the code snippets for this deployment: variables.tf variable \"behind_vnet\" { type = bool } main.tf resource \"azurerm_virtual_network\" \"vnet\" { count = var.behind_vnet ? 1 : 0 name = \"MyVnet\" address_space = [ x.x.x.x / 16 ] resource_group_name = \"MyResourceGroup\" location = \"WestEurope\" ... subnet { name = \"subnet_1\" address_prefix = \"x.x.x.x/24\" } } resource \"azurerm_storage_account\" \"storage_account\" { name = \"storage\" resource_group_name = \"MyResourceGroup\" location = \"WestEurope\" tags = var.tags ... public_network_access_enabled = var.behind_vnet ? false : true } resource \"azurerm_private_endpoint\" \"storage_account_table_private_endpoint\" { count = var.behind_vnet ? 1 : 0 name = \"pe-storage\" subnet_id = azurerm_virtual_network.vnet[0].subnet[0].id ... private_service_connection { name = \"psc-storage\" private_connection_resource_id = azurerm_storage_account.storage_account.id subresource_names = [ \"table\" ] ... } private_dns_zone_group { name = \"privateDnsZoneGroup\" ... } } If we run terraform apply -var behind_vnet = true then all the resources above will be deployed, and it is what we want on a pre-production or production environment. The instruction count = var.behind_vnet ? 1 : 0 will set count with the value 1 , therefore blocks will be executed. However, if we run terraform apply -var behind_vnet = false the azurerm_virtual_network and azurerm_private_endpoint resources will be skipped (because count will be 0 ). The resource azurerm_storage_account will be created, with minor differences in some properties: for instance, here, public_network_access_enabled will be set to true (and this is the goal for a developer to be able to access resources created). The same pattern can be applied over and over for the entire infrastructure code.","title":"Option"},{"location":"developer-experience/toggle-vnet-dev-environment/#conclusion","text":"With this approach, the same infrastructure code base can be used to target a production like environment with secured resources behind a VNet with no public accesses and also a more permissive development environment. However, there are a couple of trade-offs with this approach: if a resource has the count argument, it needs to be treated as a list, and not a single item. In the example above, if there is a need to reference the resource azurerm_virtual_network later in the code, azurerm_virtual_network.vnet.id will not work. The following must be used azurerm_virtual_network.vnet[0].id # First (and only) item of the collection The meta-argument count cannot be used with for_each for a whole block. That means that the use of loops to deploy multiple endpoints for instance will not work. Each private endpoints will need to be deployed individually.","title":"Conclusion"},{"location":"documentation/","text":"Documentation Every software development project requires documentation. Agile Software Development values working software over comprehensive documentation . Still, projects should include the key information needed to understand the development and the use of the generated software. Documentation shouldn't be an afterthought. Different written documents and materials should be created during the whole life cycle of the project, as per the project needs. Goals Facilitate onboarding of new team members. Improve communication and collaboration between teams (especially when distributed across time zones). Improve the transition of the project to another team. Challenges When working in an engineering project, we typically encounter one or more of these challenges related to documentation (including some examples): Non-existent . No onboarding documentation, so it takes a long time to set up the environment when you join the project. No document in the wiki explaining existing repositories, so you cannot tell which of the 10 available repositories you should clone. No main README, so you don't know where to start when you clone a repository. No \"how to contribute\" section, so you don't know which is the branch policy, where to add new documents, etc. No code guidelines, so everyone follows different naming conventions, etc. Hidden . Impossible to find useful documentation as it\u2019s scattered all over the place. E.g., no idea how to compile, run and test the code as the README is hidden in a folder within a folder within a folder. Useful processes (e.g., grooming process) explained outside the backlog management tool and not linked anywhere. Decisions taken in different channels other than the backlog management tool and not recorded anywhere else. Incomplete . No clear branch policy, so everyone names their branches differently. Missing settings in the \"how to run this\" document that are required to run the application. Inaccurate . Documents not updated along with the code, so they don't mention the right folders, settings, etc. Obsolete . Design documents that don't apply anymore, sitting next to valid documents. Which one shows the latest decisions? Out of order (subject / date) . Documents not organized per subject/workstream so not easy to find relevant information when you change to a new workstream. Design decision logs out of order and without a date that helps to determine which is the final decision on something. Duplicate . No settings file available in a centralized place as a single source of truth, so developers must keep sharing their own versions, and we end up with many files that might or might not work. Afterthought . Key documents created several weeks into the project: onboarding, how to run the app, etc. Documents created last minute just before the end of a project, forgetting that they also help the team while working on the project. What Documentation Should Exist Project and Repositories Commit Messages Pull Requests Code Work Items REST APIs Engineering Feedback Best Practices Establishing and managing documentation Creating good documentation Replacing documentation with automation Tools Wikis Languages markdown mermaid How to automate simple checks Integration with Teams/Slack Recipes How to sync a wiki between repositories Using DocFx and Companion Tools to generate a Documentation website Deploy the DocFx Documentation website to an Azure Website automatically How to create a static website for your documentation based on MkDocs and Material for MkDocs Resources Software Documentation Types and Best Practices","title":"Documentation"},{"location":"documentation/#documentation","text":"Every software development project requires documentation. Agile Software Development values working software over comprehensive documentation . Still, projects should include the key information needed to understand the development and the use of the generated software. Documentation shouldn't be an afterthought. Different written documents and materials should be created during the whole life cycle of the project, as per the project needs.","title":"Documentation"},{"location":"documentation/#goals","text":"Facilitate onboarding of new team members. Improve communication and collaboration between teams (especially when distributed across time zones). Improve the transition of the project to another team.","title":"Goals"},{"location":"documentation/#challenges","text":"When working in an engineering project, we typically encounter one or more of these challenges related to documentation (including some examples): Non-existent . No onboarding documentation, so it takes a long time to set up the environment when you join the project. No document in the wiki explaining existing repositories, so you cannot tell which of the 10 available repositories you should clone. No main README, so you don't know where to start when you clone a repository. No \"how to contribute\" section, so you don't know which is the branch policy, where to add new documents, etc. No code guidelines, so everyone follows different naming conventions, etc. Hidden . Impossible to find useful documentation as it\u2019s scattered all over the place. E.g., no idea how to compile, run and test the code as the README is hidden in a folder within a folder within a folder. Useful processes (e.g., grooming process) explained outside the backlog management tool and not linked anywhere. Decisions taken in different channels other than the backlog management tool and not recorded anywhere else. Incomplete . No clear branch policy, so everyone names their branches differently. Missing settings in the \"how to run this\" document that are required to run the application. Inaccurate . Documents not updated along with the code, so they don't mention the right folders, settings, etc. Obsolete . Design documents that don't apply anymore, sitting next to valid documents. Which one shows the latest decisions? Out of order (subject / date) . Documents not organized per subject/workstream so not easy to find relevant information when you change to a new workstream. Design decision logs out of order and without a date that helps to determine which is the final decision on something. Duplicate . No settings file available in a centralized place as a single source of truth, so developers must keep sharing their own versions, and we end up with many files that might or might not work. Afterthought . Key documents created several weeks into the project: onboarding, how to run the app, etc. Documents created last minute just before the end of a project, forgetting that they also help the team while working on the project.","title":"Challenges"},{"location":"documentation/#what-documentation-should-exist","text":"Project and Repositories Commit Messages Pull Requests Code Work Items REST APIs Engineering Feedback","title":"What Documentation Should Exist"},{"location":"documentation/#best-practices","text":"Establishing and managing documentation Creating good documentation Replacing documentation with automation","title":"Best Practices"},{"location":"documentation/#tools","text":"Wikis Languages markdown mermaid How to automate simple checks Integration with Teams/Slack","title":"Tools"},{"location":"documentation/#recipes","text":"How to sync a wiki between repositories Using DocFx and Companion Tools to generate a Documentation website Deploy the DocFx Documentation website to an Azure Website automatically How to create a static website for your documentation based on MkDocs and Material for MkDocs","title":"Recipes"},{"location":"documentation/#resources","text":"Software Documentation Types and Best Practices","title":"Resources"},{"location":"documentation/best-practices/automation/","text":"Replacing Documentation with Automation You can document how to set up your dev machine with the right version of the framework required to run the code, which extensions are useful to develop the application with your editor, or how to configure your editor to launch and debug the application. If it is possible, a better solution is to provide the means to automate tool installs, application startup, etc., instead. Some examples are provided below: Dev Containers in Visual Studio Code The Visual Studio Code Remote - Containers extension lets you use a Docker container as a full-featured development environment. It allows you to open any folder inside (or mounted into) a container and take advantage of Visual Studio Code's full feature set. Additional information: Developing inside a Container . Launch Configurations and Tasks in Visual Studio Code Launch configurations allows you to configure and save debugging setup details. Tasks can be configured to run scripts and start processes so that many of these existing tools can be used from within VS Code without having to enter a command line or write new code.","title":"Replacing Documentation with Automation"},{"location":"documentation/best-practices/automation/#replacing-documentation-with-automation","text":"You can document how to set up your dev machine with the right version of the framework required to run the code, which extensions are useful to develop the application with your editor, or how to configure your editor to launch and debug the application. If it is possible, a better solution is to provide the means to automate tool installs, application startup, etc., instead. Some examples are provided below:","title":"Replacing Documentation with Automation"},{"location":"documentation/best-practices/automation/#dev-containers-in-visual-studio-code","text":"The Visual Studio Code Remote - Containers extension lets you use a Docker container as a full-featured development environment. It allows you to open any folder inside (or mounted into) a container and take advantage of Visual Studio Code's full feature set. Additional information: Developing inside a Container .","title":"Dev Containers in Visual Studio Code"},{"location":"documentation/best-practices/automation/#launch-configurations-and-tasks-in-visual-studio-code","text":"Launch configurations allows you to configure and save debugging setup details. Tasks can be configured to run scripts and start processes so that many of these existing tools can be used from within VS Code without having to enter a command line or write new code.","title":"Launch Configurations and Tasks in Visual Studio Code"},{"location":"documentation/best-practices/establish-and-manage/","text":"Establishing and Managing Documentation Documentation should be source-controlled. Pull Requests can be used to tell others about the changes, so they can be reviewed and discussed. E.g., Async Design Reviews . Tools: Wikis .","title":"Establishing and Managing Documentation"},{"location":"documentation/best-practices/establish-and-manage/#establishing-and-managing-documentation","text":"Documentation should be source-controlled. Pull Requests can be used to tell others about the changes, so they can be reviewed and discussed. E.g., Async Design Reviews . Tools: Wikis .","title":"Establishing and Managing Documentation"},{"location":"documentation/best-practices/good-documentation/","text":"Creating Good Documentation Review the Documentation Review Checklist for advice on how to write good documentation. Good documentation should follow good writing guidelines: Writing Style Guidelines .","title":"Creating Good Documentation"},{"location":"documentation/best-practices/good-documentation/#creating-good-documentation","text":"Review the Documentation Review Checklist for advice on how to write good documentation. Good documentation should follow good writing guidelines: Writing Style Guidelines .","title":"Creating Good Documentation"},{"location":"documentation/guidance/code/","text":"Code You might have heard more than once that you should write self-documenting code . This doesn't mean that you should never comment your code. There are two types of code comments, implementation comments and documentation comments. Implementation Comments They are used for internal documentation, and are intended for anyone who may need to maintain the code in the future, including your future self. There can be single line and multi-line comments (e.g., C# Comments ). Comments are human-readable and not executed, thus ignored by the compiler. So you could potentially add as many as you want. Now, the use of these comments is often considered a code smell. If you need to clarify your code, that may mean the code is too complex. So you should work towards the removal of the clarification by making the code simpler, easier to read, and understand. Still, these comments can be useful to give overviews of the code, or provide additional context information that is not available in the code itself. Examples of useful comments: Single line comment in C# that explains why that piece of code is there (from a private method in System.Text.Json.JsonSerializer ): // For performance, avoid obtaining actual byte count unless memory usage is higher than the threshold. Span < byte > utf8 = json . Length <= ( ArrayPoolMaxSizeBeforeUsingNormalAlloc / JsonConstants . MaxExpansionFactorWhileTranscoding ) ? ... Multi-line comment in C# that provides additional context (from a private method in System.Text.Json.Utf8JsonReader ): // Transcoding from UTF-16 to UTF-8 will change the length by somewhere between 1x and 3x. // Un-escaping the token value will at most shrink its length by 6x. // There is no point incurring the transcoding/un-escaping/comparing cost if: // - The token value is smaller than charTextLength // - The token value needs to be transcoded AND unescaped and it is more than 6x larger than charTextLength // - For an ASCII UTF-16 characters, transcoding = 1x, escaping = 6x => 6x factor // - For non-ASCII UTF-16 characters within the BMP, transcoding = 2-3x, but they are represented as a single escaped hex value, \\uXXXX => 6x factor // - For non-ASCII UTF-16 characters outside of the BMP, transcoding = 4x, but the surrogate pair (2 characters) are represented by 16 bytes \\uXXXX\\uXXXX => 6x factor // - The token value needs to be transcoded, but NOT escaped and it is more than 3x larger than charTextLength // - For an ASCII UTF-16 characters, transcoding = 1x, // - For non-ASCII UTF-16 characters within the BMP, transcoding = 2-3x, // - For non-ASCII UTF-16 characters outside of the BMP, transcoding = 2x, (surrogate pairs - 2 characters transcode to 4 UTF-8 bytes) if ( sourceLength < charTextLength || sourceLength / ( _stringHasEscaping ? JsonConstants . MaxExpansionFactorWhileEscaping : JsonConstants . MaxExpansionFactorWhileTranscoding ) > charTextLength ) { Documentation Comments Doc comments are a special kind of comment, added above the definition of any user-defined type or member, and are intended for anyone who may need to use those types or members in their own code. If, for example, you are building a library or framework, doc comments can be used to generate their documentation. This documentation should serve as API specification, and/or programming guide. Doc comments won't be included by the compiler in the final executable, as with single and multi-line comments. Example of a doc comment in C# (from Deserialize method in System.Text.Json.JsonSerializer ): /// <summary> /// Parse the text representing a single JSON value into a <typeparamref name=\"TValue\"/>. /// </summary> /// <returns>A <typeparamref name=\"TValue\"/> representation of the JSON value.</returns> /// <param name=\"json\">JSON text to parse.</param> /// <param name=\"options\">Options to control the behavior during parsing.</param> /// <exception cref=\"System.ArgumentNullException\"> /// <paramref name=\"json\"/> is <see langword=\"null\"/>. /// </exception> /// <exception cref=\"JsonException\"> /// The JSON is invalid. /// /// -or- /// /// <typeparamref name=\"TValue\" /> is not compatible with the JSON. /// /// -or- /// /// There is remaining data in the string beyond a single JSON value.</exception> /// <exception cref=\"NotSupportedException\"> /// There is no compatible <see cref=\"System.Text.Json.Serialization.JsonConverter\"/> /// for <typeparamref name=\"TValue\"/> or its serializable members. /// </exception> /// <remarks>Using a <see cref=\"string\"/> is not as efficient as using the /// UTF-8 methods since the implementation natively uses UTF-8. /// </remarks> [RequiresUnreferencedCode(SerializationUnreferencedCodeMessage)] public static TValue ? Deserialize < TValue > ( string json , JsonSerializerOptions ? options = null ) { In C# , doc comments can be processed by the compiler to generate XML documentation files. These files can be distributed alongside your libraries so that Visual Studio and other IDEs can use IntelliSense to show quick information about types or members. Additionally, these files can be run through tools like DocFx to generate API reference websites. More information: Recommended XML tags for C# documentation comments . In other languages, you may require external tools. For example, Java doc comments can be processed by Javadoc tool to generate HTML documentation files. More information: How to Write Doc Comments for the Javadoc Tool Javadoc Tool","title":"Code"},{"location":"documentation/guidance/code/#code","text":"You might have heard more than once that you should write self-documenting code . This doesn't mean that you should never comment your code. There are two types of code comments, implementation comments and documentation comments.","title":"Code"},{"location":"documentation/guidance/code/#implementation-comments","text":"They are used for internal documentation, and are intended for anyone who may need to maintain the code in the future, including your future self. There can be single line and multi-line comments (e.g., C# Comments ). Comments are human-readable and not executed, thus ignored by the compiler. So you could potentially add as many as you want. Now, the use of these comments is often considered a code smell. If you need to clarify your code, that may mean the code is too complex. So you should work towards the removal of the clarification by making the code simpler, easier to read, and understand. Still, these comments can be useful to give overviews of the code, or provide additional context information that is not available in the code itself. Examples of useful comments: Single line comment in C# that explains why that piece of code is there (from a private method in System.Text.Json.JsonSerializer ): // For performance, avoid obtaining actual byte count unless memory usage is higher than the threshold. Span < byte > utf8 = json . Length <= ( ArrayPoolMaxSizeBeforeUsingNormalAlloc / JsonConstants . MaxExpansionFactorWhileTranscoding ) ? ... Multi-line comment in C# that provides additional context (from a private method in System.Text.Json.Utf8JsonReader ): // Transcoding from UTF-16 to UTF-8 will change the length by somewhere between 1x and 3x. // Un-escaping the token value will at most shrink its length by 6x. // There is no point incurring the transcoding/un-escaping/comparing cost if: // - The token value is smaller than charTextLength // - The token value needs to be transcoded AND unescaped and it is more than 6x larger than charTextLength // - For an ASCII UTF-16 characters, transcoding = 1x, escaping = 6x => 6x factor // - For non-ASCII UTF-16 characters within the BMP, transcoding = 2-3x, but they are represented as a single escaped hex value, \\uXXXX => 6x factor // - For non-ASCII UTF-16 characters outside of the BMP, transcoding = 4x, but the surrogate pair (2 characters) are represented by 16 bytes \\uXXXX\\uXXXX => 6x factor // - The token value needs to be transcoded, but NOT escaped and it is more than 3x larger than charTextLength // - For an ASCII UTF-16 characters, transcoding = 1x, // - For non-ASCII UTF-16 characters within the BMP, transcoding = 2-3x, // - For non-ASCII UTF-16 characters outside of the BMP, transcoding = 2x, (surrogate pairs - 2 characters transcode to 4 UTF-8 bytes) if ( sourceLength < charTextLength || sourceLength / ( _stringHasEscaping ? JsonConstants . MaxExpansionFactorWhileEscaping : JsonConstants . MaxExpansionFactorWhileTranscoding ) > charTextLength ) {","title":"Implementation Comments"},{"location":"documentation/guidance/code/#documentation-comments","text":"Doc comments are a special kind of comment, added above the definition of any user-defined type or member, and are intended for anyone who may need to use those types or members in their own code. If, for example, you are building a library or framework, doc comments can be used to generate their documentation. This documentation should serve as API specification, and/or programming guide. Doc comments won't be included by the compiler in the final executable, as with single and multi-line comments. Example of a doc comment in C# (from Deserialize method in System.Text.Json.JsonSerializer ): /// <summary> /// Parse the text representing a single JSON value into a <typeparamref name=\"TValue\"/>. /// </summary> /// <returns>A <typeparamref name=\"TValue\"/> representation of the JSON value.</returns> /// <param name=\"json\">JSON text to parse.</param> /// <param name=\"options\">Options to control the behavior during parsing.</param> /// <exception cref=\"System.ArgumentNullException\"> /// <paramref name=\"json\"/> is <see langword=\"null\"/>. /// </exception> /// <exception cref=\"JsonException\"> /// The JSON is invalid. /// /// -or- /// /// <typeparamref name=\"TValue\" /> is not compatible with the JSON. /// /// -or- /// /// There is remaining data in the string beyond a single JSON value.</exception> /// <exception cref=\"NotSupportedException\"> /// There is no compatible <see cref=\"System.Text.Json.Serialization.JsonConverter\"/> /// for <typeparamref name=\"TValue\"/> or its serializable members. /// </exception> /// <remarks>Using a <see cref=\"string\"/> is not as efficient as using the /// UTF-8 methods since the implementation natively uses UTF-8. /// </remarks> [RequiresUnreferencedCode(SerializationUnreferencedCodeMessage)] public static TValue ? Deserialize < TValue > ( string json , JsonSerializerOptions ? options = null ) { In C# , doc comments can be processed by the compiler to generate XML documentation files. These files can be distributed alongside your libraries so that Visual Studio and other IDEs can use IntelliSense to show quick information about types or members. Additionally, these files can be run through tools like DocFx to generate API reference websites. More information: Recommended XML tags for C# documentation comments . In other languages, you may require external tools. For example, Java doc comments can be processed by Javadoc tool to generate HTML documentation files. More information: How to Write Doc Comments for the Javadoc Tool Javadoc Tool","title":"Documentation Comments"},{"location":"documentation/guidance/engineering-feedback/","text":"Engineering Feedback Good engineering feedback is: Actionable Specific Detailed Includes assets (script, data, code, etc.) to reproduce scenario and validate solution Includes details about the customer scenario / what the customer was trying to achieve Refer to Microsoft Engineering Feedback for more details, including guidance , FAQ and examples .","title":"Engineering Feedback"},{"location":"documentation/guidance/engineering-feedback/#engineering-feedback","text":"Good engineering feedback is: Actionable Specific Detailed Includes assets (script, data, code, etc.) to reproduce scenario and validate solution Includes details about the customer scenario / what the customer was trying to achieve Refer to Microsoft Engineering Feedback for more details, including guidance , FAQ and examples .","title":"Engineering Feedback"},{"location":"documentation/guidance/project-and-repositories/","text":"Projects and Repositories Every source code repository should include documentation that is specific to it (e.g., in a Wiki within the repository), while the project itself should include general documentation that is common to all its associated repositories (e.g., in a Wiki within the backlog management tool). Documentation Specific to a Repository Introduction Getting started Onboarding Setup: programming language, frameworks, platforms, tools, etc. Sandbox environment Working agreement Contributing guide Structure: folders, projects, etc. How to compile, test, build, deploy the solution/each project Different OS versions Command line + editors/IDEs Design Decision Logs Architecture Decision Record (ADRs) Trade Studies Some sections in the documentation of the repository might point to the project\u2019s documentation (e.g., Onboarding, Working Agreement, Contributing Guide). Common Documentation to all Repositories Introduction Project Stakeholders Definitions Requirements Onboarding Repository guide Production, Spikes Team agreements Team Manifesto Short summary of expectations around the technical way of working and supported mindset in the team. E.g., ownership, respect, collaboration, transparency. Working Agreement How we work together as a team and what our expectations and principles are. E.g., communication, work-life balance, scrum rhythm, backlog management, code management. Definition of Done List of tasks that must be completed to close a user story, a sprint, or a milestone. Definition of Ready How complete a user story should be in order to be selected as candidate for estimation in the sprint planning. Contributing Guide Repo structure Design documents Branching and branch name strategy Merge and commit history strategy Pull Requests Code Review Process Code Review Checklist Language Specific Checklists Project Design High Level / Game Plan Milestone / Epic Design Review Design Review Recipes Milestone / Epic Design Review Template Feature / Story Design Review Template Task Design Review Template Decision Log Template Architecture Decision Record (ADR) Template ( Example 1 , Example 2 ) Trade Study Template","title":"Projects and Repositories"},{"location":"documentation/guidance/project-and-repositories/#projects-and-repositories","text":"Every source code repository should include documentation that is specific to it (e.g., in a Wiki within the repository), while the project itself should include general documentation that is common to all its associated repositories (e.g., in a Wiki within the backlog management tool).","title":"Projects and Repositories"},{"location":"documentation/guidance/project-and-repositories/#documentation-specific-to-a-repository","text":"Introduction Getting started Onboarding Setup: programming language, frameworks, platforms, tools, etc. Sandbox environment Working agreement Contributing guide Structure: folders, projects, etc. How to compile, test, build, deploy the solution/each project Different OS versions Command line + editors/IDEs Design Decision Logs Architecture Decision Record (ADRs) Trade Studies Some sections in the documentation of the repository might point to the project\u2019s documentation (e.g., Onboarding, Working Agreement, Contributing Guide).","title":"Documentation Specific to a Repository"},{"location":"documentation/guidance/project-and-repositories/#common-documentation-to-all-repositories","text":"Introduction Project Stakeholders Definitions Requirements Onboarding Repository guide Production, Spikes Team agreements Team Manifesto Short summary of expectations around the technical way of working and supported mindset in the team. E.g., ownership, respect, collaboration, transparency. Working Agreement How we work together as a team and what our expectations and principles are. E.g., communication, work-life balance, scrum rhythm, backlog management, code management. Definition of Done List of tasks that must be completed to close a user story, a sprint, or a milestone. Definition of Ready How complete a user story should be in order to be selected as candidate for estimation in the sprint planning. Contributing Guide Repo structure Design documents Branching and branch name strategy Merge and commit history strategy Pull Requests Code Review Process Code Review Checklist Language Specific Checklists Project Design High Level / Game Plan Milestone / Epic Design Review Design Review Recipes Milestone / Epic Design Review Template Feature / Story Design Review Template Task Design Review Template Decision Log Template Architecture Decision Record (ADR) Template ( Example 1 , Example 2 ) Trade Study Template","title":"Common Documentation to all Repositories"},{"location":"documentation/guidance/pull-requests/","text":"Pull Requests When we create Pull Requests , we must ensure they are properly documented: Title and Description Pull Request Description Pull Request Template Linked worked items Comments As an author, address all comments As a reviewer, make comments clear","title":"Pull Requests"},{"location":"documentation/guidance/pull-requests/#pull-requests","text":"When we create Pull Requests , we must ensure they are properly documented: Title and Description Pull Request Description Pull Request Template Linked worked items Comments As an author, address all comments As a reviewer, make comments clear","title":"Pull Requests"},{"location":"documentation/guidance/rest-apis/","text":"REST APIs When creating REST APIs , you can leverage the OpenAPI-Specification (OAI) (originally known as the Swagger Specification) to describe them: The OpenAPI Specification (OAS) defines a standard, programming language-agnostic interface description for HTTP APIs, which allows both humans and computers to discover and understand the capabilities of a service without requiring access to source code, additional documentation, or inspection of network traffic. When properly defined via OpenAPI, a consumer can understand and interact with the remote service with a minimal amount of implementation logic. Use cases for machine-readable API definition documents include, but are not limited to: interactive documentation; code generation for documentation, clients, and servers; and automation of test cases. OpenAPI documents describe an APIs services and are represented in either YAML or JSON formats. These documents may either be produced and served statically or be generated dynamically from an application. There are implementations available for many languages like C#, including low-level tooling, editors, user interfaces, code generators, etc. Here you can find a list of known tooling for the different languages: OpenAPI-Specification/IMPLEMENTATIONS.md . Using Microsoft TypeSpec While the OpenAPI-Specification (OAI) is a popular method for defining and documenting RESTful APIs, there are other languages available that can simplify and expedite the documentation process. Microsoft TypeSpec is one such language that allows for the description of cloud service APIs and the generation of API description languages, client and service code, documentation, and other assets. Microsoft TypeSpec is a highly extensible language that offers a set of core primitives that can describe API shapes common among REST, OpenAPI, GraphQL, gRPC, and other protocols. This makes it a versatile option for developers who need to work with a range of different API styles and technologies. Microsoft TypeSpec is a widely adopted tool within Azure teams, particularly for generating OpenAPI Specifications in complex and interconnected APIs that span multiple teams. To ensure consistency across different parts of the API, teams commonly leverage shared libraries which contain reusable patterns. This makes easier to follow best practices rather than deviating from them. By promoting highly regular API designs that adhere to best practices by construction, TypeSpec can help improve the quality and consistency of APIs developed within an organization. Resources ASP.NET Core web API documentation with Swagger / OpenAPI . Microsoft TypeSpec . Design Patterns - REST API Guidance","title":"REST APIs"},{"location":"documentation/guidance/rest-apis/#rest-apis","text":"When creating REST APIs , you can leverage the OpenAPI-Specification (OAI) (originally known as the Swagger Specification) to describe them: The OpenAPI Specification (OAS) defines a standard, programming language-agnostic interface description for HTTP APIs, which allows both humans and computers to discover and understand the capabilities of a service without requiring access to source code, additional documentation, or inspection of network traffic. When properly defined via OpenAPI, a consumer can understand and interact with the remote service with a minimal amount of implementation logic. Use cases for machine-readable API definition documents include, but are not limited to: interactive documentation; code generation for documentation, clients, and servers; and automation of test cases. OpenAPI documents describe an APIs services and are represented in either YAML or JSON formats. These documents may either be produced and served statically or be generated dynamically from an application. There are implementations available for many languages like C#, including low-level tooling, editors, user interfaces, code generators, etc. Here you can find a list of known tooling for the different languages: OpenAPI-Specification/IMPLEMENTATIONS.md .","title":"REST APIs"},{"location":"documentation/guidance/rest-apis/#using-microsoft-typespec","text":"While the OpenAPI-Specification (OAI) is a popular method for defining and documenting RESTful APIs, there are other languages available that can simplify and expedite the documentation process. Microsoft TypeSpec is one such language that allows for the description of cloud service APIs and the generation of API description languages, client and service code, documentation, and other assets. Microsoft TypeSpec is a highly extensible language that offers a set of core primitives that can describe API shapes common among REST, OpenAPI, GraphQL, gRPC, and other protocols. This makes it a versatile option for developers who need to work with a range of different API styles and technologies. Microsoft TypeSpec is a widely adopted tool within Azure teams, particularly for generating OpenAPI Specifications in complex and interconnected APIs that span multiple teams. To ensure consistency across different parts of the API, teams commonly leverage shared libraries which contain reusable patterns. This makes easier to follow best practices rather than deviating from them. By promoting highly regular API designs that adhere to best practices by construction, TypeSpec can help improve the quality and consistency of APIs developed within an organization.","title":"Using Microsoft TypeSpec"},{"location":"documentation/guidance/rest-apis/#resources","text":"ASP.NET Core web API documentation with Swagger / OpenAPI . Microsoft TypeSpec . Design Patterns - REST API Guidance","title":"Resources"},{"location":"documentation/guidance/work-items/","text":"Work Items While many teams can work with a flat list of items, sometimes it helps to group related items into a hierarchical structure. You can use portfolio backlogs to bring more order to your backlog. Agile process backlog work item hierarchy: Scrum process backlog work item hierarchy: Bugs can be set at the same level as User Stories / Product Backlog Items or Tasks. Epics and Features User stories / Product Backlog Items roll up into Features , which typically represent a shippable deliverable that addresses a customer need e.g., \"Add shopping cart\". And Features roll up into Epics , which represent a business initiative to be accomplished e.g., \"Increase customer engagement\". Take that into account when naming them. Each Feature or Epic should include as much detail as the team needs to: Understand the scope. Estimate the work required. Develop tests. Ensure the end product meets acceptance criteria. Details that should be added: Value Area : Business (directly deliver customer value) vs. Architectural (technical services to implement business features). Effort / Story Points / Size : Relative estimate of the amount of work required to complete the item. Business Value : Priority of an item compared to other items of the same type. Time Criticality : Higher values indicate an item is more time critical than items with lower values. Target Date by which the feature should be implemented. You may use work item tags to support queries and filtering. User Stories / Product Backlog Items Each User Story / Product Backlog Item should be sized so that they can be completed within a sprint. You should add the following details to the items: Title : Usually expressed as \"As a [persona], I want [to perform an action], so that [I can achieve an end result].\". Description : Provide enough detail to create shared understanding of scope and support estimation efforts. Focus on the user, what they want to accomplish, and why. Don't describe how to develop the product. Provide enough details so the team can write tasks and test cases to implement the item. Include Design Reviews. Acceptance Criteria : Define what \"Done\" means. Activity : Deployment, Design, Development, Documentation, Requirements, Testing. Effort / Story Points / Size : Relative estimate of the amount of work required to complete the item. Business Value : Priority of an item compared to other items of the same type. Original Estimate : The amount of estimated work required to complete a task. Remember to use the Discussion section of the items to keep track of related comments, and mention individuals, groups, work items or pull requests when required. Tasks Each Task should be sized so that they can be completed within a day. You should at least add the following details to the items: Title . Description : Provide enough detail to create shared understanding of scope. Any developer should be able to take the item and know what needs to be implemented. Include Design Reviews. Reference to the working branch in related code repository. Remember to use the Discussion section of the tasks to keep track of related comments. Bugs You should use bugs to capture both the initial issue and ongoing discoveries. You should at least add the following details to the bug items: Title . Description . Steps to Reproduce . System Info / Found in Build : Software and system configuration that is relevant to the bug and tests to apply. Acceptance Criteria : Criteria to meet so the bug can be closed. Integrated in Build : Name of the build that incorporates the code that fixes the bug. Priority : 1: Product should not ship without the successful resolution of the work item. The bug should be addressed as soon as possible. 2: Product should not ship without the successful resolution of the work item, but it does not need to be addressed immediately. 3: Resolution of the work item is optional based on resources, time, and risk. Severity : 1 - Critical: Must fix. No acceptable alternative methods. 2 - High: Consider fix. An acceptable alternative method exists. 3 - Medium: (Default). 4 - Low. Issues / Impediments Don't confuse with bugs. They represent unplanned activities that may block work from getting done. For example: feature ambiguity, personnel or resource issues, problems with environments, or other risks that impact scope, quality, or schedule. In general, you link these items to user stories or other work items. Actions from Retrospectives After a retrospective, every action that requires work should be tracked with its own Task or Issue / Impediment. These items might be unparented (without link to parent backlog item or user story). Related information Best practices for Agile project management - Azure Boards | Microsoft Docs . Define features and epics, organize backlog items - Azure Boards | Microsoft Docs . Create your product backlog - Azure Boards | Microsoft Docs . Add tasks to support sprint planning - Azure Boards | Microsoft Docs . Define, capture, triage, and manage bugs or code defects - Azure Boards | Microsoft Docs . Add and manage issues or impediments - Azure Boards | Microsoft Docs .","title":"Work Items"},{"location":"documentation/guidance/work-items/#work-items","text":"While many teams can work with a flat list of items, sometimes it helps to group related items into a hierarchical structure. You can use portfolio backlogs to bring more order to your backlog. Agile process backlog work item hierarchy: Scrum process backlog work item hierarchy: Bugs can be set at the same level as User Stories / Product Backlog Items or Tasks.","title":"Work Items"},{"location":"documentation/guidance/work-items/#epics-and-features","text":"User stories / Product Backlog Items roll up into Features , which typically represent a shippable deliverable that addresses a customer need e.g., \"Add shopping cart\". And Features roll up into Epics , which represent a business initiative to be accomplished e.g., \"Increase customer engagement\". Take that into account when naming them. Each Feature or Epic should include as much detail as the team needs to: Understand the scope. Estimate the work required. Develop tests. Ensure the end product meets acceptance criteria. Details that should be added: Value Area : Business (directly deliver customer value) vs. Architectural (technical services to implement business features). Effort / Story Points / Size : Relative estimate of the amount of work required to complete the item. Business Value : Priority of an item compared to other items of the same type. Time Criticality : Higher values indicate an item is more time critical than items with lower values. Target Date by which the feature should be implemented. You may use work item tags to support queries and filtering.","title":"Epics and Features"},{"location":"documentation/guidance/work-items/#user-stories-product-backlog-items","text":"Each User Story / Product Backlog Item should be sized so that they can be completed within a sprint. You should add the following details to the items: Title : Usually expressed as \"As a [persona], I want [to perform an action], so that [I can achieve an end result].\". Description : Provide enough detail to create shared understanding of scope and support estimation efforts. Focus on the user, what they want to accomplish, and why. Don't describe how to develop the product. Provide enough details so the team can write tasks and test cases to implement the item. Include Design Reviews. Acceptance Criteria : Define what \"Done\" means. Activity : Deployment, Design, Development, Documentation, Requirements, Testing. Effort / Story Points / Size : Relative estimate of the amount of work required to complete the item. Business Value : Priority of an item compared to other items of the same type. Original Estimate : The amount of estimated work required to complete a task. Remember to use the Discussion section of the items to keep track of related comments, and mention individuals, groups, work items or pull requests when required.","title":"User Stories / Product Backlog Items"},{"location":"documentation/guidance/work-items/#tasks","text":"Each Task should be sized so that they can be completed within a day. You should at least add the following details to the items: Title . Description : Provide enough detail to create shared understanding of scope. Any developer should be able to take the item and know what needs to be implemented. Include Design Reviews. Reference to the working branch in related code repository. Remember to use the Discussion section of the tasks to keep track of related comments.","title":"Tasks"},{"location":"documentation/guidance/work-items/#bugs","text":"You should use bugs to capture both the initial issue and ongoing discoveries. You should at least add the following details to the bug items: Title . Description . Steps to Reproduce . System Info / Found in Build : Software and system configuration that is relevant to the bug and tests to apply. Acceptance Criteria : Criteria to meet so the bug can be closed. Integrated in Build : Name of the build that incorporates the code that fixes the bug. Priority : 1: Product should not ship without the successful resolution of the work item. The bug should be addressed as soon as possible. 2: Product should not ship without the successful resolution of the work item, but it does not need to be addressed immediately. 3: Resolution of the work item is optional based on resources, time, and risk. Severity : 1 - Critical: Must fix. No acceptable alternative methods. 2 - High: Consider fix. An acceptable alternative method exists. 3 - Medium: (Default). 4 - Low.","title":"Bugs"},{"location":"documentation/guidance/work-items/#issues-impediments","text":"Don't confuse with bugs. They represent unplanned activities that may block work from getting done. For example: feature ambiguity, personnel or resource issues, problems with environments, or other risks that impact scope, quality, or schedule. In general, you link these items to user stories or other work items.","title":"Issues / Impediments"},{"location":"documentation/guidance/work-items/#actions-from-retrospectives","text":"After a retrospective, every action that requires work should be tracked with its own Task or Issue / Impediment. These items might be unparented (without link to parent backlog item or user story).","title":"Actions from Retrospectives"},{"location":"documentation/guidance/work-items/#related-information","text":"Best practices for Agile project management - Azure Boards | Microsoft Docs . Define features and epics, organize backlog items - Azure Boards | Microsoft Docs . Create your product backlog - Azure Boards | Microsoft Docs . Add tasks to support sprint planning - Azure Boards | Microsoft Docs . Define, capture, triage, and manage bugs or code defects - Azure Boards | Microsoft Docs . Add and manage issues or impediments - Azure Boards | Microsoft Docs .","title":"Related information"},{"location":"documentation/recipes/deploy-docfx-azure-website/","text":"Deploy the DocFx Documentation Website to an Azure Website Automatically In the article Using DocFx and Companion Tools to generate a Documentation website the process is described to generate content of a documentation website using DocFx. This document describes how to setup an Azure Website to host the content and automate the deployment to it using a pipeline in Azure DevOps. The QuickStart sample that is provided for a quick setup of DocFx generation also contains the files explained in this document. Especially the .pipelines and infrastructure folders. The following steps can be followed when using the Quick Start folder. In the infrastructure folder you can find the Terraform files to create the website in an Azure environment. Out of the box, the script will create a website where the documentation content can be deployed to. 1. Install Terraform You can use tools like Chocolatey to install Terraform: choco install terraform 2. Set the Proper Variables Note: Make sure you modify the value of the app_name , rg_name and rg_location variables. The app_name value is appended by azurewebsites.net and must be unique. Otherwise the script will fail that it cannot create the website. In the Quick Start, authentication is disabled. If you want that enabled, make sure you have create an Application in the Azure AD and have the client ID . This client id must be set as the value of the client_id variable in variables.tf . In the main.tf make sure you uncomment the authentication settings in the app-service . For more information see Configure Azure AD authentication - Azure App Service . If you want to set a custom domain for your documentation website with an SSL certificate you have to do some extra steps. You have to create a Key Vault and store the certificate there. Next step is to uncomment and set the values in variables.tf . You also have to uncomment the necessary steps in main.tf . All is indicated by comment-boxes. For more information see Add a TLS/SSL certificate in Azure App Service . Some extra information on SSL certificate, custom domain and Azure App Service can be found in the following paragraphs. If you are familiar with that or don't need it, go ahead and continue with Step 3 . SSL Certificate To secure a website with a custom domain name and a certificate, you can find the steps to take in the article Add a TLS/SSL certificate in Azure App Service . That article also contains a description of ways to obtain a certificate and the requirements for a certificate. Usually you'll get a certificate from the customers IT department. If you want to start with a development certificate to test the process, you can create one yourself. You can do that in PowerShell with the script below. Replace: [YOUR DOMAIN] with the domain you would like to register, e.g. docs.somewhere.com [PASSWORD] with a password of the certificate. It's required for uploading a certificate in the Key Vault to have a password. You'll need this password in that step. [FILENAME] for the output file name of the certificate. You can even insert the path here where it should be store on your machine. You can store this script in a PowerShell script file (ps1 extension). $cert = New-SelfSignedCertificate -CertStoreLocation cert :\\ currentuser \\ my -Subject \"cn=[YOUR DOMAIN]\" -DnsName \"[YOUR DOMAIN]\" $pwd = ConvertTo-SecureString -String '[PASSWORD]' -Force -AsPlainText $path = 'cert:\\currentuser\\my\\' + $cert . thumbprint Export-PfxCertificate -cert $path -FilePath [FILENAME] . pfx -Password $pwd The certificate needs to be stored in the common Key Vault. Go to Settings > Certificates in the left menu of the Key Vault and click Generate/Import . Provide these details: Method of Certificate Creation: Import Certificate name: e.g. ssl-certificate Upload Certificate File: select the file on disc for this. Password: this is the [PASSWORD] we reference earlier. Custom Domain Registration To use a custom domain a few things need to be done. The process in the Azure portal is described in the article Tutorial: Map an existing custom DNS name to Azure App Service . An important part is described under the header Get a domain verification ID . This ID needs to be registered with the DNS description as a TXT record. Important to know is that this Custom Domain Verification ID is the same for all web resources in the same Azure subscription. See this StackOverflow issue . This means that this ID needs to be registered only once for one Azure Subscription. And this enables (re)creation of an App Service with the custom domain though script. Add Get-permissions for Microsoft Azure App Service The Azure App Service needs to access the Key Vault to get the certificate. This is needed for the first run, but also when the certificate is renewed in the Key Vault. For this purpose the Azure App Service accesses the Key Vault with the App Service resource provided identity. This identity can be found with the service principal name abfa0a7c-a6b6-4736-8310-5855508787cd or Microsoft Azure App Service and is of type Application . This ID is the same for all Azure subscriptions. It needs to have Get-permissions on secrets and certificates. For more information see this article Import a certificate from Key Vault . Add the Custom Domain and SSL Certificate to the App Service Once we have the SSL certificate and there is a complete DNS registration as described, we can uncomment the code in the Terraform script from the Quick Start folder to attach this to the App Service. In this script you need to reference the certificate in the common Key Vault and use it in the custom hostname binding. The custom hostname is assigned in the script as well. The settings ssl_state needs to be SniEnabled if you're using an SSL certificate. Now the creation of the authenticated website with a custom domain is automated. 3. Deploy Azure Resources from Your Local Machine Open up a command prompt. For the commands to be executed, you need to have a connection to your Azure subscription. This can be done using Azure Cli . Type this command: az login This will use the web browser to login to your account. You can check the connected subscription with this command: az account show If you have to change to another subscription, use this command where you replace [id] with the id of the subscription to select: az account set --subscription [ id ] Once this is done run this command to initialize: terraform init Now you can run the command to plan what the script will do. You run this command every time changes are made to the terraform scripts: terraform plan Inspect the result shown. If that is what you expect, apply these changes with this command: terraform apply When asked for approval, type \"yes\" and ENTER. You can also add the -auto-approve flag to the apply command. The deployment using Terraform is not included in the pipeline from the Quick Start folder as described in the next step, as that asks for more configuration. But of course that can always be added. 4. Deploy the Website from a Pipeline The best way to create the resources and deploy to it, is to do this automatically in a pipeline. For this purpose the .pipelines/documentation.yml pipeline is provided. This pipeline is built for an Azure DevOps environment. Create a pipeline and reference this YAML file. Note: the Quick Start folder contains a web.config that is needed for deployment to IIS or Azure App Service. This enables the use of the json file for search requests. If you don't have this in place, the search of text will never return anything and result in 404's under the hood. You have to create a Service Connection in your DevOps environment to connect to the Azure Subscription you want to deploy to. Note: set the variables AzureConnectionName to the name of the Service Connection and the AzureAppServiceName to the name you determined in the infrastructure/variables.tf . In the Quick Start folder the pipeline uses master as trigger, which means that any push being done to master triggers the pipeline. You will probably change this to another branch.","title":"Deploy the DocFx Documentation Website to an Azure Website Automatically"},{"location":"documentation/recipes/deploy-docfx-azure-website/#deploy-the-docfx-documentation-website-to-an-azure-website-automatically","text":"In the article Using DocFx and Companion Tools to generate a Documentation website the process is described to generate content of a documentation website using DocFx. This document describes how to setup an Azure Website to host the content and automate the deployment to it using a pipeline in Azure DevOps. The QuickStart sample that is provided for a quick setup of DocFx generation also contains the files explained in this document. Especially the .pipelines and infrastructure folders. The following steps can be followed when using the Quick Start folder. In the infrastructure folder you can find the Terraform files to create the website in an Azure environment. Out of the box, the script will create a website where the documentation content can be deployed to.","title":"Deploy the DocFx Documentation Website to an Azure Website Automatically"},{"location":"documentation/recipes/deploy-docfx-azure-website/#1-install-terraform","text":"You can use tools like Chocolatey to install Terraform: choco install terraform","title":"1. Install Terraform"},{"location":"documentation/recipes/deploy-docfx-azure-website/#2-set-the-proper-variables","text":"Note: Make sure you modify the value of the app_name , rg_name and rg_location variables. The app_name value is appended by azurewebsites.net and must be unique. Otherwise the script will fail that it cannot create the website. In the Quick Start, authentication is disabled. If you want that enabled, make sure you have create an Application in the Azure AD and have the client ID . This client id must be set as the value of the client_id variable in variables.tf . In the main.tf make sure you uncomment the authentication settings in the app-service . For more information see Configure Azure AD authentication - Azure App Service . If you want to set a custom domain for your documentation website with an SSL certificate you have to do some extra steps. You have to create a Key Vault and store the certificate there. Next step is to uncomment and set the values in variables.tf . You also have to uncomment the necessary steps in main.tf . All is indicated by comment-boxes. For more information see Add a TLS/SSL certificate in Azure App Service . Some extra information on SSL certificate, custom domain and Azure App Service can be found in the following paragraphs. If you are familiar with that or don't need it, go ahead and continue with Step 3 .","title":"2. Set the Proper Variables"},{"location":"documentation/recipes/deploy-docfx-azure-website/#ssl-certificate","text":"To secure a website with a custom domain name and a certificate, you can find the steps to take in the article Add a TLS/SSL certificate in Azure App Service . That article also contains a description of ways to obtain a certificate and the requirements for a certificate. Usually you'll get a certificate from the customers IT department. If you want to start with a development certificate to test the process, you can create one yourself. You can do that in PowerShell with the script below. Replace: [YOUR DOMAIN] with the domain you would like to register, e.g. docs.somewhere.com [PASSWORD] with a password of the certificate. It's required for uploading a certificate in the Key Vault to have a password. You'll need this password in that step. [FILENAME] for the output file name of the certificate. You can even insert the path here where it should be store on your machine. You can store this script in a PowerShell script file (ps1 extension). $cert = New-SelfSignedCertificate -CertStoreLocation cert :\\ currentuser \\ my -Subject \"cn=[YOUR DOMAIN]\" -DnsName \"[YOUR DOMAIN]\" $pwd = ConvertTo-SecureString -String '[PASSWORD]' -Force -AsPlainText $path = 'cert:\\currentuser\\my\\' + $cert . thumbprint Export-PfxCertificate -cert $path -FilePath [FILENAME] . pfx -Password $pwd The certificate needs to be stored in the common Key Vault. Go to Settings > Certificates in the left menu of the Key Vault and click Generate/Import . Provide these details: Method of Certificate Creation: Import Certificate name: e.g. ssl-certificate Upload Certificate File: select the file on disc for this. Password: this is the [PASSWORD] we reference earlier.","title":"SSL Certificate"},{"location":"documentation/recipes/deploy-docfx-azure-website/#custom-domain-registration","text":"To use a custom domain a few things need to be done. The process in the Azure portal is described in the article Tutorial: Map an existing custom DNS name to Azure App Service . An important part is described under the header Get a domain verification ID . This ID needs to be registered with the DNS description as a TXT record. Important to know is that this Custom Domain Verification ID is the same for all web resources in the same Azure subscription. See this StackOverflow issue . This means that this ID needs to be registered only once for one Azure Subscription. And this enables (re)creation of an App Service with the custom domain though script.","title":"Custom Domain Registration"},{"location":"documentation/recipes/deploy-docfx-azure-website/#add-get-permissions-for-microsoft-azure-app-service","text":"The Azure App Service needs to access the Key Vault to get the certificate. This is needed for the first run, but also when the certificate is renewed in the Key Vault. For this purpose the Azure App Service accesses the Key Vault with the App Service resource provided identity. This identity can be found with the service principal name abfa0a7c-a6b6-4736-8310-5855508787cd or Microsoft Azure App Service and is of type Application . This ID is the same for all Azure subscriptions. It needs to have Get-permissions on secrets and certificates. For more information see this article Import a certificate from Key Vault .","title":"Add Get-permissions for Microsoft Azure App Service"},{"location":"documentation/recipes/deploy-docfx-azure-website/#add-the-custom-domain-and-ssl-certificate-to-the-app-service","text":"Once we have the SSL certificate and there is a complete DNS registration as described, we can uncomment the code in the Terraform script from the Quick Start folder to attach this to the App Service. In this script you need to reference the certificate in the common Key Vault and use it in the custom hostname binding. The custom hostname is assigned in the script as well. The settings ssl_state needs to be SniEnabled if you're using an SSL certificate. Now the creation of the authenticated website with a custom domain is automated.","title":"Add the Custom Domain and SSL Certificate to the App Service"},{"location":"documentation/recipes/deploy-docfx-azure-website/#3-deploy-azure-resources-from-your-local-machine","text":"Open up a command prompt. For the commands to be executed, you need to have a connection to your Azure subscription. This can be done using Azure Cli . Type this command: az login This will use the web browser to login to your account. You can check the connected subscription with this command: az account show If you have to change to another subscription, use this command where you replace [id] with the id of the subscription to select: az account set --subscription [ id ] Once this is done run this command to initialize: terraform init Now you can run the command to plan what the script will do. You run this command every time changes are made to the terraform scripts: terraform plan Inspect the result shown. If that is what you expect, apply these changes with this command: terraform apply When asked for approval, type \"yes\" and ENTER. You can also add the -auto-approve flag to the apply command. The deployment using Terraform is not included in the pipeline from the Quick Start folder as described in the next step, as that asks for more configuration. But of course that can always be added.","title":"3. Deploy Azure Resources from Your Local Machine"},{"location":"documentation/recipes/deploy-docfx-azure-website/#4-deploy-the-website-from-a-pipeline","text":"The best way to create the resources and deploy to it, is to do this automatically in a pipeline. For this purpose the .pipelines/documentation.yml pipeline is provided. This pipeline is built for an Azure DevOps environment. Create a pipeline and reference this YAML file. Note: the Quick Start folder contains a web.config that is needed for deployment to IIS or Azure App Service. This enables the use of the json file for search requests. If you don't have this in place, the search of text will never return anything and result in 404's under the hood. You have to create a Service Connection in your DevOps environment to connect to the Azure Subscription you want to deploy to. Note: set the variables AzureConnectionName to the name of the Service Connection and the AzureAppServiceName to the name you determined in the infrastructure/variables.tf . In the Quick Start folder the pipeline uses master as trigger, which means that any push being done to master triggers the pipeline. You will probably change this to another branch.","title":"4. Deploy the Website from a Pipeline"},{"location":"documentation/recipes/static-website-with-mkdocs/","text":"How to Create a Static Website for Your Documentation Based on mkdocs and mkdocs-material MkDocs is a tool built to create static websites from raw markdown files. Other alternatives include Sphinx , and Jekyll . We used MkDocs to create ISE Engineering Fundamentals Playbook static website from the contents in the GitHub repository . Then we deployed it to GitHub Pages . We found MkDocs to be a good choice since: It's easy to set up and looks great even with the vanilla version. It works well with markdown, which is what we already have in the Playbook. It uses a Python stack which is friendly to many contributors of this Playbook. For comparison, Sphinx mainly generates docs from restructured-text (rst) format, and Jekyll is written in Ruby. To setup an MkDocs website, the main assets needed are: An mkdocs.yaml file, similar to the one we have in the Playbook . This is the configuration file that defines the appearance of the website, the navigation, the plugins used and more. A folder named docs (the default value for the directory) that contains the documentation source files. A GitHub Action for automatically generating the website (e.g. on every commit to main), similar to this one from the Playbook . A list of plugins used during the build phase of the website. We specified ours here . And these are the plugins we've used: - Material for MkDocs : Material design appearance and user experience. - pymdown-extensions : Improves the appearance of markdown based content. - mdx_truly_sane_lists : For defining the indent level for lists without having to refactor the entire documentation we already had in the Playbook. Setting up locally is very easy. See Getting Started with MkDocs for details. For publishing the website, there's a good integration with GitHub for storing the website as a GitHub Page . Resources MkDocs Plugins The best MkDocs plugins and customizations","title":"How to Create a Static Website for Your Documentation Based on mkdocs and mkdocs-material"},{"location":"documentation/recipes/static-website-with-mkdocs/#how-to-create-a-static-website-for-your-documentation-based-on-mkdocs-and-mkdocs-material","text":"MkDocs is a tool built to create static websites from raw markdown files. Other alternatives include Sphinx , and Jekyll . We used MkDocs to create ISE Engineering Fundamentals Playbook static website from the contents in the GitHub repository . Then we deployed it to GitHub Pages . We found MkDocs to be a good choice since: It's easy to set up and looks great even with the vanilla version. It works well with markdown, which is what we already have in the Playbook. It uses a Python stack which is friendly to many contributors of this Playbook. For comparison, Sphinx mainly generates docs from restructured-text (rst) format, and Jekyll is written in Ruby. To setup an MkDocs website, the main assets needed are: An mkdocs.yaml file, similar to the one we have in the Playbook . This is the configuration file that defines the appearance of the website, the navigation, the plugins used and more. A folder named docs (the default value for the directory) that contains the documentation source files. A GitHub Action for automatically generating the website (e.g. on every commit to main), similar to this one from the Playbook . A list of plugins used during the build phase of the website. We specified ours here . And these are the plugins we've used: - Material for MkDocs : Material design appearance and user experience. - pymdown-extensions : Improves the appearance of markdown based content. - mdx_truly_sane_lists : For defining the indent level for lists without having to refactor the entire documentation we already had in the Playbook. Setting up locally is very easy. See Getting Started with MkDocs for details. For publishing the website, there's a good integration with GitHub for storing the website as a GitHub Page .","title":"How to Create a Static Website for Your Documentation Based on mkdocs and mkdocs-material"},{"location":"documentation/recipes/static-website-with-mkdocs/#resources","text":"MkDocs Plugins The best MkDocs plugins and customizations","title":"Resources"},{"location":"documentation/recipes/sync-wiki-between-repos/","text":"How to Sync a Wiki Between Repositories This is a quick guide to mirroring a Project Wiki to another repository. # Clone the wiki git clone < source wiki repo url> # Add mirror repository as a remote cd < source wiki repo working folder> git remote add mirror <mirror repo that must already exist> Now each time you wish to sync run the following to get latest from the source wiki repo: # Get everything git pull -v Warning : Check that the output of the pull shows \"From source repo URL\". If this shows the mirror repo url then you've forgotten to reset the tracking. Run git branch -u origin/wikiMaster then continue. Then run this to push it to the mirror repo and reset the branch to track the source repo again: # Push all branches up to mirror remote git push -u mirror # Reset local to track source remote git branch -u origin/wikiMaster Your output should look like this when run: PS C:\\Git\\MyProject.wiki> git pull -v POST git-upload-pack (909 bytes) remote: Azure Repos remote: Found 5 objects to send. (0 ms) Unpacking objects: 100% (5/5), done. From https://..... wikiMaster -> origin/wikiMaster Updating 7412b94..a0f543b Fast-forward .../dffffds.md | 4 ++++ 1 file changed, 4 insertions(+) PS C:\\Git\\MyProject.wiki> git push -u mirror Enumerating objects: 9, done. Counting objects: 100% (9/9), done. Delta compression using up to 8 threads Compressing objects: 100% (5/5), done. Writing objects: 100% (5/5), 2.08 KiB | 2.08 MiB/s, done. Total 5 (delta 4), reused 0 (delta 0) remote: Analyzing objects... (5/5) (6 ms) remote: Storing packfile... done (48 ms) remote: Storing index... done (59 ms) To https://...... 7412b94..a0f543b wikiMaster -> wikiMaster Branch 'wikiMaster' set up to track remote branch 'wikiMaster' from 'mirror'. PS C:\\Git\\MyProject.wiki> git branch -u origin/wikiMaster Branch 'wikiMaster' set up to track remote branch 'wikiMaster' from 'origin'.","title":"How to Sync a Wiki Between Repositories"},{"location":"documentation/recipes/sync-wiki-between-repos/#how-to-sync-a-wiki-between-repositories","text":"This is a quick guide to mirroring a Project Wiki to another repository. # Clone the wiki git clone < source wiki repo url> # Add mirror repository as a remote cd < source wiki repo working folder> git remote add mirror <mirror repo that must already exist> Now each time you wish to sync run the following to get latest from the source wiki repo: # Get everything git pull -v Warning : Check that the output of the pull shows \"From source repo URL\". If this shows the mirror repo url then you've forgotten to reset the tracking. Run git branch -u origin/wikiMaster then continue. Then run this to push it to the mirror repo and reset the branch to track the source repo again: # Push all branches up to mirror remote git push -u mirror # Reset local to track source remote git branch -u origin/wikiMaster Your output should look like this when run: PS C:\\Git\\MyProject.wiki> git pull -v POST git-upload-pack (909 bytes) remote: Azure Repos remote: Found 5 objects to send. (0 ms) Unpacking objects: 100% (5/5), done. From https://..... wikiMaster -> origin/wikiMaster Updating 7412b94..a0f543b Fast-forward .../dffffds.md | 4 ++++ 1 file changed, 4 insertions(+) PS C:\\Git\\MyProject.wiki> git push -u mirror Enumerating objects: 9, done. Counting objects: 100% (9/9), done. Delta compression using up to 8 threads Compressing objects: 100% (5/5), done. Writing objects: 100% (5/5), 2.08 KiB | 2.08 MiB/s, done. Total 5 (delta 4), reused 0 (delta 0) remote: Analyzing objects... (5/5) (6 ms) remote: Storing packfile... done (48 ms) remote: Storing index... done (59 ms) To https://...... 7412b94..a0f543b wikiMaster -> wikiMaster Branch 'wikiMaster' set up to track remote branch 'wikiMaster' from 'mirror'. PS C:\\Git\\MyProject.wiki> git branch -u origin/wikiMaster Branch 'wikiMaster' set up to track remote branch 'wikiMaster' from 'origin'.","title":"How to Sync a Wiki Between Repositories"},{"location":"documentation/recipes/using-docfx-and-tools/","text":"Using DocFx and Companion Tools to Generate a Documentation Website If you want an easy way to have a website with all your documentation coming from Markdown files and comments coming from code, you can use DocFx . The website generated by DocFx also includes fast search capabilities. There are some gaps in the DocFx solution, but we've provided companion tools that help you fill those gaps. Also see the blog post Providing quality documentation in your project with DocFx and Companion Tools for more explanation about the solution. Prerequisites This document is followed best by cloning the sample from https://github.com/mtirionMSFT/DocFxQuickStart first. Copy the contents of the QuickStart folder to the root of your own repository to get started in your own environment. Quick Start TLDR; If you want a really quick start using Azure DevOps and Azure App Service without reading the what and how, follow these steps: Azure DevOps: If you don't have it yet, create a project in Azure DevOps and create a Service Connection to your Azure environment . Clone the repository. QuickStart folder: Copy the contents of the QuickStart folder in there repository that can be found on https://github.com/mtirionMSFT/DocFxQuickStart to the root of the repository. Azure: Create a resource group in your Azure environment where the documentation website resources should be created. Create Azure resources: Fill in the default values in infrastructure/variables.tf and run the commands from Step 3 - Deploy Azure resources from your local machine to create the Azure Resources. Pipeline: Fill in the variables in .pipelines/documentation.yml , commit the changes and push the contents of the repository to your branch (possibly through a PR). Now you can create a pipeline in your Azure DevOps project that uses the .pipelines/documentation.yml and run it. Documents and Projects Folder Structure The easiest is to work with a mono repository where documentation and code live together. If that's not the case in your situation but you still want to combine multiple repositories into one documentation website, you'll have to clone all repositories first to be able to combine the information. In this recipe we'll assume a monorepo is used. In the steps below we'll consider the generation of the documentation website from this content structure: \u251c\u2500\u2500 .pipelines // Azure DevOps pipeline for automatic generation and deployment \u2502 \u251c\u2500\u2500 docs // all documents \u2502 \u251c\u2500\u2500 .attachments // all images and other attachments used by documents \u2502 \u251c\u2500\u2500 infrastructure // Terraform scripts for creation of the Azure website \u2502 \u251c\u2500\u2500 src // all projects \u2502 \u251c\u2500\u2500 build // build settings \u2502 \u251c\u2500\u2500 dotnet // .NET build settings \u2502 \u251c\u2500\u2500 Directory.Build.props // project settings for all .NET projects in sub folders \u2502 \u251c\u2500\u2500 [ Project folders ] \u2502 \u251c\u2500\u2500 x-cross \u2502 \u251c\u2500\u2500 toc.yml // Cross reference definition ( optional ) \u2502 \u251c\u2500\u2500 .markdownlint.json // Markdownlinter settings \u251c\u2500\u2500 docfx.json // DocFx configuration \u251c\u2500\u2500 index.md // Website landing page \u251c\u2500\u2500 toc.yml // Definition of the website header content links \u251c\u2500\u2500 web.config // web.config to enable search in deployed website We'll be using the DocLinkChecker tool to validate all links in documentation and for orphaned attachments. That's the reason we have all attachments in the .attachments folder. In the generated website from the QuickStart folder you'll see that the hierarchies of documentation and references is combined in the left table of contents. This is achieved by the definition and use of x-cross\\toc.yml . If you don't want the hierarchies combined, just remove the folder and file from your environment and (re)generate the website. A .markdownlint.json is included with the contents below. The MD013 setting is set to false to prevent checking for maximum line length. You can modify this file to your likings to include or exclude certain tests. { \"MD013\" : false } The contents of the .pipelines and infrastructure folders are explained in the recipe Deploy the DocFx Documentation website to an Azure Website automatically . Reference Documentation from Source Code DocFx can generate reference documentation from code, where C# and Typescript are supported best at the moment. In the QuickStart folder we only used C# projects. For DocFx to generate quality reference documentation, quality triple slash-comments are required. See Triple-slash (///) Code Comments Support . To enforce this, it's a good idea to enforce the use of StyleCop . There are a few steps that will give you an easy start with this. First, you can use the Directory.Build.props file in the /src folder in combination with the files in the build/dotnet folder. By having this, you enforce StyleCop in all Visual Studio project files in it's sub folders with a configuration of which rules should be used or ignored. You can tailor this to your needs of course. For more information, see Customize your build and Use rule sets to group code analysis rules . To make sure developers are forced to add the triple-slash comments by throwing compiler errors and to have the proper settings for the generation of documentation XML-files, add the TreatWarningsAsErrors and GenerateDocumentationFile settings to every .csproj file. You can add that in the first PropertyGroup settings like this: <Project Sdk= \"Microsoft.NET.Sdk\" > <PropertyGroup> ... <GenerateDocumentationFile> true </GenerateDocumentationFile> <TreatWarningsAsErrors> true </TreatWarningsAsErrors> </PropertyGroup> ... </Project> Now you are all set to generate documentation from your C# code. For more information about languages supported by DocFx and how to configure it, see Introduction to Multiple Languages Support . Note: You can also add a PropertyGroup definition with the two settings in Directory.Build.props to have that settings in all projects. But in that case it will also be inherited in your Test projects. 1. Install DocFx and markdownlint-cli Go to the DocFx website to the Download section and download the latest version of DocFx. Go to the github page of markdownlint-cli to find download and install options. You can also use tools like Chocolatey to install: choco install docfx choco install markdownlint-cli 2. Configure DocFx Configuration for DocFx is done in a docfx.json file. Store this file in the root of your repository. Note: You can store the docfx.json somewhere else in the hierarchy, but then you need to provide the path of the file as an argument to the docfx command so it can be located. Below is a good configuration to start with, where documentation is in the /docs folder and the sources are in the /src folder: { \"metadata\" : [ { \"src\" : [ { \"files\" : [ \"src/**.csproj\" ], \"exclude\" : [ \"_site/**\" , \"**/bin/**\" , \"**/obj/**\" , \"**/[Tt]ests/**\" ] } ], \"dest\" : \"reference\" , \"disableGitFeatures\" : false } ], \"build\" : { \"content\" : [ { \"files\" : [ \"reference/**\" ] }, { \"files\" : [ \"**.md\" , \"**/toc.yml\" ], \"exclude\" : [ \"_site/**\" , \"**/bin/**\" , \"**/obj/**\" , \"**/[Tt]ests/**\" ] } ], \"resource\" : [ { \"files\" : [ \"docs/.attachments/**\" ] }, { \"files\" : [ \"web.config\" ] } ], \"template\" : [ \"templates/cse\" ], \"globalMetadata\" : { \"_appTitle\" : \"CSE Documentation\" , \"_enableSearch\" : true }, \"markdownEngineName\" : \"markdig\" , \"dest\" : \"_site\" , \"xrefService\" : [ \"https://xref.learn.microsoft.com/query?uid={uid}\" ] } } 3. Setup Some Basic Documents We suggest starting with a basic documentation structure in the /docs folder. In the provided QuickStart folder we have a basic setup: \u251c\u2500\u2500 docs \u2502 \u251c\u2500\u2500 .attachments // All images and other attachments used by documents \u2502 \u2502 \u251c\u2500\u2500 architecture-decisions \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 decision-log.md // Sample index into all ADRs \u2502 \u2514\u2500\u2500 README.md // Landing page architecture decisions \u2502 \u2502 \u251c\u2500\u2500 getting-started \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // This recipe document. Replace the content with something meaningful to the project \u2502 \u2502 \u251c\u2500\u2500 guidelines \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 docs-guidelines.md // General documentation guidelines \u2502 \u2514\u2500\u2500 README.md // Landing page guidelines \u2502 \u2502 \u251c\u2500\u2500 templates // all templates like ADR template and such \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // Landing page templates \u2502 \u2502 \u251c\u2500\u2500 working-agreements \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // Landing page working agreements \u2502 \u2502 \u251c\u2500\u2500 .order // Providing a fixed order of files and directories \u2502 \u251c\u2500\u2500 index.md // Landing page documentation You can use templates like working agreements and such from the ISE Playbook . To have a proper landing page of your documentation website, you can use a markdown file called INDEX.MD in the root of your repository. Contents can be something like this: # ISE Documentation This is the landing page of the ISE Documentation website. This is the page to introduce everything on the website. You can add specific links that are important to provide direct access. > Try not to duplicate the links on the top of the page, unless it really makes sense. To get started with the setup of this website, read the getting started document with the title [ Using DocFx and Companion Tools ]( using-docfx-and-tools.md ). 4. Compile the Companion Tools and Run Them Note: To explain each step, we'll be going through the various steps in the next few paragraphs. In the provided sample, a batch-file called GenerateDocWebsite.cmd is included. This script will take all the necessary steps to compile the tools, execute the checks, generate the table of contents and execute docfx to generate the website. To check for proper markdown formatting the markdownlint-cli tool is used. The command takes it's configuration from the .markdownlint.json file in the root of the project. To check all markdown files, simply execute this command: markdownlint **/*.md In the QuickStart folder you should have copied in the two companion tools TocDocFxCreation and DocLinkChecker as described in the introduction of this article. You can compile the tools from Visual Studio, but you can also run dotnet build in both tool folders. The DocLinkChecker companion tool is used to validate what's in the docs folder. It validates links between documents and attachments in the docs folder and checks if there aren't orphaned attachments. An example of executing this tool, where the check of attachments is included: DocLinkChecker.exe -d ./docs -a The TocDocFxCreation tool is needed to generate a table of contents for your documentation, so users can navigate between folders and documents. If you have compiled the tool, use this command to generate a table of content file toc.yml . To generate a table of contents with the use of the .order files for determining the sequence of articles and to automatically generate index.md documents if no default document is available in a folder, this command can be used: TocDocFxCreation.exe -d ./docs -sri 5. Run DocFx to Generate the Website Run the docfx command to generate the website, by default in the _site folder. TIP: If you want to check the website in your local environment, provide the --serve option to either the docfx command or the GenerateDocWebsite script. A small webserver is launched that hosts your website, which is accessible on localhost. Style of the Website If you started with the QuickStart folder, the website is generated using a custom theme using material design and the Microsoft logo. You can change this to your likings. For more information see How-to: Create A Custom Template | DocFX website (dotnet.github.io) . Deploy to an Azure Website After you completed the steps, you should have a default website generated in the _site folder. But of course, you want this to be accessible for everyone. So, the next step is to create for instance an Azure Website and have a process to automatically generate and deploy the contents to that website. That process is described in the recipe Deploy the DocFx Documentation website to an Azure Website automatically . Resources DocFX - static documentation generator Deploy the DocFx Documentation website to an Azure Website automatically Providing quality documentation in your project with DocFx and Companion Tools Monorepo For Beginners","title":"Using DocFx and Companion Tools to Generate a Documentation Website"},{"location":"documentation/recipes/using-docfx-and-tools/#using-docfx-and-companion-tools-to-generate-a-documentation-website","text":"If you want an easy way to have a website with all your documentation coming from Markdown files and comments coming from code, you can use DocFx . The website generated by DocFx also includes fast search capabilities. There are some gaps in the DocFx solution, but we've provided companion tools that help you fill those gaps. Also see the blog post Providing quality documentation in your project with DocFx and Companion Tools for more explanation about the solution.","title":"Using DocFx and Companion Tools to Generate a Documentation Website"},{"location":"documentation/recipes/using-docfx-and-tools/#prerequisites","text":"This document is followed best by cloning the sample from https://github.com/mtirionMSFT/DocFxQuickStart first. Copy the contents of the QuickStart folder to the root of your own repository to get started in your own environment.","title":"Prerequisites"},{"location":"documentation/recipes/using-docfx-and-tools/#quick-start","text":"TLDR; If you want a really quick start using Azure DevOps and Azure App Service without reading the what and how, follow these steps: Azure DevOps: If you don't have it yet, create a project in Azure DevOps and create a Service Connection to your Azure environment . Clone the repository. QuickStart folder: Copy the contents of the QuickStart folder in there repository that can be found on https://github.com/mtirionMSFT/DocFxQuickStart to the root of the repository. Azure: Create a resource group in your Azure environment where the documentation website resources should be created. Create Azure resources: Fill in the default values in infrastructure/variables.tf and run the commands from Step 3 - Deploy Azure resources from your local machine to create the Azure Resources. Pipeline: Fill in the variables in .pipelines/documentation.yml , commit the changes and push the contents of the repository to your branch (possibly through a PR). Now you can create a pipeline in your Azure DevOps project that uses the .pipelines/documentation.yml and run it.","title":"Quick Start"},{"location":"documentation/recipes/using-docfx-and-tools/#documents-and-projects-folder-structure","text":"The easiest is to work with a mono repository where documentation and code live together. If that's not the case in your situation but you still want to combine multiple repositories into one documentation website, you'll have to clone all repositories first to be able to combine the information. In this recipe we'll assume a monorepo is used. In the steps below we'll consider the generation of the documentation website from this content structure: \u251c\u2500\u2500 .pipelines // Azure DevOps pipeline for automatic generation and deployment \u2502 \u251c\u2500\u2500 docs // all documents \u2502 \u251c\u2500\u2500 .attachments // all images and other attachments used by documents \u2502 \u251c\u2500\u2500 infrastructure // Terraform scripts for creation of the Azure website \u2502 \u251c\u2500\u2500 src // all projects \u2502 \u251c\u2500\u2500 build // build settings \u2502 \u251c\u2500\u2500 dotnet // .NET build settings \u2502 \u251c\u2500\u2500 Directory.Build.props // project settings for all .NET projects in sub folders \u2502 \u251c\u2500\u2500 [ Project folders ] \u2502 \u251c\u2500\u2500 x-cross \u2502 \u251c\u2500\u2500 toc.yml // Cross reference definition ( optional ) \u2502 \u251c\u2500\u2500 .markdownlint.json // Markdownlinter settings \u251c\u2500\u2500 docfx.json // DocFx configuration \u251c\u2500\u2500 index.md // Website landing page \u251c\u2500\u2500 toc.yml // Definition of the website header content links \u251c\u2500\u2500 web.config // web.config to enable search in deployed website We'll be using the DocLinkChecker tool to validate all links in documentation and for orphaned attachments. That's the reason we have all attachments in the .attachments folder. In the generated website from the QuickStart folder you'll see that the hierarchies of documentation and references is combined in the left table of contents. This is achieved by the definition and use of x-cross\\toc.yml . If you don't want the hierarchies combined, just remove the folder and file from your environment and (re)generate the website. A .markdownlint.json is included with the contents below. The MD013 setting is set to false to prevent checking for maximum line length. You can modify this file to your likings to include or exclude certain tests. { \"MD013\" : false } The contents of the .pipelines and infrastructure folders are explained in the recipe Deploy the DocFx Documentation website to an Azure Website automatically .","title":"Documents and Projects Folder Structure"},{"location":"documentation/recipes/using-docfx-and-tools/#reference-documentation-from-source-code","text":"DocFx can generate reference documentation from code, where C# and Typescript are supported best at the moment. In the QuickStart folder we only used C# projects. For DocFx to generate quality reference documentation, quality triple slash-comments are required. See Triple-slash (///) Code Comments Support . To enforce this, it's a good idea to enforce the use of StyleCop . There are a few steps that will give you an easy start with this. First, you can use the Directory.Build.props file in the /src folder in combination with the files in the build/dotnet folder. By having this, you enforce StyleCop in all Visual Studio project files in it's sub folders with a configuration of which rules should be used or ignored. You can tailor this to your needs of course. For more information, see Customize your build and Use rule sets to group code analysis rules . To make sure developers are forced to add the triple-slash comments by throwing compiler errors and to have the proper settings for the generation of documentation XML-files, add the TreatWarningsAsErrors and GenerateDocumentationFile settings to every .csproj file. You can add that in the first PropertyGroup settings like this: <Project Sdk= \"Microsoft.NET.Sdk\" > <PropertyGroup> ... <GenerateDocumentationFile> true </GenerateDocumentationFile> <TreatWarningsAsErrors> true </TreatWarningsAsErrors> </PropertyGroup> ... </Project> Now you are all set to generate documentation from your C# code. For more information about languages supported by DocFx and how to configure it, see Introduction to Multiple Languages Support . Note: You can also add a PropertyGroup definition with the two settings in Directory.Build.props to have that settings in all projects. But in that case it will also be inherited in your Test projects.","title":"Reference Documentation from Source Code"},{"location":"documentation/recipes/using-docfx-and-tools/#1-install-docfx-and-markdownlint-cli","text":"Go to the DocFx website to the Download section and download the latest version of DocFx. Go to the github page of markdownlint-cli to find download and install options. You can also use tools like Chocolatey to install: choco install docfx choco install markdownlint-cli","title":"1. Install DocFx and markdownlint-cli"},{"location":"documentation/recipes/using-docfx-and-tools/#2-configure-docfx","text":"Configuration for DocFx is done in a docfx.json file. Store this file in the root of your repository. Note: You can store the docfx.json somewhere else in the hierarchy, but then you need to provide the path of the file as an argument to the docfx command so it can be located. Below is a good configuration to start with, where documentation is in the /docs folder and the sources are in the /src folder: { \"metadata\" : [ { \"src\" : [ { \"files\" : [ \"src/**.csproj\" ], \"exclude\" : [ \"_site/**\" , \"**/bin/**\" , \"**/obj/**\" , \"**/[Tt]ests/**\" ] } ], \"dest\" : \"reference\" , \"disableGitFeatures\" : false } ], \"build\" : { \"content\" : [ { \"files\" : [ \"reference/**\" ] }, { \"files\" : [ \"**.md\" , \"**/toc.yml\" ], \"exclude\" : [ \"_site/**\" , \"**/bin/**\" , \"**/obj/**\" , \"**/[Tt]ests/**\" ] } ], \"resource\" : [ { \"files\" : [ \"docs/.attachments/**\" ] }, { \"files\" : [ \"web.config\" ] } ], \"template\" : [ \"templates/cse\" ], \"globalMetadata\" : { \"_appTitle\" : \"CSE Documentation\" , \"_enableSearch\" : true }, \"markdownEngineName\" : \"markdig\" , \"dest\" : \"_site\" , \"xrefService\" : [ \"https://xref.learn.microsoft.com/query?uid={uid}\" ] } }","title":"2. Configure DocFx"},{"location":"documentation/recipes/using-docfx-and-tools/#3-setup-some-basic-documents","text":"We suggest starting with a basic documentation structure in the /docs folder. In the provided QuickStart folder we have a basic setup: \u251c\u2500\u2500 docs \u2502 \u251c\u2500\u2500 .attachments // All images and other attachments used by documents \u2502 \u2502 \u251c\u2500\u2500 architecture-decisions \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 decision-log.md // Sample index into all ADRs \u2502 \u2514\u2500\u2500 README.md // Landing page architecture decisions \u2502 \u2502 \u251c\u2500\u2500 getting-started \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // This recipe document. Replace the content with something meaningful to the project \u2502 \u2502 \u251c\u2500\u2500 guidelines \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 docs-guidelines.md // General documentation guidelines \u2502 \u2514\u2500\u2500 README.md // Landing page guidelines \u2502 \u2502 \u251c\u2500\u2500 templates // all templates like ADR template and such \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // Landing page templates \u2502 \u2502 \u251c\u2500\u2500 working-agreements \u2502 \u2514\u2500\u2500 .order \u2502 \u2514\u2500\u2500 README.md // Landing page working agreements \u2502 \u2502 \u251c\u2500\u2500 .order // Providing a fixed order of files and directories \u2502 \u251c\u2500\u2500 index.md // Landing page documentation You can use templates like working agreements and such from the ISE Playbook . To have a proper landing page of your documentation website, you can use a markdown file called INDEX.MD in the root of your repository. Contents can be something like this: # ISE Documentation This is the landing page of the ISE Documentation website. This is the page to introduce everything on the website. You can add specific links that are important to provide direct access. > Try not to duplicate the links on the top of the page, unless it really makes sense. To get started with the setup of this website, read the getting started document with the title [ Using DocFx and Companion Tools ]( using-docfx-and-tools.md ).","title":"3. Setup Some Basic Documents"},{"location":"documentation/recipes/using-docfx-and-tools/#4-compile-the-companion-tools-and-run-them","text":"Note: To explain each step, we'll be going through the various steps in the next few paragraphs. In the provided sample, a batch-file called GenerateDocWebsite.cmd is included. This script will take all the necessary steps to compile the tools, execute the checks, generate the table of contents and execute docfx to generate the website. To check for proper markdown formatting the markdownlint-cli tool is used. The command takes it's configuration from the .markdownlint.json file in the root of the project. To check all markdown files, simply execute this command: markdownlint **/*.md In the QuickStart folder you should have copied in the two companion tools TocDocFxCreation and DocLinkChecker as described in the introduction of this article. You can compile the tools from Visual Studio, but you can also run dotnet build in both tool folders. The DocLinkChecker companion tool is used to validate what's in the docs folder. It validates links between documents and attachments in the docs folder and checks if there aren't orphaned attachments. An example of executing this tool, where the check of attachments is included: DocLinkChecker.exe -d ./docs -a The TocDocFxCreation tool is needed to generate a table of contents for your documentation, so users can navigate between folders and documents. If you have compiled the tool, use this command to generate a table of content file toc.yml . To generate a table of contents with the use of the .order files for determining the sequence of articles and to automatically generate index.md documents if no default document is available in a folder, this command can be used: TocDocFxCreation.exe -d ./docs -sri","title":"4. Compile the Companion Tools and Run Them"},{"location":"documentation/recipes/using-docfx-and-tools/#5-run-docfx-to-generate-the-website","text":"Run the docfx command to generate the website, by default in the _site folder. TIP: If you want to check the website in your local environment, provide the --serve option to either the docfx command or the GenerateDocWebsite script. A small webserver is launched that hosts your website, which is accessible on localhost.","title":"5. Run DocFx to Generate the Website"},{"location":"documentation/recipes/using-docfx-and-tools/#style-of-the-website","text":"If you started with the QuickStart folder, the website is generated using a custom theme using material design and the Microsoft logo. You can change this to your likings. For more information see How-to: Create A Custom Template | DocFX website (dotnet.github.io) .","title":"Style of the Website"},{"location":"documentation/recipes/using-docfx-and-tools/#deploy-to-an-azure-website","text":"After you completed the steps, you should have a default website generated in the _site folder. But of course, you want this to be accessible for everyone. So, the next step is to create for instance an Azure Website and have a process to automatically generate and deploy the contents to that website. That process is described in the recipe Deploy the DocFx Documentation website to an Azure Website automatically .","title":"Deploy to an Azure Website"},{"location":"documentation/recipes/using-docfx-and-tools/#resources","text":"DocFX - static documentation generator Deploy the DocFx Documentation website to an Azure Website automatically Providing quality documentation in your project with DocFx and Companion Tools Monorepo For Beginners","title":"Resources"},{"location":"documentation/tools/automation/","text":"How to Automate Simple Checks If you want to automate some checks on your Markdown documents, there are several tools that you could leverage. For example: Code Analysis / Linting markdownlint to verify Markdown syntax and enforce rules that make the text more readable. markdown-link-check to extract links from markdown texts and check whether each link is alive (200 OK) or dead. write-good to check English prose. Docker image for node-markdown-spellcheck , a lightweight docker image to spellcheck markdown files. static code analysis VS Code Extensions Write Good Linter to get grammar and language advice while editing a document. markdownlint to examine Markdown documents and get warnings for rule violations while editing. Automation pre-commit to use Git hook scripts to identify simple issues before submitting our code or documentation for review. Check Build validation to automate linting for PRs. Check CI Pipeline for better documentation for a sample pipeline with markdownlint , markdown-link-check and write-good . Sample output: On Linting Rules The team needs to be clear what linting rules are required and shouldn't be overridden with tooling or comments. The team should have consensus on when to override tooling rules.","title":"How to Automate Simple Checks"},{"location":"documentation/tools/automation/#how-to-automate-simple-checks","text":"If you want to automate some checks on your Markdown documents, there are several tools that you could leverage. For example: Code Analysis / Linting markdownlint to verify Markdown syntax and enforce rules that make the text more readable. markdown-link-check to extract links from markdown texts and check whether each link is alive (200 OK) or dead. write-good to check English prose. Docker image for node-markdown-spellcheck , a lightweight docker image to spellcheck markdown files. static code analysis VS Code Extensions Write Good Linter to get grammar and language advice while editing a document. markdownlint to examine Markdown documents and get warnings for rule violations while editing. Automation pre-commit to use Git hook scripts to identify simple issues before submitting our code or documentation for review. Check Build validation to automate linting for PRs. Check CI Pipeline for better documentation for a sample pipeline with markdownlint , markdown-link-check and write-good . Sample output:","title":"How to Automate Simple Checks"},{"location":"documentation/tools/automation/#on-linting-rules","text":"The team needs to be clear what linting rules are required and shouldn't be overridden with tooling or comments. The team should have consensus on when to override tooling rules.","title":"On Linting Rules"},{"location":"documentation/tools/integrations/","text":"Integration with Teams/Slack Monitor your Azure repositories and receive notifications in your channel whenever code is pushed/checked in and whenever a pull request (PR) is created, updated, or a merge is attempted. Azure Repos with Microsoft Teams Azure Repos with Slack","title":"Integration with Teams/Slack"},{"location":"documentation/tools/integrations/#integration-with-teamsslack","text":"Monitor your Azure repositories and receive notifications in your channel whenever code is pushed/checked in and whenever a pull request (PR) is created, updated, or a merge is attempted. Azure Repos with Microsoft Teams Azure Repos with Slack","title":"Integration with Teams/Slack"},{"location":"documentation/tools/languages/","text":"Languages Markdown Markdown is one of the most popular markup languages to add rich formatting, tables and images to your documentation using plain text documents. Markdown files (.md) can be source-controlled along with your code. More information: Getting Started Cheat Sheet Basic Syntax Extended Syntax Wiki Markdown Syntax Tools: Markdown and Visual Studio Code How to automate simple checks Mermaid Mermaid lets you create diagrams using text definitions that can later be rendered with a diagramming and charting tool. Mermaid files (.mmd) can be source-controlled along with your code. It's also recommended to include image files (.png) with the rendered diagrams under source control. Your markdown files should link the image files, so they can be read without the need of a Mermaid rendering tool (e.g., during Pull Request review). Example Mermaid Diagram This is an example of a Mermaid flowchart diagram written as code. graph LR A[Diagram Idea] -->|Write mermaid code| B(mermaid.mmd file) B -->|Add to source control| C{Code repo} B -->|Export as .png| G(.png file of diagram) G -->|Add to source control| C This is an example of how it can be rendered as an image. More information: About Mermaid Diagram syntax Tools: Mermaid Live Editor Markdown Preview Mermaid Support for Visual Studio Code","title":"Languages"},{"location":"documentation/tools/languages/#languages","text":"","title":"Languages"},{"location":"documentation/tools/languages/#markdown","text":"Markdown is one of the most popular markup languages to add rich formatting, tables and images to your documentation using plain text documents. Markdown files (.md) can be source-controlled along with your code. More information: Getting Started Cheat Sheet Basic Syntax Extended Syntax Wiki Markdown Syntax Tools: Markdown and Visual Studio Code How to automate simple checks","title":"Markdown"},{"location":"documentation/tools/languages/#mermaid","text":"Mermaid lets you create diagrams using text definitions that can later be rendered with a diagramming and charting tool. Mermaid files (.mmd) can be source-controlled along with your code. It's also recommended to include image files (.png) with the rendered diagrams under source control. Your markdown files should link the image files, so they can be read without the need of a Mermaid rendering tool (e.g., during Pull Request review).","title":"Mermaid"},{"location":"documentation/tools/languages/#example-mermaid-diagram","text":"This is an example of a Mermaid flowchart diagram written as code. graph LR A[Diagram Idea] -->|Write mermaid code| B(mermaid.mmd file) B -->|Add to source control| C{Code repo} B -->|Export as .png| G(.png file of diagram) G -->|Add to source control| C This is an example of how it can be rendered as an image. More information: About Mermaid Diagram syntax Tools: Mermaid Live Editor Markdown Preview Mermaid Support for Visual Studio Code","title":"Example Mermaid Diagram"},{"location":"documentation/tools/wikis/","text":"Wikis Use a team project wiki to share information with other team members. When you provision a wiki from scratch, a new Git repository stores your Markdown files, images, attachments, and sequence of pages. This wiki supports collaborative editing of its content and structure. In Azure DevOps, you have the following options for maintaining wiki content : Provision a wiki for your team project. This option supports only one wiki for the team project. Publish Markdown files defined in a Git repository to a wiki. With this option, you can maintain several versioned wikis to support your content needs. More information: About Wikis, READMEs, and Markdown . Provisioned wikis vs. published code as a wiki . Create a Wiki for your project . Manage wikis . Wikis vs. Digital Notebooks (e.g., OneNote) When you work on a project, you may decide to document relevant details or record important decisions about the project in a digital notebook. Tools like OneNote allows you to easily organize, navigate and search your notes. You can provide type, highlighting, or ink annotations to your notes. These notes can easily be shared and created together with others. Still, Wikis greatly facilitate the process of establishing and managing documentation by allowing us to source control the documentation.","title":"Wikis"},{"location":"documentation/tools/wikis/#wikis","text":"Use a team project wiki to share information with other team members. When you provision a wiki from scratch, a new Git repository stores your Markdown files, images, attachments, and sequence of pages. This wiki supports collaborative editing of its content and structure. In Azure DevOps, you have the following options for maintaining wiki content : Provision a wiki for your team project. This option supports only one wiki for the team project. Publish Markdown files defined in a Git repository to a wiki. With this option, you can maintain several versioned wikis to support your content needs. More information: About Wikis, READMEs, and Markdown . Provisioned wikis vs. published code as a wiki . Create a Wiki for your project . Manage wikis .","title":"Wikis"},{"location":"documentation/tools/wikis/#wikis-vs-digital-notebooks-eg-onenote","text":"When you work on a project, you may decide to document relevant details or record important decisions about the project in a digital notebook. Tools like OneNote allows you to easily organize, navigate and search your notes. You can provide type, highlighting, or ink annotations to your notes. These notes can easily be shared and created together with others. Still, Wikis greatly facilitate the process of establishing and managing documentation by allowing us to source control the documentation.","title":"Wikis vs. Digital Notebooks (e.g., OneNote)"},{"location":"engineering-feedback/","text":"Microsoft Engineering Feedback Why is it Important to Submit Microsoft Engineering Feedback Engineering Feedback captures the \"voice of the customer\" and is an important mechanism to provide actionable insights and help Microsoft product groups continuously improve the platform and cloud services to enable all customers to be as productive as possible. Please note that Engineering Feedback is an asynchronous (i.e. not real-time) method to capture and aggregate friction points across multiple customers and code-with engagements. Therefore, if you need to report a service outage, or an immediately-blocking bug, you should file an official Azure support ticket and, if possible, reference the ticket id in the feedback that you submit later. Even if the feedback has already been raised directly with a product group or on through online channels like GitHub or Stack Overflow, it is still important to raise it via Microsoft Engineering feedback, so it can be consolidated with other customer projects that have the same feedback to help with prioritization. When to Submit Engineering Feedback Capturing and providing high-quality actionable Engineering Feedback is an integral ongoing part of all code-with engagements. It is recommended to submit feedback on an ongoing basis instead of batching it up for submission at the end of the engagement. You should jot down the details of the feedback close to the time when you encounter the specific blockers, challenges, and friction since that is when it is freshest in your mind. The project team can then decide how to prioritize and when to submit the feedback into the official CSE Feedback system (accessible to ISE team members) during each sprint. What is Good and High-quality Engineering Feedback Good engineering feedback provides enough information for those who are not part of the code-with engagement to understand the customer pain, the associated product issues, the impact and priority of these issues, and any potential workarounds that exist to minimize that impact. High-Quality Engineering Feedback is Goal Oriented - states what the customer is trying to accomplish Specific - details the scenario, observation, or challenge faced by the customer Actionable - includes the necessary clarifying information to enable a decision Examples of Good Engineering Feedback For example, here is an evolution of transforming a fictitious feedback with the above high-quality engineering feedback guidance in mind: Stage Feedback Evolution Initial feedback Azure Functions Service Bus Trigger is slow for in-order scenarios Making it Goal Oriented Customer requests batch receiving for Azure Functions Service Bus trigger with sessions enabled to better support higher throughput messaging. They want to use Azure Functions to process as many messages per second as possible with minimum latency and in a given order. Adding Specifics Customer scenario was to receive a total of 250 messages/second from 50 producers with requirement for ordering per producer & minimum latency, using a Service Bus topic with sessions enabled for ordering. Batch receiving is not supported in Azure Functions Service Bus Trigger. Making it Actionable Customer scenario was to receive a total of 250 messages/second from 50 producers with requirement for ordering per producer & minimum latency, using a Service Bus topic with sessions enabled for ordering. According to Microsoft documentation , batch receiving is recommended for better performance but this is not currently supported in the Azure Functions Service Bus Trigger. The impact and workaround was choosing containers over Functions. The desired outcome is for Azure Functions to support Service Bus sessions with batch and non-batch processing for all Azure Functions GA languages. For real-world examples please follow Feedback Examples . How to Submit Engineering Feedback Please follow the Engineering Feedback Guidance to ensure that you provide feedback that can be triaged and processed most efficiently. Please review the Frequently Asked Questions page for additional information on the engineering feedback process.","title":"Microsoft Engineering Feedback"},{"location":"engineering-feedback/#microsoft-engineering-feedback","text":"","title":"Microsoft Engineering Feedback"},{"location":"engineering-feedback/#why-is-it-important-to-submit-microsoft-engineering-feedback","text":"Engineering Feedback captures the \"voice of the customer\" and is an important mechanism to provide actionable insights and help Microsoft product groups continuously improve the platform and cloud services to enable all customers to be as productive as possible. Please note that Engineering Feedback is an asynchronous (i.e. not real-time) method to capture and aggregate friction points across multiple customers and code-with engagements. Therefore, if you need to report a service outage, or an immediately-blocking bug, you should file an official Azure support ticket and, if possible, reference the ticket id in the feedback that you submit later. Even if the feedback has already been raised directly with a product group or on through online channels like GitHub or Stack Overflow, it is still important to raise it via Microsoft Engineering feedback, so it can be consolidated with other customer projects that have the same feedback to help with prioritization.","title":"Why is it Important to Submit Microsoft Engineering Feedback"},{"location":"engineering-feedback/#when-to-submit-engineering-feedback","text":"Capturing and providing high-quality actionable Engineering Feedback is an integral ongoing part of all code-with engagements. It is recommended to submit feedback on an ongoing basis instead of batching it up for submission at the end of the engagement. You should jot down the details of the feedback close to the time when you encounter the specific blockers, challenges, and friction since that is when it is freshest in your mind. The project team can then decide how to prioritize and when to submit the feedback into the official CSE Feedback system (accessible to ISE team members) during each sprint.","title":"When to Submit Engineering Feedback"},{"location":"engineering-feedback/#what-is-good-and-high-quality-engineering-feedback","text":"Good engineering feedback provides enough information for those who are not part of the code-with engagement to understand the customer pain, the associated product issues, the impact and priority of these issues, and any potential workarounds that exist to minimize that impact.","title":"What is Good and High-quality Engineering Feedback"},{"location":"engineering-feedback/#high-quality-engineering-feedback-is","text":"Goal Oriented - states what the customer is trying to accomplish Specific - details the scenario, observation, or challenge faced by the customer Actionable - includes the necessary clarifying information to enable a decision","title":"High-Quality Engineering Feedback is"},{"location":"engineering-feedback/#examples-of-good-engineering-feedback","text":"For example, here is an evolution of transforming a fictitious feedback with the above high-quality engineering feedback guidance in mind: Stage Feedback Evolution Initial feedback Azure Functions Service Bus Trigger is slow for in-order scenarios Making it Goal Oriented Customer requests batch receiving for Azure Functions Service Bus trigger with sessions enabled to better support higher throughput messaging. They want to use Azure Functions to process as many messages per second as possible with minimum latency and in a given order. Adding Specifics Customer scenario was to receive a total of 250 messages/second from 50 producers with requirement for ordering per producer & minimum latency, using a Service Bus topic with sessions enabled for ordering. Batch receiving is not supported in Azure Functions Service Bus Trigger. Making it Actionable Customer scenario was to receive a total of 250 messages/second from 50 producers with requirement for ordering per producer & minimum latency, using a Service Bus topic with sessions enabled for ordering. According to Microsoft documentation , batch receiving is recommended for better performance but this is not currently supported in the Azure Functions Service Bus Trigger. The impact and workaround was choosing containers over Functions. The desired outcome is for Azure Functions to support Service Bus sessions with batch and non-batch processing for all Azure Functions GA languages. For real-world examples please follow Feedback Examples .","title":"Examples of Good Engineering Feedback"},{"location":"engineering-feedback/#how-to-submit-engineering-feedback","text":"Please follow the Engineering Feedback Guidance to ensure that you provide feedback that can be triaged and processed most efficiently. Please review the Frequently Asked Questions page for additional information on the engineering feedback process.","title":"How to Submit Engineering Feedback"},{"location":"engineering-feedback/feedback-examples/","text":"Engineering Feedback Examples The following are real-world examples of Engineering Feedback that have led to product improvements and unblocked customers. Windows Server Container Support for Azure Kubernetes Service The Azure Kubernetes Service should have first class Windows container support so solutions that require Windows workloads can be deployed on a wildly popular container orchestration platform. The need was to be able to deploy Windows Server containers on AKS the managed Azure Kubernetes Service. According to this FAQ (and in parallel confirmation) it is not available yet. We tried to deploy anyway as a test, and it did not work \u2013 the deployment would be pending without success. More than a dozen large partners/customers are blocked in deploying Windows workloads to AKS due to a lack of support for Windows Server containers. They need this feature so solutions requiring Windows workloads can be deployed to this popular container orchestration platform. We are seeing an emergence of companies beginning to try Windows containers as an option to move their Windows workloads to the cloud.\u202f Gartner is claiming that 80% of enterprise apps run on Windows. Containers have become the de facto deployment mechanism in the industry, and deployment consistency and speed are a few of the important factors companies are looking for. Enabling Windows applications and ensuring that developers have a good experience when moving their workloads to Azure via Windows containers is key to keeping existing Windows customers within the Azure ecosystem and driving Azure adoption for new workloads. We are also seeing increased interest, particularly among enterprise customers, in using a single orchestrator control plane for managing both Linux and Windows workloads. This feedback was created as a high priority feedback and followed up internally until addressed. Here is the announcement . Support Batch Receiving with Sessions in Azure Functions Service Bus Trigger Customer scenario was to receive a total of 250 messages per second from 50 producers with requirement for ordering & minimum latency, using a Service Bus topic with sessions enabled for ordering. According to Microsoft documentation , batch receiving is recommended for better performance but this is not currently supported in Azure Functions Service Bus Trigger. The impact (and work around) was choosing containers over Functions. The Acceptance Criteria is for Azure Functions to support Service Bus sessions with batch and non-batch processing for all Azure Functions GA languages. This feedback was created as a feedback with the Azure Functions product group and also followed up internally until addressed. Stream Analytics - No Support for Zero-Downtime Scale-Down In order to update the Streaming Unit number in Stream Analytics you need to stop the service and wait for minutes for it to restart. This unacceptable by customers who need near real-time analysis\u200b. In order to have a job re-started, up to 2 minutes are needed and this is not acceptable for a real-time streaming solution. It would also be optimal if scale-up and scale-down could be done automatically, by setting threshold values that when reached increase or decrease automatically the amount of RU available. This feedback is for customers' request for zero down-time scale-down capability in stream analytics. Problem Statement: In order to update the \"Streaming Unit\" number, partners must stop the service and wait until it restarts. The partner needs to be able to update the number without stopping the service. Desired Experience: Partners should be able to update the Streaming Unit number without stopping the associated service. This feedback was created as a high priority feedback and followed up until addressed in December 2019. Python Support for Azure Functions Several customers already use Python as part of their workflow, and would like to be able to use Python for Azure Functions. This is specially true since many of them are already have scripts running on other clouds and services. In addition, Python support has been in Preview for a very long time, and it's missing a lot of functionality. This feature request is one of the most asked, and a huge upside potential to pull through Machine Learning (ML) based workloads. This feedback was created as a feedback with the Azure Functions product group and also followed up internally until addressed. Here is the announcement .","title":"Engineering Feedback Examples"},{"location":"engineering-feedback/feedback-examples/#engineering-feedback-examples","text":"The following are real-world examples of Engineering Feedback that have led to product improvements and unblocked customers.","title":"Engineering Feedback Examples"},{"location":"engineering-feedback/feedback-examples/#windows-server-container-support-for-azure-kubernetes-service","text":"The Azure Kubernetes Service should have first class Windows container support so solutions that require Windows workloads can be deployed on a wildly popular container orchestration platform. The need was to be able to deploy Windows Server containers on AKS the managed Azure Kubernetes Service. According to this FAQ (and in parallel confirmation) it is not available yet. We tried to deploy anyway as a test, and it did not work \u2013 the deployment would be pending without success. More than a dozen large partners/customers are blocked in deploying Windows workloads to AKS due to a lack of support for Windows Server containers. They need this feature so solutions requiring Windows workloads can be deployed to this popular container orchestration platform. We are seeing an emergence of companies beginning to try Windows containers as an option to move their Windows workloads to the cloud.\u202f Gartner is claiming that 80% of enterprise apps run on Windows. Containers have become the de facto deployment mechanism in the industry, and deployment consistency and speed are a few of the important factors companies are looking for. Enabling Windows applications and ensuring that developers have a good experience when moving their workloads to Azure via Windows containers is key to keeping existing Windows customers within the Azure ecosystem and driving Azure adoption for new workloads. We are also seeing increased interest, particularly among enterprise customers, in using a single orchestrator control plane for managing both Linux and Windows workloads. This feedback was created as a high priority feedback and followed up internally until addressed. Here is the announcement .","title":"Windows Server Container Support for Azure Kubernetes Service"},{"location":"engineering-feedback/feedback-examples/#support-batch-receiving-with-sessions-in-azure-functions-service-bus-trigger","text":"Customer scenario was to receive a total of 250 messages per second from 50 producers with requirement for ordering & minimum latency, using a Service Bus topic with sessions enabled for ordering. According to Microsoft documentation , batch receiving is recommended for better performance but this is not currently supported in Azure Functions Service Bus Trigger. The impact (and work around) was choosing containers over Functions. The Acceptance Criteria is for Azure Functions to support Service Bus sessions with batch and non-batch processing for all Azure Functions GA languages. This feedback was created as a feedback with the Azure Functions product group and also followed up internally until addressed.","title":"Support Batch Receiving with Sessions in Azure Functions Service Bus Trigger"},{"location":"engineering-feedback/feedback-examples/#stream-analytics-no-support-for-zero-downtime-scale-down","text":"In order to update the Streaming Unit number in Stream Analytics you need to stop the service and wait for minutes for it to restart. This unacceptable by customers who need near real-time analysis\u200b. In order to have a job re-started, up to 2 minutes are needed and this is not acceptable for a real-time streaming solution. It would also be optimal if scale-up and scale-down could be done automatically, by setting threshold values that when reached increase or decrease automatically the amount of RU available. This feedback is for customers' request for zero down-time scale-down capability in stream analytics. Problem Statement: In order to update the \"Streaming Unit\" number, partners must stop the service and wait until it restarts. The partner needs to be able to update the number without stopping the service. Desired Experience: Partners should be able to update the Streaming Unit number without stopping the associated service. This feedback was created as a high priority feedback and followed up until addressed in December 2019.","title":"Stream Analytics - No Support for Zero-Downtime Scale-Down"},{"location":"engineering-feedback/feedback-examples/#python-support-for-azure-functions","text":"Several customers already use Python as part of their workflow, and would like to be able to use Python for Azure Functions. This is specially true since many of them are already have scripts running on other clouds and services. In addition, Python support has been in Preview for a very long time, and it's missing a lot of functionality. This feature request is one of the most asked, and a huge upside potential to pull through Machine Learning (ML) based workloads. This feedback was created as a feedback with the Azure Functions product group and also followed up internally until addressed. Here is the announcement .","title":"Python Support for Azure Functions"},{"location":"engineering-feedback/feedback-faq/","text":"Engineering Feedback Frequently Asked Questions (F.A.Q.) The questions below are common questions related to the feedback process. The answers are intended to help both Microsoft employees and customers. When Should I Submit Feedback vs. Creating an Issue on GitHub, UserVoice, or Sending an Email Directly to a Microsoft Employee? It is appropriate to do both. As a customer or Microsoft employee, you are empowered to create an issue or submit feedback via the medium appropriate for service. In addition to an issue on GitHub, feedback on UserVoice, or a personal email, Microsoft employees in CSE should submit feedback via CSE Feedback. In doing so, please reference the GitHub issue, UserVoice feedback, or email by including a link to the item or attaching the email. Submitting to ISE Feedback allows the ISE Feedback team to coalesce feedback across a wide range of sources, and thus create a unified case to submit to the appropriate Azure engineering team(s). How can a Customer Track the Status of a Specific Feedback Item? At this time, customers are not able to directly track the status of feedback submitted via ISE Feedback. The ISE Feedback process is internal to Microsoft, and as such, available only to Microsoft employees. Customers may request an update from their ISE engineering partner or Microsoft account representative(s). Customers can also submit their feedback directly via GitHub or UserVoice (as appropriate for the specific service), and inform their ISE engineering partner. The ISE engineer should submit the feedback via the ISE Feedback process, and in doing so reference the previously created issue. Customers can follow the GitHub or UserVoice item to be alerted on updates. How can a Microsoft Employee Track the Status of a Specific Feedback Item? The easiest way for a Microsoft employee within ISE to track a specific feedback item is to follow the feedback (a work item) in Azure DevOps. As a Microsoft Employee Within ISE, if I Submit a Feedback and Move to Another Dev Crew Engagement, how Would my Customer get an Update on that Feedback? If the feedback is also submitted via GitHub or UserVoice, the customer may elect to follow that item for publicly available updates. The customer may also contact their Microsoft account representative to request an update. As a Microsoft Employee Within ISE, what Should I Expect/Do After Submitting Feedback via ISE Feedback? After submitting the feedback, it is recommended to follow the feedback (a work item) in Azure DevOps. If you have configured Azure DevOps notifications to send an email on work item updates, you will receive an email when the feedback is updated. If more information about the feedback is needed, a member of the ISE Feedback team will contact you to gather more information. How/When are Feedback Aggregated? Members of the ISE Feedback team will make a best effort to triage and review new ISE Feedback items within two weeks of the original submission date. If there is similarity across multiple feedback items, a member of the ISE Feedback team may decide to create a new feedback item which is an aggregate of similar items. This is done to aid in the creation of a unified feedback item to present to the appropriate Microsoft engineering team. On a monthly basis, the ISE Feedback team will review all feedback and generate a report consisting of the highest priority feedback. The report is presented to appropriate ISE and Microsoft leadership teams.","title":"Engineering Feedback Frequently Asked Questions (F.A.Q.)"},{"location":"engineering-feedback/feedback-faq/#engineering-feedback-frequently-asked-questions-faq","text":"The questions below are common questions related to the feedback process. The answers are intended to help both Microsoft employees and customers.","title":"Engineering Feedback Frequently Asked Questions (F.A.Q.)"},{"location":"engineering-feedback/feedback-faq/#when-should-i-submit-feedback-vs-creating-an-issue-on-github-uservoice-or-sending-an-email-directly-to-a-microsoft-employee","text":"It is appropriate to do both. As a customer or Microsoft employee, you are empowered to create an issue or submit feedback via the medium appropriate for service. In addition to an issue on GitHub, feedback on UserVoice, or a personal email, Microsoft employees in CSE should submit feedback via CSE Feedback. In doing so, please reference the GitHub issue, UserVoice feedback, or email by including a link to the item or attaching the email. Submitting to ISE Feedback allows the ISE Feedback team to coalesce feedback across a wide range of sources, and thus create a unified case to submit to the appropriate Azure engineering team(s).","title":"When Should I Submit Feedback vs. Creating an Issue on GitHub, UserVoice, or Sending an Email Directly to a Microsoft Employee?"},{"location":"engineering-feedback/feedback-faq/#how-can-a-customer-track-the-status-of-a-specific-feedback-item","text":"At this time, customers are not able to directly track the status of feedback submitted via ISE Feedback. The ISE Feedback process is internal to Microsoft, and as such, available only to Microsoft employees. Customers may request an update from their ISE engineering partner or Microsoft account representative(s). Customers can also submit their feedback directly via GitHub or UserVoice (as appropriate for the specific service), and inform their ISE engineering partner. The ISE engineer should submit the feedback via the ISE Feedback process, and in doing so reference the previously created issue. Customers can follow the GitHub or UserVoice item to be alerted on updates.","title":"How can a Customer Track the Status of a Specific Feedback Item?"},{"location":"engineering-feedback/feedback-faq/#how-can-a-microsoft-employee-track-the-status-of-a-specific-feedback-item","text":"The easiest way for a Microsoft employee within ISE to track a specific feedback item is to follow the feedback (a work item) in Azure DevOps.","title":"How can a Microsoft Employee Track the Status of a Specific Feedback Item?"},{"location":"engineering-feedback/feedback-faq/#as-a-microsoft-employee-within-ise-if-i-submit-a-feedback-and-move-to-another-dev-crew-engagement-how-would-my-customer-get-an-update-on-that-feedback","text":"If the feedback is also submitted via GitHub or UserVoice, the customer may elect to follow that item for publicly available updates. The customer may also contact their Microsoft account representative to request an update.","title":"As a Microsoft Employee Within ISE, if I Submit a Feedback and Move to Another Dev Crew Engagement, how Would my Customer get an Update on that Feedback?"},{"location":"engineering-feedback/feedback-faq/#as-a-microsoft-employee-within-ise-what-should-i-expectdo-after-submitting-feedback-via-ise-feedback","text":"After submitting the feedback, it is recommended to follow the feedback (a work item) in Azure DevOps. If you have configured Azure DevOps notifications to send an email on work item updates, you will receive an email when the feedback is updated. If more information about the feedback is needed, a member of the ISE Feedback team will contact you to gather more information.","title":"As a Microsoft Employee Within ISE, what Should I Expect/Do After Submitting Feedback via ISE Feedback?"},{"location":"engineering-feedback/feedback-faq/#howwhen-are-feedback-aggregated","text":"Members of the ISE Feedback team will make a best effort to triage and review new ISE Feedback items within two weeks of the original submission date. If there is similarity across multiple feedback items, a member of the ISE Feedback team may decide to create a new feedback item which is an aggregate of similar items. This is done to aid in the creation of a unified feedback item to present to the appropriate Microsoft engineering team. On a monthly basis, the ISE Feedback team will review all feedback and generate a report consisting of the highest priority feedback. The report is presented to appropriate ISE and Microsoft leadership teams.","title":"How/When are Feedback Aggregated?"},{"location":"engineering-feedback/feedback-guidance/","text":"Engineering Feedback Guidance The following guidance provides a minimum set of details that will result in actionable engineering feedback. Ensure that you provide as much detail for each of the following sections as relevant and possible. Title Provide a meaningful and descriptive title. There is no need to include the Azure service in the title as this will be included as part of the Categorization section. Good examples: Supported X versions not documented Require all-in-one Y story Summary Summarize the feedback in a short paragraph. Categorization Azure Service Which Azure service does this feedback item refer to? If there are multiple Azure services involved, pick the primary service and include the details of the others in the Notes section. Type Select one of the following to describe what type of feedback is being provided: Business Blocker (e.g. No SLA on X, Service Y not GA, Service A not in Region B) Technical Blocker (e.g. Accelerated networking not available on Service X) Documentation (e.g. Instructions for configuring scenario X missing) Feature Request (e.g. Enable simple integration to X on Service Y) Stage Select one of the following to describe the lifecycle stage of the engagement that has generated this feedback: Production Staging Testing Development Impact Describe the impact to the customer and engagement that this feedback implies. Time Frame Provide a time frame that this feedback item needs to be resolved within (if relevant). Priority Please provide the customer perspective priority of the feedback. Feedback is prioritized at one of the following four levels: P0 - Impact is critical and large : Needs to be addressed immediately; impact is critical and large in scope (i.e. major service outage; makes service or functions unusable/unavailable to a high portion of addressable space; no known workaround). P1 - Impact is high and significant : Needs to be addressed quickly; impacts a large percentage of addressable space and impedes progress. A partial workaround exists or is overly painful. P2 - Impact is moderate and varies in scope : Needs to be addressed in a reasonable time frame (i.e. issues that are impeding adoption and usage with no reasonable workarounds). For example, feedback may be related to feature-level issue to solve for friction. P3 - Impact is low : Issue can be address when able or eventually (i.e. relevant to core addressable space but issue does not impede progress or has reasonable workaround). For example, feedback may be related to feature ideas or opportunities. Reproduction Steps The reproduction steps are important since they help confirm and replay the issue, and are essential in demonstrating success once there is a resolution. Pre-requisites Provide a clear set of all conditions and pre-requisites required before following the set of reproduction steps. These could include: Platform (e.g. AKS 1.16.4 cluster with Azure CNI, Ubuntu 19.04 VM) Services (e.g. Azure Key Vault, Azure Monitor) Networking (e.g. VNET with subnet) Steps Provide a clear set of repeatable steps that will allow for this feedback to be reproduced. This can take the form of: Scripts (e.g. bash, PowerShell, terraform, arm template) Command line instructions (e.g. az, helm, terraform) Screen shots (e.g. azure portal screens) Notes Include items like architecture diagrams, screenshots, logs, traces etc which can help with understanding your notes and the feedback item. Also include details about the scenario customer/partner verbatim as much as possible in the main content. What Didn't Work Describe what didn't work or what feature gap you identified. What was Your Expectation or the Desired Outcome Describe what you expected to happen. What was the outcome that was expected? Describe the Steps you Took Provide a clear description of the steps taken and the outcome/description at each point.","title":"Engineering Feedback Guidance"},{"location":"engineering-feedback/feedback-guidance/#engineering-feedback-guidance","text":"The following guidance provides a minimum set of details that will result in actionable engineering feedback. Ensure that you provide as much detail for each of the following sections as relevant and possible.","title":"Engineering Feedback Guidance"},{"location":"engineering-feedback/feedback-guidance/#title","text":"Provide a meaningful and descriptive title. There is no need to include the Azure service in the title as this will be included as part of the Categorization section. Good examples: Supported X versions not documented Require all-in-one Y story","title":"Title"},{"location":"engineering-feedback/feedback-guidance/#summary","text":"Summarize the feedback in a short paragraph.","title":"Summary"},{"location":"engineering-feedback/feedback-guidance/#categorization","text":"","title":"Categorization"},{"location":"engineering-feedback/feedback-guidance/#azure-service","text":"Which Azure service does this feedback item refer to? If there are multiple Azure services involved, pick the primary service and include the details of the others in the Notes section.","title":"Azure Service"},{"location":"engineering-feedback/feedback-guidance/#type","text":"Select one of the following to describe what type of feedback is being provided: Business Blocker (e.g. No SLA on X, Service Y not GA, Service A not in Region B) Technical Blocker (e.g. Accelerated networking not available on Service X) Documentation (e.g. Instructions for configuring scenario X missing) Feature Request (e.g. Enable simple integration to X on Service Y)","title":"Type"},{"location":"engineering-feedback/feedback-guidance/#stage","text":"Select one of the following to describe the lifecycle stage of the engagement that has generated this feedback: Production Staging Testing Development","title":"Stage"},{"location":"engineering-feedback/feedback-guidance/#impact","text":"Describe the impact to the customer and engagement that this feedback implies.","title":"Impact"},{"location":"engineering-feedback/feedback-guidance/#time-frame","text":"Provide a time frame that this feedback item needs to be resolved within (if relevant).","title":"Time Frame"},{"location":"engineering-feedback/feedback-guidance/#priority","text":"Please provide the customer perspective priority of the feedback. Feedback is prioritized at one of the following four levels: P0 - Impact is critical and large : Needs to be addressed immediately; impact is critical and large in scope (i.e. major service outage; makes service or functions unusable/unavailable to a high portion of addressable space; no known workaround). P1 - Impact is high and significant : Needs to be addressed quickly; impacts a large percentage of addressable space and impedes progress. A partial workaround exists or is overly painful. P2 - Impact is moderate and varies in scope : Needs to be addressed in a reasonable time frame (i.e. issues that are impeding adoption and usage with no reasonable workarounds). For example, feedback may be related to feature-level issue to solve for friction. P3 - Impact is low : Issue can be address when able or eventually (i.e. relevant to core addressable space but issue does not impede progress or has reasonable workaround). For example, feedback may be related to feature ideas or opportunities.","title":"Priority"},{"location":"engineering-feedback/feedback-guidance/#reproduction-steps","text":"The reproduction steps are important since they help confirm and replay the issue, and are essential in demonstrating success once there is a resolution.","title":"Reproduction Steps"},{"location":"engineering-feedback/feedback-guidance/#pre-requisites","text":"Provide a clear set of all conditions and pre-requisites required before following the set of reproduction steps. These could include: Platform (e.g. AKS 1.16.4 cluster with Azure CNI, Ubuntu 19.04 VM) Services (e.g. Azure Key Vault, Azure Monitor) Networking (e.g. VNET with subnet)","title":"Pre-requisites"},{"location":"engineering-feedback/feedback-guidance/#steps","text":"Provide a clear set of repeatable steps that will allow for this feedback to be reproduced. This can take the form of: Scripts (e.g. bash, PowerShell, terraform, arm template) Command line instructions (e.g. az, helm, terraform) Screen shots (e.g. azure portal screens)","title":"Steps"},{"location":"engineering-feedback/feedback-guidance/#notes","text":"Include items like architecture diagrams, screenshots, logs, traces etc which can help with understanding your notes and the feedback item. Also include details about the scenario customer/partner verbatim as much as possible in the main content.","title":"Notes"},{"location":"engineering-feedback/feedback-guidance/#what-didnt-work","text":"Describe what didn't work or what feature gap you identified.","title":"What Didn't Work"},{"location":"engineering-feedback/feedback-guidance/#what-was-your-expectation-or-the-desired-outcome","text":"Describe what you expected to happen. What was the outcome that was expected?","title":"What was Your Expectation or the Desired Outcome"},{"location":"engineering-feedback/feedback-guidance/#describe-the-steps-you-took","text":"Provide a clear description of the steps taken and the outcome/description at each point.","title":"Describe the Steps you Took"},{"location":"machine-learning/","text":"Machine Learning Fundamentals at ISE This guideline documents the Machine Learning (ML) practices in ISE. ISE works with customers on developing ML models and putting them in production, with an emphasis on engineering and research best practices throughout the project's life cycle. Goals Provide a set of ML practices to follow in an ML project. Provide clarity on ML process and how it fits within a software engineering project. Provide best practices for the different stages of an ML project. How to use these Fundamentals If you are starting a new ML project, consider reading through the general guidance documents . For specific aspects of an ML project, refer to the guidelines for different project phases . ML Project Phases The diagram below shows different phases in an ideal ML project. Due to practical constraints and requirements, it might not always be possible to have a project structured in such a manner, however best practices should be followed for each individual phase. Envisioning : Initial problem understanding, customer goals and objectives. Feasibility Study : Assess whether the problem in question is feasible to solve satisfactorily using ML with the available data. Model Milestone : There is a basic model that is achieving the minimum required performance, both in terms of ML performance and system performance. Using the knowledge gathered to this milestone, define the scope, objectives, high-level architecture, definition of done and plan for the entire project. Model(s) experimentation : Tools and best practices for conducting successful model experimentation. Model(s) Operationalization : Model readiness for production checklist. General Guidance ML Process Guidance ML Fundamentals checklist Data Exploration Agile ML development Testing Data Science and ML Ops code Profiling Machine Learning and ML Ops code Responsible AI Program Management for ML projects Resources Model Operationalization","title":"Machine Learning Fundamentals at ISE"},{"location":"machine-learning/#machine-learning-fundamentals-at-ise","text":"This guideline documents the Machine Learning (ML) practices in ISE. ISE works with customers on developing ML models and putting them in production, with an emphasis on engineering and research best practices throughout the project's life cycle.","title":"Machine Learning Fundamentals at ISE"},{"location":"machine-learning/#goals","text":"Provide a set of ML practices to follow in an ML project. Provide clarity on ML process and how it fits within a software engineering project. Provide best practices for the different stages of an ML project.","title":"Goals"},{"location":"machine-learning/#how-to-use-these-fundamentals","text":"If you are starting a new ML project, consider reading through the general guidance documents . For specific aspects of an ML project, refer to the guidelines for different project phases .","title":"How to use these Fundamentals"},{"location":"machine-learning/#ml-project-phases","text":"The diagram below shows different phases in an ideal ML project. Due to practical constraints and requirements, it might not always be possible to have a project structured in such a manner, however best practices should be followed for each individual phase. Envisioning : Initial problem understanding, customer goals and objectives. Feasibility Study : Assess whether the problem in question is feasible to solve satisfactorily using ML with the available data. Model Milestone : There is a basic model that is achieving the minimum required performance, both in terms of ML performance and system performance. Using the knowledge gathered to this milestone, define the scope, objectives, high-level architecture, definition of done and plan for the entire project. Model(s) experimentation : Tools and best practices for conducting successful model experimentation. Model(s) Operationalization : Model readiness for production checklist.","title":"ML Project Phases"},{"location":"machine-learning/#general-guidance","text":"ML Process Guidance ML Fundamentals checklist Data Exploration Agile ML development Testing Data Science and ML Ops code Profiling Machine Learning and ML Ops code Responsible AI Program Management for ML projects","title":"General Guidance"},{"location":"machine-learning/#resources","text":"Model Operationalization","title":"Resources"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/","text":"Agile Development Considerations for ML Projects Overview When running ML projects, we follow the Agile methodology for software development with some adaptations, as we acknowledge that research and experimentation are sometimes difficult to plan and estimate. Goals Run and manage ML projects effectively Create effective collaboration between the ML team and the other teams working on the project To learn more about how ISE runs the Agile process for software development teams, refer to this doc . Within this framework, the team follows these Agile ceremonies: Backlog management Retrospectives Scrum of Scrums (where applicable) Sprint planning Stand-ups Working agreement Agile Process During Exploration and Experimentation While acknowledging the fact that ML user stories and research spikes are less predictable than software development ones, we strive to have a deliverable for every user story in every sprint. User stories and spikes are usually estimated using T-shirt sizes or similar, and not in actual days/hours. ML design sessions should be included in each sprint. Examples of ML Deliverables for each Sprint Working code (e.g. models, pipelines, exploratory code) Documentation of new hypotheses, and the acceptance or rejection of previous hypotheses as part of a Hypothesis Driven Analysis (HDA). For more information see Hypothesis Driven Development on Barry Oreilly's website Exploratory Data Analysis (EDA) results and learnings documented Collaboration Between Data Scientists and Software Developers Data scientists and software developers work together on the project. The team uses one backlog and attend the same Agile ceremonies. In cases where the project has many participants, we will divide into working groups, but still have the entire team join the Agile ceremonies. If possible, feasibility study and initial model experimentation takes place before the operationalization work kicks off. Everyone shares the accountability for the MLOps solution. The ML model interface (API) is determined as early as possible, to allow the developers to consider its integration into the production pipeline. MLOps artifacts are developed with a continuous collaboration and review of the data scientists, to ensure the appropriate approaches for experimentation and productization are used. Retrospectives and sprint planning are performed on the entire team level, and not the specific work groups level.","title":"Agile Development Considerations for ML Projects"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#agile-development-considerations-for-ml-projects","text":"","title":"Agile Development Considerations for ML Projects"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#overview","text":"When running ML projects, we follow the Agile methodology for software development with some adaptations, as we acknowledge that research and experimentation are sometimes difficult to plan and estimate.","title":"Overview"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#goals","text":"Run and manage ML projects effectively Create effective collaboration between the ML team and the other teams working on the project To learn more about how ISE runs the Agile process for software development teams, refer to this doc . Within this framework, the team follows these Agile ceremonies: Backlog management Retrospectives Scrum of Scrums (where applicable) Sprint planning Stand-ups Working agreement","title":"Goals"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#agile-process-during-exploration-and-experimentation","text":"While acknowledging the fact that ML user stories and research spikes are less predictable than software development ones, we strive to have a deliverable for every user story in every sprint. User stories and spikes are usually estimated using T-shirt sizes or similar, and not in actual days/hours. ML design sessions should be included in each sprint.","title":"Agile Process During Exploration and Experimentation"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#examples-of-ml-deliverables-for-each-sprint","text":"Working code (e.g. models, pipelines, exploratory code) Documentation of new hypotheses, and the acceptance or rejection of previous hypotheses as part of a Hypothesis Driven Analysis (HDA). For more information see Hypothesis Driven Development on Barry Oreilly's website Exploratory Data Analysis (EDA) results and learnings documented","title":"Examples of ML Deliverables for each Sprint"},{"location":"machine-learning/agile-development-considerations-for-ml-projects/#collaboration-between-data-scientists-and-software-developers","text":"Data scientists and software developers work together on the project. The team uses one backlog and attend the same Agile ceremonies. In cases where the project has many participants, we will divide into working groups, but still have the entire team join the Agile ceremonies. If possible, feasibility study and initial model experimentation takes place before the operationalization work kicks off. Everyone shares the accountability for the MLOps solution. The ML model interface (API) is determined as early as possible, to allow the developers to consider its integration into the production pipeline. MLOps artifacts are developed with a continuous collaboration and review of the data scientists, to ensure the appropriate approaches for experimentation and productization are used. Retrospectives and sprint planning are performed on the entire team level, and not the specific work groups level.","title":"Collaboration Between Data Scientists and Software Developers"},{"location":"machine-learning/data-exploration/","text":"Data Exploration After envisioning , and typically as part of the ML feasibility study , the next step is to confirm resource access and then dive deep into the available data through data exploration workshops. Purpose of the Data Exploration Workshop The purpose of the data exploration workshop is as follows: Ensure the team can access the data and compute resources that are necessary for the ML feasibility study Ensure that the data provided is of quality and is relevant to the ML solution Make sure that the project team has a good understanding of the data Make sure that the SMEs (Subject Matter Experts) needed are present for Data Exploration Workshop List people needed for the data exploration workshop Accessing Resources Prior to diving into data exploration workshops, it is important to confirm that you have access to the necessary resources (including data). Below is an example list of questions to consider before starting a data exploration workshop. What are the requirements for an account to be set up in order for the team to access data and compute resources? Are there security requirements around accessing resources (Subscriptions, Azure Resources, project management, etc.) such as VPN, 2FA, jump boxes, etc.? Data access: * Is it on-prem or on Azure already? * If it is on-prem, can we move the needed data to Azure under the appropriate subscription? Who has permission to move the data? * Is the data access approved from a legal/compliance perspective? Computation: * Is a VPN needed for the project team to access these computation nodes (Virtual Machines, Databricks clusters, etc) from their work PCs/Macs? * Any restrictions on accessing the source data system from these computation nodes? * If we want to create some compute resources, who has permissions to do so? Source code repository: * Do you have any preference on source code repository location? Backlog management and work planning: * Do you have any preference on backlog management and work planning, such as Azure DevOps, Jira or anything else? * If an existing system, are special accounts / system setups required to access? Programming Language: * Is Python/PySpark a preferred language? * Is there any internal approval processes for the Python/PySpark libraries we want to use for this engagement? Data Exploration Workshop Key objectives of the exploration workshops include the following: Understand and document the features, location, and availability of the data. What order of magnitude is the current data (e.g., GB, TB)? Is this all relevant? How does the organization decide when to collect additional data or purchase external data? Are there any examples of this? Understand the quality of the data. Is there already a data validation strategy in place? What data has been used so far to analyze recent data-driven projects? What has been found to be most useful? What was not useful? How was this judged? What additional internal data may provide insights useful for data-driven decision-making for proposed projects? What external data could be useful? What are the possible constraints or challenges in accessing or incorporating this data? How was the data collected? Are there any obvious biases due to how the data was collected? What changes to data collection, coding, integration, etc has occurred in the last 2 years that may impact the interpretation or availability of the collected data","title":"Data Exploration"},{"location":"machine-learning/data-exploration/#data-exploration","text":"After envisioning , and typically as part of the ML feasibility study , the next step is to confirm resource access and then dive deep into the available data through data exploration workshops.","title":"Data Exploration"},{"location":"machine-learning/data-exploration/#purpose-of-the-data-exploration-workshop","text":"The purpose of the data exploration workshop is as follows: Ensure the team can access the data and compute resources that are necessary for the ML feasibility study Ensure that the data provided is of quality and is relevant to the ML solution Make sure that the project team has a good understanding of the data Make sure that the SMEs (Subject Matter Experts) needed are present for Data Exploration Workshop List people needed for the data exploration workshop","title":"Purpose of the Data Exploration Workshop"},{"location":"machine-learning/data-exploration/#accessing-resources","text":"Prior to diving into data exploration workshops, it is important to confirm that you have access to the necessary resources (including data). Below is an example list of questions to consider before starting a data exploration workshop. What are the requirements for an account to be set up in order for the team to access data and compute resources? Are there security requirements around accessing resources (Subscriptions, Azure Resources, project management, etc.) such as VPN, 2FA, jump boxes, etc.? Data access: * Is it on-prem or on Azure already? * If it is on-prem, can we move the needed data to Azure under the appropriate subscription? Who has permission to move the data? * Is the data access approved from a legal/compliance perspective? Computation: * Is a VPN needed for the project team to access these computation nodes (Virtual Machines, Databricks clusters, etc) from their work PCs/Macs? * Any restrictions on accessing the source data system from these computation nodes? * If we want to create some compute resources, who has permissions to do so? Source code repository: * Do you have any preference on source code repository location? Backlog management and work planning: * Do you have any preference on backlog management and work planning, such as Azure DevOps, Jira or anything else? * If an existing system, are special accounts / system setups required to access? Programming Language: * Is Python/PySpark a preferred language? * Is there any internal approval processes for the Python/PySpark libraries we want to use for this engagement?","title":"Accessing Resources"},{"location":"machine-learning/data-exploration/#data-exploration-workshop","text":"Key objectives of the exploration workshops include the following: Understand and document the features, location, and availability of the data. What order of magnitude is the current data (e.g., GB, TB)? Is this all relevant? How does the organization decide when to collect additional data or purchase external data? Are there any examples of this? Understand the quality of the data. Is there already a data validation strategy in place? What data has been used so far to analyze recent data-driven projects? What has been found to be most useful? What was not useful? How was this judged? What additional internal data may provide insights useful for data-driven decision-making for proposed projects? What external data could be useful? What are the possible constraints or challenges in accessing or incorporating this data? How was the data collected? Are there any obvious biases due to how the data was collected? What changes to data collection, coding, integration, etc has occurred in the last 2 years that may impact the interpretation or availability of the collected data","title":"Data Exploration Workshop"},{"location":"machine-learning/envisioning-and-problem-formulation/","text":"Envisioning and Problem Formulation Before beginning a data science investigation, we need to define a problem statement which the data science team can explore; this problem statement can have a significant influence on whether the project is likely to be successful. Envisioning Goals The main goals of the envisioning process are: Establish a clear understanding of the problem domain and the underlying business objective Define how a potential solution would be used and how its performance should be measured Determine what data is available to solve the problem Understand the capabilities and working practices of the data science team Ensure all parties have the same understanding of the scope and next steps (e.g., onboarding, data exploration workshop) The envisioning process usually entails a series of 'envisioning' sessions where the data science team work alongside subject-matter experts to formulate the problem in such a way that there is a shared understanding a shared understanding of the problem domain, a clear goal, and a predefined approach to evaluating a potential solution. Understanding the Problem Domain Generally, before defining a project scope for a data science investigation, we must first understand the problem domain: What is the problem? Why does the problem need to be solved? Does this problem require a machine learning solution? How would a potential solution be used? However, establishing this understanding can prove difficult, especially for those unfamiliar with the problem domain. To ease this process, we can approach problems in a structured way by taking the following steps: Identify a measurable problem and define this in business terms. The objective should be clear, and we should have a good understanding of the factors that we can control - that can be used as inputs - and how they affect the objective. Be as specific as possible. Decide how the performance of a solution should be measured and identify whether this is possible within the restrictions of this problem. Make sure it aligns with the business objective and that you have identified the data required to evaluate the solution. Note that the data required to evaluate a solution may differ from the data needed to create a solution. Thinking about the solution as a black box, detail the function that a solution to this problem should perform to fulfil the objective and verify that the relevant data is available to solve the problem. One way of approaching this is by thinking about how a subject-matter expert could solve the problem manually, and the data that would be required; if a human subject-matter expert is unable to solve the problem given the available data, this is indicative that additional information is required and/or more data needs to be collected. Based on the available data, define specific hypothesis statements - which can be proved or disproved - to guide the exploration of the data science team. Where possible, each hypothesis statement should have a clearly defined success criteria (e.g., with an accuracy of over 60% ), however, this is not always possible - especially for projects where no solution to the problem currently exists. In these cases, the measure of success could be based on a subject-matter expert verifying that the results meet their expectations. Document all the above information, to ensure alignment between stakeholders and establish a clear understanding of the problem to be solved. Try to ensure that as much relevant domain knowledge is captured as possible, and that the features present in available data - and the way that the data was collected - are clearly explained, such that they can be understood by a non-subject matter expert. Once an understanding of the problem domain has been established, it may be necessary to break down the overall problem into smaller, meaningful chunks of work to maintain team focus and ensure a realistic project scope within the given time frame. Listening to the End User These problems are complex and require understanding from a variety of perspectives. It is not uncommon for the stakeholders to not be the end user of the solution framework. In these cases, listening to the actual end users is critical to the success of the project. The following questions can help guide discussion in understanding the stakeholders' perspectives: Who is the end user? What is the current practice related to the business problem? What's the performance of the current solution? What are their pain points? What is their toughest problem? What is the state of the data used to build the solution? How does the end user or SME envision the solution? Envisioning Guidance During envisioning sessions, the following may prove useful for guiding the discussion. Many of these points are taken directly, or adapted from, [1] and [2] . Problem Framing Define the objective in business terms. How will the solution be used? What are the current solutions/workarounds (if any)? What work has been done in this area so far? Does this solution need to fit into an existing system? How should performance be measured? Is the performance measure aligned with the business objective? What would be the minimum performance needed to reach the business objective? Are there any known constraints around non-functional requirements that would have to be taken into account? (e.g., computation times) Frame this problem (supervised/unsupervised, online/offline, etc.) Is human expertise available? How would you solve the problem manually? Are there any restrictions on the type of approaches which can be used? (e.g., does the solution need to be completely explainable?) List the assumptions you or others have made so far. Verify these assumptions if possible. Define some initial hypothesis statements to be explored. Highlight and discuss any responsible AI concerns if appropriate. Workflow What data science skills exist in the organization? How many data scientists/engineers would be available to work on this project? In what capacity would these resources be available (full-time, part-time, etc.)? What does the team's current workflow practices look like? Do they work on the cloud/on-prem? In notebooks/IDE? Is version control used? How are data, experiments and models currently tracked? Does the team employ an Agile methodology? How is work tracked? Are there any ML solutions currently running in production? Who is responsible for maintaining these solutions? Who would be responsible for maintaining a solution produced during this project? Are there any restrictions on tooling that must/cannot be used? Example: A Recommendation Engine Problem To illustrate how the above process can be applied to a tangible problem domain, as an example, consider that we are looking at implementing a recommendation engine for a clothing retailer. This example was, in part, inspired by [3] . Often, the objective may be simply presented, in a form such as \"to improve sales\". However, whilst this is ultimately the main goal, we would benefit from being more specific here. Suppose that we were to deploy a solution in November and then observed a December sales surge; how would we be able to distinguish how much of this was as a result of the new recommendation engine, as opposed to the fact that December is a peak buying season? A better objective, in this case, would be \"to drive additional sales by presenting the customer with items that they would not otherwise have purchased without the recommendation \". Here, the inputs that we can control are the choice of items that are presented to each customer, and the order in which they are displayed; considering factors such as how frequently these should change, seasonality, etc. The data required to evaluate a potential solution in this case would be which recommendations resulted in new sales, and an estimation of a customer's likeliness to purchase a specific item without a recommendation. Note that, whilst this data could also be used to build a recommendation engine, it is unlikely that this data will be available before a recommendation system has been implemented, so it is likely that we will have to use an alternate data source to build the model. We can get an initial idea of how to approach a solution to this problem by considering how it would be solved by a subject-matter expert. Thinking of how a personal stylist may provide a recommendation, they are likely to recommend items based on one or more of the following: generally popular items items similar to those liked/purchased by the customer items that were liked/purchased by similar customers items which are complementary to those owned by the customer Whilst this list is by no means exhaustive, it provides a good indication of the data that is likely to be useful to us: item sales data customer purchase histories customer demographics item descriptions and tags previous outfits, or sets, which have been curated by the stylist We would then be able to use this data to explore: a method of measuring similarity between items a method of measuring similarity between customers a method of measuring how complementary items are relative to one another which can be used to create and rank recommendations. Depending on the project scope, and available data, one or more of these areas could be selected to create hypotheses to be explored by the data science team. Some examples of such hypothesis statements could be: From the descriptions of each item, we can determine a measure of similarity between different items to a degree of accuracy which is specified by a stylist. Based on the behavior of customers with similar purchasing histories, we are able to predict certain items that a customer is likely to purchase; with a certainty which is greater than random choice. Using sets of items which have previously been sold together, we can formulate rules around the features which determine whether items are complementary or not which can be verified by a stylist. Next Steps To ensure clarity and alignment, it is useful to summarize the envisioning stage findings focusing on proposed detailed scenarios, assumptions and agreed decisions as well next steps. We suggest confirming that you have access to all necessary resources (including data) as a next step before proceeding with data exploration workshops. Below are the links to the exit document template and to some questions which may be helpful in confirming resource access. Summary of Scope Exit Document Template List of Resource Access Questions List of Data Exploration Workshop Questions Resources Many of the ideas presented here - and much more - were inspired by, and can be found in the following resources; all of which are highly recommended. Aur\u00e9lien G\u00e9ron's Machine learning project checklist Fast.ai's Data project checklist Designing great data products. Jeremy Howard, Margit Zwemer and Mike Loukides","title":"Envisioning and Problem Formulation"},{"location":"machine-learning/envisioning-and-problem-formulation/#envisioning-and-problem-formulation","text":"Before beginning a data science investigation, we need to define a problem statement which the data science team can explore; this problem statement can have a significant influence on whether the project is likely to be successful.","title":"Envisioning and Problem Formulation"},{"location":"machine-learning/envisioning-and-problem-formulation/#envisioning-goals","text":"The main goals of the envisioning process are: Establish a clear understanding of the problem domain and the underlying business objective Define how a potential solution would be used and how its performance should be measured Determine what data is available to solve the problem Understand the capabilities and working practices of the data science team Ensure all parties have the same understanding of the scope and next steps (e.g., onboarding, data exploration workshop) The envisioning process usually entails a series of 'envisioning' sessions where the data science team work alongside subject-matter experts to formulate the problem in such a way that there is a shared understanding a shared understanding of the problem domain, a clear goal, and a predefined approach to evaluating a potential solution.","title":"Envisioning Goals"},{"location":"machine-learning/envisioning-and-problem-formulation/#understanding-the-problem-domain","text":"Generally, before defining a project scope for a data science investigation, we must first understand the problem domain: What is the problem? Why does the problem need to be solved? Does this problem require a machine learning solution? How would a potential solution be used? However, establishing this understanding can prove difficult, especially for those unfamiliar with the problem domain. To ease this process, we can approach problems in a structured way by taking the following steps: Identify a measurable problem and define this in business terms. The objective should be clear, and we should have a good understanding of the factors that we can control - that can be used as inputs - and how they affect the objective. Be as specific as possible. Decide how the performance of a solution should be measured and identify whether this is possible within the restrictions of this problem. Make sure it aligns with the business objective and that you have identified the data required to evaluate the solution. Note that the data required to evaluate a solution may differ from the data needed to create a solution. Thinking about the solution as a black box, detail the function that a solution to this problem should perform to fulfil the objective and verify that the relevant data is available to solve the problem. One way of approaching this is by thinking about how a subject-matter expert could solve the problem manually, and the data that would be required; if a human subject-matter expert is unable to solve the problem given the available data, this is indicative that additional information is required and/or more data needs to be collected. Based on the available data, define specific hypothesis statements - which can be proved or disproved - to guide the exploration of the data science team. Where possible, each hypothesis statement should have a clearly defined success criteria (e.g., with an accuracy of over 60% ), however, this is not always possible - especially for projects where no solution to the problem currently exists. In these cases, the measure of success could be based on a subject-matter expert verifying that the results meet their expectations. Document all the above information, to ensure alignment between stakeholders and establish a clear understanding of the problem to be solved. Try to ensure that as much relevant domain knowledge is captured as possible, and that the features present in available data - and the way that the data was collected - are clearly explained, such that they can be understood by a non-subject matter expert. Once an understanding of the problem domain has been established, it may be necessary to break down the overall problem into smaller, meaningful chunks of work to maintain team focus and ensure a realistic project scope within the given time frame.","title":"Understanding the Problem Domain"},{"location":"machine-learning/envisioning-and-problem-formulation/#listening-to-the-end-user","text":"These problems are complex and require understanding from a variety of perspectives. It is not uncommon for the stakeholders to not be the end user of the solution framework. In these cases, listening to the actual end users is critical to the success of the project. The following questions can help guide discussion in understanding the stakeholders' perspectives: Who is the end user? What is the current practice related to the business problem? What's the performance of the current solution? What are their pain points? What is their toughest problem? What is the state of the data used to build the solution? How does the end user or SME envision the solution?","title":"Listening to the End User"},{"location":"machine-learning/envisioning-and-problem-formulation/#envisioning-guidance","text":"During envisioning sessions, the following may prove useful for guiding the discussion. Many of these points are taken directly, or adapted from, [1] and [2] .","title":"Envisioning Guidance"},{"location":"machine-learning/envisioning-and-problem-formulation/#problem-framing","text":"Define the objective in business terms. How will the solution be used? What are the current solutions/workarounds (if any)? What work has been done in this area so far? Does this solution need to fit into an existing system? How should performance be measured? Is the performance measure aligned with the business objective? What would be the minimum performance needed to reach the business objective? Are there any known constraints around non-functional requirements that would have to be taken into account? (e.g., computation times) Frame this problem (supervised/unsupervised, online/offline, etc.) Is human expertise available? How would you solve the problem manually? Are there any restrictions on the type of approaches which can be used? (e.g., does the solution need to be completely explainable?) List the assumptions you or others have made so far. Verify these assumptions if possible. Define some initial hypothesis statements to be explored. Highlight and discuss any responsible AI concerns if appropriate.","title":"Problem Framing"},{"location":"machine-learning/envisioning-and-problem-formulation/#workflow","text":"What data science skills exist in the organization? How many data scientists/engineers would be available to work on this project? In what capacity would these resources be available (full-time, part-time, etc.)? What does the team's current workflow practices look like? Do they work on the cloud/on-prem? In notebooks/IDE? Is version control used? How are data, experiments and models currently tracked? Does the team employ an Agile methodology? How is work tracked? Are there any ML solutions currently running in production? Who is responsible for maintaining these solutions? Who would be responsible for maintaining a solution produced during this project? Are there any restrictions on tooling that must/cannot be used?","title":"Workflow"},{"location":"machine-learning/envisioning-and-problem-formulation/#example-a-recommendation-engine-problem","text":"To illustrate how the above process can be applied to a tangible problem domain, as an example, consider that we are looking at implementing a recommendation engine for a clothing retailer. This example was, in part, inspired by [3] . Often, the objective may be simply presented, in a form such as \"to improve sales\". However, whilst this is ultimately the main goal, we would benefit from being more specific here. Suppose that we were to deploy a solution in November and then observed a December sales surge; how would we be able to distinguish how much of this was as a result of the new recommendation engine, as opposed to the fact that December is a peak buying season? A better objective, in this case, would be \"to drive additional sales by presenting the customer with items that they would not otherwise have purchased without the recommendation \". Here, the inputs that we can control are the choice of items that are presented to each customer, and the order in which they are displayed; considering factors such as how frequently these should change, seasonality, etc. The data required to evaluate a potential solution in this case would be which recommendations resulted in new sales, and an estimation of a customer's likeliness to purchase a specific item without a recommendation. Note that, whilst this data could also be used to build a recommendation engine, it is unlikely that this data will be available before a recommendation system has been implemented, so it is likely that we will have to use an alternate data source to build the model. We can get an initial idea of how to approach a solution to this problem by considering how it would be solved by a subject-matter expert. Thinking of how a personal stylist may provide a recommendation, they are likely to recommend items based on one or more of the following: generally popular items items similar to those liked/purchased by the customer items that were liked/purchased by similar customers items which are complementary to those owned by the customer Whilst this list is by no means exhaustive, it provides a good indication of the data that is likely to be useful to us: item sales data customer purchase histories customer demographics item descriptions and tags previous outfits, or sets, which have been curated by the stylist We would then be able to use this data to explore: a method of measuring similarity between items a method of measuring similarity between customers a method of measuring how complementary items are relative to one another which can be used to create and rank recommendations. Depending on the project scope, and available data, one or more of these areas could be selected to create hypotheses to be explored by the data science team. Some examples of such hypothesis statements could be: From the descriptions of each item, we can determine a measure of similarity between different items to a degree of accuracy which is specified by a stylist. Based on the behavior of customers with similar purchasing histories, we are able to predict certain items that a customer is likely to purchase; with a certainty which is greater than random choice. Using sets of items which have previously been sold together, we can formulate rules around the features which determine whether items are complementary or not which can be verified by a stylist.","title":"Example: A Recommendation Engine Problem"},{"location":"machine-learning/envisioning-and-problem-formulation/#next-steps","text":"To ensure clarity and alignment, it is useful to summarize the envisioning stage findings focusing on proposed detailed scenarios, assumptions and agreed decisions as well next steps. We suggest confirming that you have access to all necessary resources (including data) as a next step before proceeding with data exploration workshops. Below are the links to the exit document template and to some questions which may be helpful in confirming resource access. Summary of Scope Exit Document Template List of Resource Access Questions List of Data Exploration Workshop Questions","title":"Next Steps"},{"location":"machine-learning/envisioning-and-problem-formulation/#resources","text":"Many of the ideas presented here - and much more - were inspired by, and can be found in the following resources; all of which are highly recommended. Aur\u00e9lien G\u00e9ron's Machine learning project checklist Fast.ai's Data project checklist Designing great data products. Jeremy Howard, Margit Zwemer and Mike Loukides","title":"Resources"},{"location":"machine-learning/envisioning-summary-template/","text":"Generic Envisioning Summary Purpose of this Template This is an example of an envisioning summary completed after envisioning sessions have concluded. It summarizes the materials reviewed, application scenarios discussed and decided, and the next steps in the process. Summary of Envisioning Introduction This document is to summarize what we have discussed in these envisioning sessions, and what we have decided to work on in this machine learning (ML) engagement. With this document, we hope that everyone can be on the same page regarding the scope of this ML engagement, and will ensure a successful start for the project. Materials Shared with the Team List materials shared with you here. The list below contains some examples. You will want to be more specific. Business vision statement Sample Data Current problem statement Also discuss: How the current solution is built and implemented Details about the current state of the systems and processes. Applications Scenarios that Can Help [People] Achieve [Task] The following application scenarios were discussed: Scenario 1: Scenario 2: Add more scenarios as needed For each scenario, provide an appropriately descriptive name and then follow up with more details. For each scenario, discuss: What problem statement was discussed How we propose to solve the problem (there may be several proposals) Who would use the solution What would it look like to use our solution? An example of how it would bring value to the end user. Selected Scenario for this ML Engagement Which scenario was selected? Why was this scenario prioritised over the others? Will other scenarios be considered in the future? When will we revisit them / what conditions need to be met to pursue them? More Details of the Scope for Selected Scenario What is in scope? What data is available? Which performance metric to use? Bar of performance metrics What are deliverables? What\u2019s Next? Legal Documents to be Signed State documents and timeline Responsible AI Review Plan when to conduct a responsible AI process. What are the prerequisites to start this process? Data Exploration Workshop A data exploration workshop is planned for DATE RANGE . This data exploration workshops will be X - Y days, not including the time to gain access resources. The purpose of the data exploration workshop is as follows: Ensure the team can access the data and compute resources that are necessary for the ML feasibility study Ensure that the data provided is of quality and is relevant to the ML solution Make sure that the project team has a good understanding of the data Make sure that the SMEs (Subject Matter Experts) needed are present for Data Exploration Workshop List people needed for the data exploration workshop ML Feasibility Study til [date] Objectives State what we expect to be the objective in the feasibility study Timeline Give a possible timeline for the feasibility study Personnel Needed What sorts of people/roles are needed for the feasibility study? What\u2019s After ML Feasibility Study Detail here Summary of Timeline Below is a high-level summary of the upcoming timeline: Discuss dates for the data exploration workshop, and feasibility study along with any to-do items such as starting responsible AI process, identifying engineering resources. We suggest using a concise bulleted list or a table to easily convey the information.","title":"Generic Envisioning Summary"},{"location":"machine-learning/envisioning-summary-template/#generic-envisioning-summary","text":"","title":"Generic Envisioning Summary"},{"location":"machine-learning/envisioning-summary-template/#purpose-of-this-template","text":"This is an example of an envisioning summary completed after envisioning sessions have concluded. It summarizes the materials reviewed, application scenarios discussed and decided, and the next steps in the process.","title":"Purpose of this Template"},{"location":"machine-learning/envisioning-summary-template/#summary-of-envisioning","text":"","title":"Summary of Envisioning"},{"location":"machine-learning/envisioning-summary-template/#introduction","text":"This document is to summarize what we have discussed in these envisioning sessions, and what we have decided to work on in this machine learning (ML) engagement. With this document, we hope that everyone can be on the same page regarding the scope of this ML engagement, and will ensure a successful start for the project.","title":"Introduction"},{"location":"machine-learning/envisioning-summary-template/#materials-shared-with-the-team","text":"List materials shared with you here. The list below contains some examples. You will want to be more specific. Business vision statement Sample Data Current problem statement Also discuss: How the current solution is built and implemented Details about the current state of the systems and processes.","title":"Materials Shared with the Team"},{"location":"machine-learning/envisioning-summary-template/#applications-scenarios-that-can-help-people-achieve-task","text":"The following application scenarios were discussed: Scenario 1: Scenario 2: Add more scenarios as needed For each scenario, provide an appropriately descriptive name and then follow up with more details. For each scenario, discuss: What problem statement was discussed How we propose to solve the problem (there may be several proposals) Who would use the solution What would it look like to use our solution? An example of how it would bring value to the end user.","title":"Applications Scenarios that Can Help [People] Achieve [Task]"},{"location":"machine-learning/envisioning-summary-template/#selected-scenario-for-this-ml-engagement","text":"Which scenario was selected? Why was this scenario prioritised over the others? Will other scenarios be considered in the future? When will we revisit them / what conditions need to be met to pursue them?","title":"Selected Scenario for this ML Engagement"},{"location":"machine-learning/envisioning-summary-template/#more-details-of-the-scope-for-selected-scenario","text":"What is in scope? What data is available? Which performance metric to use? Bar of performance metrics What are deliverables?","title":"More Details of the Scope for Selected Scenario"},{"location":"machine-learning/envisioning-summary-template/#whats-next","text":"","title":"What\u2019s Next?"},{"location":"machine-learning/envisioning-summary-template/#legal-documents-to-be-signed","text":"State documents and timeline","title":"Legal Documents to be Signed"},{"location":"machine-learning/envisioning-summary-template/#responsible-ai-review","text":"Plan when to conduct a responsible AI process. What are the prerequisites to start this process?","title":"Responsible AI Review"},{"location":"machine-learning/envisioning-summary-template/#data-exploration-workshop","text":"A data exploration workshop is planned for DATE RANGE . This data exploration workshops will be X - Y days, not including the time to gain access resources. The purpose of the data exploration workshop is as follows: Ensure the team can access the data and compute resources that are necessary for the ML feasibility study Ensure that the data provided is of quality and is relevant to the ML solution Make sure that the project team has a good understanding of the data Make sure that the SMEs (Subject Matter Experts) needed are present for Data Exploration Workshop List people needed for the data exploration workshop","title":"Data Exploration Workshop"},{"location":"machine-learning/envisioning-summary-template/#ml-feasibility-study-til-date","text":"","title":"ML Feasibility Study til [date]"},{"location":"machine-learning/envisioning-summary-template/#objectives","text":"State what we expect to be the objective in the feasibility study","title":"Objectives"},{"location":"machine-learning/envisioning-summary-template/#timeline","text":"Give a possible timeline for the feasibility study","title":"Timeline"},{"location":"machine-learning/envisioning-summary-template/#personnel-needed","text":"What sorts of people/roles are needed for the feasibility study?","title":"Personnel Needed"},{"location":"machine-learning/envisioning-summary-template/#whats-after-ml-feasibility-study","text":"Detail here","title":"What\u2019s After ML Feasibility Study"},{"location":"machine-learning/envisioning-summary-template/#summary-of-timeline","text":"Below is a high-level summary of the upcoming timeline: Discuss dates for the data exploration workshop, and feasibility study along with any to-do items such as starting responsible AI process, identifying engineering resources. We suggest using a concise bulleted list or a table to easily convey the information.","title":"Summary of Timeline"},{"location":"machine-learning/feasibility-studies/","text":"Feasibility Studies The main goal of feasibility studies is to assess whether it is feasible to solve the problem satisfactorily using ML with the available data. We want to avoid investing too much in the solution before we have: Sufficient evidence that a solution would be the best technical solution given the business case Sufficient evidence that a solution is compatible with the problem context Sufficient evidence that a solution is possible Some vetted direction on what a solution should look like This effort ensures quality solutions backed by the appropriate, thorough amount of consideration and evidence. When are Feasibility Studies Useful? Every engagement can benefit from a feasibility study early in the project. Architectural discussions can still occur in parallel as the team works towards gaining a solid understanding and definition of what will be built. Feasibility studies can last between 4-16 weeks, depending on specific problem details, volume of data, state of the data etc. Starting with a 4-week milestone might be useful, during which it can be determined how much more time, if any, is required for completion. Who Collaborates on Feasibility Studies? Collaboration from individuals with diverse skill sets is desired at this stage, including data scientists, data engineers, software engineers, PMs, human experience researchers, and domain experts. It embraces the use of engineering fundamentals, with some flexibility. For example, not all experimentation requires full test coverage and code review. Experimentation is typically not part of a CI/CD pipeline. Artifacts may live in the main branch as a folder excluded from the CI/CD pipeline, or as a separate experimental branch, depending on customer/team preferences. What do Feasibility Studies Entail? Problem Definition and Desired Outcome Ensure that the problem is complex enough that coding rules or manual scaling is unrealistic Clear definition of the problem from business and technical perspectives Deep Contextual Understanding Confirm that the following questions can be answered based on what was learned during the Discovery Phase of the project. For items that can not be satisfactorily answered, undertake additional investigation to answer. Understanding the people who are using and/or affected by the solution Understanding the contextual forces at play around the problem, including goals, culture, and historical context To accomplish this a researcher will: Collaborate with customers and colleagues to explore the landscape of people who relate to and may be affected by the problem space being explored (Users, stakeholders, subject matter experts, etc) Formulate the research question(s) to be addressed Select and design research to best serve the research question(s) Identify and select representative research participants across the problem space with whom to conduct the research Construct a research plan and necessary preparation documents for the selected research method(s) Conduct research activity with the participants via the selected method(s) Synthesize, analyze, and interpret research findings Where relevant, build frameworks, artifacts and processes that help explore the findings and implications of the research across the team Share what was uncovered and understood, and the implications thereof across the engagement team and relevant stakeholders. If the above research was conducted during the Discovery phase, it should be reviewed, and any substantial knowledge gaps should be identified and filled by following the above process. Data Access Verify that the full team has access to the data Set up a dedicated and/or restricted environment if required Perform any required de-identification or redaction of sensitive information Understand data access requirements (retention, role-based access, etc.) Data Discovery Hold a data exploration workshop and deep dive with domain experts Understand data availability and confirm the team's access Understand the data dictionary, if available Understand the quality of the data. Is there already a data validation strategy in place? Ensure required data is present in reasonable volumes For supervised problems (most common), assess the availability of labels or data that can be used to effectively approximate labels If applicable, ensure all data can be joined as required and understand how Ideally obtain or create an entity relationship diagram (ERD) Potentially uncover new useful data sources Architecture Discovery Clear picture of existing architecture Infrastructure spikes Concept Ideation and Iteration Develop value proposition(s) for users and stakeholders based on the contextual understanding developed through the discovery process (e.g. key elements of value, benefits) As relevant, make use of Co-creation with team Co-creation with users and stakeholders As relevant, create vignettes, narratives or other materials to communicate the concept Identify the next set of hypotheses or unknowns to be tested (see concept testing) Revisit and iterate on the concept throughout discovery as understanding of the problem space evolves Exploratory Data Analysis (EDA) Data deep dive Understand feature and label value distributions Understand correlations among features and between features and labels Understand data specific problem constraints like missing values, categorical cardinality, potential for data leakage etc. Identify any gaps in data that couldn't be identified in the data discovery phase Pave the way of further understanding of what techniques are applicable Establish a mutual understanding of what data is in or out of scope for feasibility, ensuring that the data in scope is significant for the business Data Pre-Processing Happens during EDA and hypothesis testing Feature engineering Sampling Scaling and/or discretization Noise handling Hypothesis Testing Design several potential solutions using theoretically applicable algorithms and techniques, starting with the simplest reasonable baseline Train model(s) Evaluate performance and determine if satisfactory Tweak experimental solution designs based on outcomes Iterate Thoroughly document each step and outcome, plus any resulting hypotheses for easy following of the decision-making process Concept Testing Where relevant, to test the value proposition, concepts or aspects of the experience Plan user, stakeholder and expert research Develop and design necessary research materials Synthesize and evaluate feedback to incorporate into concept development Continue to iterate and test different elements of the concept as necessary, including testing to best serve RAI goals and guidelines Ensure that the proposed solution and framing are compatible with and acceptable to affected people Ensure that the proposed solution and framing is compatible with existing business goals and context Risk Assessment Identification and assessment of risks and constraints Responsible AI Consideration of responsible AI principles Understanding of users and stakeholders\u2019 contexts, needs and concerns to inform development of RAI Testing AI concept and experience elements with users and stakeholders Discussion and feedback from diverse perspectives around any responsible AI concerns Output of a Feasibility Study The main outcome is a feasibility study report, with a recommendation on next steps: If there is not enough evidence to support the hypothesis that this problem can be solved using ML, as aligned with the pre-determined performance measures and business impact: We detail the gaps and challenges that prevented us from reaching a positive outcome We may scope down the project, if applicable We may look at re-scoping the problem taking into account the findings of the feasibility study We assess the possibility to collect more data or improve data quality If there is enough evidence to support the hypothesis that this problem can be solved using ML Provide recommendations and technical assets for moving to the operationalization phase","title":"Feasibility Studies"},{"location":"machine-learning/feasibility-studies/#feasibility-studies","text":"The main goal of feasibility studies is to assess whether it is feasible to solve the problem satisfactorily using ML with the available data. We want to avoid investing too much in the solution before we have: Sufficient evidence that a solution would be the best technical solution given the business case Sufficient evidence that a solution is compatible with the problem context Sufficient evidence that a solution is possible Some vetted direction on what a solution should look like This effort ensures quality solutions backed by the appropriate, thorough amount of consideration and evidence.","title":"Feasibility Studies"},{"location":"machine-learning/feasibility-studies/#when-are-feasibility-studies-useful","text":"Every engagement can benefit from a feasibility study early in the project. Architectural discussions can still occur in parallel as the team works towards gaining a solid understanding and definition of what will be built. Feasibility studies can last between 4-16 weeks, depending on specific problem details, volume of data, state of the data etc. Starting with a 4-week milestone might be useful, during which it can be determined how much more time, if any, is required for completion.","title":"When are Feasibility Studies Useful?"},{"location":"machine-learning/feasibility-studies/#who-collaborates-on-feasibility-studies","text":"Collaboration from individuals with diverse skill sets is desired at this stage, including data scientists, data engineers, software engineers, PMs, human experience researchers, and domain experts. It embraces the use of engineering fundamentals, with some flexibility. For example, not all experimentation requires full test coverage and code review. Experimentation is typically not part of a CI/CD pipeline. Artifacts may live in the main branch as a folder excluded from the CI/CD pipeline, or as a separate experimental branch, depending on customer/team preferences.","title":"Who Collaborates on Feasibility Studies?"},{"location":"machine-learning/feasibility-studies/#what-do-feasibility-studies-entail","text":"","title":"What do Feasibility Studies Entail?"},{"location":"machine-learning/feasibility-studies/#problem-definition-and-desired-outcome","text":"Ensure that the problem is complex enough that coding rules or manual scaling is unrealistic Clear definition of the problem from business and technical perspectives","title":"Problem Definition and Desired Outcome"},{"location":"machine-learning/feasibility-studies/#deep-contextual-understanding","text":"Confirm that the following questions can be answered based on what was learned during the Discovery Phase of the project. For items that can not be satisfactorily answered, undertake additional investigation to answer. Understanding the people who are using and/or affected by the solution Understanding the contextual forces at play around the problem, including goals, culture, and historical context To accomplish this a researcher will: Collaborate with customers and colleagues to explore the landscape of people who relate to and may be affected by the problem space being explored (Users, stakeholders, subject matter experts, etc) Formulate the research question(s) to be addressed Select and design research to best serve the research question(s) Identify and select representative research participants across the problem space with whom to conduct the research Construct a research plan and necessary preparation documents for the selected research method(s) Conduct research activity with the participants via the selected method(s) Synthesize, analyze, and interpret research findings Where relevant, build frameworks, artifacts and processes that help explore the findings and implications of the research across the team Share what was uncovered and understood, and the implications thereof across the engagement team and relevant stakeholders. If the above research was conducted during the Discovery phase, it should be reviewed, and any substantial knowledge gaps should be identified and filled by following the above process.","title":"Deep Contextual Understanding"},{"location":"machine-learning/feasibility-studies/#data-access","text":"Verify that the full team has access to the data Set up a dedicated and/or restricted environment if required Perform any required de-identification or redaction of sensitive information Understand data access requirements (retention, role-based access, etc.)","title":"Data Access"},{"location":"machine-learning/feasibility-studies/#data-discovery","text":"Hold a data exploration workshop and deep dive with domain experts Understand data availability and confirm the team's access Understand the data dictionary, if available Understand the quality of the data. Is there already a data validation strategy in place? Ensure required data is present in reasonable volumes For supervised problems (most common), assess the availability of labels or data that can be used to effectively approximate labels If applicable, ensure all data can be joined as required and understand how Ideally obtain or create an entity relationship diagram (ERD) Potentially uncover new useful data sources","title":"Data Discovery"},{"location":"machine-learning/feasibility-studies/#architecture-discovery","text":"Clear picture of existing architecture Infrastructure spikes","title":"Architecture Discovery"},{"location":"machine-learning/feasibility-studies/#concept-ideation-and-iteration","text":"Develop value proposition(s) for users and stakeholders based on the contextual understanding developed through the discovery process (e.g. key elements of value, benefits) As relevant, make use of Co-creation with team Co-creation with users and stakeholders As relevant, create vignettes, narratives or other materials to communicate the concept Identify the next set of hypotheses or unknowns to be tested (see concept testing) Revisit and iterate on the concept throughout discovery as understanding of the problem space evolves","title":"Concept Ideation and Iteration"},{"location":"machine-learning/feasibility-studies/#exploratory-data-analysis-eda","text":"Data deep dive Understand feature and label value distributions Understand correlations among features and between features and labels Understand data specific problem constraints like missing values, categorical cardinality, potential for data leakage etc. Identify any gaps in data that couldn't be identified in the data discovery phase Pave the way of further understanding of what techniques are applicable Establish a mutual understanding of what data is in or out of scope for feasibility, ensuring that the data in scope is significant for the business","title":"Exploratory Data Analysis (EDA)"},{"location":"machine-learning/feasibility-studies/#data-pre-processing","text":"Happens during EDA and hypothesis testing Feature engineering Sampling Scaling and/or discretization Noise handling","title":"Data Pre-Processing"},{"location":"machine-learning/feasibility-studies/#hypothesis-testing","text":"Design several potential solutions using theoretically applicable algorithms and techniques, starting with the simplest reasonable baseline Train model(s) Evaluate performance and determine if satisfactory Tweak experimental solution designs based on outcomes Iterate Thoroughly document each step and outcome, plus any resulting hypotheses for easy following of the decision-making process","title":"Hypothesis Testing"},{"location":"machine-learning/feasibility-studies/#concept-testing","text":"Where relevant, to test the value proposition, concepts or aspects of the experience Plan user, stakeholder and expert research Develop and design necessary research materials Synthesize and evaluate feedback to incorporate into concept development Continue to iterate and test different elements of the concept as necessary, including testing to best serve RAI goals and guidelines Ensure that the proposed solution and framing are compatible with and acceptable to affected people Ensure that the proposed solution and framing is compatible with existing business goals and context","title":"Concept Testing"},{"location":"machine-learning/feasibility-studies/#risk-assessment","text":"Identification and assessment of risks and constraints","title":"Risk Assessment"},{"location":"machine-learning/feasibility-studies/#responsible-ai","text":"Consideration of responsible AI principles Understanding of users and stakeholders\u2019 contexts, needs and concerns to inform development of RAI Testing AI concept and experience elements with users and stakeholders Discussion and feedback from diverse perspectives around any responsible AI concerns","title":"Responsible AI"},{"location":"machine-learning/feasibility-studies/#output-of-a-feasibility-study","text":"The main outcome is a feasibility study report, with a recommendation on next steps: If there is not enough evidence to support the hypothesis that this problem can be solved using ML, as aligned with the pre-determined performance measures and business impact: We detail the gaps and challenges that prevented us from reaching a positive outcome We may scope down the project, if applicable We may look at re-scoping the problem taking into account the findings of the feasibility study We assess the possibility to collect more data or improve data quality If there is enough evidence to support the hypothesis that this problem can be solved using ML Provide recommendations and technical assets for moving to the operationalization phase","title":"Output of a Feasibility Study"},{"location":"machine-learning/ml-fundamentals-checklist/","text":"ML Fundamentals Checklist This checklist helps ensure that our ML projects meet our ML Fundamentals. The items below are not sequential, but rather organized by different parts of an ML project. Data Quality and Governance There is access to data. Labels exist for dataset of interest. Data quality evaluation. Able to track data lineage. Understanding of where the data is coming from and any policies related to data access. Gather Security and Compliance requirements. Feasibility Study A feasibility study was performed to assess if the data supports the proposed tasks. Rigorous Exploratory data analysis was performed (including analysis of data distribution). Hypotheses were tested producing sufficient evidence to either support or reject that an ML approach is feasible to solve the problem. ROI estimation and risk analysis was performed for the project. ML outputs/assets can be integrated within the production system. Recommendations on how to proceed have been documented. Evaluation and Metrics Clear definition of how performance will be measured. The evaluation metrics are somewhat connected to the success criteria. The metrics can be calculated with the datasets available. Evaluation flow can be applied to all versions of the model. Evaluation code is unit-tested and reviewed by all team members. Evaluation flow facilitates further results and error analysis. Model Baseline Well-defined baseline model exists and its performance is calculated. ( More details on well defined baselines ) The performance of other ML models can be compared with the model baseline. Experimentation setup Well-defined train/test dataset with labels. Reproducible and logged experiments in an environment accessible by all data scientists to quickly iterate. Defined experiments/hypothesis to test. Results of experiments are documented. Model hyper parameters are tuned systematically. Same performance evaluation metrics and consistent datasets are used when comparing candidate models. Production Model readiness checklist reviewed. Model reviews were performed (covering model debugging, reviews of training and evaluation approaches, model performance). Data pipeline for inferencing, including an end-to-end tests. SLAs requirements for models are gathered and documented. Monitoring of data feeds and model output. Ensure consistent schema is used across the system with expected input/output defined for each component of the pipelines (data processing as well as models). Responsible AI reviewed.","title":"ML Fundamentals Checklist"},{"location":"machine-learning/ml-fundamentals-checklist/#ml-fundamentals-checklist","text":"This checklist helps ensure that our ML projects meet our ML Fundamentals. The items below are not sequential, but rather organized by different parts of an ML project.","title":"ML Fundamentals Checklist"},{"location":"machine-learning/ml-fundamentals-checklist/#data-quality-and-governance","text":"There is access to data. Labels exist for dataset of interest. Data quality evaluation. Able to track data lineage. Understanding of where the data is coming from and any policies related to data access. Gather Security and Compliance requirements.","title":"Data Quality and Governance"},{"location":"machine-learning/ml-fundamentals-checklist/#feasibility-study","text":"A feasibility study was performed to assess if the data supports the proposed tasks. Rigorous Exploratory data analysis was performed (including analysis of data distribution). Hypotheses were tested producing sufficient evidence to either support or reject that an ML approach is feasible to solve the problem. ROI estimation and risk analysis was performed for the project. ML outputs/assets can be integrated within the production system. Recommendations on how to proceed have been documented.","title":"Feasibility Study"},{"location":"machine-learning/ml-fundamentals-checklist/#evaluation-and-metrics","text":"Clear definition of how performance will be measured. The evaluation metrics are somewhat connected to the success criteria. The metrics can be calculated with the datasets available. Evaluation flow can be applied to all versions of the model. Evaluation code is unit-tested and reviewed by all team members. Evaluation flow facilitates further results and error analysis.","title":"Evaluation and Metrics"},{"location":"machine-learning/ml-fundamentals-checklist/#model-baseline","text":"Well-defined baseline model exists and its performance is calculated. ( More details on well defined baselines ) The performance of other ML models can be compared with the model baseline.","title":"Model Baseline"},{"location":"machine-learning/ml-fundamentals-checklist/#experimentation-setup","text":"Well-defined train/test dataset with labels. Reproducible and logged experiments in an environment accessible by all data scientists to quickly iterate. Defined experiments/hypothesis to test. Results of experiments are documented. Model hyper parameters are tuned systematically. Same performance evaluation metrics and consistent datasets are used when comparing candidate models.","title":"Experimentation setup"},{"location":"machine-learning/ml-fundamentals-checklist/#production","text":"Model readiness checklist reviewed. Model reviews were performed (covering model debugging, reviews of training and evaluation approaches, model performance). Data pipeline for inferencing, including an end-to-end tests. SLAs requirements for models are gathered and documented. Monitoring of data feeds and model output. Ensure consistent schema is used across the system with expected input/output defined for each component of the pipelines (data processing as well as models). Responsible AI reviewed.","title":"Production"},{"location":"machine-learning/ml-model-checklist/","text":"ML Model Production Checklist The purpose of this checklist is to make sure that: The team assessed if the model is ready for production before moving to the scoring process The team has prepared a production plan for the model The checklist provides guidelines for creating this production plan. It should be used by teams/organizations that already built/trained an ML model and are now considering putting it into production. Checklist Before putting an individual ML model into production, the following aspects should be considered: Is there a well defined baseline? Is the model performing better than the baseline? Are machine learning performance metrics defined for both training and scoring? Is the model benchmarked? Can ground truth be obtained or inferred in production? Has the data distribution of training, testing and validation sets been analyzed? Have goals and hard limits for performance, speed of prediction and costs been established so they can be considered if trade-offs need to be made? How will the model be integrated into other systems, and what impact will it have? How will incoming data quality be monitored? How will drift in data characteristics be monitored? How will performance be monitored? Have any ethical concerns been taken into account? Please note that there might be scenarios where it is not possible to check all the items on this checklist. However, it is advised to go through all items and make informed decisions based on your specific use case. Will Your Model Performance be Different in Production than During the Training Phase Once deployed into production, the model might be performing much worse than expected. This poor performance could be a result of: The data to be scored in production is significantly different from the train and test datasets The feature engineering steps are different or inconsistent in production compared to the training process The performance measure is not consistent (for example your test set covers several months of data where the performance metric for production has been calculated for one month of data) Is there a Well-Defined Baseline? Is the Model Performing Better than the Baseline? A good way to think of a model baseline is the simplest model one can come up with: either a simple threshold, a random guess or a very basic linear model. This baseline is the reference point your model needs to outperform. A well-defined baseline is different for each problem type and there is no one size fits all approach. As an example, let's consider some common types of machine learning problems: Classification : Predicting between a positive and a negative class. Either the class with the most observations or a simple logistic regression model can be the baseline. Regression : Predicting the house prices in a city. The average house price for the last year or last month, a simple linear regression model, or the previous median house price in a neighborhood could be the baseline. Image classification : Building an image classifier to distinguish between cats and no cats in an image. If your classes are unbalanced: 70% cats and 30% no cats and if you always predict cats, your naive classifier has 70% accuracy and this can be your baseline. If your classes are balanced: 52% cats and 48% no cats, then a simple convolutional architecture can be the baseline (1 conv layer + 1 max pooling + 1 dense). Additionally, human accuracy at labelling can also be the baseline in an image classification scenario. Some questions to ask when comparing to a baseline: How does your model compare to a random guess? How does your model performance compare to applying a simple threshold? How does your model compare with always predicting the most common value? Note : In some cases, human parity might be too ambitious as a baseline, but this should be decided on a case by case basis. Human accuracy is one of the available options, but not the only one. Resources: \"How To Get Baseline Results And Why They Matter\" article \"Always start with a stupid model, no exceptions.\" article Are Machine Learning Performance Metrics Defined for Both Training and Scoring? The methodology of translating the training metrics to scoring metrics should be well-defined and understood. Depending on the data type and model, the model metrics calculation might differ in production and in training. For example, the training procedure calculated metrics for a long period of time (a year, a decade) with different seasonal characteristics while the scoring procedure will calculate the metrics per a restricted time interval (for example a week, a month, a quarter). Well-defined ML performance metrics are essential in production so that a decrease or increase in model performance can be accurately detected. Things to consider: In forecasting, if you change the period of assessing the performance, from one month to a year for example, then you might get a different result. For example, if your model is predicting sales of a product per day and the RMSE (Root Mean Squared Error) is very low for the first month the model is in production. As the model is live for longer, the RMSE is increasing, becoming 10x the RMSE for the first year compared to the first month. In a classification scenario, the overall accuracy is good, but the model is performing poorly for some subgroups. For example, a classifier has an accuracy of 80% overall, but only 55% for the 20-30 age group. If this is a significant age group for the production data, then your accuracy might suffer greatly when in production. In scene classification scenario, the model is trying to identify a specific scene in a video, and the model has been trained and tested (80-20 split) on 50000 segments where half are segments containing the scene and half of the segments do not contain the scene. The accuracy on the training set is 85% and 84% on the test set. However, when an entire video is scored, scores are obtained on all segments, and we expect few segments to contain the scene. The accuracy for an entire video is not comparable with the training/test set procedure in this case, hence different metrics should be considered. If sampling techniques (over-sampling, under-sampling) are used to train model when classes are imbalanced, ensure the metrics used during training are comparable with the ones used in scoring. If the number of samples used for training and testing is small, the performance metrics might change significantly as new data is scored. Is the Model Benchmarked? The trained model to be put into production is well benchmarked if machine learning performance metrics (such as accuracy, recall, RMSE or whatever is appropriate) are measured on the train and test set. Furthermore, the train and test set split should be well documented and reproducible. Can Ground Truth be Obtained or Inferred in Production? Without a reliable ground truth, the machine learning metrics cannot be calculated. It is important to identify if the ground truth can be obtained as the model is scoring new data by either manual or automatic means. If the ground truth cannot be obtained systematically, other proxies and methodology should be investigated in order to obtain some measure of model performance. One option is to use humans to manually label samples. One important aspect of human labelling is to take into account the human accuracy. If there are two different individuals labelling an image, the labels will likely be different for some samples. It is important to understand how the labels were obtained to assess the reliability of the ground truth (that is why we talk about human accuracy). For clarity, let's consider the following examples (by no means an exhaustive list): Forecasting : Forecasting scenarios are an example of machine learning problems where the ground truth could be obtained in most cases even though a delay might occur. For example, for a model predicting the sales of ice cream in a local shop, the ground truth will be obtained as the sales are happening, but it might appear in the system at a later time than as the model prediction. Recommender systems : For recommender system, obtaining the ground truth is a complex problem in most cases as there is no way of identifying the ideal recommendation. For a retail website for example, click/not click, buy/not buy or other user interaction with recommendation can be used as ground truth proxies. Object detection in images : For an object detection model, as new images are scored, there are no new labels being generated automatically. One option to obtain the ground truth for the new images is to use people to manually label the images. Human labelling is costly, time-consuming and not 100% accurate, so in most cases, only a subset of images can be labelled. These samples can be chosen at random or by using active learning techniques of selecting the most informative unlabeled samples. Has the Data Distribution of Training, Testing and Validation Sets Been Analyzed? The data distribution of your training, test and validation (if applicable) dataset (including labels) should be analyzed to ensure they all come from the same distribution. If this is not the case, some options to consider are: re-shuffling, re-sampling, modifying the data, more samples need to be gathered or features removed from the dataset. Significant differences in the data distributions of the different datasets can greatly impact the performance of the model. Some potential questions to ask: How much does the training and test data represent the end result? Is the distribution of each individual feature consistent across all your datasets? (i.e. same representation of age groups, gender, race etc.) Is there any data lineage information? Where did the data come from? How was the data collected? Can collection and labelling be automated? Resources: \"Splitting into train, dev and test\" tutorial Have Goals and Hard Limits for Performance, Speed of Prediction and Costs been Established, so they can be Considered if Trade-Offs Need to be Made? Some machine learning models achieve high ML performance, but they are costly and time-consuming to run. In those cases, a less performant and cheaper model could be preferred. Hence, it is important to calculate the model performance metrics (accuracy, precision, recall, RMSE etc), but also to gather data on how expensive it will be to run the model and how long it will take to run. Once this data is gathered, an informed decision should be made on what model to productionize. System metrics to consider: CPU/GPU/memory usage Cost per prediction Time taken to make a prediction How Will the Model be Integrated into Other Systems, and what Impact will it Have? Machine Learning models do not exist in isolation, but rather they are part of a much larger system. These systems could be old, proprietary systems or new systems being developed as a results of the creation a new machine learning model. In both of those cases, it is important to understand where the actual model is going to fit in, what output is expected from the model and how that output is going to be used by the larger system. Additionally, it is essential to decide if the model will be used for batch and/or real-time inference as production paths might differ. Possible questions to assess model impact: Is there a human in the loop? How is feedback collected through the system? (for example how do we know if a prediction is wrong) Is there a fallback mechanism when things go wrong? Is the system transparent that there is a model making a prediction and what data is used to make this prediction? What is the cost of a wrong prediction? How Will Incoming Data Quality be Monitored? As data systems become increasingly complex in the mainstream, it is especially vital to employ data quality monitoring, alerting and rectification protocols. Following data validation best practices can prevent insidious issues from creeping into machine learning models that, at best, reduce the usefulness of the model, and at worst, introduce harm. Data validation, reduces the risk of data downtime (increasing headroom) and technical debt and supports long-term success of machine learning models and other applications that rely on the data. Data validation best practices include: Employing automated data quality testing processes at each stage of the data pipeline Re-routing data that fails quality tests to a separate data store for diagnosis and resolution Employing end-to-end data observability on data freshness, distribution, volume, schema and lineage Note that data validation is distinct from data drift detection. Data validation detects errors in the data (ex. a datum is outside of the expected range), while data drift detection uncovers legitimate changes in the data that are truly representative of the phenomenon being modeled (ex. user preferences change). Data validation issues should trigger re-routing and rectification, while data drift should trigger adaptation or retraining of a model. Resources: \"Data Quality Fundamentals\" by Moses et al. How Will Drift in Data Characteristics be Monitored? Data drift detection uncovers legitimate changes in incoming data that are truly representative of the phenomenon being modeled,and are not erroneous (ex. user preferences change). It is imperative to understand if the new data in production will be significantly different from the data in the training phase. It is also important to check that the data distribution information can be obtained for any of the new data coming in. Drift monitoring can inform when changes are occurring and what their characteristics are (ex. abrupt vs gradual) and guide effective adaptation or retraining strategies to maintain performance. Possible questions to ask: What are some examples of drift, or deviation from the norm, that have been experience in the past or that might be expected? Is there a drift detection strategy in place? Does it align with expected types of changes? Are there warnings when anomalies in input data are occurring? Is there an adaptation strategy in place? Does it align with expected types of changes? Resources: \"Learning Under Concept Drift: A Review\" by Lu at al. Understanding dataset shift How Will Performance be Monitored? It is important to define how the model will be monitored when it is in production and how that data is going to be used to make decisions. For example, when will a model need retraining as the performance has degraded and how to identify what are the underlying causes of this degradation could be part of this monitoring methodology. Ideally, model monitoring should be done automatically. However, if this is not possible, then there should be a manual periodical check of the model performance. Model monitoring should lead to: Ability to identify changes in model performance Warnings when anomalies in model output are occurring Retraining decisions and adaptation strategy Have any Ethical Concerns Been Taken into Account? Every ML project goes through the Responsible AI process to ensure that it upholds Microsoft's 6 Responsible AI principles .","title":"ML Model Production Checklist"},{"location":"machine-learning/ml-model-checklist/#ml-model-production-checklist","text":"The purpose of this checklist is to make sure that: The team assessed if the model is ready for production before moving to the scoring process The team has prepared a production plan for the model The checklist provides guidelines for creating this production plan. It should be used by teams/organizations that already built/trained an ML model and are now considering putting it into production.","title":"ML Model Production Checklist"},{"location":"machine-learning/ml-model-checklist/#checklist","text":"Before putting an individual ML model into production, the following aspects should be considered: Is there a well defined baseline? Is the model performing better than the baseline? Are machine learning performance metrics defined for both training and scoring? Is the model benchmarked? Can ground truth be obtained or inferred in production? Has the data distribution of training, testing and validation sets been analyzed? Have goals and hard limits for performance, speed of prediction and costs been established so they can be considered if trade-offs need to be made? How will the model be integrated into other systems, and what impact will it have? How will incoming data quality be monitored? How will drift in data characteristics be monitored? How will performance be monitored? Have any ethical concerns been taken into account? Please note that there might be scenarios where it is not possible to check all the items on this checklist. However, it is advised to go through all items and make informed decisions based on your specific use case.","title":"Checklist"},{"location":"machine-learning/ml-model-checklist/#will-your-model-performance-be-different-in-production-than-during-the-training-phase","text":"Once deployed into production, the model might be performing much worse than expected. This poor performance could be a result of: The data to be scored in production is significantly different from the train and test datasets The feature engineering steps are different or inconsistent in production compared to the training process The performance measure is not consistent (for example your test set covers several months of data where the performance metric for production has been calculated for one month of data)","title":"Will Your Model Performance be Different in Production than During the Training Phase"},{"location":"machine-learning/ml-model-checklist/#is-there-a-well-defined-baseline-is-the-model-performing-better-than-the-baseline","text":"A good way to think of a model baseline is the simplest model one can come up with: either a simple threshold, a random guess or a very basic linear model. This baseline is the reference point your model needs to outperform. A well-defined baseline is different for each problem type and there is no one size fits all approach. As an example, let's consider some common types of machine learning problems: Classification : Predicting between a positive and a negative class. Either the class with the most observations or a simple logistic regression model can be the baseline. Regression : Predicting the house prices in a city. The average house price for the last year or last month, a simple linear regression model, or the previous median house price in a neighborhood could be the baseline. Image classification : Building an image classifier to distinguish between cats and no cats in an image. If your classes are unbalanced: 70% cats and 30% no cats and if you always predict cats, your naive classifier has 70% accuracy and this can be your baseline. If your classes are balanced: 52% cats and 48% no cats, then a simple convolutional architecture can be the baseline (1 conv layer + 1 max pooling + 1 dense). Additionally, human accuracy at labelling can also be the baseline in an image classification scenario. Some questions to ask when comparing to a baseline: How does your model compare to a random guess? How does your model performance compare to applying a simple threshold? How does your model compare with always predicting the most common value? Note : In some cases, human parity might be too ambitious as a baseline, but this should be decided on a case by case basis. Human accuracy is one of the available options, but not the only one. Resources: \"How To Get Baseline Results And Why They Matter\" article \"Always start with a stupid model, no exceptions.\" article","title":"Is there a Well-Defined Baseline? Is the Model Performing Better than the Baseline?"},{"location":"machine-learning/ml-model-checklist/#are-machine-learning-performance-metrics-defined-for-both-training-and-scoring","text":"The methodology of translating the training metrics to scoring metrics should be well-defined and understood. Depending on the data type and model, the model metrics calculation might differ in production and in training. For example, the training procedure calculated metrics for a long period of time (a year, a decade) with different seasonal characteristics while the scoring procedure will calculate the metrics per a restricted time interval (for example a week, a month, a quarter). Well-defined ML performance metrics are essential in production so that a decrease or increase in model performance can be accurately detected. Things to consider: In forecasting, if you change the period of assessing the performance, from one month to a year for example, then you might get a different result. For example, if your model is predicting sales of a product per day and the RMSE (Root Mean Squared Error) is very low for the first month the model is in production. As the model is live for longer, the RMSE is increasing, becoming 10x the RMSE for the first year compared to the first month. In a classification scenario, the overall accuracy is good, but the model is performing poorly for some subgroups. For example, a classifier has an accuracy of 80% overall, but only 55% for the 20-30 age group. If this is a significant age group for the production data, then your accuracy might suffer greatly when in production. In scene classification scenario, the model is trying to identify a specific scene in a video, and the model has been trained and tested (80-20 split) on 50000 segments where half are segments containing the scene and half of the segments do not contain the scene. The accuracy on the training set is 85% and 84% on the test set. However, when an entire video is scored, scores are obtained on all segments, and we expect few segments to contain the scene. The accuracy for an entire video is not comparable with the training/test set procedure in this case, hence different metrics should be considered. If sampling techniques (over-sampling, under-sampling) are used to train model when classes are imbalanced, ensure the metrics used during training are comparable with the ones used in scoring. If the number of samples used for training and testing is small, the performance metrics might change significantly as new data is scored.","title":"Are Machine Learning Performance Metrics Defined for Both Training and Scoring?"},{"location":"machine-learning/ml-model-checklist/#is-the-model-benchmarked","text":"The trained model to be put into production is well benchmarked if machine learning performance metrics (such as accuracy, recall, RMSE or whatever is appropriate) are measured on the train and test set. Furthermore, the train and test set split should be well documented and reproducible.","title":"Is the Model Benchmarked?"},{"location":"machine-learning/ml-model-checklist/#can-ground-truth-be-obtained-or-inferred-in-production","text":"Without a reliable ground truth, the machine learning metrics cannot be calculated. It is important to identify if the ground truth can be obtained as the model is scoring new data by either manual or automatic means. If the ground truth cannot be obtained systematically, other proxies and methodology should be investigated in order to obtain some measure of model performance. One option is to use humans to manually label samples. One important aspect of human labelling is to take into account the human accuracy. If there are two different individuals labelling an image, the labels will likely be different for some samples. It is important to understand how the labels were obtained to assess the reliability of the ground truth (that is why we talk about human accuracy). For clarity, let's consider the following examples (by no means an exhaustive list): Forecasting : Forecasting scenarios are an example of machine learning problems where the ground truth could be obtained in most cases even though a delay might occur. For example, for a model predicting the sales of ice cream in a local shop, the ground truth will be obtained as the sales are happening, but it might appear in the system at a later time than as the model prediction. Recommender systems : For recommender system, obtaining the ground truth is a complex problem in most cases as there is no way of identifying the ideal recommendation. For a retail website for example, click/not click, buy/not buy or other user interaction with recommendation can be used as ground truth proxies. Object detection in images : For an object detection model, as new images are scored, there are no new labels being generated automatically. One option to obtain the ground truth for the new images is to use people to manually label the images. Human labelling is costly, time-consuming and not 100% accurate, so in most cases, only a subset of images can be labelled. These samples can be chosen at random or by using active learning techniques of selecting the most informative unlabeled samples.","title":"Can Ground Truth be Obtained or Inferred in Production?"},{"location":"machine-learning/ml-model-checklist/#has-the-data-distribution-of-training-testing-and-validation-sets-been-analyzed","text":"The data distribution of your training, test and validation (if applicable) dataset (including labels) should be analyzed to ensure they all come from the same distribution. If this is not the case, some options to consider are: re-shuffling, re-sampling, modifying the data, more samples need to be gathered or features removed from the dataset. Significant differences in the data distributions of the different datasets can greatly impact the performance of the model. Some potential questions to ask: How much does the training and test data represent the end result? Is the distribution of each individual feature consistent across all your datasets? (i.e. same representation of age groups, gender, race etc.) Is there any data lineage information? Where did the data come from? How was the data collected? Can collection and labelling be automated? Resources: \"Splitting into train, dev and test\" tutorial","title":"Has the Data Distribution of Training, Testing and Validation Sets Been Analyzed?"},{"location":"machine-learning/ml-model-checklist/#have-goals-and-hard-limits-for-performance-speed-of-prediction-and-costs-been-established-so-they-can-be-considered-if-trade-offs-need-to-be-made","text":"Some machine learning models achieve high ML performance, but they are costly and time-consuming to run. In those cases, a less performant and cheaper model could be preferred. Hence, it is important to calculate the model performance metrics (accuracy, precision, recall, RMSE etc), but also to gather data on how expensive it will be to run the model and how long it will take to run. Once this data is gathered, an informed decision should be made on what model to productionize. System metrics to consider: CPU/GPU/memory usage Cost per prediction Time taken to make a prediction","title":"Have Goals and Hard Limits for Performance, Speed of Prediction and Costs been Established, so they can be Considered if Trade-Offs Need to be Made?"},{"location":"machine-learning/ml-model-checklist/#how-will-the-model-be-integrated-into-other-systems-and-what-impact-will-it-have","text":"Machine Learning models do not exist in isolation, but rather they are part of a much larger system. These systems could be old, proprietary systems or new systems being developed as a results of the creation a new machine learning model. In both of those cases, it is important to understand where the actual model is going to fit in, what output is expected from the model and how that output is going to be used by the larger system. Additionally, it is essential to decide if the model will be used for batch and/or real-time inference as production paths might differ. Possible questions to assess model impact: Is there a human in the loop? How is feedback collected through the system? (for example how do we know if a prediction is wrong) Is there a fallback mechanism when things go wrong? Is the system transparent that there is a model making a prediction and what data is used to make this prediction? What is the cost of a wrong prediction?","title":"How Will the Model be Integrated into Other Systems, and what Impact will it Have?"},{"location":"machine-learning/ml-model-checklist/#how-will-incoming-data-quality-be-monitored","text":"As data systems become increasingly complex in the mainstream, it is especially vital to employ data quality monitoring, alerting and rectification protocols. Following data validation best practices can prevent insidious issues from creeping into machine learning models that, at best, reduce the usefulness of the model, and at worst, introduce harm. Data validation, reduces the risk of data downtime (increasing headroom) and technical debt and supports long-term success of machine learning models and other applications that rely on the data. Data validation best practices include: Employing automated data quality testing processes at each stage of the data pipeline Re-routing data that fails quality tests to a separate data store for diagnosis and resolution Employing end-to-end data observability on data freshness, distribution, volume, schema and lineage Note that data validation is distinct from data drift detection. Data validation detects errors in the data (ex. a datum is outside of the expected range), while data drift detection uncovers legitimate changes in the data that are truly representative of the phenomenon being modeled (ex. user preferences change). Data validation issues should trigger re-routing and rectification, while data drift should trigger adaptation or retraining of a model. Resources: \"Data Quality Fundamentals\" by Moses et al.","title":"How Will Incoming Data Quality be Monitored?"},{"location":"machine-learning/ml-model-checklist/#how-will-drift-in-data-characteristics-be-monitored","text":"Data drift detection uncovers legitimate changes in incoming data that are truly representative of the phenomenon being modeled,and are not erroneous (ex. user preferences change). It is imperative to understand if the new data in production will be significantly different from the data in the training phase. It is also important to check that the data distribution information can be obtained for any of the new data coming in. Drift monitoring can inform when changes are occurring and what their characteristics are (ex. abrupt vs gradual) and guide effective adaptation or retraining strategies to maintain performance. Possible questions to ask: What are some examples of drift, or deviation from the norm, that have been experience in the past or that might be expected? Is there a drift detection strategy in place? Does it align with expected types of changes? Are there warnings when anomalies in input data are occurring? Is there an adaptation strategy in place? Does it align with expected types of changes? Resources: \"Learning Under Concept Drift: A Review\" by Lu at al. Understanding dataset shift","title":"How Will Drift in Data Characteristics be Monitored?"},{"location":"machine-learning/ml-model-checklist/#how-will-performance-be-monitored","text":"It is important to define how the model will be monitored when it is in production and how that data is going to be used to make decisions. For example, when will a model need retraining as the performance has degraded and how to identify what are the underlying causes of this degradation could be part of this monitoring methodology. Ideally, model monitoring should be done automatically. However, if this is not possible, then there should be a manual periodical check of the model performance. Model monitoring should lead to: Ability to identify changes in model performance Warnings when anomalies in model output are occurring Retraining decisions and adaptation strategy","title":"How Will Performance be Monitored?"},{"location":"machine-learning/ml-model-checklist/#have-any-ethical-concerns-been-taken-into-account","text":"Every ML project goes through the Responsible AI process to ensure that it upholds Microsoft's 6 Responsible AI principles .","title":"Have any Ethical Concerns Been Taken into Account?"},{"location":"machine-learning/model-experimentation/","text":"Model Experimentation Overview Machine learning model experimentation involves uncertainty around the expected model results and future operationalization. To handle this uncertainty as much as possible, we propose a semi-structured process, balancing between engineering/research best practices and rapid model/data exploration. Model Experimentation Goals Performance : Find the best performing solution Operationalization : Keep an eye towards production, making sure that operationalization is feasible Code quality Maintain code and artifacts quality Reproducibility : Keep research active by allowing experiment tracking and reproducibility Collaboration : Foster the collaboration and joint work of multiple people on the team Model Experimentation Challenges Trial and error process : Difficult to plan and estimate durations and capacity. Quick and dirty : We want to fail fast and get a sense of what\u2019s working efficiently. Collaboration : How do we form a team-wide trial and error process and effective brainstorming. Code quality : How do we maintain the quality of non-production code during research. Operationalization : Switching between approaches might have a significant impact on operationalization (e.g. GPU/CPU, batch/online, parallel/sequential, runtime environments). Creating an experimentation framework which facilitates rapid experimentation , collaboration , experiment and model reproducibility , evaluation and defined APIs , and lets each team member focus on the model development and improvement, while trusting the framework to do the rest. The following tools and guidelines are aimed at achieving experimentation goals as well as addressing the aforementioned challenges. Tools and Guidelines for Successful Model Experimentation Virtual environments Source control and folder/package structure Experiment tracking Datasets and models abstractions Model evaluation Virtual Environments In languages like Python and R, it is always advised to employ virtual environments. Virtual environments facilitate reproducibility, collaboration and productization. Virtual environments allow us to be consistent across our local dev envs as well as with compute resources. These environments' configuration files can be used to build the code from source in an consistent way. For more details on why we need virtual environments visit this blog post . Which Virtual Environment Framework should I Choose All virtual environments frameworks create isolation, some also propose dependency management and additional features. Decision on which framework to use depends on the complexity of the development environment (dependencies and other required resources) and on the ease of use of the framework. Types of Virtual Environments In ISE, we often choose from either venv , Conda or Poetry , depending on the project requirements and complexity. venv is included in Python, is the easiest to use, but lacks more advanced features like dependency management. Conda is a popular package, dependency and environment management framework. It supports multiple stacks (Python, R) and multiple versions of the same environment (e.g. multiple Python versions). Conda maintains its own package repository, therefore some packages might not be downloaded and managed directly through Conda . Poetry is a Python dependency management system which manages dependencies in a standard way using pyproject.toml files and lock files. Similar to Conda , Poetry 's dependency resolution process is sometimes slow (see FAQ ), but in cases where dependency issues are common or tricky, it provides a robust way to create reproducible and stable environments. Expected Outcomes for Virtual Environments Setup Documentation describing how to create the selected virtual environment and how to install dependencies. Environment configuration files if applicable (e.g. requirements.txt for venv , environment.yml for Conda or pyrpoject.toml for Poetry ). Virtual Environments Benefits Productization Collaboration Reproducibility Source Control and Folder or Package Structure Applied ML projects often contain source code, notebooks, devops scripts, documentation, scientific resources, datasets and more. We recommend coming up with an agreed folder structure to keep resources tidy. Consider deciding upon a generic folder structure for projects (e.g. which contains the folders data , src , docs and notebooks ), or adopt popular structures like the CookieCutter Data Science folder structure. Source control should be applied to allow collaboration, versioning, code reviews, traceability and backup. In data science projects, source control should be used for code, and the storing and versioning of other artifacts (e.g. data, scientific literature) should be decided upon depending on the scenario. Folder Structure and Source Control Expected Outcomes Defined folder structure for all users to use, pushed to the repo. .gitignore file determining which folders should be synced with git and which should be kept locally. For example, this one . Determine how notebooks are stored and versioned (e.g. strip output from Jupyter notebooks ) Source Control and Folder Structure Benefits Collaboration Reproducibility Code quality Experiment Tracking Experiment tracking tools allow data scientists and researchers to keep track of previous experiments for better understanding of the experimentation process and for the reproducibility of experiments or models. Types of Experiment Tracking Frameworks Experiment tracking frameworks differ by the set of features they provide for collecting experiment metadata, and comparing and analyzing experiments. In ISE, we mainly use MLFlow on Databricks or Azure ML Experimentation . Note that some experiment tracking frameworks require a deployment, while others are SaaS. Experiment Tracking Outcomes Decide on an experiment tracking framework Ensure it is accessible to all users Document set-up on local environments Define datasets and evaluation in a way which will allow the comparison of all experiments. Consistency across datasets and evaluation is paramount for experiment comparison . Ensure full reproducibility by assuring that all required details are tracked (i.e. dataset names and versions, parameters, code, environment) Experiment Tracking Benefits Model performance Reproducibility Collaboration Code quality Datasets and Models Abstractions By creating abstractions to building blocks (e.g., datasets, models, evaluators), we allow the easy introduction of new logic into the experimentation pipeline while keeping the agreed upon experimentation flow intact. These abstractions can be created using different mechanisms. For example, we can use Object-Oriented Programming (OOP) solutions like abstract classes: An example from scikit-learn describing the creation of new estimators compatible with the API . An example from PyTorch on extending the abstract Dataset class . Abstraction Outcomes Different building blocks have defined APIs allowing them to be replaced or extended. Replacing building blocks does not break the original experimentation flow. Mock building blocks are used for unit tests APIs/mocks are shared with the engineering teams for integration with other modules. Abstraction Benefits Collaboration Code quality Reproducibility Operationalization Model performance Model Evaluation When deciding on the evaluation of the ML model/process, consider the following checklist: Evaluation logic is approved by all stakeholders. Relationship between evaluation logic and business KPIs is analyzed and decided. Evaluation flow is applicable for all present and future models (i.e. does not assume some prediction structure or method-specific process). Evaluation code is unit-tested and reviewed by all team members. Evaluation flow facilitates further results and error analysis. Evaluation Development Process Outcomes Evaluation strategy is agreed upon all stakeholders Research and discussion on various evaluation methods and metrics is documented. The code holding the logic and data structures for evaluation is reviewed and tested. Documentation on how to apply evaluation is reviewed. Performance metrics are automatically tracked into the experiment tracker. Evaluation Development Process Benefits Model performance Code quality Collaboration Reproducibility","title":"Model Experimentation"},{"location":"machine-learning/model-experimentation/#model-experimentation","text":"","title":"Model Experimentation"},{"location":"machine-learning/model-experimentation/#overview","text":"Machine learning model experimentation involves uncertainty around the expected model results and future operationalization. To handle this uncertainty as much as possible, we propose a semi-structured process, balancing between engineering/research best practices and rapid model/data exploration.","title":"Overview"},{"location":"machine-learning/model-experimentation/#model-experimentation-goals","text":"Performance : Find the best performing solution Operationalization : Keep an eye towards production, making sure that operationalization is feasible Code quality Maintain code and artifacts quality Reproducibility : Keep research active by allowing experiment tracking and reproducibility Collaboration : Foster the collaboration and joint work of multiple people on the team","title":"Model Experimentation Goals"},{"location":"machine-learning/model-experimentation/#model-experimentation-challenges","text":"Trial and error process : Difficult to plan and estimate durations and capacity. Quick and dirty : We want to fail fast and get a sense of what\u2019s working efficiently. Collaboration : How do we form a team-wide trial and error process and effective brainstorming. Code quality : How do we maintain the quality of non-production code during research. Operationalization : Switching between approaches might have a significant impact on operationalization (e.g. GPU/CPU, batch/online, parallel/sequential, runtime environments). Creating an experimentation framework which facilitates rapid experimentation , collaboration , experiment and model reproducibility , evaluation and defined APIs , and lets each team member focus on the model development and improvement, while trusting the framework to do the rest. The following tools and guidelines are aimed at achieving experimentation goals as well as addressing the aforementioned challenges.","title":"Model Experimentation Challenges"},{"location":"machine-learning/model-experimentation/#tools-and-guidelines-for-successful-model-experimentation","text":"Virtual environments Source control and folder/package structure Experiment tracking Datasets and models abstractions Model evaluation","title":"Tools and Guidelines for Successful Model Experimentation"},{"location":"machine-learning/model-experimentation/#virtual-environments","text":"In languages like Python and R, it is always advised to employ virtual environments. Virtual environments facilitate reproducibility, collaboration and productization. Virtual environments allow us to be consistent across our local dev envs as well as with compute resources. These environments' configuration files can be used to build the code from source in an consistent way. For more details on why we need virtual environments visit this blog post .","title":"Virtual Environments"},{"location":"machine-learning/model-experimentation/#which-virtual-environment-framework-should-i-choose","text":"All virtual environments frameworks create isolation, some also propose dependency management and additional features. Decision on which framework to use depends on the complexity of the development environment (dependencies and other required resources) and on the ease of use of the framework.","title":"Which Virtual Environment Framework should I Choose"},{"location":"machine-learning/model-experimentation/#types-of-virtual-environments","text":"In ISE, we often choose from either venv , Conda or Poetry , depending on the project requirements and complexity. venv is included in Python, is the easiest to use, but lacks more advanced features like dependency management. Conda is a popular package, dependency and environment management framework. It supports multiple stacks (Python, R) and multiple versions of the same environment (e.g. multiple Python versions). Conda maintains its own package repository, therefore some packages might not be downloaded and managed directly through Conda . Poetry is a Python dependency management system which manages dependencies in a standard way using pyproject.toml files and lock files. Similar to Conda , Poetry 's dependency resolution process is sometimes slow (see FAQ ), but in cases where dependency issues are common or tricky, it provides a robust way to create reproducible and stable environments.","title":"Types of Virtual Environments"},{"location":"machine-learning/model-experimentation/#expected-outcomes-for-virtual-environments-setup","text":"Documentation describing how to create the selected virtual environment and how to install dependencies. Environment configuration files if applicable (e.g. requirements.txt for venv , environment.yml for Conda or pyrpoject.toml for Poetry ).","title":"Expected Outcomes for Virtual Environments Setup"},{"location":"machine-learning/model-experimentation/#virtual-environments-benefits","text":"Productization Collaboration Reproducibility","title":"Virtual Environments Benefits"},{"location":"machine-learning/model-experimentation/#source-control-and-folder-or-package-structure","text":"Applied ML projects often contain source code, notebooks, devops scripts, documentation, scientific resources, datasets and more. We recommend coming up with an agreed folder structure to keep resources tidy. Consider deciding upon a generic folder structure for projects (e.g. which contains the folders data , src , docs and notebooks ), or adopt popular structures like the CookieCutter Data Science folder structure. Source control should be applied to allow collaboration, versioning, code reviews, traceability and backup. In data science projects, source control should be used for code, and the storing and versioning of other artifacts (e.g. data, scientific literature) should be decided upon depending on the scenario.","title":"Source Control and Folder or Package Structure"},{"location":"machine-learning/model-experimentation/#folder-structure-and-source-control-expected-outcomes","text":"Defined folder structure for all users to use, pushed to the repo. .gitignore file determining which folders should be synced with git and which should be kept locally. For example, this one . Determine how notebooks are stored and versioned (e.g. strip output from Jupyter notebooks )","title":"Folder Structure and Source Control Expected Outcomes"},{"location":"machine-learning/model-experimentation/#source-control-and-folder-structure-benefits","text":"Collaboration Reproducibility Code quality","title":"Source Control and Folder Structure Benefits"},{"location":"machine-learning/model-experimentation/#experiment-tracking","text":"Experiment tracking tools allow data scientists and researchers to keep track of previous experiments for better understanding of the experimentation process and for the reproducibility of experiments or models.","title":"Experiment Tracking"},{"location":"machine-learning/model-experimentation/#types-of-experiment-tracking-frameworks","text":"Experiment tracking frameworks differ by the set of features they provide for collecting experiment metadata, and comparing and analyzing experiments. In ISE, we mainly use MLFlow on Databricks or Azure ML Experimentation . Note that some experiment tracking frameworks require a deployment, while others are SaaS.","title":"Types of Experiment Tracking Frameworks"},{"location":"machine-learning/model-experimentation/#experiment-tracking-outcomes","text":"Decide on an experiment tracking framework Ensure it is accessible to all users Document set-up on local environments Define datasets and evaluation in a way which will allow the comparison of all experiments. Consistency across datasets and evaluation is paramount for experiment comparison . Ensure full reproducibility by assuring that all required details are tracked (i.e. dataset names and versions, parameters, code, environment)","title":"Experiment Tracking Outcomes"},{"location":"machine-learning/model-experimentation/#experiment-tracking-benefits","text":"Model performance Reproducibility Collaboration Code quality","title":"Experiment Tracking Benefits"},{"location":"machine-learning/model-experimentation/#datasets-and-models-abstractions","text":"By creating abstractions to building blocks (e.g., datasets, models, evaluators), we allow the easy introduction of new logic into the experimentation pipeline while keeping the agreed upon experimentation flow intact. These abstractions can be created using different mechanisms. For example, we can use Object-Oriented Programming (OOP) solutions like abstract classes: An example from scikit-learn describing the creation of new estimators compatible with the API . An example from PyTorch on extending the abstract Dataset class .","title":"Datasets and Models Abstractions"},{"location":"machine-learning/model-experimentation/#abstraction-outcomes","text":"Different building blocks have defined APIs allowing them to be replaced or extended. Replacing building blocks does not break the original experimentation flow. Mock building blocks are used for unit tests APIs/mocks are shared with the engineering teams for integration with other modules.","title":"Abstraction Outcomes"},{"location":"machine-learning/model-experimentation/#abstraction-benefits","text":"Collaboration Code quality Reproducibility Operationalization Model performance","title":"Abstraction Benefits"},{"location":"machine-learning/model-experimentation/#model-evaluation","text":"When deciding on the evaluation of the ML model/process, consider the following checklist: Evaluation logic is approved by all stakeholders. Relationship between evaluation logic and business KPIs is analyzed and decided. Evaluation flow is applicable for all present and future models (i.e. does not assume some prediction structure or method-specific process). Evaluation code is unit-tested and reviewed by all team members. Evaluation flow facilitates further results and error analysis.","title":"Model Evaluation"},{"location":"machine-learning/model-experimentation/#evaluation-development-process-outcomes","text":"Evaluation strategy is agreed upon all stakeholders Research and discussion on various evaluation methods and metrics is documented. The code holding the logic and data structures for evaluation is reviewed and tested. Documentation on how to apply evaluation is reviewed. Performance metrics are automatically tracked into the experiment tracker.","title":"Evaluation Development Process Outcomes"},{"location":"machine-learning/model-experimentation/#evaluation-development-process-benefits","text":"Model performance Code quality Collaboration Reproducibility","title":"Evaluation Development Process Benefits"},{"location":"machine-learning/profiling-ml-and-mlops-code/","text":"Profiling Machine Learning and MLOps Code Data Science projects, especially the ones that involve Deep Learning techniques, usually are resource intensive. One model training iteration might be multiple hours long. Although large data volumes processing genuinely takes time, minor bugs and suboptimal implementation of some functional pieces might cause extra resources consumption. Profiling can be used to identify performance bottlenecks and see which functions are the costliest in the application code. Based on the outputs of the profiler, one can focus on largest and easiest-to-resolve inefficiencies and therefore achieve better code performance. Although profiling follows the same principles of any other software project, the purpose of this document is to provide profiling samples for the most common scenarios in MLOps/Data Science projects. Below are some common scenarios in MLOps/Data Science projects, along with suggestions on how to profile them. Generic Python profiling PyTorch model training profiling Azure Machine Learning pipeline profiling Generic Python Profiling Usually an MLOps/Data Science solution contains plain Python code serving different purposes (e.g. data processing) along with specialized model training code. Although many Machine Learning frameworks provide their own profiler, sometimes it is also useful to profile the whole solution. There are two types of profilers: deterministic (all events are tracked, e.g. cProfile ) and statistical (sampling with regular intervals, e.g., py-spy ). The sample below shows an example of a deterministic profiler. There are many options of generic deterministic Python code profiling. One of the default options for profiling used to be a built-in cProfile profiler. Using cProfile one can easily profile either a Python script or just a chunk of code. This profiling tool produces a file that can be either visualized using open source tools or analyzed using stats.Stats class. The latter option requires setting up filtering and sorting parameters for better analysis experience. Below you can find an example of using cProfile to profile a chunk of code. import cProfile # Start profiling profiler = cProfile . Profile () profiler . enable () # -- YOUR CODE GOES HERE --- # Stop profiling profiler . disable () # Write profiler results to an html file profiler . dump_stats ( \"profiler_results.prof\" ) You can also run cProfile outside of the Python script using the following command: python -m cProfile [ -o output_file ] [ -s sort_order ] ( -m module | myscript.py ) Note: one epoch of model training is usually enough for profiling. There's no need to run more epochs and produce additional cost. Refer to The Python Profilers for further details. PyTorch Model Training Profiling PyTorch 1.8 includes an updated PyTorch profiler that is supplied together with the PyTorch distribution and doesn't require any additional installation. Using PyTorch profiler one can record CPU side operations as well as CUDA kernel launches on GPU side. The profiler can visualize analysis results using TensorBoard plugin as well as provide suggestions on bottlenecks and potential code improvements. with torch . profiler . profile ( # Limit number of training steps included in profiling schedule = torch . profiler . schedule ( wait = 1 , warmup = 1 , active = 3 , repeat = 2 ), # Automatically saves profiling results to disk on_trace_ready = torch . profiler . tensorboard_trace_handler , with_stack = True ) as profiler : for step , data in enumerate ( trainloader , 0 ): # -- TRAINING STEP CODE GOES HERE --- profiler . step () The tensorboard_trace_handler can be used to generate result files for TensorBoard. Those can be visualized by installing TensorBoard. plugin and running TensorBoard on your log directory. pip install torch_tb_profiler tensorboard --logdir = <LOG_DIR_PATH> # Navigate to `http://localhost:6006/#pytorch_profiler` Note: make sure to provide the right parameters to the torch.profiler.schedule . Usually you would need several steps of training to be profiled rather than the whole epoch. More information on PyTorch profiler : PyTorch Profiler Recipe Introducing PyTorch Profiler - the new and improved performance tool Azure Machine Learning Pipeline Profiling In our projects we often use Azure Machine Learning pipelines to train Machine Learning models. Most of the profilers can also be used in conjunction with Azure Machine Learning. For a profiler to be used with Azure Machine Learning, it should meet the following criteria: Turning the profiler on/off can be achieved by passing a parameter to the script ran by Azure Machine Learning The profiler produces a file as an output In general, a recipe for using profilers with Azure Machine Learning is the following: (Optional) If you're using profiling with an Azure Machine Learning pipeline, you might want to add --profile Boolean flag as a pipeline parameter Use one of the profilers described above or any other profiler that can produce a file as an output Inside of your Python script, create step output folder, e.g.: output_dir = \"./outputs/profiler_results\" os . makedirs ( output_dir , exist_ok = True ) Run your training pipeline Once the pipeline is completed, navigate to Azure ML portal and open details of the step that contains training code. The results can be found in the Outputs+logs tab, under outputs/profiler_results folder. You might want to download the results and visualize it locally. Note: it's not recommended to run profilers simultaneously. Profiles also consume resources, therefore a simultaneous run might significantly affect the results.","title":"Profiling Machine Learning and MLOps Code"},{"location":"machine-learning/profiling-ml-and-mlops-code/#profiling-machine-learning-and-mlops-code","text":"Data Science projects, especially the ones that involve Deep Learning techniques, usually are resource intensive. One model training iteration might be multiple hours long. Although large data volumes processing genuinely takes time, minor bugs and suboptimal implementation of some functional pieces might cause extra resources consumption. Profiling can be used to identify performance bottlenecks and see which functions are the costliest in the application code. Based on the outputs of the profiler, one can focus on largest and easiest-to-resolve inefficiencies and therefore achieve better code performance. Although profiling follows the same principles of any other software project, the purpose of this document is to provide profiling samples for the most common scenarios in MLOps/Data Science projects. Below are some common scenarios in MLOps/Data Science projects, along with suggestions on how to profile them. Generic Python profiling PyTorch model training profiling Azure Machine Learning pipeline profiling","title":"Profiling Machine Learning and MLOps Code"},{"location":"machine-learning/profiling-ml-and-mlops-code/#generic-python-profiling","text":"Usually an MLOps/Data Science solution contains plain Python code serving different purposes (e.g. data processing) along with specialized model training code. Although many Machine Learning frameworks provide their own profiler, sometimes it is also useful to profile the whole solution. There are two types of profilers: deterministic (all events are tracked, e.g. cProfile ) and statistical (sampling with regular intervals, e.g., py-spy ). The sample below shows an example of a deterministic profiler. There are many options of generic deterministic Python code profiling. One of the default options for profiling used to be a built-in cProfile profiler. Using cProfile one can easily profile either a Python script or just a chunk of code. This profiling tool produces a file that can be either visualized using open source tools or analyzed using stats.Stats class. The latter option requires setting up filtering and sorting parameters for better analysis experience. Below you can find an example of using cProfile to profile a chunk of code. import cProfile # Start profiling profiler = cProfile . Profile () profiler . enable () # -- YOUR CODE GOES HERE --- # Stop profiling profiler . disable () # Write profiler results to an html file profiler . dump_stats ( \"profiler_results.prof\" ) You can also run cProfile outside of the Python script using the following command: python -m cProfile [ -o output_file ] [ -s sort_order ] ( -m module | myscript.py ) Note: one epoch of model training is usually enough for profiling. There's no need to run more epochs and produce additional cost. Refer to The Python Profilers for further details.","title":"Generic Python Profiling"},{"location":"machine-learning/profiling-ml-and-mlops-code/#pytorch-model-training-profiling","text":"PyTorch 1.8 includes an updated PyTorch profiler that is supplied together with the PyTorch distribution and doesn't require any additional installation. Using PyTorch profiler one can record CPU side operations as well as CUDA kernel launches on GPU side. The profiler can visualize analysis results using TensorBoard plugin as well as provide suggestions on bottlenecks and potential code improvements. with torch . profiler . profile ( # Limit number of training steps included in profiling schedule = torch . profiler . schedule ( wait = 1 , warmup = 1 , active = 3 , repeat = 2 ), # Automatically saves profiling results to disk on_trace_ready = torch . profiler . tensorboard_trace_handler , with_stack = True ) as profiler : for step , data in enumerate ( trainloader , 0 ): # -- TRAINING STEP CODE GOES HERE --- profiler . step () The tensorboard_trace_handler can be used to generate result files for TensorBoard. Those can be visualized by installing TensorBoard. plugin and running TensorBoard on your log directory. pip install torch_tb_profiler tensorboard --logdir = <LOG_DIR_PATH> # Navigate to `http://localhost:6006/#pytorch_profiler` Note: make sure to provide the right parameters to the torch.profiler.schedule . Usually you would need several steps of training to be profiled rather than the whole epoch. More information on PyTorch profiler : PyTorch Profiler Recipe Introducing PyTorch Profiler - the new and improved performance tool","title":"PyTorch Model Training Profiling"},{"location":"machine-learning/profiling-ml-and-mlops-code/#azure-machine-learning-pipeline-profiling","text":"In our projects we often use Azure Machine Learning pipelines to train Machine Learning models. Most of the profilers can also be used in conjunction with Azure Machine Learning. For a profiler to be used with Azure Machine Learning, it should meet the following criteria: Turning the profiler on/off can be achieved by passing a parameter to the script ran by Azure Machine Learning The profiler produces a file as an output In general, a recipe for using profilers with Azure Machine Learning is the following: (Optional) If you're using profiling with an Azure Machine Learning pipeline, you might want to add --profile Boolean flag as a pipeline parameter Use one of the profilers described above or any other profiler that can produce a file as an output Inside of your Python script, create step output folder, e.g.: output_dir = \"./outputs/profiler_results\" os . makedirs ( output_dir , exist_ok = True ) Run your training pipeline Once the pipeline is completed, navigate to Azure ML portal and open details of the step that contains training code. The results can be found in the Outputs+logs tab, under outputs/profiler_results folder. You might want to download the results and visualize it locally. Note: it's not recommended to run profilers simultaneously. Profiles also consume resources, therefore a simultaneous run might significantly affect the results.","title":"Azure Machine Learning Pipeline Profiling"},{"location":"machine-learning/proposed-ml-process/","text":"Proposed ML Process Introduction The objective of this document is to provide guidance to produce machine learning (ML) applications that are based on code, data and models that can be reproduced and reliably released to production environments. When developing ML applications, we consider the following approaches: Best practices in ML engineering: The ML application development should use engineering fundamentals to ensure high quality software deliverables. The ML application should be reliability released into production, leveraging automation as much as possible. The ML application can be deployed into production at any time. This makes the decision about when to release it a business decision rather than a technical one. Best practices in ML research: All artifacts, specifically data, code and ML models, should be versioned and managed using standard tools and workflows, in order to facilitate continuous research and development. While the model outputs can be non-deterministic and hard to reproduce, the process of releasing ML software into production should be reproducible. Responsible AI aspects are carefully analyzed and addressed. Cross-functional team: A cross-functional team consisting of different skill sets in data science, data engineering, development, operations, and industry domain specialists is required. ML process The proposed ML development process consists of: Data and problem understanding Responsible AI assessment Feasibility study Baseline model experimentation Model evaluation and experimentation Model operationalization * Unit and Integration testing * Deployment * Monitoring and Observability Version Control During all stages of the process, it is suggested that artifacts should be version-controlled . Typically, the process is iterative and versioned artifacts can assist in traceability and reviewing. Understanding the Problem Define the business problem for the ML project: Agree on the success criteria with the customer. Identify potential data sources and determine the availability of these sources. Define performance evaluation metrics on ground truth data Conduct a Responsible AI assessment to ensure development and deployment of the ML solution in a responsible manner. Conduct a feasibility study to assess whether the business problem is feasible to solve satisfactorily using ML with the available data. The objective of the feasibility study is to mitigate potential over-investment by ensuring sufficient evidence that ML is possible and would be the best solution. The study also provides initial indications of what the ML solution should look like. This ensures quality solutions supported by thorough consideration and evidence. Refer to feasibility study . Exploratory data analysis is performed and discussed with the team Typical output : Data exploration source code (Jupyter notebooks/scripts) and slides/docs Initial ML model code (Jupyter notebook or scripts) Initial solution architecture with initial data engineering requirements Data dictionary (if not yet available) List of assumptions Baseline Model Experimentation Data preparation: creating data source connectors, determining storage services to be used and potential versioning of raw datasets. Feature engineering: create new features from raw source data to increase the predictive power of the learning algorithm. The features should capture additional information that is not apparent in the original feature set. Split data into training, validation and test sets: creating training, validation, and test datasets with ground truth to develop ML models. This would entail joining or merging various feature engineered datasets. The training dataset is used to train the model to find the patterns between its features and labels (ground truth). The validation dataset is used to assess the model architecture, and the test data is used to confirm the prediction quality of the model. Initial code to create access data sources, transform raw data into features and model training as well as scoring. During this phase, experiment code (Jupyter notebooks or scripts) and accompanying utility code should be version-controlled using tools such as ADO (Azure DevOps). Typical output : Rough Jupyter notebooks or scripts in Python or R, initial results from baseline model. For more information on experimentation, refer to the experimentation section. Model Evaluation Compare the effectiveness of different algorithms on the given problem. Typical output : Evaluation flow is fully set up . Reproducible experiments for the different approaches experimented with. Model Operationalization Taking \"experimental\" code and preparing it, so it can be deployed. This includes data pre-processing, featurization code, training model code (if required to be trained using CI/CD) and model inference code. Typical output : Production-grade code (Preferably in the form of a package) for: Data preprocessing / post processing Serving a model Training a model CI/CD scripts. Reproducibility steps for the model in production. See more in the ML model checklist . Unit and Integration Testing Ensuring that production code behaves in the way we expect it to, and that its results match those we saw during the Model Evaluation and Experimentation phases. Refer to ML testing post for further details. Typical output : Test suite with unit and end-to-end tests is created and completes successfully. Deployment Responsible AI considerations such as bias and fairness analysis. Additionally, explainability/interpretability of the model should also be considered. It is recommended for a human-in-the-loop to verify the model and manually approve deployment to production. Getting the model into production where it can start adding value by serving predictions. Typical artifacts are APIs for accessing the model and integrating the model to the solution architecture. Additionally, certain scenarios may require training the model periodically in production. Reproducibility steps of the production model are available. Typical output : model readiness checklist is completed. Monitoring and Observability This is the final phase, where we ensure our model is doing what we expect it to in production. Read more about ML observability . Read more about Azure ML's offerings around ML models production monitoring . It is recommended to consider incorporating data drift monitoring process in the production solution. This will assist in detecting potential changes in new datasets presented for inference that may significantly impact model performance. For more info on detecting data drift with Azure ML see the Microsoft docs article on how to monitor datasets . Typical output : Logging and monitoring scripts and tools set up, permissions for users to access monitoring tools.","title":"Proposed ML Process"},{"location":"machine-learning/proposed-ml-process/#proposed-ml-process","text":"","title":"Proposed ML Process"},{"location":"machine-learning/proposed-ml-process/#introduction","text":"The objective of this document is to provide guidance to produce machine learning (ML) applications that are based on code, data and models that can be reproduced and reliably released to production environments. When developing ML applications, we consider the following approaches: Best practices in ML engineering: The ML application development should use engineering fundamentals to ensure high quality software deliverables. The ML application should be reliability released into production, leveraging automation as much as possible. The ML application can be deployed into production at any time. This makes the decision about when to release it a business decision rather than a technical one. Best practices in ML research: All artifacts, specifically data, code and ML models, should be versioned and managed using standard tools and workflows, in order to facilitate continuous research and development. While the model outputs can be non-deterministic and hard to reproduce, the process of releasing ML software into production should be reproducible. Responsible AI aspects are carefully analyzed and addressed. Cross-functional team: A cross-functional team consisting of different skill sets in data science, data engineering, development, operations, and industry domain specialists is required.","title":"Introduction"},{"location":"machine-learning/proposed-ml-process/#ml-process","text":"The proposed ML development process consists of: Data and problem understanding Responsible AI assessment Feasibility study Baseline model experimentation Model evaluation and experimentation Model operationalization * Unit and Integration testing * Deployment * Monitoring and Observability","title":"ML process"},{"location":"machine-learning/proposed-ml-process/#version-control","text":"During all stages of the process, it is suggested that artifacts should be version-controlled . Typically, the process is iterative and versioned artifacts can assist in traceability and reviewing.","title":"Version Control"},{"location":"machine-learning/proposed-ml-process/#understanding-the-problem","text":"Define the business problem for the ML project: Agree on the success criteria with the customer. Identify potential data sources and determine the availability of these sources. Define performance evaluation metrics on ground truth data Conduct a Responsible AI assessment to ensure development and deployment of the ML solution in a responsible manner. Conduct a feasibility study to assess whether the business problem is feasible to solve satisfactorily using ML with the available data. The objective of the feasibility study is to mitigate potential over-investment by ensuring sufficient evidence that ML is possible and would be the best solution. The study also provides initial indications of what the ML solution should look like. This ensures quality solutions supported by thorough consideration and evidence. Refer to feasibility study . Exploratory data analysis is performed and discussed with the team Typical output : Data exploration source code (Jupyter notebooks/scripts) and slides/docs Initial ML model code (Jupyter notebook or scripts) Initial solution architecture with initial data engineering requirements Data dictionary (if not yet available) List of assumptions","title":"Understanding the Problem"},{"location":"machine-learning/proposed-ml-process/#baseline-model-experimentation","text":"Data preparation: creating data source connectors, determining storage services to be used and potential versioning of raw datasets. Feature engineering: create new features from raw source data to increase the predictive power of the learning algorithm. The features should capture additional information that is not apparent in the original feature set. Split data into training, validation and test sets: creating training, validation, and test datasets with ground truth to develop ML models. This would entail joining or merging various feature engineered datasets. The training dataset is used to train the model to find the patterns between its features and labels (ground truth). The validation dataset is used to assess the model architecture, and the test data is used to confirm the prediction quality of the model. Initial code to create access data sources, transform raw data into features and model training as well as scoring. During this phase, experiment code (Jupyter notebooks or scripts) and accompanying utility code should be version-controlled using tools such as ADO (Azure DevOps). Typical output : Rough Jupyter notebooks or scripts in Python or R, initial results from baseline model. For more information on experimentation, refer to the experimentation section.","title":"Baseline Model Experimentation"},{"location":"machine-learning/proposed-ml-process/#model-evaluation","text":"Compare the effectiveness of different algorithms on the given problem. Typical output : Evaluation flow is fully set up . Reproducible experiments for the different approaches experimented with.","title":"Model Evaluation"},{"location":"machine-learning/proposed-ml-process/#model-operationalization","text":"Taking \"experimental\" code and preparing it, so it can be deployed. This includes data pre-processing, featurization code, training model code (if required to be trained using CI/CD) and model inference code. Typical output : Production-grade code (Preferably in the form of a package) for: Data preprocessing / post processing Serving a model Training a model CI/CD scripts. Reproducibility steps for the model in production. See more in the ML model checklist .","title":"Model Operationalization"},{"location":"machine-learning/proposed-ml-process/#unit-and-integration-testing","text":"Ensuring that production code behaves in the way we expect it to, and that its results match those we saw during the Model Evaluation and Experimentation phases. Refer to ML testing post for further details. Typical output : Test suite with unit and end-to-end tests is created and completes successfully.","title":"Unit and Integration Testing"},{"location":"machine-learning/proposed-ml-process/#deployment","text":"Responsible AI considerations such as bias and fairness analysis. Additionally, explainability/interpretability of the model should also be considered. It is recommended for a human-in-the-loop to verify the model and manually approve deployment to production. Getting the model into production where it can start adding value by serving predictions. Typical artifacts are APIs for accessing the model and integrating the model to the solution architecture. Additionally, certain scenarios may require training the model periodically in production. Reproducibility steps of the production model are available. Typical output : model readiness checklist is completed.","title":"Deployment"},{"location":"machine-learning/proposed-ml-process/#monitoring-and-observability","text":"This is the final phase, where we ensure our model is doing what we expect it to in production. Read more about ML observability . Read more about Azure ML's offerings around ML models production monitoring . It is recommended to consider incorporating data drift monitoring process in the production solution. This will assist in detecting potential changes in new datasets presented for inference that may significantly impact model performance. For more info on detecting data drift with Azure ML see the Microsoft docs article on how to monitor datasets . Typical output : Logging and monitoring scripts and tools set up, permissions for users to access monitoring tools.","title":"Monitoring and Observability"},{"location":"machine-learning/responsible-ai/","text":"Responsible AI in ISE Microsoft's Responsible AI principles Every ML project in ISE goes through a Responsible AI (RAI) assessment to ensure that it upholds Microsoft's 6 Responsible AI principles : Fairness Reliability & Safety Privacy & Security Inclusiveness Transparency Accountability Every project goes through the RAI process, whether we are building a new ML model from scratch, or putting an existing model in production. ISE's Responsible AI process The process begins as soon as we start a prospective project. We start to complete a Responsible AI review document, and an impact assessment, which provides a structured way to explore topics such as: Can the problem be addressed with a non-technical (e.g. social) solution? Can the problem be solved without AI? Would simpler technology suffice? Will the team have access to domain experts (e.g. doctors, refugees) in the field where the AI is applicable? Who are the stakeholders in this project? Who does the AI impact? Are there any vulnerable groups affected? What are the possible benefits and harms to each stakeholder? How can the technology be misused, and what can go wrong? Has the team analyzed the input data properly to make sure that the training data is suitable for machine learning? Is the training data an accurate representation of data that will be used as input in production? Is there a good representation of all users? Is there a fall-back mechanism (a human in the loop, or a way to revert decisions based on the model)? Does data used by the model for training or scoring contain PII? What measures have been taken to remove sensitive data? Does the model impact consequential decisions, like blocking people from getting jobs, loans, health care etc. or in the cases where it may, have appropriate ethical considerations been discussed? Have measures for re-training been considered? How can we address any concerns that arise, and how can we mitigate risk? At this point we research available tools and resources , such as InterpretML or Fairlearn , that we may use on the project. We may change the project scope or re-define the ML problem definition if necessary. The Responsible AI review documents remain living documents that we re-visit and update throughout project development, through the feasibility study , as the model is developed and prepared for production, and new information unfolds. The documents can be used and expanded once the model is deployed, and monitored in production.","title":"Responsible AI in ISE"},{"location":"machine-learning/responsible-ai/#responsible-ai-in-ise","text":"","title":"Responsible AI in ISE"},{"location":"machine-learning/responsible-ai/#microsofts-responsible-ai-principles","text":"Every ML project in ISE goes through a Responsible AI (RAI) assessment to ensure that it upholds Microsoft's 6 Responsible AI principles : Fairness Reliability & Safety Privacy & Security Inclusiveness Transparency Accountability Every project goes through the RAI process, whether we are building a new ML model from scratch, or putting an existing model in production.","title":"Microsoft's Responsible AI principles"},{"location":"machine-learning/responsible-ai/#ises-responsible-ai-process","text":"The process begins as soon as we start a prospective project. We start to complete a Responsible AI review document, and an impact assessment, which provides a structured way to explore topics such as: Can the problem be addressed with a non-technical (e.g. social) solution? Can the problem be solved without AI? Would simpler technology suffice? Will the team have access to domain experts (e.g. doctors, refugees) in the field where the AI is applicable? Who are the stakeholders in this project? Who does the AI impact? Are there any vulnerable groups affected? What are the possible benefits and harms to each stakeholder? How can the technology be misused, and what can go wrong? Has the team analyzed the input data properly to make sure that the training data is suitable for machine learning? Is the training data an accurate representation of data that will be used as input in production? Is there a good representation of all users? Is there a fall-back mechanism (a human in the loop, or a way to revert decisions based on the model)? Does data used by the model for training or scoring contain PII? What measures have been taken to remove sensitive data? Does the model impact consequential decisions, like blocking people from getting jobs, loans, health care etc. or in the cases where it may, have appropriate ethical considerations been discussed? Have measures for re-training been considered? How can we address any concerns that arise, and how can we mitigate risk? At this point we research available tools and resources , such as InterpretML or Fairlearn , that we may use on the project. We may change the project scope or re-define the ML problem definition if necessary. The Responsible AI review documents remain living documents that we re-visit and update throughout project development, through the feasibility study , as the model is developed and prepared for production, and new information unfolds. The documents can be used and expanded once the model is deployed, and monitored in production.","title":"ISE's Responsible AI process"},{"location":"machine-learning/testing-data-science-and-mlops-code/","text":"Testing Data Science and MLOps Code The purpose of this document is to provide samples of tests for the most common operations in MLOps/Data Science projects. Testing the code used for MLOps or data science projects follows the same principles of any other software project. Some scenarios might seem different or more difficult to test. The best way to approach this is to always have a test design session, where the focus is on the input/outputs, exceptions and testing the behavior of data transformations. Designing the tests first makes it easier to test as it forces a more modular style, where each function has one purpose, and extracting common functionality functions and modules. Below are some common operations in MLOps or Data Science projects, along with suggestions on how to test them. Saving and loading data Transforming data Model load or predict Data validation Model testing Saving and Loading Data Reading and writing to csv, reading images or loading audio files are common scenarios encountered in MLOps projects. Example: Verify that a Load Function Calls read_csv if the File Exists utils.py def load_data ( filename : str ) -> pd . DataFrame : if os . path . isfile ( filename ): df = pd . read_csv ( filename , index_col = 'ID' ) return df return None There's no need to test the read_csv function, or the isfile functions, we can leave testing them to the pandas and os developers. The only thing we need to test here is the logic in this function, i.e. that load_data loads the file if the file exists with the right index column, and doesn't load the file if it doesn't exist, and that it returns the expected results. One way to do this would be to provide a sample file and call the function, and verify that the output is None or a DataFrame . This requires separate files to be present, or not present, for the tests to run. This can cause the same test to run on one machine and then fail on a build server which is not a desired behavior. A much better way is to mock calls to isfile , and read_csv . Instead of calling the real function, we will return a predefined return value, or call a stub that doesn't have any side effects. This way no files are needed in the repository to execute the test, and the test will always work the same, independent of what machine it runs on. Note: Below we mock the specific os and pd functions referenced in the utils file, any others are left unaffected and would run as normal. test_utils.py import utils from mock import patch @patch ( 'utils.os.path.isfile' ) @patch ( 'utils.pd.read_csv' ) def test_load_data_calls_read_csv_if_exists ( mock_isfile , mock_read_csv ): # arrange # always return true for isfile utils . os . path . isfile . return_value = True filename = 'file.csv' # act _ = utils . load_data ( filename ) # assert # check that read_csv is called with the correct parameters utils . pd . read_csv . assert_called_once_with ( filename , index_col = 'ID' ) Similarly, we can verify that it's called 0 or multiple times. In the example below where we verify that it's not called if the file doesn't exist @patch ( 'utils.os.path.isfile' ) @patch ( 'utils.pd.read_csv' ) def test_load_data_does_not_call_read_csv_if_not_exists ( mock_isfile , mock_read_csv ): # arrange # file doesn't exist utils . os . path . isfile . return_value = False filename = 'file.csv' # act _ = utils . load_data ( filename ) # assert # check that read_csv is not called assert utils . pd . read_csv . call_count == 0 Example: Using the Same Sample Data for Multiple Tests If more than one test will use the same sample data, fixtures are a good way to reuse this sample data. The sample data can be the contents of a json file, or a csv, or a DataFrame, or even an image. Note: The sample data is still hard coded if possible, and does not need to be large. Only add as much sample data as required for the tests to make the tests readable. Use the fixture to return the sample data, and add this as a parameter to the tests where you want to use the sample data. import pytest @pytest . fixture def house_features_json (): return { 'area' : 25 , 'price' : 2500 , 'rooms' : np . nan } def test_clean_features_cleans_nan_values ( house_features_json ): cleaned_features = clean_features ( house_features_json ) assert cleaned_features [ 'rooms' ] == 0 def test_extract_features_extracts_price_per_area ( house_features_json ): extracted_features = extract_features ( house_features_json ) assert extracted_features [ 'price_per_area' ] == 100 Transforming Data For cleaning and transforming data, test fixed input and output, but try to limit each test to one verification. For example, create one test to verify the output shape of the data. def test_resize_image_generates_the_correct_size (): # Arrange original_image = np . ones (( 10 , 5 , 2 , 3 )) # act resized_image = utils . resize_image ( original_image , 100 , 100 ) # assert resized_image . shape [: 2 ] = ( 100 , 100 ) and one to verify that any padding is made appropriately def test_resize_image_pads_correctly (): # Arrange original_image = np . ones (( 10 , 5 , 2 , 3 )) # Act resized_image = utils . resize_image ( original_image , 100 , 100 ) # Assert assert resized_image [ 0 ][ 0 ][ 0 ][ 0 ] == 0 assert resized_image [ 0 ][ 0 ][ 2 ][ 0 ] == 1 To test different inputs and expected outputs automatically, use parametrize @pytest . mark . parametrize ( 'orig_height, orig_width, expected_height, expected_width' , [ # smaller than target ( 10 , 10 , 20 , 20 ), # larger than target ( 20 , 20 , 10 , 10 ), # wider than target ( 10 , 20 , 10 , 10 ) ]) def test_resize_image_generates_the_correct_size ( orig_height , orig_width , expected_height , expected_width ): # Arrange original_image = np . ones (( orig_height , orig_width , 2 , 3 )) # act resized_image = utils . resize_image ( original_image , expected_height , expected_width ) # assert resized_image . shape [: 2 ] = ( expected_height , expected_width ) Model Load or Predict When unit testing we should mock model load and model predictions similarly to mocking file access. There may be cases when you want to load your model to do smoke tests, or integration tests. Since these will often take a bit longer to run it's important to be able to separate them from unit tests so that the developers on the team can still run unit tests as part of their test driven development. One way to do this is using marks @pytest . mark . longrunning def test_integration_between_two_systems (): # this might take a while Run all tests that are not marked longrunning pytest -v -m \"not longrunning\" Basic Unit Tests for ML Models ML unit tests are not intended to check the accuracy or performance of a model. Unit tests for an ML model is for code quality checks - for example: Does the model accept the correct inputs and produce the correctly shaped outputs? Do the weights of the model update when running fit ? To do this, the ML model tests do not strictly follow best practices of standard Unit tests - not all outside calls are mocked. These tests are much closer to a narrow integration test . However, the benefits of having simple tests for the ML model help to stop a poorly configured model from spending hours in training, while still producing poor results. Examples of how to implement these tests (for Deep Learning models) include: Build a model and compare the shape of input layers to that of an example source of data. Then, compare the output layer shape to the expected output. Initialize the model and record the weights of each layer. Then, run a single epoch of training on a dummy data set, and compare the weights of the \"trained model\" - only check if the values have changed. Train the model on a dummy dataset for a single epoch, and then validate with dummy data - only validate that the prediction is formatted correctly, this model will not be accurate. Data Validation An important part of the unit testing is to include test cases for data validation. For example, no data supplied, images that are not in the expected format, data containing null values or outliers to make sure that the data processing pipeline is robust. Model Testing Apart from unit testing code, we can also test, debug and validate our models in different ways during the training process Some options to consider at this stage: Adversarial and Boundary tests to increase robustness Verifying accuracy for under-represented classes","title":"Testing Data Science and MLOps Code"},{"location":"machine-learning/testing-data-science-and-mlops-code/#testing-data-science-and-mlops-code","text":"The purpose of this document is to provide samples of tests for the most common operations in MLOps/Data Science projects. Testing the code used for MLOps or data science projects follows the same principles of any other software project. Some scenarios might seem different or more difficult to test. The best way to approach this is to always have a test design session, where the focus is on the input/outputs, exceptions and testing the behavior of data transformations. Designing the tests first makes it easier to test as it forces a more modular style, where each function has one purpose, and extracting common functionality functions and modules. Below are some common operations in MLOps or Data Science projects, along with suggestions on how to test them. Saving and loading data Transforming data Model load or predict Data validation Model testing","title":"Testing Data Science and MLOps Code"},{"location":"machine-learning/testing-data-science-and-mlops-code/#saving-and-loading-data","text":"Reading and writing to csv, reading images or loading audio files are common scenarios encountered in MLOps projects.","title":"Saving and Loading Data"},{"location":"machine-learning/testing-data-science-and-mlops-code/#example-verify-that-a-load-function-calls-read_csv-if-the-file-exists","text":"utils.py def load_data ( filename : str ) -> pd . DataFrame : if os . path . isfile ( filename ): df = pd . read_csv ( filename , index_col = 'ID' ) return df return None There's no need to test the read_csv function, or the isfile functions, we can leave testing them to the pandas and os developers. The only thing we need to test here is the logic in this function, i.e. that load_data loads the file if the file exists with the right index column, and doesn't load the file if it doesn't exist, and that it returns the expected results. One way to do this would be to provide a sample file and call the function, and verify that the output is None or a DataFrame . This requires separate files to be present, or not present, for the tests to run. This can cause the same test to run on one machine and then fail on a build server which is not a desired behavior. A much better way is to mock calls to isfile , and read_csv . Instead of calling the real function, we will return a predefined return value, or call a stub that doesn't have any side effects. This way no files are needed in the repository to execute the test, and the test will always work the same, independent of what machine it runs on. Note: Below we mock the specific os and pd functions referenced in the utils file, any others are left unaffected and would run as normal. test_utils.py import utils from mock import patch @patch ( 'utils.os.path.isfile' ) @patch ( 'utils.pd.read_csv' ) def test_load_data_calls_read_csv_if_exists ( mock_isfile , mock_read_csv ): # arrange # always return true for isfile utils . os . path . isfile . return_value = True filename = 'file.csv' # act _ = utils . load_data ( filename ) # assert # check that read_csv is called with the correct parameters utils . pd . read_csv . assert_called_once_with ( filename , index_col = 'ID' ) Similarly, we can verify that it's called 0 or multiple times. In the example below where we verify that it's not called if the file doesn't exist @patch ( 'utils.os.path.isfile' ) @patch ( 'utils.pd.read_csv' ) def test_load_data_does_not_call_read_csv_if_not_exists ( mock_isfile , mock_read_csv ): # arrange # file doesn't exist utils . os . path . isfile . return_value = False filename = 'file.csv' # act _ = utils . load_data ( filename ) # assert # check that read_csv is not called assert utils . pd . read_csv . call_count == 0","title":"Example: Verify that a Load Function Calls read_csv if the File Exists"},{"location":"machine-learning/testing-data-science-and-mlops-code/#example-using-the-same-sample-data-for-multiple-tests","text":"If more than one test will use the same sample data, fixtures are a good way to reuse this sample data. The sample data can be the contents of a json file, or a csv, or a DataFrame, or even an image. Note: The sample data is still hard coded if possible, and does not need to be large. Only add as much sample data as required for the tests to make the tests readable. Use the fixture to return the sample data, and add this as a parameter to the tests where you want to use the sample data. import pytest @pytest . fixture def house_features_json (): return { 'area' : 25 , 'price' : 2500 , 'rooms' : np . nan } def test_clean_features_cleans_nan_values ( house_features_json ): cleaned_features = clean_features ( house_features_json ) assert cleaned_features [ 'rooms' ] == 0 def test_extract_features_extracts_price_per_area ( house_features_json ): extracted_features = extract_features ( house_features_json ) assert extracted_features [ 'price_per_area' ] == 100","title":"Example: Using the Same Sample Data for Multiple Tests"},{"location":"machine-learning/testing-data-science-and-mlops-code/#transforming-data","text":"For cleaning and transforming data, test fixed input and output, but try to limit each test to one verification. For example, create one test to verify the output shape of the data. def test_resize_image_generates_the_correct_size (): # Arrange original_image = np . ones (( 10 , 5 , 2 , 3 )) # act resized_image = utils . resize_image ( original_image , 100 , 100 ) # assert resized_image . shape [: 2 ] = ( 100 , 100 ) and one to verify that any padding is made appropriately def test_resize_image_pads_correctly (): # Arrange original_image = np . ones (( 10 , 5 , 2 , 3 )) # Act resized_image = utils . resize_image ( original_image , 100 , 100 ) # Assert assert resized_image [ 0 ][ 0 ][ 0 ][ 0 ] == 0 assert resized_image [ 0 ][ 0 ][ 2 ][ 0 ] == 1 To test different inputs and expected outputs automatically, use parametrize @pytest . mark . parametrize ( 'orig_height, orig_width, expected_height, expected_width' , [ # smaller than target ( 10 , 10 , 20 , 20 ), # larger than target ( 20 , 20 , 10 , 10 ), # wider than target ( 10 , 20 , 10 , 10 ) ]) def test_resize_image_generates_the_correct_size ( orig_height , orig_width , expected_height , expected_width ): # Arrange original_image = np . ones (( orig_height , orig_width , 2 , 3 )) # act resized_image = utils . resize_image ( original_image , expected_height , expected_width ) # assert resized_image . shape [: 2 ] = ( expected_height , expected_width )","title":"Transforming Data"},{"location":"machine-learning/testing-data-science-and-mlops-code/#model-load-or-predict","text":"When unit testing we should mock model load and model predictions similarly to mocking file access. There may be cases when you want to load your model to do smoke tests, or integration tests. Since these will often take a bit longer to run it's important to be able to separate them from unit tests so that the developers on the team can still run unit tests as part of their test driven development. One way to do this is using marks @pytest . mark . longrunning def test_integration_between_two_systems (): # this might take a while Run all tests that are not marked longrunning pytest -v -m \"not longrunning\"","title":"Model Load or Predict"},{"location":"machine-learning/testing-data-science-and-mlops-code/#basic-unit-tests-for-ml-models","text":"ML unit tests are not intended to check the accuracy or performance of a model. Unit tests for an ML model is for code quality checks - for example: Does the model accept the correct inputs and produce the correctly shaped outputs? Do the weights of the model update when running fit ? To do this, the ML model tests do not strictly follow best practices of standard Unit tests - not all outside calls are mocked. These tests are much closer to a narrow integration test . However, the benefits of having simple tests for the ML model help to stop a poorly configured model from spending hours in training, while still producing poor results. Examples of how to implement these tests (for Deep Learning models) include: Build a model and compare the shape of input layers to that of an example source of data. Then, compare the output layer shape to the expected output. Initialize the model and record the weights of each layer. Then, run a single epoch of training on a dummy data set, and compare the weights of the \"trained model\" - only check if the values have changed. Train the model on a dummy dataset for a single epoch, and then validate with dummy data - only validate that the prediction is formatted correctly, this model will not be accurate.","title":"Basic Unit Tests for ML Models"},{"location":"machine-learning/testing-data-science-and-mlops-code/#data-validation","text":"An important part of the unit testing is to include test cases for data validation. For example, no data supplied, images that are not in the expected format, data containing null values or outliers to make sure that the data processing pipeline is robust.","title":"Data Validation"},{"location":"machine-learning/testing-data-science-and-mlops-code/#model-testing","text":"Apart from unit testing code, we can also test, debug and validate our models in different ways during the training process Some options to consider at this stage: Adversarial and Boundary tests to increase robustness Verifying accuracy for under-represented classes","title":"Model Testing"},{"location":"machine-learning/tpm-considerations-for-ml-projects/","text":"TPM considerations for Machine Learning projects In this document, we explore some of the Program Management considerations for Machine Learning (ML) projects and suggest recommendations for Technical Program Managers (TPM) to effectively work with Data and Applied Machine Learning engineering teams. Determine the Need for Machine Learning in the Project In Artificial Intelligence (AI) projects, the ML component is generally a part of an overall business problem and NOT the problem itself. Determine the overall business problem first and then evaluate if ML can help address a part of the problem space. Few considerations for identifying the right fit for the project: Engage experts in human experience and employ techniques such as Design Thinking and Problem Formulation to understand the customer needs and human behavior first. Identify the right stakeholders from both business and technical leadership and invite them to these workshops. The outcome should be end-user scenarios and personas to determine the real needs of the users. Focus on System Design principles to identify the architectural components, entities, interfaces, constraints. Ask the right questions early and explore design alternatives with the engineering team. Think hard about the costs of ML and whether we are solving a repetitive problem at scale. Many a times, customer problems can be solved with data analytics, dashboards, or rule-based algorithms as the first phase of the project. Set Expectations for High Ambiguity in ML components ML projects can be plagued with a phenomenon we can call as the \" Death by Unknowns \". Unlike software engineering projects, ML focused projects can result in quick success early (aka sudden decrease in error rate), but this may flatten eventually. Few things to consider: Set clear expectations : Identify the performance metrics and discuss on a \"good enough\" prediction rate that will bring value to the business. An 80% \"good enough\" rate may save business costs and increase productivity but if going from 80 to 95% would require unimaginable cost and effort. Is it worth it? Can it be a progressive road map? Create a smaller team and undertake a feasibility analysis through techniques like EDA (Exploratory Data Analysis). A feasibility study is much cheaper to evaluate data quality, customer constraints and model feasibility. It allows a TPM to better understand customer use cases and current environment and can act as a fail-fast mechanism. Note that feasibility should be shorter (in weeks) else it misses the point of saving costs. As in any project, there will be new needs (additional data sources, technical constraints, hiring data labelers, business users time etc.). Incorporate Agile techniques to fail fast and minimize cost and schedule surprises. Notebooks != ML Production Notebooks are a great way to kick start Data Analytics and Applied Machine Learning efforts, however for a production releases, additional constraints should be considered: Understand the end-end flow of data management , how data will be made available (ingestion flows), what's the frequency, storage, retention of data. Plan user stories and design spikes around these flows to ensure a robust ML pipeline is developed. Engineering team should follow the same rigor in building ML projects as in any software engineering project. We at ISE (Industry Solutions Engineering) have built a good set of resources from our learnings in our ISE Engineering Playbook . Think about the how the model will be deployed, for example, are there technical constraints due to an edge device, or network constraints that will prevent updating the model. Understanding of the environment is critical, refer to the Model Production Checklist as a reference to determine model deployment choices. ML Focussed projects are not a \"one-shot\" release solution, they need to be nurtured, evolved, and improved over time. Plan for a continuous improvement lifecycle, the initial phases can be model feasibility and validation to get the good enough prediction rate, the later phases can be then be scaling and improving the models through feedback loops and fresh data sets. Garbage Data In -> Garbage Model Out Data quality is a major factor in affecting model performance and production roll-out, consider the following: Conduct a data exploration workshop and generate a report on data quality that includes missing values, duplicates, unlabeled data, expired or not valid data, incomplete data (e.g., only having male representation in a people dataset). Identify data source reliability to ensure data is coming from a production source. (e.g., are the images from a production or industrial camera or taken from an iPhone/Android phone.) Identify data acquisition constraints : Determine how the data is being obtained and the constraints around it. Some example may include legal, contractual, Privacy, regulation, ethics constraints. These can significantly slow down production roll out if not captured in the early phases of the project. Determine data volumes : Identify if we have enough data for sampling the required business use case and how will the data be improved over time. The thumb rule here is that data should be enough for generalization to avoid overfitting. Plan for Unique Roles in AI projects An ML Project has multiple stages, and each stage may require additional roles. For example, Design Research & Designers for Human Experience, Data Engineer for Data Collection, Feature Engineering, a Data Labeler for labeling structured data, engineers for MLOps and model deployment and the list can go on. As a TPM, factor in having these resources available at the right time to avoid any schedule risks. Feature Engineering and Hyperparameter Tuning Feature Engineering enables the transformation of data so that it becomes usable for an algorithm. Creating the right features is an art and may require experimentation as well as domain expertise. Allocate time for domain experts to help with improving and identifying the best features. For example, for a natural language processing engine for text extraction of financial documents, we may involve financial researchers and run a relevance judgment exercise and provide a feedback loop to evaluate model performance. Responsible AI Considerations Bias in machine learning could be the number one issue of a model not performing to its intended needs. Plan to incorporate Responsible AI principles from Day 1 to ensure fairness, security, privacy and transparency of the models. For example, for a person recognition algorithm, if the data source is only feeding a specific skin type, then production scenarios may not provide good results. PM Fundamentals Core to a TPM role are the fundamentals that include bringing clarity to the team, design thinking, driving the team to the right technical decisions, managing risk, managing stakeholders, backlog management, project management. These are a TPM superpowers . A TPM can complement the machine learning team by ensuring the problem and customer needs are understood, a wholistic system design is evaluated, the stakeholder expectations and driving customer objectives. Here are some references that may help: The T in a TPM The TPM Don't M*ck up framework The mind of a TPM ML Learning Journey for a TPM","title":"TPM considerations for Machine Learning projects"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#tpm-considerations-for-machine-learning-projects","text":"In this document, we explore some of the Program Management considerations for Machine Learning (ML) projects and suggest recommendations for Technical Program Managers (TPM) to effectively work with Data and Applied Machine Learning engineering teams.","title":"TPM considerations for Machine Learning projects"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#determine-the-need-for-machine-learning-in-the-project","text":"In Artificial Intelligence (AI) projects, the ML component is generally a part of an overall business problem and NOT the problem itself. Determine the overall business problem first and then evaluate if ML can help address a part of the problem space. Few considerations for identifying the right fit for the project: Engage experts in human experience and employ techniques such as Design Thinking and Problem Formulation to understand the customer needs and human behavior first. Identify the right stakeholders from both business and technical leadership and invite them to these workshops. The outcome should be end-user scenarios and personas to determine the real needs of the users. Focus on System Design principles to identify the architectural components, entities, interfaces, constraints. Ask the right questions early and explore design alternatives with the engineering team. Think hard about the costs of ML and whether we are solving a repetitive problem at scale. Many a times, customer problems can be solved with data analytics, dashboards, or rule-based algorithms as the first phase of the project.","title":"Determine the Need for Machine Learning in the Project"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#set-expectations-for-high-ambiguity-in-ml-components","text":"ML projects can be plagued with a phenomenon we can call as the \" Death by Unknowns \". Unlike software engineering projects, ML focused projects can result in quick success early (aka sudden decrease in error rate), but this may flatten eventually. Few things to consider: Set clear expectations : Identify the performance metrics and discuss on a \"good enough\" prediction rate that will bring value to the business. An 80% \"good enough\" rate may save business costs and increase productivity but if going from 80 to 95% would require unimaginable cost and effort. Is it worth it? Can it be a progressive road map? Create a smaller team and undertake a feasibility analysis through techniques like EDA (Exploratory Data Analysis). A feasibility study is much cheaper to evaluate data quality, customer constraints and model feasibility. It allows a TPM to better understand customer use cases and current environment and can act as a fail-fast mechanism. Note that feasibility should be shorter (in weeks) else it misses the point of saving costs. As in any project, there will be new needs (additional data sources, technical constraints, hiring data labelers, business users time etc.). Incorporate Agile techniques to fail fast and minimize cost and schedule surprises.","title":"Set Expectations for High Ambiguity in ML components"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#notebooks-ml-production","text":"Notebooks are a great way to kick start Data Analytics and Applied Machine Learning efforts, however for a production releases, additional constraints should be considered: Understand the end-end flow of data management , how data will be made available (ingestion flows), what's the frequency, storage, retention of data. Plan user stories and design spikes around these flows to ensure a robust ML pipeline is developed. Engineering team should follow the same rigor in building ML projects as in any software engineering project. We at ISE (Industry Solutions Engineering) have built a good set of resources from our learnings in our ISE Engineering Playbook . Think about the how the model will be deployed, for example, are there technical constraints due to an edge device, or network constraints that will prevent updating the model. Understanding of the environment is critical, refer to the Model Production Checklist as a reference to determine model deployment choices. ML Focussed projects are not a \"one-shot\" release solution, they need to be nurtured, evolved, and improved over time. Plan for a continuous improvement lifecycle, the initial phases can be model feasibility and validation to get the good enough prediction rate, the later phases can be then be scaling and improving the models through feedback loops and fresh data sets.","title":"Notebooks != ML Production"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#garbage-data-in-garbage-model-out","text":"Data quality is a major factor in affecting model performance and production roll-out, consider the following: Conduct a data exploration workshop and generate a report on data quality that includes missing values, duplicates, unlabeled data, expired or not valid data, incomplete data (e.g., only having male representation in a people dataset). Identify data source reliability to ensure data is coming from a production source. (e.g., are the images from a production or industrial camera or taken from an iPhone/Android phone.) Identify data acquisition constraints : Determine how the data is being obtained and the constraints around it. Some example may include legal, contractual, Privacy, regulation, ethics constraints. These can significantly slow down production roll out if not captured in the early phases of the project. Determine data volumes : Identify if we have enough data for sampling the required business use case and how will the data be improved over time. The thumb rule here is that data should be enough for generalization to avoid overfitting.","title":"Garbage Data In -> Garbage Model Out"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#plan-for-unique-roles-in-ai-projects","text":"An ML Project has multiple stages, and each stage may require additional roles. For example, Design Research & Designers for Human Experience, Data Engineer for Data Collection, Feature Engineering, a Data Labeler for labeling structured data, engineers for MLOps and model deployment and the list can go on. As a TPM, factor in having these resources available at the right time to avoid any schedule risks.","title":"Plan for Unique Roles in AI projects"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#feature-engineering-and-hyperparameter-tuning","text":"Feature Engineering enables the transformation of data so that it becomes usable for an algorithm. Creating the right features is an art and may require experimentation as well as domain expertise. Allocate time for domain experts to help with improving and identifying the best features. For example, for a natural language processing engine for text extraction of financial documents, we may involve financial researchers and run a relevance judgment exercise and provide a feedback loop to evaluate model performance.","title":"Feature Engineering and Hyperparameter Tuning"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#responsible-ai-considerations","text":"Bias in machine learning could be the number one issue of a model not performing to its intended needs. Plan to incorporate Responsible AI principles from Day 1 to ensure fairness, security, privacy and transparency of the models. For example, for a person recognition algorithm, if the data source is only feeding a specific skin type, then production scenarios may not provide good results.","title":"Responsible AI Considerations"},{"location":"machine-learning/tpm-considerations-for-ml-projects/#pm-fundamentals","text":"Core to a TPM role are the fundamentals that include bringing clarity to the team, design thinking, driving the team to the right technical decisions, managing risk, managing stakeholders, backlog management, project management. These are a TPM superpowers . A TPM can complement the machine learning team by ensuring the problem and customer needs are understood, a wholistic system design is evaluated, the stakeholder expectations and driving customer objectives. Here are some references that may help: The T in a TPM The TPM Don't M*ck up framework The mind of a TPM ML Learning Journey for a TPM","title":"PM Fundamentals"},{"location":"non-functional-requirements/accessibility/","text":"Accessibility Accessibility is a critical component of any successful project and ensures the solutions we build are usable and enjoyed by as many people as possible. While meeting accessibility compliance standards is required, accessibility is much broader than compliance alone. Accessibility is about using techniques like inclusive design to infuse different perspectives and the full range of human diversity into the products we build. By incorporating accessibility into your project from the initial envisioning through MVP and beyond, you are promoting a more inclusive environment for your team and helping close the \"Disability Divide\" that exists for many people living with disabilities. Getting Started If you are new to accessibility or are looking for an overview of accessibility fundamentals, Microsoft Learn offers a great training course that covers a broad range of topics from creating accessible content in Office to designing accessibility features in your own apps. You can learn more about the course or get started at Microsoft Learn: Accessibility Fundamentals . Inclusive Design Inclusive design is a methodology that embraces the full range of human diversity as a resource to help build better products and services. Inclusive design compliments accessibility going beyond accessibility compliance standards to ensure products are usable and enjoyed by all people. By leveraging the inclusive design methodology early in a project, you can expect a more inclusive and better solution for everyone. The Microsoft Inclusive Design website offers a variety of resources for incorporating inclusive design in your projects including inclusive design activities that can be used in envisioning and architecture design sessions. The Microsoft Inclusive Design methodology includes the following principles: Recognize Exclusion Designing for inclusivity not only opens up our products and services to more people, it also reflects how people really are. All humans grow and adapt to the world around them and we want our designs to reflect that. Solve for One, Extend to Many Everyone has abilities, and limits to those abilities. Designing for people with permanent disabilities actually results in designs that benefit people universally. Constraints are a beautiful thing. Learn from Diversity Human beings are the real experts in adapting to diversity. Inclusive design puts people in the center from the very start of the process, and those fresh, diverse perspectives are the key to true insight. Tools Accessibility Insights Accessibility Insights is a free, open-source solution for identifying accessibility issues in Windows, Android, and web applications. Accessibility Insights can identify a broad range of accessibility issues including problems with missing image alt tags, heading organization, tab order, color contrast, and many more. In addition, you can use Accessibility Insights to simulate color blindness to ensure your user interface is accessible to those that experience some form of color blindness. You can download Accessibility Insights here: https://accessibilityinsights.io/downloads/ Accessibility Linter Deque Systems are web accessibility experts that provide accessibility training and tools to many organizations including Microsoft. One of the many tools offered by Deque is the axe Accessibility Linter for VS Code . This VS Code extension use the axe-core rules engine to identify accessibility issues in HTML, Angular, React, Markdown, and Vue. Using an accessibility linter can help ensure accessibility issues get addressed early in the development lifecycle. Practices Accessibility Testing Accessibility testing is a specialized subset of software testing and includes automated tools and manual testing processes that vary from project to project. In addition to tools like Accessibility Insights discussed earlier, there are many other solutions for accessibility testing. The W3C provides a comprehensive list of evaluation and testing tools on their website at https://www.w3.org/WAI/ER/tools/ . If you are looking to add automated testing to your Azure Pipelines, you may want to consider the Accessibility Testing extension built by Drew Lewis, a former Microsoft employee. It's important to keep in mind that automated tooling alone is not enough - make sure to augment your automated tests with manual ones. Accessibility Insights (linked above) can guide users through some manual testing steps. Code and Documentation Basics Before you get to testing, you can make some small changes in how you write code and documentation. Document! Beyond text documentation, this also means code comments, clear variable and file naming, and pipeline or script outputs that clearly report success or failure and give details. Avoid small case for variable and file names, hashtags, neologisms, etc. Use camelCase, snake_case, or other methods of creating separation between words. Introduce abbreviations by spelling the full term out, then the abbreviation in parentheses. Use headers effectively to break up content by topic. Don't use more than one h1 per page, and don't skip levels (e.g. use an h3 directly under an h1). Avoid using formatting to make something look like a header when it's not. Use descriptive link text. Avoid attaching a link to phrases like \"Read more\" and ensure that the text directly states what it links to. Link text should be able to stand on its own. When including images or diagrams, add alt text. This should never just be \"Image\" or \"Diagram\" (or similar). In your description, highlight the purpose of the image or diagram in the page and what it is intended to convey. Prefer tabs to spaces when possible. This allows users to default to their preferred tab width, so users with a range of vision can all take in code easily. Resources Microsoft Accessibility Technology & Tools Web Content Accessibility Guidelines (WCAG) Accessibility Guidelines and Requirements | Microsoft Style Guide Google Developer Style Guide: Write Accessible Documentation","title":"Accessibility"},{"location":"non-functional-requirements/accessibility/#accessibility","text":"Accessibility is a critical component of any successful project and ensures the solutions we build are usable and enjoyed by as many people as possible. While meeting accessibility compliance standards is required, accessibility is much broader than compliance alone. Accessibility is about using techniques like inclusive design to infuse different perspectives and the full range of human diversity into the products we build. By incorporating accessibility into your project from the initial envisioning through MVP and beyond, you are promoting a more inclusive environment for your team and helping close the \"Disability Divide\" that exists for many people living with disabilities.","title":"Accessibility"},{"location":"non-functional-requirements/accessibility/#getting-started","text":"If you are new to accessibility or are looking for an overview of accessibility fundamentals, Microsoft Learn offers a great training course that covers a broad range of topics from creating accessible content in Office to designing accessibility features in your own apps. You can learn more about the course or get started at Microsoft Learn: Accessibility Fundamentals .","title":"Getting Started"},{"location":"non-functional-requirements/accessibility/#inclusive-design","text":"Inclusive design is a methodology that embraces the full range of human diversity as a resource to help build better products and services. Inclusive design compliments accessibility going beyond accessibility compliance standards to ensure products are usable and enjoyed by all people. By leveraging the inclusive design methodology early in a project, you can expect a more inclusive and better solution for everyone. The Microsoft Inclusive Design website offers a variety of resources for incorporating inclusive design in your projects including inclusive design activities that can be used in envisioning and architecture design sessions. The Microsoft Inclusive Design methodology includes the following principles:","title":"Inclusive Design"},{"location":"non-functional-requirements/accessibility/#recognize-exclusion","text":"Designing for inclusivity not only opens up our products and services to more people, it also reflects how people really are. All humans grow and adapt to the world around them and we want our designs to reflect that.","title":"Recognize Exclusion"},{"location":"non-functional-requirements/accessibility/#solve-for-one-extend-to-many","text":"Everyone has abilities, and limits to those abilities. Designing for people with permanent disabilities actually results in designs that benefit people universally. Constraints are a beautiful thing.","title":"Solve for One, Extend to Many"},{"location":"non-functional-requirements/accessibility/#learn-from-diversity","text":"Human beings are the real experts in adapting to diversity. Inclusive design puts people in the center from the very start of the process, and those fresh, diverse perspectives are the key to true insight.","title":"Learn from Diversity"},{"location":"non-functional-requirements/accessibility/#tools","text":"","title":"Tools"},{"location":"non-functional-requirements/accessibility/#accessibility-insights","text":"Accessibility Insights is a free, open-source solution for identifying accessibility issues in Windows, Android, and web applications. Accessibility Insights can identify a broad range of accessibility issues including problems with missing image alt tags, heading organization, tab order, color contrast, and many more. In addition, you can use Accessibility Insights to simulate color blindness to ensure your user interface is accessible to those that experience some form of color blindness. You can download Accessibility Insights here: https://accessibilityinsights.io/downloads/","title":"Accessibility Insights"},{"location":"non-functional-requirements/accessibility/#accessibility-linter","text":"Deque Systems are web accessibility experts that provide accessibility training and tools to many organizations including Microsoft. One of the many tools offered by Deque is the axe Accessibility Linter for VS Code . This VS Code extension use the axe-core rules engine to identify accessibility issues in HTML, Angular, React, Markdown, and Vue. Using an accessibility linter can help ensure accessibility issues get addressed early in the development lifecycle.","title":"Accessibility Linter"},{"location":"non-functional-requirements/accessibility/#practices","text":"","title":"Practices"},{"location":"non-functional-requirements/accessibility/#accessibility-testing","text":"Accessibility testing is a specialized subset of software testing and includes automated tools and manual testing processes that vary from project to project. In addition to tools like Accessibility Insights discussed earlier, there are many other solutions for accessibility testing. The W3C provides a comprehensive list of evaluation and testing tools on their website at https://www.w3.org/WAI/ER/tools/ . If you are looking to add automated testing to your Azure Pipelines, you may want to consider the Accessibility Testing extension built by Drew Lewis, a former Microsoft employee. It's important to keep in mind that automated tooling alone is not enough - make sure to augment your automated tests with manual ones. Accessibility Insights (linked above) can guide users through some manual testing steps.","title":"Accessibility Testing"},{"location":"non-functional-requirements/accessibility/#code-and-documentation-basics","text":"Before you get to testing, you can make some small changes in how you write code and documentation. Document! Beyond text documentation, this also means code comments, clear variable and file naming, and pipeline or script outputs that clearly report success or failure and give details. Avoid small case for variable and file names, hashtags, neologisms, etc. Use camelCase, snake_case, or other methods of creating separation between words. Introduce abbreviations by spelling the full term out, then the abbreviation in parentheses. Use headers effectively to break up content by topic. Don't use more than one h1 per page, and don't skip levels (e.g. use an h3 directly under an h1). Avoid using formatting to make something look like a header when it's not. Use descriptive link text. Avoid attaching a link to phrases like \"Read more\" and ensure that the text directly states what it links to. Link text should be able to stand on its own. When including images or diagrams, add alt text. This should never just be \"Image\" or \"Diagram\" (or similar). In your description, highlight the purpose of the image or diagram in the page and what it is intended to convey. Prefer tabs to spaces when possible. This allows users to default to their preferred tab width, so users with a range of vision can all take in code easily.","title":"Code and Documentation Basics"},{"location":"non-functional-requirements/accessibility/#resources","text":"Microsoft Accessibility Technology & Tools Web Content Accessibility Guidelines (WCAG) Accessibility Guidelines and Requirements | Microsoft Style Guide Google Developer Style Guide: Write Accessible Documentation","title":"Resources"},{"location":"non-functional-requirements/availability/","text":"Availability Availability refers to the degree to which a system is operational and accessible when needed for use. It is a critical non-functional requirement that ensures users can rely on the system to perform its intended functions without unexpected downtime. High availability is vital for maintaining user trust and satisfaction, especially in industries where service interruptions can lead to significant financial losses or even jeopardize safety. Achieving high availability often involves strategies like redundancy, failover mechanisms, and robust maintenance practices to minimize both planned and unplanned outages. In essence, availability ensures that the system is there when users need it, which is fundamental for any service-oriented or mission-critical application. Characteristics Uptime: This is the proportion of time the system is operational and accessible. It's often measured as a percentage over a specific period (e.g., 99.99% uptime). Redundancy: Implementing backup components or systems that can take over in case of a failure. This ensures continuous operation even if one part fails. Fault Tolerance: The system's ability to continue operating correctly even when part of it fails. This typically involves designing systems that can handle failures gracefully without significant impact on availability. Failover Mechanisms: Automatic switching to a standby system or component when the primary one fails. This minimizes downtime and maintains availability. Scalability: The system's capacity to handle increasing loads without compromising availability. This often involves scaling resources up or out to meet demand. Maintenance and Monitoring: Regular maintenance and real-time monitoring help to detect issues early and address them before they cause downtime. Proactive maintenance schedules and monitoring tools are crucial for maintaining high availability. Recovery Time Objective (RTO) and Recovery Point Objective (RPO): RTO is the maximum acceptable time to restore service after an outage, while RPO is the maximum acceptable amount of data loss measured in time. These metrics guide the design of disaster recovery plans to ensure availability. Service Level Agreements (SLAs): Formal agreements that specify the expected level of service availability and the penalties or compensations if these levels are not met. SLAs help set clear expectations and accountability. Implementations Implementing availability involves various strategies and technologies designed to ensure that a system remains operational and accessible. Here are some examples: Redundant Systems: Deploying duplicate hardware and software systems that can take over if the primary system fails. For instance, using multiple servers in different geographic locations ensures that if one server goes down, another can handle the load. Load Balancing: Distributing incoming network traffic across multiple servers so that no single server becomes a bottleneck. This not only improves performance but also enhances availability by ensuring that if one server fails, the others can take over the traffic. Failover Mechanisms: Implementing automatic failover processes that switch operations to a backup system when a failure is detected. For example, in a database system, using a hot standby database that immediately takes over if the primary database fails. Clustering: Using a group of servers (a cluster) that work together to provide a service. If one server in the cluster fails, others can pick up the load without interrupting the service. This is commonly used in web hosting and database management. Geographic Distribution: Placing copies of data and services in multiple, geographically dispersed data centers. This approach not only improves access speed for users around the world but also protects against regional failures due to natural disasters or other localized issues. Data Replication: Continuously copying and synchronizing data across multiple locations. Techniques like database replication and distributed file systems ensure that data is always available even if one site goes down. Disaster Recovery Plans: Developing and regularly testing comprehensive disaster recovery plans that include steps for restoring services and data in case of a catastrophic failure. These plans often include off-site backups and detailed procedures for quickly bringing systems back online. Real-Time Monitoring and Alerts: Implementing monitoring tools that constantly check the health of the system and send alerts if something goes wrong. This enables quick response to potential issues before they lead to significant downtime. Scheduled Maintenance Windows: Planning and communicating scheduled maintenance periods during off-peak hours to minimize the impact on users. Systems can be designed to perform maintenance tasks without taking the entire service offline. High Availability Software Architectures: Designing software with high availability in mind, using principles like microservices architecture, which isolates different functions of an application. This isolation ensures that a failure in one component doesn\u2019t bring down the entire system. Resources Recommendations for highly available multi-region design Recommendations for using availability zones and regions","title":"Availability"},{"location":"non-functional-requirements/availability/#availability","text":"Availability refers to the degree to which a system is operational and accessible when needed for use. It is a critical non-functional requirement that ensures users can rely on the system to perform its intended functions without unexpected downtime. High availability is vital for maintaining user trust and satisfaction, especially in industries where service interruptions can lead to significant financial losses or even jeopardize safety. Achieving high availability often involves strategies like redundancy, failover mechanisms, and robust maintenance practices to minimize both planned and unplanned outages. In essence, availability ensures that the system is there when users need it, which is fundamental for any service-oriented or mission-critical application.","title":"Availability"},{"location":"non-functional-requirements/availability/#characteristics","text":"Uptime: This is the proportion of time the system is operational and accessible. It's often measured as a percentage over a specific period (e.g., 99.99% uptime). Redundancy: Implementing backup components or systems that can take over in case of a failure. This ensures continuous operation even if one part fails. Fault Tolerance: The system's ability to continue operating correctly even when part of it fails. This typically involves designing systems that can handle failures gracefully without significant impact on availability. Failover Mechanisms: Automatic switching to a standby system or component when the primary one fails. This minimizes downtime and maintains availability. Scalability: The system's capacity to handle increasing loads without compromising availability. This often involves scaling resources up or out to meet demand. Maintenance and Monitoring: Regular maintenance and real-time monitoring help to detect issues early and address them before they cause downtime. Proactive maintenance schedules and monitoring tools are crucial for maintaining high availability. Recovery Time Objective (RTO) and Recovery Point Objective (RPO): RTO is the maximum acceptable time to restore service after an outage, while RPO is the maximum acceptable amount of data loss measured in time. These metrics guide the design of disaster recovery plans to ensure availability. Service Level Agreements (SLAs): Formal agreements that specify the expected level of service availability and the penalties or compensations if these levels are not met. SLAs help set clear expectations and accountability.","title":"Characteristics"},{"location":"non-functional-requirements/availability/#implementations","text":"Implementing availability involves various strategies and technologies designed to ensure that a system remains operational and accessible. Here are some examples: Redundant Systems: Deploying duplicate hardware and software systems that can take over if the primary system fails. For instance, using multiple servers in different geographic locations ensures that if one server goes down, another can handle the load. Load Balancing: Distributing incoming network traffic across multiple servers so that no single server becomes a bottleneck. This not only improves performance but also enhances availability by ensuring that if one server fails, the others can take over the traffic. Failover Mechanisms: Implementing automatic failover processes that switch operations to a backup system when a failure is detected. For example, in a database system, using a hot standby database that immediately takes over if the primary database fails. Clustering: Using a group of servers (a cluster) that work together to provide a service. If one server in the cluster fails, others can pick up the load without interrupting the service. This is commonly used in web hosting and database management. Geographic Distribution: Placing copies of data and services in multiple, geographically dispersed data centers. This approach not only improves access speed for users around the world but also protects against regional failures due to natural disasters or other localized issues. Data Replication: Continuously copying and synchronizing data across multiple locations. Techniques like database replication and distributed file systems ensure that data is always available even if one site goes down. Disaster Recovery Plans: Developing and regularly testing comprehensive disaster recovery plans that include steps for restoring services and data in case of a catastrophic failure. These plans often include off-site backups and detailed procedures for quickly bringing systems back online. Real-Time Monitoring and Alerts: Implementing monitoring tools that constantly check the health of the system and send alerts if something goes wrong. This enables quick response to potential issues before they lead to significant downtime. Scheduled Maintenance Windows: Planning and communicating scheduled maintenance periods during off-peak hours to minimize the impact on users. Systems can be designed to perform maintenance tasks without taking the entire service offline. High Availability Software Architectures: Designing software with high availability in mind, using principles like microservices architecture, which isolates different functions of an application. This isolation ensures that a failure in one component doesn\u2019t bring down the entire system.","title":"Implementations"},{"location":"non-functional-requirements/availability/#resources","text":"Recommendations for highly available multi-region design Recommendations for using availability zones and regions","title":"Resources"},{"location":"non-functional-requirements/capacity/","text":"Capacity Capacity defines the maximum load or volume that a system can handle while maintaining specified performance criteria. This attribute is crucial for ensuring that the system can support the anticipated number of users, transactions, or data volume without degradation in performance. Characteristics Maximum Load: Capacity defines the upper limit of user activity or workload that the system can handle without performance degradation. This includes peak loads during high-demand periods. Scalability: The system's capacity should be scalable, meaning it can be expanded or upgraded to accommodate increased workload or data volume as the organization grows. Resource Management: Efficient allocation and management of resources such as CPU, memory, disk space, and network bandwidth are critical for maintaining capacity. Performance Criteria: Capacity is defined within specific performance criteria, such as response time, throughput, and transaction processing rates, ensuring that the system maintains acceptable performance levels under load. Load Balancing: Systems with high capacity often employ load balancing techniques to distribute workload evenly across servers or resources, optimizing performance and avoiding overload. Failover and Redundancy: Capacity planning may include provisions for failover mechanisms and redundancy to ensure continuity of service and minimal downtime in case of hardware failures or traffic spikes. Monitoring and Testing: Continuous monitoring and periodic load testing are essential to verify that the system's capacity meets expected levels and to identify potential bottlenecks or performance issues proactively. Load testing is one of the critical methods used to ensure that the system can handle expected loads. Capacity Planning: Effective capacity management involves forecasting future needs based on growth projections and historical usage patterns, allowing for timely upgrades or adjustments to infrastructure and resources. Implementations Capacity is typically implemented through a combination of architectural design, infrastructure planning, and performance optimization strategies. For example: Scalable Architecture: Designing the system with scalability in mind allows it to handle increased load by adding resources (horizontal scaling) or upgrading existing resources (vertical scaling). This involves using distributed systems, microservices architecture, and load balancing mechanisms to distribute workload across multiple servers or instances. It is also important to plan for scalability with a forward-looking approach, typically anticipating the needs for at least the next 6 months, to ensure the system can accommodate future growth and demand. Resource Allocation: Efficient allocation and management of resources such as CPU, memory, disk space, and network bandwidth are crucial. This can include techniques like resource pooling, where resources are shared among multiple users or tasks to optimize utilization. Caching: Utilizing caching mechanisms (e.g., in-memory caching, content delivery networks) to store frequently accessed data or computations can reduce the load on backend services and improve response times, thereby enhancing overall capacity. Database Optimization: Ensure that data is modeled efficiently to support optimal performance and scalability. Optimizing database queries, indexing frequently accessed data, and using database scaling techniques (e.g., sharding, replication) can improve the system's ability to handle large volumes of data and concurrent transactions. Load Balancing: Implementing load balancers to evenly distribute incoming traffic across multiple servers or instances helps prevent overload on any single component and ensures efficient resource utilization. Auto-scaling: Leveraging auto-scaling capabilities provided by cloud platforms allows the system to automatically adjust its capacity based on real-time demand. This ensures that additional resources are provisioned during peak periods and scaled down during low traffic times, optimizing cost and performance. Performance Monitoring and Tuning: Continuous monitoring of system performance metrics (e.g., CPU usage, memory utilization, response times) helps identify bottlenecks and areas for optimization. Tuning configurations, optimizing code, and conducting performance testing are essential to maintain and improve system capacity over time. High Availability and Fault Tolerance: Implementing strategies such as redundant servers, failover mechanisms, and disaster recovery plans ensures that the system remains available and operational even in the event of hardware failures or other disruptions. Capacity Planning: Conducting thorough capacity planning based on anticipated growth, usage patterns, and business requirements helps forecast resource needs and proactively scale the system to meet future demands. Resources Performance Testing","title":"Capacity"},{"location":"non-functional-requirements/capacity/#capacity","text":"Capacity defines the maximum load or volume that a system can handle while maintaining specified performance criteria. This attribute is crucial for ensuring that the system can support the anticipated number of users, transactions, or data volume without degradation in performance.","title":"Capacity"},{"location":"non-functional-requirements/capacity/#characteristics","text":"Maximum Load: Capacity defines the upper limit of user activity or workload that the system can handle without performance degradation. This includes peak loads during high-demand periods. Scalability: The system's capacity should be scalable, meaning it can be expanded or upgraded to accommodate increased workload or data volume as the organization grows. Resource Management: Efficient allocation and management of resources such as CPU, memory, disk space, and network bandwidth are critical for maintaining capacity. Performance Criteria: Capacity is defined within specific performance criteria, such as response time, throughput, and transaction processing rates, ensuring that the system maintains acceptable performance levels under load. Load Balancing: Systems with high capacity often employ load balancing techniques to distribute workload evenly across servers or resources, optimizing performance and avoiding overload. Failover and Redundancy: Capacity planning may include provisions for failover mechanisms and redundancy to ensure continuity of service and minimal downtime in case of hardware failures or traffic spikes. Monitoring and Testing: Continuous monitoring and periodic load testing are essential to verify that the system's capacity meets expected levels and to identify potential bottlenecks or performance issues proactively. Load testing is one of the critical methods used to ensure that the system can handle expected loads. Capacity Planning: Effective capacity management involves forecasting future needs based on growth projections and historical usage patterns, allowing for timely upgrades or adjustments to infrastructure and resources.","title":"Characteristics"},{"location":"non-functional-requirements/capacity/#implementations","text":"Capacity is typically implemented through a combination of architectural design, infrastructure planning, and performance optimization strategies. For example: Scalable Architecture: Designing the system with scalability in mind allows it to handle increased load by adding resources (horizontal scaling) or upgrading existing resources (vertical scaling). This involves using distributed systems, microservices architecture, and load balancing mechanisms to distribute workload across multiple servers or instances. It is also important to plan for scalability with a forward-looking approach, typically anticipating the needs for at least the next 6 months, to ensure the system can accommodate future growth and demand. Resource Allocation: Efficient allocation and management of resources such as CPU, memory, disk space, and network bandwidth are crucial. This can include techniques like resource pooling, where resources are shared among multiple users or tasks to optimize utilization. Caching: Utilizing caching mechanisms (e.g., in-memory caching, content delivery networks) to store frequently accessed data or computations can reduce the load on backend services and improve response times, thereby enhancing overall capacity. Database Optimization: Ensure that data is modeled efficiently to support optimal performance and scalability. Optimizing database queries, indexing frequently accessed data, and using database scaling techniques (e.g., sharding, replication) can improve the system's ability to handle large volumes of data and concurrent transactions. Load Balancing: Implementing load balancers to evenly distribute incoming traffic across multiple servers or instances helps prevent overload on any single component and ensures efficient resource utilization. Auto-scaling: Leveraging auto-scaling capabilities provided by cloud platforms allows the system to automatically adjust its capacity based on real-time demand. This ensures that additional resources are provisioned during peak periods and scaled down during low traffic times, optimizing cost and performance. Performance Monitoring and Tuning: Continuous monitoring of system performance metrics (e.g., CPU usage, memory utilization, response times) helps identify bottlenecks and areas for optimization. Tuning configurations, optimizing code, and conducting performance testing are essential to maintain and improve system capacity over time. High Availability and Fault Tolerance: Implementing strategies such as redundant servers, failover mechanisms, and disaster recovery plans ensures that the system remains available and operational even in the event of hardware failures or other disruptions. Capacity Planning: Conducting thorough capacity planning based on anticipated growth, usage patterns, and business requirements helps forecast resource needs and proactively scale the system to meet future demands.","title":"Implementations"},{"location":"non-functional-requirements/capacity/#resources","text":"Performance Testing","title":"Resources"},{"location":"non-functional-requirements/compliance/","text":"Compliance Compliance refers to the adherence to regulatory standards, legal requirements, and organizational policies that govern the handling of data, security practices, and operational procedures. It ensures that the software solution meets specific industry regulations (such as GDPR, HIPAA, PCI-DSS) and internal governance frameworks. Characteristics Regulatory Adherence: Compliance requires the software system to adhere to specific regulatory frameworks relevant to its industry or geographic region. This includes laws and regulations related to data protection, privacy, security, financial transactions, healthcare, and more. Data Privacy: Ensuring that the system handles sensitive data in accordance with privacy laws and regulations, such as implementing encryption, access controls, data anonymization, and secure data storage practices. This includes proper management of Personally Identifiable Information (PII) and encapsulation of secrets to prevent unauthorized access and ensure compliance with data protection standards. Security Standards: Compliance mandates adherence to security standards and best practices to protect against unauthorized access, data breaches, and cyber threats. This involves implementing measures such as firewalls, intrusion detection systems, secure authentication mechanisms, and regular security audits. Auditability: The system must be designed and operated in a way that allows for comprehensive auditing and logging of activities. This ensures that compliance with regulations can be verified through audit trails and compliance reports. Documentation: Comprehensive documentation of policies, procedures, and controls related to compliance requirements is essential. This includes documenting data handling processes, security measures, incident response plans, and compliance assessments. Risk Management: Implementing risk assessment and management practices to identify, assess, and mitigate risks associated with non-compliance. This involves conducting risk assessments regularly and implementing controls to manage identified risks effectively. Change Management: Compliance requires robust change management processes to ensure that any updates or modifications to the software system do not compromise regulatory compliance. This includes testing changes thoroughly and obtaining necessary approvals. Implementations Implementing compliance involves a systematic approach that integrates regulatory requirements, organizational policies, and best practices into the development, deployment, and operation phases. Here are common strategies and practices used to implement compliance: Compliance Framework Selection: Choosing and adopting a compliance framework or standards (e.g., ISO 27001, NIST Cybersecurity Framework) that aligns with the organization's compliance obligations and provides guidelines for implementing controls. Privacy by Design: Integrating privacy considerations into the software design and development process. This includes conducting privacy impact assessments, implementing data minimization techniques, and ensuring user consent mechanisms are in place where required. Audit and Monitoring: Establishing mechanisms for continuous monitoring, auditing, and logging of activities within the software system to ensure compliance with regulatory requirements. This includes maintaining audit trails, generating compliance reports, and conducting regular security assessments. Documentation and Record Keeping: Maintaining comprehensive documentation of compliance efforts, including policies, procedures, audit reports, risk assessments, and compliance certifications. Resources General Data Protection Regulation (GDPR) Purview Compliance Manager","title":"Compliance"},{"location":"non-functional-requirements/compliance/#compliance","text":"Compliance refers to the adherence to regulatory standards, legal requirements, and organizational policies that govern the handling of data, security practices, and operational procedures. It ensures that the software solution meets specific industry regulations (such as GDPR, HIPAA, PCI-DSS) and internal governance frameworks.","title":"Compliance"},{"location":"non-functional-requirements/compliance/#characteristics","text":"Regulatory Adherence: Compliance requires the software system to adhere to specific regulatory frameworks relevant to its industry or geographic region. This includes laws and regulations related to data protection, privacy, security, financial transactions, healthcare, and more. Data Privacy: Ensuring that the system handles sensitive data in accordance with privacy laws and regulations, such as implementing encryption, access controls, data anonymization, and secure data storage practices. This includes proper management of Personally Identifiable Information (PII) and encapsulation of secrets to prevent unauthorized access and ensure compliance with data protection standards. Security Standards: Compliance mandates adherence to security standards and best practices to protect against unauthorized access, data breaches, and cyber threats. This involves implementing measures such as firewalls, intrusion detection systems, secure authentication mechanisms, and regular security audits. Auditability: The system must be designed and operated in a way that allows for comprehensive auditing and logging of activities. This ensures that compliance with regulations can be verified through audit trails and compliance reports. Documentation: Comprehensive documentation of policies, procedures, and controls related to compliance requirements is essential. This includes documenting data handling processes, security measures, incident response plans, and compliance assessments. Risk Management: Implementing risk assessment and management practices to identify, assess, and mitigate risks associated with non-compliance. This involves conducting risk assessments regularly and implementing controls to manage identified risks effectively. Change Management: Compliance requires robust change management processes to ensure that any updates or modifications to the software system do not compromise regulatory compliance. This includes testing changes thoroughly and obtaining necessary approvals.","title":"Characteristics"},{"location":"non-functional-requirements/compliance/#implementations","text":"Implementing compliance involves a systematic approach that integrates regulatory requirements, organizational policies, and best practices into the development, deployment, and operation phases. Here are common strategies and practices used to implement compliance: Compliance Framework Selection: Choosing and adopting a compliance framework or standards (e.g., ISO 27001, NIST Cybersecurity Framework) that aligns with the organization's compliance obligations and provides guidelines for implementing controls. Privacy by Design: Integrating privacy considerations into the software design and development process. This includes conducting privacy impact assessments, implementing data minimization techniques, and ensuring user consent mechanisms are in place where required. Audit and Monitoring: Establishing mechanisms for continuous monitoring, auditing, and logging of activities within the software system to ensure compliance with regulatory requirements. This includes maintaining audit trails, generating compliance reports, and conducting regular security assessments. Documentation and Record Keeping: Maintaining comprehensive documentation of compliance efforts, including policies, procedures, audit reports, risk assessments, and compliance certifications.","title":"Implementations"},{"location":"non-functional-requirements/compliance/#resources","text":"General Data Protection Regulation (GDPR) Purview Compliance Manager","title":"Resources"},{"location":"non-functional-requirements/data-integrity/","text":"Data Integrity Data Integrity is the maintenance and assurance of the quality of data over its entire lifecycle. This includes the many facets of data quality such as, but not limited to, consistency, accuracy, and reliability. The benefits of this NFR are significant, as it ensures that data is trustworthy and reliable for decision-making, analysis, and reporting. Characteristics Accuracy: Data should be correct and free from errors or inconsistencies. Are the column data types correct? Are numeric values rounded off correctly? Completeness: All required data should be present and not missing any essential components. Consistency: Data should be consistent across different databases, applications, or time periods. Validity: Data should conform to defined rules, constraints, or standards. Invalid data should be rejected or flagged for correction. Reliability: Data should be trustworthy and dependable for decision-making and analysis. Timeliness: Data should be up-to-date and reflect the most current information available. Security: Data should be protected from unauthorized access, alteration, or deletion to maintain its integrity. Auditability: Changes to data should be tracked and logged, allowing for accountability and traceability. Transparency: Processes for data collection, storage, and manipulation should be transparent and understandable. Redundancy: Data should have backups or redundancy measures in place to prevent loss or corruption. Compliance: Data handling practices should comply with relevant regulations, standards, and industry best practices. Uniqueness: Data should be unique and not duplicated within the same dataset. Referential integrity: Does every row that depends on a dimension in the fact table actually have its associated dimension? (i.e., foreign keys without a primary) For example, let's say the dimension is \"city\"- then if we have a fact table referencing Seattle, and then delete the Seattle dimension, we need to go delete Seattle from the facts Orderliness: Data should be organized in a logical and consistent manner, making it easy to search, retrieve, and analyze. Implementations Data validation: Implement validation rules at the data entry points to ensure that only accurate and valid data is accepted into the system. This includes checks for data type, format, range, and consistency. Data logging and auditing: Implement logging mechanisms to record all data-related activities, including data modifications, access attempts, and system events. Regularly review audit logs to detect any unauthorized or suspicious activities. Data quality monitoring: Establish data quality monitoring processes to continuously evaluate the accuracy, completeness, and consistency of data. Implement automated checks and alerts to identify and address data quality issues in real-time. Database constraints: Utilize database constraints such as primary keys, foreign keys, unique constraints, and check constraints to enforce data integrity rules at the database level. Regular data backups: Implement regular backups of data to prevent loss in case of system failures, errors, or security breaches. Ensure that backup procedures are automated, monitored, and regularly tested. Resources Great Expectations : A framework to build data validations and test the quality of your data.","title":"Data Integrity"},{"location":"non-functional-requirements/data-integrity/#data-integrity","text":"Data Integrity is the maintenance and assurance of the quality of data over its entire lifecycle. This includes the many facets of data quality such as, but not limited to, consistency, accuracy, and reliability. The benefits of this NFR are significant, as it ensures that data is trustworthy and reliable for decision-making, analysis, and reporting.","title":"Data Integrity"},{"location":"non-functional-requirements/data-integrity/#characteristics","text":"Accuracy: Data should be correct and free from errors or inconsistencies. Are the column data types correct? Are numeric values rounded off correctly? Completeness: All required data should be present and not missing any essential components. Consistency: Data should be consistent across different databases, applications, or time periods. Validity: Data should conform to defined rules, constraints, or standards. Invalid data should be rejected or flagged for correction. Reliability: Data should be trustworthy and dependable for decision-making and analysis. Timeliness: Data should be up-to-date and reflect the most current information available. Security: Data should be protected from unauthorized access, alteration, or deletion to maintain its integrity. Auditability: Changes to data should be tracked and logged, allowing for accountability and traceability. Transparency: Processes for data collection, storage, and manipulation should be transparent and understandable. Redundancy: Data should have backups or redundancy measures in place to prevent loss or corruption. Compliance: Data handling practices should comply with relevant regulations, standards, and industry best practices. Uniqueness: Data should be unique and not duplicated within the same dataset. Referential integrity: Does every row that depends on a dimension in the fact table actually have its associated dimension? (i.e., foreign keys without a primary) For example, let's say the dimension is \"city\"- then if we have a fact table referencing Seattle, and then delete the Seattle dimension, we need to go delete Seattle from the facts Orderliness: Data should be organized in a logical and consistent manner, making it easy to search, retrieve, and analyze.","title":"Characteristics"},{"location":"non-functional-requirements/data-integrity/#implementations","text":"Data validation: Implement validation rules at the data entry points to ensure that only accurate and valid data is accepted into the system. This includes checks for data type, format, range, and consistency. Data logging and auditing: Implement logging mechanisms to record all data-related activities, including data modifications, access attempts, and system events. Regularly review audit logs to detect any unauthorized or suspicious activities. Data quality monitoring: Establish data quality monitoring processes to continuously evaluate the accuracy, completeness, and consistency of data. Implement automated checks and alerts to identify and address data quality issues in real-time. Database constraints: Utilize database constraints such as primary keys, foreign keys, unique constraints, and check constraints to enforce data integrity rules at the database level. Regular data backups: Implement regular backups of data to prevent loss in case of system failures, errors, or security breaches. Ensure that backup procedures are automated, monitored, and regularly tested.","title":"Implementations"},{"location":"non-functional-requirements/data-integrity/#resources","text":"Great Expectations : A framework to build data validations and test the quality of your data.","title":"Resources"},{"location":"non-functional-requirements/disaster-recovery/","text":"Disaster Recovery and Continuity Disaster Recovery (DR) focuses on the processes and technologies required to restore IT systems and data after a catastrophic event, such as a natural disaster, cyber attack, or hardware failure. It involves regular backups, failover procedures, and recovery plans that enable a swift return to normal operations. Business Continuity (BC), on the other hand, encompasses a broader scope, ensuring that essential business functions can continue during and after a disaster. This includes not only IT systems but also processes, personnel, and physical infrastructure. Together, DR and BC strategies are vital for minimizing downtime, protecting data integrity, and maintaining customer trust and operational stability. They ensure that an organization can quickly recover from disruptions and continue providing critical services, safeguarding both its reputation and financial health. Characteristics Recovery Time Objective (RTO) : This defines the maximum acceptable amount of time it should take to restore a system after a disaster. RTO sets the target for how quickly systems and applications must be back online to minimize impact on the business. Recovery Point Objective (RPO) : This specifies the maximum acceptable amount of data loss measured in time. RPO determines how frequently data backups should occur to ensure that data loss remains within acceptable limits. Backup and Restore Procedures : Effective DR involves robust backup procedures, including regular, automated backups of critical data and systems. These backups must be stored securely, often in off-site or cloud locations, and tested regularly to ensure they can be restored as needed. Failover Mechanisms : These are automated processes that switch operations to a standby system or site in the event of a failure. Failover mechanisms ensure continuity of service by redirecting workloads to backup systems without significant downtime. Redundancy : DR plans often include redundant systems and infrastructure to eliminate single points of failure. This can involve duplicate hardware, network paths, and data storage locations. Disaster Recovery Plan (DRP) : A comprehensive DRP outlines the specific steps, roles, and responsibilities involved in responding to a disaster. It includes detailed procedures for data recovery, system restoration, and communication protocols. Testing and Drills : Regular testing and simulation drills are essential to validate the effectiveness of the DR plan. This helps identify potential weaknesses and ensures that staff are familiar with the recovery procedures. Communication Plan : Effective DR includes a clear communication strategy for notifying stakeholders, including employees, customers, and partners, about the status of recovery efforts and expected timelines for restoration. Scalability : The DR plan should be scalable to accommodate changes in the business environment, such as growth in data volume or expansion to new geographic locations. This ensures that the recovery strategy remains effective as the organization evolves. Compliance and Regulatory Requirements : DR plans must adhere to relevant industry standards and regulatory requirements, ensuring that recovery processes meet legal and compliance obligations. Cost Considerations : Balancing the costs associated with implementing and maintaining DR capabilities against the potential losses from downtime and data loss is crucial. Effective DR planning considers cost-efficiency while ensuring robust protection. Implementations Implementing disaster recovery (DR) involves a combination of strategies, technologies, and practices designed to restore systems and data quickly and effectively after a catastrophic event. Here are some examples: Cloud Backups : Store backup copies of data in the cloud, ensuring they are accessible from anywhere and providing geographic redundancy. Disaster Recovery as a Service (DRaaS) : Utilize DRaaS providers that offer comprehensive disaster recovery solutions, including automated failover to cloud-based systems. Failover and Redundancy : Hot Site : Maintain a fully operational, geographically separate duplicate of your primary site that can take over immediately in case of a disaster. Cold Site : Have an alternate site with necessary infrastructure but without active systems or data, ready to be brought online when needed. Warm Site : A compromise between hot and cold sites, with partially prepared systems that require some setup before use. Virtualization and Snapshots : Virtual Machine (VM) Snapshots : Regularly take snapshots of virtual machines, allowing for quick rollback to a known good state. VM Replication : Continuously replicate VMs to a secondary location, ensuring up-to-date copies are ready to take over if the primary site fails. Automated Failover Systems : High Availability Clusters : Implement clusters of servers that automatically detect failures and shift workloads to healthy nodes without manual intervention. Load Balancers : Use load balancers to distribute traffic across multiple servers, ensuring continuous service availability even if one server fails. Data Replication : Ensure that data is simultaneously written to primary and secondary locations, maintaining real-time consistency between sites. Regular Testing and Drills : Conduct regular simulation drills to test the effectiveness of the DR plan and to ensure that all team members are familiar with their roles. Comprehensive Documentation : Develop run books with step-by-step instructions for executing the DR plan, tailored to specific scenarios and systems. Resources Azure Site Recovery","title":"Disaster Recovery and Continuity"},{"location":"non-functional-requirements/disaster-recovery/#disaster-recovery-and-continuity","text":"Disaster Recovery (DR) focuses on the processes and technologies required to restore IT systems and data after a catastrophic event, such as a natural disaster, cyber attack, or hardware failure. It involves regular backups, failover procedures, and recovery plans that enable a swift return to normal operations. Business Continuity (BC), on the other hand, encompasses a broader scope, ensuring that essential business functions can continue during and after a disaster. This includes not only IT systems but also processes, personnel, and physical infrastructure. Together, DR and BC strategies are vital for minimizing downtime, protecting data integrity, and maintaining customer trust and operational stability. They ensure that an organization can quickly recover from disruptions and continue providing critical services, safeguarding both its reputation and financial health.","title":"Disaster Recovery and Continuity"},{"location":"non-functional-requirements/disaster-recovery/#characteristics","text":"Recovery Time Objective (RTO) : This defines the maximum acceptable amount of time it should take to restore a system after a disaster. RTO sets the target for how quickly systems and applications must be back online to minimize impact on the business. Recovery Point Objective (RPO) : This specifies the maximum acceptable amount of data loss measured in time. RPO determines how frequently data backups should occur to ensure that data loss remains within acceptable limits. Backup and Restore Procedures : Effective DR involves robust backup procedures, including regular, automated backups of critical data and systems. These backups must be stored securely, often in off-site or cloud locations, and tested regularly to ensure they can be restored as needed. Failover Mechanisms : These are automated processes that switch operations to a standby system or site in the event of a failure. Failover mechanisms ensure continuity of service by redirecting workloads to backup systems without significant downtime. Redundancy : DR plans often include redundant systems and infrastructure to eliminate single points of failure. This can involve duplicate hardware, network paths, and data storage locations. Disaster Recovery Plan (DRP) : A comprehensive DRP outlines the specific steps, roles, and responsibilities involved in responding to a disaster. It includes detailed procedures for data recovery, system restoration, and communication protocols. Testing and Drills : Regular testing and simulation drills are essential to validate the effectiveness of the DR plan. This helps identify potential weaknesses and ensures that staff are familiar with the recovery procedures. Communication Plan : Effective DR includes a clear communication strategy for notifying stakeholders, including employees, customers, and partners, about the status of recovery efforts and expected timelines for restoration. Scalability : The DR plan should be scalable to accommodate changes in the business environment, such as growth in data volume or expansion to new geographic locations. This ensures that the recovery strategy remains effective as the organization evolves. Compliance and Regulatory Requirements : DR plans must adhere to relevant industry standards and regulatory requirements, ensuring that recovery processes meet legal and compliance obligations. Cost Considerations : Balancing the costs associated with implementing and maintaining DR capabilities against the potential losses from downtime and data loss is crucial. Effective DR planning considers cost-efficiency while ensuring robust protection.","title":"Characteristics"},{"location":"non-functional-requirements/disaster-recovery/#implementations","text":"Implementing disaster recovery (DR) involves a combination of strategies, technologies, and practices designed to restore systems and data quickly and effectively after a catastrophic event. Here are some examples: Cloud Backups : Store backup copies of data in the cloud, ensuring they are accessible from anywhere and providing geographic redundancy. Disaster Recovery as a Service (DRaaS) : Utilize DRaaS providers that offer comprehensive disaster recovery solutions, including automated failover to cloud-based systems. Failover and Redundancy : Hot Site : Maintain a fully operational, geographically separate duplicate of your primary site that can take over immediately in case of a disaster. Cold Site : Have an alternate site with necessary infrastructure but without active systems or data, ready to be brought online when needed. Warm Site : A compromise between hot and cold sites, with partially prepared systems that require some setup before use. Virtualization and Snapshots : Virtual Machine (VM) Snapshots : Regularly take snapshots of virtual machines, allowing for quick rollback to a known good state. VM Replication : Continuously replicate VMs to a secondary location, ensuring up-to-date copies are ready to take over if the primary site fails. Automated Failover Systems : High Availability Clusters : Implement clusters of servers that automatically detect failures and shift workloads to healthy nodes without manual intervention. Load Balancers : Use load balancers to distribute traffic across multiple servers, ensuring continuous service availability even if one server fails. Data Replication : Ensure that data is simultaneously written to primary and secondary locations, maintaining real-time consistency between sites. Regular Testing and Drills : Conduct regular simulation drills to test the effectiveness of the DR plan and to ensure that all team members are familiar with their roles. Comprehensive Documentation : Develop run books with step-by-step instructions for executing the DR plan, tailored to specific scenarios and systems.","title":"Implementations"},{"location":"non-functional-requirements/disaster-recovery/#resources","text":"Azure Site Recovery","title":"Resources"},{"location":"non-functional-requirements/internationalization/","text":"Internationalization and Localization Internationalization (i18n) and Localization (l10n) refer to the design and adaptation of software systems to support multiple languages, cultures, and regions, ensuring usability and compliance with local preferences and regulations. Characteristics Main Characteristics of Internationalization Text Externalization: Moving all user-facing text to external resource files to facilitate easy translation. Unicode Support: Using Unicode or another character encoding that supports all necessary scripts and characters. Date and Time Formatting: Designing the system to handle various date and time formats. Number and Currency Formatting: Ensuring that numbers and currencies can be displayed according to local conventions. Locale-Sensitive Data Processing: Adapting data processing to respect locale-specific rules, such as sorting and case conversion. Bidirectional Text Support: Supporting both left-to-right (LTR) and right-to-left (RTL) text orientations where necessary. Main Characteristics of Localization Translation: Converting text and UI elements to the target language. Cultural Adaptation: Adapting content and design elements to align with local cultural norms and expectations. Legal and Regulatory Compliance: Ensuring that the application meets local legal requirements, such as privacy laws and accessibility standards. Testing in Context: Testing the localized version of the application in its intended locale to ensure proper functionality and usability. Localized User Interfaces: Adjusting the layout and design to accommodate text expansion or contraction and to suit cultural preferences. Help and Documentation: Providing user assistance and documentation in the target language and context. Implementations Resource Bundles: Using resource bundles to store locale-specific text and data. Translation Management Systems: Employing tools and platforms to manage translations and streamline the localization workflow. Locale-Aware Libraries: Leveraging libraries and frameworks that provide built-in support for handling locale-specific data. Automated Testing: Implementing automated tests to verify that the software behaves correctly in different locales. Continuous Localization: Integrating localization processes into the continuous integration/continuous deployment (CI/CD) pipeline to keep translations up-to-date. Coordinated Universal Time: When dealing with times, it is essential to always use UTC for internal storage and processing. Using UTC helps avoid issues related to time zone differences, daylight saving time changes, and other regional time adjustments. Consistent Internal Representation: Store numbers and currency values in a consistent internal representation, such as a standardized numeric format or a base currency, and apply locale-specific formatting only when displaying data to the user. This prevents errors during calculations and data processing.","title":"Internationalization and Localization"},{"location":"non-functional-requirements/internationalization/#internationalization-and-localization","text":"Internationalization (i18n) and Localization (l10n) refer to the design and adaptation of software systems to support multiple languages, cultures, and regions, ensuring usability and compliance with local preferences and regulations.","title":"Internationalization and Localization"},{"location":"non-functional-requirements/internationalization/#characteristics","text":"","title":"Characteristics"},{"location":"non-functional-requirements/internationalization/#main-characteristics-of-internationalization","text":"Text Externalization: Moving all user-facing text to external resource files to facilitate easy translation. Unicode Support: Using Unicode or another character encoding that supports all necessary scripts and characters. Date and Time Formatting: Designing the system to handle various date and time formats. Number and Currency Formatting: Ensuring that numbers and currencies can be displayed according to local conventions. Locale-Sensitive Data Processing: Adapting data processing to respect locale-specific rules, such as sorting and case conversion. Bidirectional Text Support: Supporting both left-to-right (LTR) and right-to-left (RTL) text orientations where necessary.","title":"Main Characteristics of Internationalization"},{"location":"non-functional-requirements/internationalization/#main-characteristics-of-localization","text":"Translation: Converting text and UI elements to the target language. Cultural Adaptation: Adapting content and design elements to align with local cultural norms and expectations. Legal and Regulatory Compliance: Ensuring that the application meets local legal requirements, such as privacy laws and accessibility standards. Testing in Context: Testing the localized version of the application in its intended locale to ensure proper functionality and usability. Localized User Interfaces: Adjusting the layout and design to accommodate text expansion or contraction and to suit cultural preferences. Help and Documentation: Providing user assistance and documentation in the target language and context.","title":"Main Characteristics of Localization"},{"location":"non-functional-requirements/internationalization/#implementations","text":"Resource Bundles: Using resource bundles to store locale-specific text and data. Translation Management Systems: Employing tools and platforms to manage translations and streamline the localization workflow. Locale-Aware Libraries: Leveraging libraries and frameworks that provide built-in support for handling locale-specific data. Automated Testing: Implementing automated tests to verify that the software behaves correctly in different locales. Continuous Localization: Integrating localization processes into the continuous integration/continuous deployment (CI/CD) pipeline to keep translations up-to-date. Coordinated Universal Time: When dealing with times, it is essential to always use UTC for internal storage and processing. Using UTC helps avoid issues related to time zone differences, daylight saving time changes, and other regional time adjustments. Consistent Internal Representation: Store numbers and currency values in a consistent internal representation, such as a standardized numeric format or a base currency, and apply locale-specific formatting only when displaying data to the user. This prevents errors during calculations and data processing.","title":"Implementations"},{"location":"non-functional-requirements/interoperability/","text":"Interoperability Interoperability refers to the ability of different software components or systems to seamlessly exchange and use information. It involves ensuring that the software can integrate effectively with other systems, regardless of their operating platforms, programming languages, or data formats. Characteristics Standardization: Adherence to industry standards, protocols, and specifications that enable consistent and compatible interactions between different software components or systems. Compatibility: The ability of systems to work together without requiring extensive modifications or adaptations, ensuring that data and operations can be shared effectively. Interface Definition: Well-defined interfaces and APIs that facilitate communication and data exchange between systems, abstracting complexities and promoting ease of integration. Data Format Consistency: Consistent handling and interpretation of data formats, ensuring that information exchanged between systems remains accurate and meaningful. Platform Agnosticism: Capability to operate across different hardware platforms, operating systems, and environments without dependency on specific technologies or configurations. Implementations An interoperable solution facilitates seamless communication and data exchange between heterogeneous systems. Here are some of the implementations: Providing RESTful APIs. Using data formats and standards such as JSON schemas. Utilizing libraries and frameworks that provide cross-platform support and abstraction layers for common functionalities. Adhering to industry standards (e.g., ISO, IEEE) and governance frameworks that define interoperability requirements, protocols, and best practices for seamless integration.","title":"Interoperability"},{"location":"non-functional-requirements/interoperability/#interoperability","text":"Interoperability refers to the ability of different software components or systems to seamlessly exchange and use information. It involves ensuring that the software can integrate effectively with other systems, regardless of their operating platforms, programming languages, or data formats.","title":"Interoperability"},{"location":"non-functional-requirements/interoperability/#characteristics","text":"Standardization: Adherence to industry standards, protocols, and specifications that enable consistent and compatible interactions between different software components or systems. Compatibility: The ability of systems to work together without requiring extensive modifications or adaptations, ensuring that data and operations can be shared effectively. Interface Definition: Well-defined interfaces and APIs that facilitate communication and data exchange between systems, abstracting complexities and promoting ease of integration. Data Format Consistency: Consistent handling and interpretation of data formats, ensuring that information exchanged between systems remains accurate and meaningful. Platform Agnosticism: Capability to operate across different hardware platforms, operating systems, and environments without dependency on specific technologies or configurations.","title":"Characteristics"},{"location":"non-functional-requirements/interoperability/#implementations","text":"An interoperable solution facilitates seamless communication and data exchange between heterogeneous systems. Here are some of the implementations: Providing RESTful APIs. Using data formats and standards such as JSON schemas. Utilizing libraries and frameworks that provide cross-platform support and abstraction layers for common functionalities. Adhering to industry standards (e.g., ISO, IEEE) and governance frameworks that define interoperability requirements, protocols, and best practices for seamless integration.","title":"Implementations"},{"location":"non-functional-requirements/maintainability/","text":"Maintainability Maintainability is the ease with which a software system can be modified, updated, extended, or repaired over time. It impacts the long-term viability and sustainability of a software system. A maintainable system is one that is easy to understand, has clear and modular code, is well-documented, and has a low risk of introducing errors when changes are made. Characteristics Modularity: The software is divided into discrete, independent modules or components, each with a clear and specific functionality. This makes it easier to modify or replace individual parts without affecting the entire system. Readability: Code is written clearly and concisely, following consistent naming conventions, coding standards, and documentation practices. Readable code is easier for developers to understand, troubleshoot, and enhance. Testability: The software is designed to support thorough testing, with components that can be tested independently. This includes unit tests, integration tests, and automated testing frameworks that facilitate ongoing validation of the software's behavior. Documentation: Comprehensive and up-to-date documentation is provided, docstrings, design documents, user manuals, and API references. Good documentation helps developers understand the system's structure, functionality, and dependencies. Simplicity: The design and implementation of the software are kept as simple as possible, avoiding unnecessary complexity. Simple systems are easier to understand, maintain, and extend. Consistency: Consistent use of design patterns, coding practices, language best practices, and architectural principles throughout the software. Consistency reduces the learning curve for new developers and helps maintain uniform quality across the codebase. Configurability: The software allows configuration through external files or settings rather than hard-coded values. This makes it easier to adapt the software to different environments or requirements without changing the code. Dependency Management: Proper management of dependencies ensures that external libraries or components can be updated or replaced without major disruptions. This includes using dependency injection, version control, and modular design. Additionally, version management for your own code will ensure consistent and reliable releases. Error Handling and Logging: Robust error handling and logging mechanisms are in place to facilitate debugging and maintenance. This includes meaningful error messages, exception handling, and comprehensive logging of system events and errors. Implementations Implementing maintainability in software systems involves adopting practices, tools, and methodologies that facilitate efficient modification, extension, and troubleshooting of the software over its lifecycle. Consistent Naming Conventions: Use meaningful and consistent names for variables, functions, classes, and other entities. Code Formatting: Follow consistent code formatting rules to enhance readability. Code Reviews: Conduct regular code reviews to ensure adherence to standards and to share knowledge among team members. External Documentation: Maintain up-to-date documentation, including design documents, user manuals, and API references . There are tools to assist with that like Swagger or Postman. README Files: Provide README files in repositories to guide new developers on setup, usage, and contribution guidelines. Automated Testing: Provide unit test, end-to-end tests, smoke and integration tests as well as continuous integration practices. Code Refactoring: Regularly refactor code to improve its structure, readability, and maintainability without changing its external behavior. Implementing pre-commit hooks in the pipelines to automate the monitoring of code refactoring tasks, like forcing coding standards, run static code analysis, linting, etc.","title":"Maintainability"},{"location":"non-functional-requirements/maintainability/#maintainability","text":"Maintainability is the ease with which a software system can be modified, updated, extended, or repaired over time. It impacts the long-term viability and sustainability of a software system. A maintainable system is one that is easy to understand, has clear and modular code, is well-documented, and has a low risk of introducing errors when changes are made.","title":"Maintainability"},{"location":"non-functional-requirements/maintainability/#characteristics","text":"Modularity: The software is divided into discrete, independent modules or components, each with a clear and specific functionality. This makes it easier to modify or replace individual parts without affecting the entire system. Readability: Code is written clearly and concisely, following consistent naming conventions, coding standards, and documentation practices. Readable code is easier for developers to understand, troubleshoot, and enhance. Testability: The software is designed to support thorough testing, with components that can be tested independently. This includes unit tests, integration tests, and automated testing frameworks that facilitate ongoing validation of the software's behavior. Documentation: Comprehensive and up-to-date documentation is provided, docstrings, design documents, user manuals, and API references. Good documentation helps developers understand the system's structure, functionality, and dependencies. Simplicity: The design and implementation of the software are kept as simple as possible, avoiding unnecessary complexity. Simple systems are easier to understand, maintain, and extend. Consistency: Consistent use of design patterns, coding practices, language best practices, and architectural principles throughout the software. Consistency reduces the learning curve for new developers and helps maintain uniform quality across the codebase. Configurability: The software allows configuration through external files or settings rather than hard-coded values. This makes it easier to adapt the software to different environments or requirements without changing the code. Dependency Management: Proper management of dependencies ensures that external libraries or components can be updated or replaced without major disruptions. This includes using dependency injection, version control, and modular design. Additionally, version management for your own code will ensure consistent and reliable releases. Error Handling and Logging: Robust error handling and logging mechanisms are in place to facilitate debugging and maintenance. This includes meaningful error messages, exception handling, and comprehensive logging of system events and errors.","title":"Characteristics"},{"location":"non-functional-requirements/maintainability/#implementations","text":"Implementing maintainability in software systems involves adopting practices, tools, and methodologies that facilitate efficient modification, extension, and troubleshooting of the software over its lifecycle. Consistent Naming Conventions: Use meaningful and consistent names for variables, functions, classes, and other entities. Code Formatting: Follow consistent code formatting rules to enhance readability. Code Reviews: Conduct regular code reviews to ensure adherence to standards and to share knowledge among team members. External Documentation: Maintain up-to-date documentation, including design documents, user manuals, and API references . There are tools to assist with that like Swagger or Postman. README Files: Provide README files in repositories to guide new developers on setup, usage, and contribution guidelines. Automated Testing: Provide unit test, end-to-end tests, smoke and integration tests as well as continuous integration practices. Code Refactoring: Regularly refactor code to improve its structure, readability, and maintainability without changing its external behavior. Implementing pre-commit hooks in the pipelines to automate the monitoring of code refactoring tasks, like forcing coding standards, run static code analysis, linting, etc.","title":"Implementations"},{"location":"non-functional-requirements/performance/","text":"Performance Performance refers to the responsiveness, efficiency, and speed with which a system completes tasks and processes user requests. It encompasses several key metrics such as response time, throughput, latency, and resource utilization. Characteristics Response Time: The time taken by the system to respond to user interactions or requests. Lower response times indicate better performance and user responsiveness. Throughput: The rate at which the system can process and handle a certain volume of transactions or requests within a given time frame. Higher throughput signifies greater processing capacity and efficiency. Latency: The delay or time lag experienced between initiating a request and receiving a response. Low latency is crucial for real-time applications to ensure timely interactions. Scalability: The system's ability to handle increasing workload or user demand by scaling resources (horizontal or vertical scaling) without impacting performance negatively. Concurrency: The system's capability to handle multiple concurrent users or tasks efficiently without significant degradation in performance. This involves managing resources such as CPU, memory, and network bandwidth effectively. Resource Utilization: Efficient utilization of hardware resources (e.g., CPU, memory, disk) to maximize performance without unnecessary overhead or bottlenecks. Stability: Consistency and reliability of performance over time and under varying conditions, ensuring predictable behavior and minimal downtime. Fault Tolerance: The system's ability to continue operating or recover gracefully from failures or disruptions without significant impact on performance or user experience. Load Handling: How well the system manages and distributes workload during peak usage periods to maintain optimal performance levels. Implementations Implementing performance involves a combination of architectural decisions, coding practices, infrastructure setup, and optimization techniques. For example: Efficient Algorithms and Data Structures: Choosing algorithms and data structures that are optimized for the specific tasks and operations performed by the system can significantly improve performance. This includes selecting algorithms with lower time complexity (e.g., O(1), O(log n)) for critical operations. Code Optimization: Writing efficient and optimized code reduces execution time and resource consumption. Techniques such as minimizing loops, reducing unnecessary computations, and using appropriate data types can improve performance. Concurrency: Implementing concurrency models such as threads and async-await techniques optimizes task execution by allowing the system to handle multiple operations simultaneously. Parallel Programming: Enables tasks to be divided into smaller subtasks that can execute concurrently on multi-core processors. This method improves computational efficiency and accelerates the completion of tasks. Caching: Implementing caching mechanisms (e.g., in-memory caching, content delivery networks) to store and retrieve frequently accessed data or computations reduces the need to fetch data from slower storage systems, thereby improving response time and overall system performance. Database Optimization: Optimizing database queries, indexing frequently accessed data, denormalizing data where appropriate, and using database scaling techniques (e.g., sharding, replication) can enhance database performance and reduce latency. Network Optimization: Minimizing network latency by optimizing network protocols, reducing the number of network requests, compressing data where feasible, and leveraging content delivery networks (CDNs) for static content delivery. Load Balancing: Distributing incoming traffic evenly across multiple servers or instances using load balancers ensures optimal resource utilization and prevents overload on any single component, improving overall system performance and availability. Scalable Architecture: Designing the system with scalability in mind allows it to handle increased workload by adding resources dynamically (horizontal scaling) or upgrading existing resources (vertical scaling). This involves using microservices architecture, containerization (e.g., Docker), and orchestration tools (e.g., Kubernetes) for efficient resource management. Performance Testing: Performing rigorous performance tests to pinpoint bottlenecks, measure critical metrics like response time and throughput, and validate system performance across varying load scenarios. Continuous Monitoring: Implementing ongoing monitoring of performance metrics to identify performance degradation. Resources Automated Testing","title":"Performance"},{"location":"non-functional-requirements/performance/#performance","text":"Performance refers to the responsiveness, efficiency, and speed with which a system completes tasks and processes user requests. It encompasses several key metrics such as response time, throughput, latency, and resource utilization.","title":"Performance"},{"location":"non-functional-requirements/performance/#characteristics","text":"Response Time: The time taken by the system to respond to user interactions or requests. Lower response times indicate better performance and user responsiveness. Throughput: The rate at which the system can process and handle a certain volume of transactions or requests within a given time frame. Higher throughput signifies greater processing capacity and efficiency. Latency: The delay or time lag experienced between initiating a request and receiving a response. Low latency is crucial for real-time applications to ensure timely interactions. Scalability: The system's ability to handle increasing workload or user demand by scaling resources (horizontal or vertical scaling) without impacting performance negatively. Concurrency: The system's capability to handle multiple concurrent users or tasks efficiently without significant degradation in performance. This involves managing resources such as CPU, memory, and network bandwidth effectively. Resource Utilization: Efficient utilization of hardware resources (e.g., CPU, memory, disk) to maximize performance without unnecessary overhead or bottlenecks. Stability: Consistency and reliability of performance over time and under varying conditions, ensuring predictable behavior and minimal downtime. Fault Tolerance: The system's ability to continue operating or recover gracefully from failures or disruptions without significant impact on performance or user experience. Load Handling: How well the system manages and distributes workload during peak usage periods to maintain optimal performance levels.","title":"Characteristics"},{"location":"non-functional-requirements/performance/#implementations","text":"Implementing performance involves a combination of architectural decisions, coding practices, infrastructure setup, and optimization techniques. For example: Efficient Algorithms and Data Structures: Choosing algorithms and data structures that are optimized for the specific tasks and operations performed by the system can significantly improve performance. This includes selecting algorithms with lower time complexity (e.g., O(1), O(log n)) for critical operations. Code Optimization: Writing efficient and optimized code reduces execution time and resource consumption. Techniques such as minimizing loops, reducing unnecessary computations, and using appropriate data types can improve performance. Concurrency: Implementing concurrency models such as threads and async-await techniques optimizes task execution by allowing the system to handle multiple operations simultaneously. Parallel Programming: Enables tasks to be divided into smaller subtasks that can execute concurrently on multi-core processors. This method improves computational efficiency and accelerates the completion of tasks. Caching: Implementing caching mechanisms (e.g., in-memory caching, content delivery networks) to store and retrieve frequently accessed data or computations reduces the need to fetch data from slower storage systems, thereby improving response time and overall system performance. Database Optimization: Optimizing database queries, indexing frequently accessed data, denormalizing data where appropriate, and using database scaling techniques (e.g., sharding, replication) can enhance database performance and reduce latency. Network Optimization: Minimizing network latency by optimizing network protocols, reducing the number of network requests, compressing data where feasible, and leveraging content delivery networks (CDNs) for static content delivery. Load Balancing: Distributing incoming traffic evenly across multiple servers or instances using load balancers ensures optimal resource utilization and prevents overload on any single component, improving overall system performance and availability. Scalable Architecture: Designing the system with scalability in mind allows it to handle increased workload by adding resources dynamically (horizontal scaling) or upgrading existing resources (vertical scaling). This involves using microservices architecture, containerization (e.g., Docker), and orchestration tools (e.g., Kubernetes) for efficient resource management. Performance Testing: Performing rigorous performance tests to pinpoint bottlenecks, measure critical metrics like response time and throughput, and validate system performance across varying load scenarios. Continuous Monitoring: Implementing ongoing monitoring of performance metrics to identify performance degradation.","title":"Implementations"},{"location":"non-functional-requirements/performance/#resources","text":"Automated Testing","title":"Resources"},{"location":"non-functional-requirements/portability/","text":"Portability Portability refers to the ease with which software can be transferred and used in different environments or platforms without requiring significant modification. This includes moving the software across various hardware, operating systems, cloud services, or development frameworks while maintaining its functionality, performance, and usability. Characteristics Platform Independence: The ability of the software to run on different operating systems, hardware architectures, and devices without requiring major changes. Minimal Modification: The need for minimal code changes or reconfiguration when moving the software to a different environment. Standard Compliance: Adherence to industry standards and protocols to ensure compatibility across different systems and platforms. Environment Abstraction: Use of abstraction layers or frameworks that isolate the software from specific platform details, making it easier to adapt to different environments. Configuration Flexibility: Ease of modifying configuration settings to suit different environments without altering the core software code. Dependency Management: Efficient handling of external dependencies, ensuring that required libraries, tools, and services are available or can be easily obtained in the new environment. Packaging and Distribution: Efficient packaging methods, such as containerization (e.g., Docker), that encapsulate the software and its dependencies to facilitate deployment in diverse environments. Modular Design: Designing the software in a modular way, where components can be independently developed, tested, and deployed, enhancing the ease of porting parts of the system. Implementations Containerization Docker: Packaging applications and their dependencies into containers, ensuring consistent behavior across different environments. Kubernetes: Orchestrating containerized applications for deployment across various cloud providers and on-premises infrastructures. Virtual Machines Java Virtual Machine (JVM): Writing software in Java or other JVM languages to run on any system with a compatible JVM. VirtualBox or VMware: Using virtual machines to create consistent runtime environments regardless of the underlying hardware. Platform-Agnostic Languages Python, JavaScript, and Go: Utilizing programming languages known for their cross-platform capabilities to ensure code runs on multiple operating systems with little to no modification. However, it's important to select a programming language that aligns with the project's requirements and team expertise. Standardized Interfaces and Protocols APIs: Designing APIs with standardized protocols (e.g., REST, GraphQL) to facilitate interaction between different systems. Data Interchange Formats: Using common data formats like JSON, XML, or Protocol Buffers to ensure data can be exchanged and understood across different systems. Other Practices Debugging and Troubleshooting: Local debugging provides direct access to debugging tools and logs, making it easier to diagnose and resolve issues quickly. CI/CD Integration: Implementing a CI/CD pipeline to automate the building, testing, and packaging of the solution enhances portability by ensuring consistent and reliable deployments across various platforms and environments.","title":"Portability"},{"location":"non-functional-requirements/portability/#portability","text":"Portability refers to the ease with which software can be transferred and used in different environments or platforms without requiring significant modification. This includes moving the software across various hardware, operating systems, cloud services, or development frameworks while maintaining its functionality, performance, and usability.","title":"Portability"},{"location":"non-functional-requirements/portability/#characteristics","text":"Platform Independence: The ability of the software to run on different operating systems, hardware architectures, and devices without requiring major changes. Minimal Modification: The need for minimal code changes or reconfiguration when moving the software to a different environment. Standard Compliance: Adherence to industry standards and protocols to ensure compatibility across different systems and platforms. Environment Abstraction: Use of abstraction layers or frameworks that isolate the software from specific platform details, making it easier to adapt to different environments. Configuration Flexibility: Ease of modifying configuration settings to suit different environments without altering the core software code. Dependency Management: Efficient handling of external dependencies, ensuring that required libraries, tools, and services are available or can be easily obtained in the new environment. Packaging and Distribution: Efficient packaging methods, such as containerization (e.g., Docker), that encapsulate the software and its dependencies to facilitate deployment in diverse environments. Modular Design: Designing the software in a modular way, where components can be independently developed, tested, and deployed, enhancing the ease of porting parts of the system.","title":"Characteristics"},{"location":"non-functional-requirements/portability/#implementations","text":"","title":"Implementations"},{"location":"non-functional-requirements/portability/#containerization","text":"Docker: Packaging applications and their dependencies into containers, ensuring consistent behavior across different environments. Kubernetes: Orchestrating containerized applications for deployment across various cloud providers and on-premises infrastructures.","title":"Containerization"},{"location":"non-functional-requirements/portability/#virtual-machines","text":"Java Virtual Machine (JVM): Writing software in Java or other JVM languages to run on any system with a compatible JVM. VirtualBox or VMware: Using virtual machines to create consistent runtime environments regardless of the underlying hardware.","title":"Virtual Machines"},{"location":"non-functional-requirements/portability/#platform-agnostic-languages","text":"Python, JavaScript, and Go: Utilizing programming languages known for their cross-platform capabilities to ensure code runs on multiple operating systems with little to no modification. However, it's important to select a programming language that aligns with the project's requirements and team expertise.","title":"Platform-Agnostic Languages"},{"location":"non-functional-requirements/portability/#standardized-interfaces-and-protocols","text":"APIs: Designing APIs with standardized protocols (e.g., REST, GraphQL) to facilitate interaction between different systems. Data Interchange Formats: Using common data formats like JSON, XML, or Protocol Buffers to ensure data can be exchanged and understood across different systems.","title":"Standardized Interfaces and Protocols"},{"location":"non-functional-requirements/portability/#other-practices","text":"Debugging and Troubleshooting: Local debugging provides direct access to debugging tools and logs, making it easier to diagnose and resolve issues quickly. CI/CD Integration: Implementing a CI/CD pipeline to automate the building, testing, and packaging of the solution enhances portability by ensuring consistent and reliable deployments across various platforms and environments.","title":"Other Practices"},{"location":"non-functional-requirements/reliability/","text":"Reliability All the other ISE Engineering Fundamentals work towards a more reliable infrastructure. Automated integration and deployment ensures code is properly tested, and helps remove human error, while slow releases build confidence in the code. Observability helps more quickly pinpoint errors when they arise to get back to a stable state, and so on. However, there are some additional steps we can take, that don't neatly fit into the previous categories, to help ensure a more reliable solution. We'll explore these below. Remove \"Foot-Guns\" Prevent your dev team from shooting themselves in the foot. People make mistakes; any mistake made in production is not the fault of that person, it's the collective fault of the system to not prevent that mistake from happening. Check out the below list for some common tooling to remove these foot guns: In Kubernetes, leverage Admission Controllers to prevent \"bad things\" from happening. You can create custom controllers using the Webhook Admission controller. Gatekeeper is a pre-built Webhook Admission controller, leveraging OPA underneath the hood, with support for some out-of-the-box protections If a user ever makes a mistake, don't ask: \"how could somebody possibly do that?\", do ask: \"how can we prevent this from happening in the future?\" Autoscaling Whenever possible, leverage autoscaling for your deployments. Vertical autoscaling can scale your VMs by tuning parameters like CPU, disk, and RAM, while horizontal autoscaling can tune the number of running images backing your deployments. Autoscaling can help your system respond to inorganic growth in traffic, and prevent failing requests due to resource starvation. Note: In environments like K8s, both horizontal and vertical autoscaling are offered as a native solution. The VMs backing each Pod however, may also need autoscaling to handle an increase in the number of Pods. It should also be noted that the parameters that affect autoscaling can be difficult to tune. Typical metrics like CPU or RAM utilization, or request rate may not be enough. Sometimes you might want to consider custom metrics, like cache eviction rate. Load shedding & DOS Protection Often we think of Denial of Service [DOS] attacks as an act from a malicious actor, so we place some load shedding at the gates to our system and call it a day. In reality, many DOS attacks are unintentional, and self-inflicted. A bad deployment that takes down a Cache results in hammering downstream services. Polling from a distributed system synchronizes and results in a thundering herd . A misconfiguration results in an error which triggers clients to retry uncontrollably. Requests append to a stored object until it is so big that future reads crash the server. The list goes on. Follow these steps to protect yourself: Add a jitter (random) to any action that occurs from a non-user triggered flow (ie: add a random duration to the sleep in a cron, or job that continuously polls a downstream service). Implement exponential backoff retry policies in your client code Add load shedding to your servers (yes, your internal microservices too). This can be configured easily when leveraging a sidecar like envoy. Be careful when deserializing user requests, and use buffer limits. ie: HTTP/gRPC Servers can set limits on how much data will get read from the socket. Set alerts for utilization, servers restarting, or going offline to detect when your system may be failing. These types of errors can result in Cascading Failures, where a non-critical portion of your system takes down the entire service. Plan accordingly, and make sure to put extra thought into how your system might degrade during failures. Backup Data Data gets lost, corrupted, or accidentally deleted. It happens. Take data backups to help get your system back up online as soon as possible. It can happen in the application stack, with code deleting or corrupting data, or at the storage layer by losing the volumes, or losing encryption keys. Consider things like: How long will it take to restore data. How much data loss can you tolerate. How long will it take you to notice there is data loss. Look into the difference between snapshot and incremental backups. A good policy might be to take incremental backups on a period of N, and a snapshot backup on a period of M (where N < M). Target Uptime & Failing Gracefully It's a known fact that systems cannot target 100% uptime. There are too many factors in today's software systems to achieve this, many outside of our control. Even a service that never gets updated and is 100% bug free will fail. Upstream DNS servers have issues all the time. Hardware breaks. Power outages, backup generators fail. The world is chaotic. Good services target some number of \"9's\" of uptime. ie: 99.99% uptime means that the system has a \"budget\" of 4 minutes and 22 seconds of uptime each month. Some months might achieve 100% uptime, which means that budget gets rolled over to the next month. What uptime means is different for everybody, and up to the service to define. A good practice is to use any leftover budget at the end of the period (ie: year, quarter), to intentionally take that service down, and ensure that the rest of your systems fail as expected. Often times other engineers and services come to rely on that additional achieved availability, and it can be healthy to ensure that systems fail gracefully. We can build graceful failure (or graceful degradation) into our software stack by anticipating failures. Some tactics include: Failover to healthy services Leader Election can be used to keep healthy services on standby in case the leader experiences issues. Entire cluster failover can redirect traffic to another region or availability zone. Propagate downstream failures of dependent services up the stack via health checks, so that your ingress points can re-route to healthy services. Circuit breakers can bail early on requests vs. propagating errors throughout the system. Consider using a well-known, tested library such as Polly (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns. Practice None of the above recommendations will work if they are not tested . Your backups are meaningless if you don't know how to mount them. Your cluster failover and other mitigations will regress over time if they are not tested. Here are some tips to test the above: Maintain Playbooks No software service is complete without playbooks to navigate the developers through unfamiliar territory. Playbooks should be thorough and cover all known failure scenarios and mitigations. Run Maintenance Exercises Take the time to fabricate scenarios, and run a D&D style campaign to solve your issues. This can be as elaborate as spinning up a new environment and injecting errors, or as simple as asking the \"players\" to navigate to a dashboard and describing would they would see in the fabricated scenario (small amounts of imagination required). The playbooks should easily navigate the user to the correct solution/mitigation. If not, update your playbooks. Chaos Testing Leverage automated chaos testing to see how things break. You can read this playbook's article on fault injection testing for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as this section in the article linked above have more details on available platforms and tooling for this purpose: Azure Chaos Studio - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources. Chaos toolkit - A declarative, modular chaos platform with many extensions, including the Azure actions and probes kit . Kraken - An Openshift-specific chaos tool, maintained by Redhat. Chaos Monkey - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB). Many services meshes, like Linkerd , offer fault injection tooling through the use of their sidecars. Chaos Mesh Simmy - A .NET library for chaos testing and fault injection integrated with the Polly library for resilience engineering. This ISE dev blog post provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing. Analyze All Failures Writing up a post-mortem is a great way to document the root causes, and action items for your failures. They're also a great way to track recurring issues, and create a strong case for prioritizing fixes. This can even be tied into your regular Agile restrospectives .","title":"Reliability"},{"location":"non-functional-requirements/reliability/#reliability","text":"All the other ISE Engineering Fundamentals work towards a more reliable infrastructure. Automated integration and deployment ensures code is properly tested, and helps remove human error, while slow releases build confidence in the code. Observability helps more quickly pinpoint errors when they arise to get back to a stable state, and so on. However, there are some additional steps we can take, that don't neatly fit into the previous categories, to help ensure a more reliable solution. We'll explore these below.","title":"Reliability"},{"location":"non-functional-requirements/reliability/#remove-foot-guns","text":"Prevent your dev team from shooting themselves in the foot. People make mistakes; any mistake made in production is not the fault of that person, it's the collective fault of the system to not prevent that mistake from happening. Check out the below list for some common tooling to remove these foot guns: In Kubernetes, leverage Admission Controllers to prevent \"bad things\" from happening. You can create custom controllers using the Webhook Admission controller. Gatekeeper is a pre-built Webhook Admission controller, leveraging OPA underneath the hood, with support for some out-of-the-box protections If a user ever makes a mistake, don't ask: \"how could somebody possibly do that?\", do ask: \"how can we prevent this from happening in the future?\"","title":"Remove \"Foot-Guns\""},{"location":"non-functional-requirements/reliability/#autoscaling","text":"Whenever possible, leverage autoscaling for your deployments. Vertical autoscaling can scale your VMs by tuning parameters like CPU, disk, and RAM, while horizontal autoscaling can tune the number of running images backing your deployments. Autoscaling can help your system respond to inorganic growth in traffic, and prevent failing requests due to resource starvation. Note: In environments like K8s, both horizontal and vertical autoscaling are offered as a native solution. The VMs backing each Pod however, may also need autoscaling to handle an increase in the number of Pods. It should also be noted that the parameters that affect autoscaling can be difficult to tune. Typical metrics like CPU or RAM utilization, or request rate may not be enough. Sometimes you might want to consider custom metrics, like cache eviction rate.","title":"Autoscaling"},{"location":"non-functional-requirements/reliability/#load-shedding-dos-protection","text":"Often we think of Denial of Service [DOS] attacks as an act from a malicious actor, so we place some load shedding at the gates to our system and call it a day. In reality, many DOS attacks are unintentional, and self-inflicted. A bad deployment that takes down a Cache results in hammering downstream services. Polling from a distributed system synchronizes and results in a thundering herd . A misconfiguration results in an error which triggers clients to retry uncontrollably. Requests append to a stored object until it is so big that future reads crash the server. The list goes on. Follow these steps to protect yourself: Add a jitter (random) to any action that occurs from a non-user triggered flow (ie: add a random duration to the sleep in a cron, or job that continuously polls a downstream service). Implement exponential backoff retry policies in your client code Add load shedding to your servers (yes, your internal microservices too). This can be configured easily when leveraging a sidecar like envoy. Be careful when deserializing user requests, and use buffer limits. ie: HTTP/gRPC Servers can set limits on how much data will get read from the socket. Set alerts for utilization, servers restarting, or going offline to detect when your system may be failing. These types of errors can result in Cascading Failures, where a non-critical portion of your system takes down the entire service. Plan accordingly, and make sure to put extra thought into how your system might degrade during failures.","title":"Load shedding &amp; DOS Protection"},{"location":"non-functional-requirements/reliability/#backup-data","text":"Data gets lost, corrupted, or accidentally deleted. It happens. Take data backups to help get your system back up online as soon as possible. It can happen in the application stack, with code deleting or corrupting data, or at the storage layer by losing the volumes, or losing encryption keys. Consider things like: How long will it take to restore data. How much data loss can you tolerate. How long will it take you to notice there is data loss. Look into the difference between snapshot and incremental backups. A good policy might be to take incremental backups on a period of N, and a snapshot backup on a period of M (where N < M).","title":"Backup Data"},{"location":"non-functional-requirements/reliability/#target-uptime-failing-gracefully","text":"It's a known fact that systems cannot target 100% uptime. There are too many factors in today's software systems to achieve this, many outside of our control. Even a service that never gets updated and is 100% bug free will fail. Upstream DNS servers have issues all the time. Hardware breaks. Power outages, backup generators fail. The world is chaotic. Good services target some number of \"9's\" of uptime. ie: 99.99% uptime means that the system has a \"budget\" of 4 minutes and 22 seconds of uptime each month. Some months might achieve 100% uptime, which means that budget gets rolled over to the next month. What uptime means is different for everybody, and up to the service to define. A good practice is to use any leftover budget at the end of the period (ie: year, quarter), to intentionally take that service down, and ensure that the rest of your systems fail as expected. Often times other engineers and services come to rely on that additional achieved availability, and it can be healthy to ensure that systems fail gracefully. We can build graceful failure (or graceful degradation) into our software stack by anticipating failures. Some tactics include: Failover to healthy services Leader Election can be used to keep healthy services on standby in case the leader experiences issues. Entire cluster failover can redirect traffic to another region or availability zone. Propagate downstream failures of dependent services up the stack via health checks, so that your ingress points can re-route to healthy services. Circuit breakers can bail early on requests vs. propagating errors throughout the system. Consider using a well-known, tested library such as Polly (.NET) that enables configurable implementations of this and other common resilience and transient fault-handling patterns.","title":"Target Uptime &amp; Failing Gracefully"},{"location":"non-functional-requirements/reliability/#practice","text":"None of the above recommendations will work if they are not tested . Your backups are meaningless if you don't know how to mount them. Your cluster failover and other mitigations will regress over time if they are not tested. Here are some tips to test the above:","title":"Practice"},{"location":"non-functional-requirements/reliability/#maintain-playbooks","text":"No software service is complete without playbooks to navigate the developers through unfamiliar territory. Playbooks should be thorough and cover all known failure scenarios and mitigations.","title":"Maintain Playbooks"},{"location":"non-functional-requirements/reliability/#run-maintenance-exercises","text":"Take the time to fabricate scenarios, and run a D&D style campaign to solve your issues. This can be as elaborate as spinning up a new environment and injecting errors, or as simple as asking the \"players\" to navigate to a dashboard and describing would they would see in the fabricated scenario (small amounts of imagination required). The playbooks should easily navigate the user to the correct solution/mitigation. If not, update your playbooks.","title":"Run Maintenance Exercises"},{"location":"non-functional-requirements/reliability/#chaos-testing","text":"Leverage automated chaos testing to see how things break. You can read this playbook's article on fault injection testing for more information on developing a hypothesis-driven suite of automated chaos test. The following list of chaos testing tools as well as this section in the article linked above have more details on available platforms and tooling for this purpose: Azure Chaos Studio - An in-preview tool for orchestrating controlled fault injection experiments on Azure resources. Chaos toolkit - A declarative, modular chaos platform with many extensions, including the Azure actions and probes kit . Kraken - An Openshift-specific chaos tool, maintained by Redhat. Chaos Monkey - The Netflix platform which popularized chaos engineering (doesn't support Azure OOTB). Many services meshes, like Linkerd , offer fault injection tooling through the use of their sidecars. Chaos Mesh Simmy - A .NET library for chaos testing and fault injection integrated with the Polly library for resilience engineering. This ISE dev blog post provides code snippets as an example of how to use Polly and Simmy to implement a hypothesis-driven approach to resilience and chaos testing.","title":"Chaos Testing"},{"location":"non-functional-requirements/reliability/#analyze-all-failures","text":"Writing up a post-mortem is a great way to document the root causes, and action items for your failures. They're also a great way to track recurring issues, and create a strong case for prioritizing fixes. This can even be tied into your regular Agile restrospectives .","title":"Analyze All Failures"},{"location":"non-functional-requirements/scalability/","text":"Scalability Scalability is the capability of a system to handle larger volumes, or its potential to accommodate additional growth. For example, a system is considered scalable if it is capable of increasing its total output under an increased load when resources (typically hardware) are added. An example of this is a system that can handle a growing number of requests when more memory is added to it. Characteristics Elasticity: The system should be able to scale up or down based on demand, and be able to automatically provision or de-provision resources as needed. Latency: The system should be able to maintain low latency even under high load, and be able to handle a large number of concurrent requests without slowing down. Examples Load Balancing: The application must be able to handle a minimum of 250 concurrent users and support load balancing across at least 3 servers to handle peak traffic. Database Scalability: The application's database must be able to handle at least 1 million records and support partitioning or sharding to ensure efficient storage and retrieval of data. Cloud-Based Infrastructure: The application must be deployed on cloud-based infrastructure that can handle at least 100,000 requests per hour, and be able to scale up or down to meet changing demand. Microservices Architecture: The application must be designed using a microservices architecture that allows for easy scaling of individual services, and be able to handle at least 500 requests per second. Caching: The application must be able to cache at least 10,000 records, with a cache hit rate of 95%, and support caching across multiple servers to ensure high availability.","title":"Scalability"},{"location":"non-functional-requirements/scalability/#scalability","text":"Scalability is the capability of a system to handle larger volumes, or its potential to accommodate additional growth. For example, a system is considered scalable if it is capable of increasing its total output under an increased load when resources (typically hardware) are added. An example of this is a system that can handle a growing number of requests when more memory is added to it.","title":"Scalability"},{"location":"non-functional-requirements/scalability/#characteristics","text":"Elasticity: The system should be able to scale up or down based on demand, and be able to automatically provision or de-provision resources as needed. Latency: The system should be able to maintain low latency even under high load, and be able to handle a large number of concurrent requests without slowing down.","title":"Characteristics"},{"location":"non-functional-requirements/scalability/#examples","text":"Load Balancing: The application must be able to handle a minimum of 250 concurrent users and support load balancing across at least 3 servers to handle peak traffic. Database Scalability: The application's database must be able to handle at least 1 million records and support partitioning or sharding to ensure efficient storage and retrieval of data. Cloud-Based Infrastructure: The application must be deployed on cloud-based infrastructure that can handle at least 100,000 requests per hour, and be able to scale up or down to meet changing demand. Microservices Architecture: The application must be designed using a microservices architecture that allows for easy scaling of individual services, and be able to handle at least 500 requests per second. Caching: The application must be able to cache at least 10,000 records, with a cache hit rate of 95%, and support caching across multiple servers to ensure high availability.","title":"Examples"},{"location":"non-functional-requirements/usability/","text":"Usability Usability is a topic that is often used interchangeably with user experience (UX), but they are not the same thing. Usability is a subset of UX, focusing specifically on the ease of use and effectiveness of a product, i.e., it is the ease with which users can learn and use a product to achieve their goals. Usability is a key factor in determining the success of a product, as it directly impacts user satisfaction, productivity, and overall experience. A system that is difficult to use or understand can lead to frustration, errors, and ultimately, abandonment by users. Closely coupled with usability and UX is the concept of accessibility . Characteristics The main three characteristics of usability are: - Effectiveness: Users should be able to accomplish their goals with the product. - Efficiency: Users should be able to perform tasks quickly and with minimal effort. Oftentimes this is measured in terms of time on task or number of clicks. - Satisfaction: Users should find the product enjoyable and satisfying to use. Additional characteristics include: - Learnability: Users should be able to easily and quickly learn how to use the product. In other words, the system should be intuitive and require minimal training. - Memorability: Users should be able to remember how to use the product after a period of not using it. - Errors: Users should encounter a minimal number of errors when completing a task, and recover easily from any errors that do occur. - Simplicity: The system should be simple and straightforward to use, with minimal complexity and cognitive load. - Comprehensibility: Users should be able to understand the system and its features easily, with clear instructions and feedback. Implementations One way of implementing usability in a user interface is by basing your design decisions on usability testing results. Usability testing's goal is to identify any usability issues, gather feedback, and make improvements to the product. It can be conducted at various stages of the design and development process, from wireframes and prototypes to the final product. These evaluations can collect two key metrics: quantitative data and qualitative data . Quantitative data can be collected through observing the facts of what actually happened. Qualitative data can be collected through interviews, observations, and other methods that provide insights into user behavior and preferences. There are several methods for conducting usability testing, including, but not limited to: - Focus groups - Wireframes - Prototyping - Surveys/Questionnaires - Interviews - Think-aloud protocol Examples One example of usability in action is the design of a website. A website that is easy to navigate, with clear labels, intuitive menus, and a logical flow of information, is more likely to be successful than one that is cluttered, confusing, and difficult to use. The latter website is likely to have a low rate of user engagement, high bounce rates , and low conversion rates, as users will quickly become frustrated and abandon the site. Resources GeeksForGeeks: What is Usability? Usability.gov Human-computer Interaction (HCI) Jakob Nielsen's 10 Usability Heuristics for User Interface Design","title":"Usability"},{"location":"non-functional-requirements/usability/#usability","text":"Usability is a topic that is often used interchangeably with user experience (UX), but they are not the same thing. Usability is a subset of UX, focusing specifically on the ease of use and effectiveness of a product, i.e., it is the ease with which users can learn and use a product to achieve their goals. Usability is a key factor in determining the success of a product, as it directly impacts user satisfaction, productivity, and overall experience. A system that is difficult to use or understand can lead to frustration, errors, and ultimately, abandonment by users. Closely coupled with usability and UX is the concept of accessibility .","title":"Usability"},{"location":"non-functional-requirements/usability/#characteristics","text":"The main three characteristics of usability are: - Effectiveness: Users should be able to accomplish their goals with the product. - Efficiency: Users should be able to perform tasks quickly and with minimal effort. Oftentimes this is measured in terms of time on task or number of clicks. - Satisfaction: Users should find the product enjoyable and satisfying to use. Additional characteristics include: - Learnability: Users should be able to easily and quickly learn how to use the product. In other words, the system should be intuitive and require minimal training. - Memorability: Users should be able to remember how to use the product after a period of not using it. - Errors: Users should encounter a minimal number of errors when completing a task, and recover easily from any errors that do occur. - Simplicity: The system should be simple and straightforward to use, with minimal complexity and cognitive load. - Comprehensibility: Users should be able to understand the system and its features easily, with clear instructions and feedback.","title":"Characteristics"},{"location":"non-functional-requirements/usability/#implementations","text":"One way of implementing usability in a user interface is by basing your design decisions on usability testing results. Usability testing's goal is to identify any usability issues, gather feedback, and make improvements to the product. It can be conducted at various stages of the design and development process, from wireframes and prototypes to the final product. These evaluations can collect two key metrics: quantitative data and qualitative data . Quantitative data can be collected through observing the facts of what actually happened. Qualitative data can be collected through interviews, observations, and other methods that provide insights into user behavior and preferences. There are several methods for conducting usability testing, including, but not limited to: - Focus groups - Wireframes - Prototyping - Surveys/Questionnaires - Interviews - Think-aloud protocol","title":"Implementations"},{"location":"non-functional-requirements/usability/#examples","text":"One example of usability in action is the design of a website. A website that is easy to navigate, with clear labels, intuitive menus, and a logical flow of information, is more likely to be successful than one that is cluttered, confusing, and difficult to use. The latter website is likely to have a low rate of user engagement, high bounce rates , and low conversion rates, as users will quickly become frustrated and abandon the site.","title":"Examples"},{"location":"non-functional-requirements/usability/#resources","text":"GeeksForGeeks: What is Usability? Usability.gov Human-computer Interaction (HCI) Jakob Nielsen's 10 Usability Heuristics for User Interface Design","title":"Resources"},{"location":"observability/","text":"Observability Building observable systems enables development teams at ISE to measure how well the application is behaving. Observability serves the following goals: Provide holistic view of the application health . Help measure business performance for the customer. Measure operational performance of the system. Identify and diagnose failures to get to the problem fast. Pillars of Observability Logs Metrics Tracing Logs vs Metrics vs Traces Insights Dashboards and Reporting Tools, Patterns and Recommended Practices Tooling and Patterns Observability As Code Recommended Practices Diagnostics tools OpenTelemetry Facets of Observability Observability for Microservices Observability in Machine Learning Observability of CI/CD Pipelines Observability in Azure Databricks Recipes Resources Non-Functional Requirements Guidance","title":"Observability"},{"location":"observability/#observability","text":"Building observable systems enables development teams at ISE to measure how well the application is behaving. Observability serves the following goals: Provide holistic view of the application health . Help measure business performance for the customer. Measure operational performance of the system. Identify and diagnose failures to get to the problem fast.","title":"Observability"},{"location":"observability/#pillars-of-observability","text":"Logs Metrics Tracing Logs vs Metrics vs Traces","title":"Pillars of Observability"},{"location":"observability/#insights","text":"Dashboards and Reporting","title":"Insights"},{"location":"observability/#tools-patterns-and-recommended-practices","text":"Tooling and Patterns Observability As Code Recommended Practices Diagnostics tools OpenTelemetry","title":"Tools, Patterns and Recommended Practices"},{"location":"observability/#facets-of-observability","text":"Observability for Microservices Observability in Machine Learning Observability of CI/CD Pipelines Observability in Azure Databricks Recipes","title":"Facets of Observability"},{"location":"observability/#resources","text":"Non-Functional Requirements Guidance","title":"Resources"},{"location":"observability/alerting/","text":"Guidance for Alerting One of the goals of building highly observable systems is to provide valuable insight into the behavior of the application. Observable systems allow problems to be identified and surfaced through alerts before end users are impacted. Best Practices The foremost thing to do before creating alerts is to implement observability. Without monitoring systems in place, it becomes next to impossible to know what activities need to be monitored and when to alert the teams. Identify what the application's minimum viable service quality needs to be. It is not what you intend to deliver, but is acceptable for the customer. These Service Level Objectives (SLOs) are a metric for measurement of the application's performance. SLOs are defined with respect to the end users. The alerts must watch for visible impact to the user. For example, alerting on request rate, latency and errors. Use automated, scriptable tools to mimic end-to-end important code paths relatable to activities in the application. Create alert polices on user impacting events or metric rate of change. Alert fatigue is real. Engineers are recommended to pay attention to their monitoring system so that accurate alerts and thresholds can be defined. Establish a primary channel for alerts that needs immediate attention and tag the right team/person(s) based on the nature of the incident. Not every single alert needs to be sent to the primary on-call channel. Establish a secondary channel for items that need to be looked into and does not affect the users, yet. For example, storage that nearing capacity threshold. These items will be what the engineering services will look to regularly to monitor the health of the system. Ensure to set up proper alerting for failures in dependent services like Redis cache, Service Bus etc. For example, if Redis cache is throwing 10 exceptions in last 60 secs, proper alerts are recommended to be created so that these failures are surfaced and action be taken. It is important to learn from each incident and continually improve the process. After every incident has been triaged, conduct a post mortem of the scenario . Scenarios and situations that were not initially considered will occur, and the post-mortem workflow is a great way to highlight that to improve the monitoring/alerting of the system. Configuring an alert to detect that incident scenario is a good idea to see if the event occurs again.","title":"Guidance for Alerting"},{"location":"observability/alerting/#guidance-for-alerting","text":"One of the goals of building highly observable systems is to provide valuable insight into the behavior of the application. Observable systems allow problems to be identified and surfaced through alerts before end users are impacted.","title":"Guidance for Alerting"},{"location":"observability/alerting/#best-practices","text":"The foremost thing to do before creating alerts is to implement observability. Without monitoring systems in place, it becomes next to impossible to know what activities need to be monitored and when to alert the teams. Identify what the application's minimum viable service quality needs to be. It is not what you intend to deliver, but is acceptable for the customer. These Service Level Objectives (SLOs) are a metric for measurement of the application's performance. SLOs are defined with respect to the end users. The alerts must watch for visible impact to the user. For example, alerting on request rate, latency and errors. Use automated, scriptable tools to mimic end-to-end important code paths relatable to activities in the application. Create alert polices on user impacting events or metric rate of change. Alert fatigue is real. Engineers are recommended to pay attention to their monitoring system so that accurate alerts and thresholds can be defined. Establish a primary channel for alerts that needs immediate attention and tag the right team/person(s) based on the nature of the incident. Not every single alert needs to be sent to the primary on-call channel. Establish a secondary channel for items that need to be looked into and does not affect the users, yet. For example, storage that nearing capacity threshold. These items will be what the engineering services will look to regularly to monitor the health of the system. Ensure to set up proper alerting for failures in dependent services like Redis cache, Service Bus etc. For example, if Redis cache is throwing 10 exceptions in last 60 secs, proper alerts are recommended to be created so that these failures are surfaced and action be taken. It is important to learn from each incident and continually improve the process. After every incident has been triaged, conduct a post mortem of the scenario . Scenarios and situations that were not initially considered will occur, and the post-mortem workflow is a great way to highlight that to improve the monitoring/alerting of the system. Configuring an alert to detect that incident scenario is a good idea to see if the event occurs again.","title":"Best Practices"},{"location":"observability/best-practices/","text":"Recommended Practices Correlation Id : Include unique identifier at the start of the interaction to tie down aggregated data from various system components and provide a holistic view. Read more guidelines about using correlation id . Ensure health of the services are monitored and provide insights into system's performance and behavior. Ensure dependent services are monitored properly. Errors and exceptions in dependent services like Redis cache, Service bus, etc. should be logged and alerted. Also, metrics related to dependent services should be captured and logged. - Additionally, failures in dependent services should be propagated up each level of the stack by the health check. Faults, crashes, and failures are logged as discrete events. This helps engineers identify problem area(s) during failures. Ensure logging configuration (eg: setting logging to \"verbose\") can be controlled without code changes. Ensure that metrics around latency and duration are collected and can be aggregated. Start small and add where there is customer impact. Avoiding metric fatigue is very crucial to collecting actionable data. It is important that every data that is collected contains relevant and rich context. Personally Identifiable Information or any other customer sensitive information should never be logged. Special attention should be paid to any local privacy data regulations and collected data must adhere to those. (ex: GDPR) Health checks : Appropriate health checks should added to determine if service is healthy and ready to serve traffic. On a kubernetes platform different types of probes e.g. Liveness, Readiness, Startup etc. can be used to determine health and readiness of the deployed service. Read more here to understand what to watch out for while designing and building an observable system.","title":"Recommended Practices"},{"location":"observability/best-practices/#recommended-practices","text":"Correlation Id : Include unique identifier at the start of the interaction to tie down aggregated data from various system components and provide a holistic view. Read more guidelines about using correlation id . Ensure health of the services are monitored and provide insights into system's performance and behavior. Ensure dependent services are monitored properly. Errors and exceptions in dependent services like Redis cache, Service bus, etc. should be logged and alerted. Also, metrics related to dependent services should be captured and logged. - Additionally, failures in dependent services should be propagated up each level of the stack by the health check. Faults, crashes, and failures are logged as discrete events. This helps engineers identify problem area(s) during failures. Ensure logging configuration (eg: setting logging to \"verbose\") can be controlled without code changes. Ensure that metrics around latency and duration are collected and can be aggregated. Start small and add where there is customer impact. Avoiding metric fatigue is very crucial to collecting actionable data. It is important that every data that is collected contains relevant and rich context. Personally Identifiable Information or any other customer sensitive information should never be logged. Special attention should be paid to any local privacy data regulations and collected data must adhere to those. (ex: GDPR) Health checks : Appropriate health checks should added to determine if service is healthy and ready to serve traffic. On a kubernetes platform different types of probes e.g. Liveness, Readiness, Startup etc. can be used to determine health and readiness of the deployed service. Read more here to understand what to watch out for while designing and building an observable system.","title":"Recommended Practices"},{"location":"observability/correlation-id/","text":"Correlation IDs The Need In a distributed system architecture (microservice architecture), it is highly difficult to understand a single end to end customer transaction flow through the various components. Here are some the general challenges - It becomes challenging to understand the end-to-end behavior of a client request entering the application. Aggregation: Consolidating logs from multiple components and making sense out of these logs is difficult, if not impossible. Cyclic dependencies on services, course of events and asynchronous requests are not easily deciphered. While troubleshooting a request, the diagnostic context of the logs are very important to get to the root of the problem. Solution A Correlation ID is a unique identifier that is added to the very first interaction (incoming request) to identify the context and is passed to all components that are involved in the transaction flow. Correlation ID becomes the glue that binds the transaction together and helps to draw an overall picture of events. Note: Before implementing your own Correlation ID, investigate if your telemetry tool of choice provides an auto-generated Correlation ID and that it serves the purposes of your application. For instance, Application Insights offers dependency auto-collection for some application frameworks Recommended Practices Assign each external request a Correlation ID that binds the message to a transaction. The Correlation ID for a transaction must be assigned as early as you can. Propagate Correlation ID to all downstream components/services. All components/services of the transaction use this Correlation ID in their logs. For an HTTP Request, Correlation ID is typically passed in the header. Add it to an outgoing response where possible. Based on the use case, there can be additional correlation IDs that may be needed. For instance, tracking logs based on both Session ID and User ID may be required. While adding multiple correlation ID, remember to propagate them through the components. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the \"Correlation-id\", called TraceId. Use Cases Log Correlation Log correlation is the ability to track disparate events through different parts of the application. Having a Correlation ID provides more context making it easy to build rules for reporting and analysis. Secondary Reporting/Observer Systems Using Correlation ID helps secondary systems to correlate data without application context. Some examples - generating metrics based on tracing data, integrating runtime/system diagnostics etc. For example, feeding AppInsights data and correlating it to infrastructure issues. Troubleshooting Errors For troubleshooting an errors, Correlation ID is a great starting point to trace the workflow of a transaction.","title":"Correlation IDs"},{"location":"observability/correlation-id/#correlation-ids","text":"","title":"Correlation IDs"},{"location":"observability/correlation-id/#the-need","text":"In a distributed system architecture (microservice architecture), it is highly difficult to understand a single end to end customer transaction flow through the various components. Here are some the general challenges - It becomes challenging to understand the end-to-end behavior of a client request entering the application. Aggregation: Consolidating logs from multiple components and making sense out of these logs is difficult, if not impossible. Cyclic dependencies on services, course of events and asynchronous requests are not easily deciphered. While troubleshooting a request, the diagnostic context of the logs are very important to get to the root of the problem.","title":"The Need"},{"location":"observability/correlation-id/#solution","text":"A Correlation ID is a unique identifier that is added to the very first interaction (incoming request) to identify the context and is passed to all components that are involved in the transaction flow. Correlation ID becomes the glue that binds the transaction together and helps to draw an overall picture of events. Note: Before implementing your own Correlation ID, investigate if your telemetry tool of choice provides an auto-generated Correlation ID and that it serves the purposes of your application. For instance, Application Insights offers dependency auto-collection for some application frameworks","title":"Solution"},{"location":"observability/correlation-id/#recommended-practices","text":"Assign each external request a Correlation ID that binds the message to a transaction. The Correlation ID for a transaction must be assigned as early as you can. Propagate Correlation ID to all downstream components/services. All components/services of the transaction use this Correlation ID in their logs. For an HTTP Request, Correlation ID is typically passed in the header. Add it to an outgoing response where possible. Based on the use case, there can be additional correlation IDs that may be needed. For instance, tracking logs based on both Session ID and User ID may be required. While adding multiple correlation ID, remember to propagate them through the components. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the \"Correlation-id\", called TraceId.","title":"Recommended Practices"},{"location":"observability/correlation-id/#use-cases","text":"","title":"Use Cases"},{"location":"observability/correlation-id/#log-correlation","text":"Log correlation is the ability to track disparate events through different parts of the application. Having a Correlation ID provides more context making it easy to build rules for reporting and analysis.","title":"Log Correlation"},{"location":"observability/correlation-id/#secondary-reportingobserver-systems","text":"Using Correlation ID helps secondary systems to correlate data without application context. Some examples - generating metrics based on tracing data, integrating runtime/system diagnostics etc. For example, feeding AppInsights data and correlating it to infrastructure issues.","title":"Secondary Reporting/Observer Systems"},{"location":"observability/correlation-id/#troubleshooting-errors","text":"For troubleshooting an errors, Correlation ID is a great starting point to trace the workflow of a transaction.","title":"Troubleshooting Errors"},{"location":"observability/diagnostic-tools/","text":"Diagnostic tools Besides Logging , Tracing and Metrics , there are additional tools to help diagnose issues when applications do not behave as expected. In some scenarios, analyzing the memory consumption and drilling down into why a specific process takes longer than expected may require additional measures. In these cases, platform or programming language specific diagnostic tools come into play and are useful to debug a memory leak, profile the CPU usage, or the cause of delays in multi-threading. Profilers and Memory Analyzers There are two types of diagnostics tools you may want to use: profilers and memory analyzers. Profiling Profiling is a technique where you take small snapshots of all the threads in a running application to see the stack trace of each thread for a specified duration. This tool can help you identify where you are spending CPU time during the execution of your application. There are two main techniques to achieve this: CPU-Sampling and Instrumentation. CPU-Sampling is a non-invasive method which takes snapshots of all the stacks at a set interval. It is the most common technique for profiling and doesn't require any modification to your code. Instrumentation is the other technique where you insert a small piece of code at the beginning and end of each function which is going to signal back to the profiler about the time spent in the function, the function name, parameters and others. This way you modify the code of your running application. There are two effects to this: your code may run a little bit more slowly, but on the other hand you have a more accurate view of every function and class that has been executed so far in your application. When to use Sampling vs Instrumentation? Not all programming languages support instrumentation. Instrumentation is mostly supported for compiled languages like .NET and Java, and some languages interpreted at runtime like Python and Javascript. Keep in mind that enabling instrumentation can require to modify your build pipeline, i.e. by adding special parameters to the command line argument. You should normally start with Sampling because it doesn't require to modify your binaries, it doesn't affect your process performance, and can be quicker to start with. Once you have your profiling data, there are multiple ways to visualize this information depending of the format you saved it. As an example for .NET (dotnet-trace), there are three available formats to save these traces: Chromium, NetTrace and SpeedScope. Select the output format depending on the tool you are going to use. SpeedScope is an online web application you can use to visualize and analyze traces, and you only need a modern browser. Be careful with online tools, as dumps/traces might contain confidential information that you don't want to share outside of your organization. Memory Analyzers Memory analyzers and memory dumps are another set of diagnostic tools you can use to identify issues in your process. Normally these types of tools take the whole memory the process is using at a point in time and saves it in a file which can be analyzed. When using these types of tools, you want to stress your process as much as possible to amplify whatever deficiency you may have in terms of memory management. The memory dump should then be taken when the process is in this stressed state. In some scenarios we recommend to take more than one memory dump during the reproduction of a problem. For example, if you suspect a memory leak and you are running a test for 30 min, it is useful to take at least 3 dumps at different intervals (i.e. 10, 20 & 30 min) to compare them with each other. There are multiple ways to take a memory dump depending the operating system you are using. Also, each operating system has it own debugger which is able to load this memory dump, and explore the state of the process at the time the memory dump was taken. The most common debuggers are: Windows - WinDbg and WinDgbNext (included in the Windows SDK), Visual Studio can also load a memory dump for a .NET Framework and .NET Core process Linux - GDB is the GNU Debugger Mac OS - LLDB Debugger There are a range of developer platform specific diagnostic tools which can be used: .NET Core diagnostic tools , GitHub repository Java diagnostic tools - version specific Python debugging and profiling - version specific Node.js Diagnostics working group Environment for Profiling To create an application profile as close to production as possible, the environment in which the application is intended to run in production has to be considered and it might be necessary to perform a snapshot of the application state under load . Diagnostics in Containers For monolithic applications, diagnostics tools can be installed and run on the VM hosting them. Most scalable applications are developed as microservices and have complex interactions which require to install the tools in the containers running the process or to leverage a sidecar container (see sidecar pattern ). Some platforms expose endpoints to interact with the application and return a dump. Resources .NET Core diagnostics in containers Experimental tool dotnet-monitor , What's new , GItHub repository Spring Boot actuator endpoints","title":"Diagnostic tools"},{"location":"observability/diagnostic-tools/#diagnostic-tools","text":"Besides Logging , Tracing and Metrics , there are additional tools to help diagnose issues when applications do not behave as expected. In some scenarios, analyzing the memory consumption and drilling down into why a specific process takes longer than expected may require additional measures. In these cases, platform or programming language specific diagnostic tools come into play and are useful to debug a memory leak, profile the CPU usage, or the cause of delays in multi-threading.","title":"Diagnostic tools"},{"location":"observability/diagnostic-tools/#profilers-and-memory-analyzers","text":"There are two types of diagnostics tools you may want to use: profilers and memory analyzers.","title":"Profilers and Memory Analyzers"},{"location":"observability/diagnostic-tools/#profiling","text":"Profiling is a technique where you take small snapshots of all the threads in a running application to see the stack trace of each thread for a specified duration. This tool can help you identify where you are spending CPU time during the execution of your application. There are two main techniques to achieve this: CPU-Sampling and Instrumentation. CPU-Sampling is a non-invasive method which takes snapshots of all the stacks at a set interval. It is the most common technique for profiling and doesn't require any modification to your code. Instrumentation is the other technique where you insert a small piece of code at the beginning and end of each function which is going to signal back to the profiler about the time spent in the function, the function name, parameters and others. This way you modify the code of your running application. There are two effects to this: your code may run a little bit more slowly, but on the other hand you have a more accurate view of every function and class that has been executed so far in your application.","title":"Profiling"},{"location":"observability/diagnostic-tools/#when-to-use-sampling-vs-instrumentation","text":"Not all programming languages support instrumentation. Instrumentation is mostly supported for compiled languages like .NET and Java, and some languages interpreted at runtime like Python and Javascript. Keep in mind that enabling instrumentation can require to modify your build pipeline, i.e. by adding special parameters to the command line argument. You should normally start with Sampling because it doesn't require to modify your binaries, it doesn't affect your process performance, and can be quicker to start with. Once you have your profiling data, there are multiple ways to visualize this information depending of the format you saved it. As an example for .NET (dotnet-trace), there are three available formats to save these traces: Chromium, NetTrace and SpeedScope. Select the output format depending on the tool you are going to use. SpeedScope is an online web application you can use to visualize and analyze traces, and you only need a modern browser. Be careful with online tools, as dumps/traces might contain confidential information that you don't want to share outside of your organization.","title":"When to use Sampling vs Instrumentation?"},{"location":"observability/diagnostic-tools/#memory-analyzers","text":"Memory analyzers and memory dumps are another set of diagnostic tools you can use to identify issues in your process. Normally these types of tools take the whole memory the process is using at a point in time and saves it in a file which can be analyzed. When using these types of tools, you want to stress your process as much as possible to amplify whatever deficiency you may have in terms of memory management. The memory dump should then be taken when the process is in this stressed state. In some scenarios we recommend to take more than one memory dump during the reproduction of a problem. For example, if you suspect a memory leak and you are running a test for 30 min, it is useful to take at least 3 dumps at different intervals (i.e. 10, 20 & 30 min) to compare them with each other. There are multiple ways to take a memory dump depending the operating system you are using. Also, each operating system has it own debugger which is able to load this memory dump, and explore the state of the process at the time the memory dump was taken. The most common debuggers are: Windows - WinDbg and WinDgbNext (included in the Windows SDK), Visual Studio can also load a memory dump for a .NET Framework and .NET Core process Linux - GDB is the GNU Debugger Mac OS - LLDB Debugger There are a range of developer platform specific diagnostic tools which can be used: .NET Core diagnostic tools , GitHub repository Java diagnostic tools - version specific Python debugging and profiling - version specific Node.js Diagnostics working group","title":"Memory Analyzers"},{"location":"observability/diagnostic-tools/#environment-for-profiling","text":"To create an application profile as close to production as possible, the environment in which the application is intended to run in production has to be considered and it might be necessary to perform a snapshot of the application state under load .","title":"Environment for Profiling"},{"location":"observability/diagnostic-tools/#diagnostics-in-containers","text":"For monolithic applications, diagnostics tools can be installed and run on the VM hosting them. Most scalable applications are developed as microservices and have complex interactions which require to install the tools in the containers running the process or to leverage a sidecar container (see sidecar pattern ). Some platforms expose endpoints to interact with the application and return a dump.","title":"Diagnostics in Containers"},{"location":"observability/diagnostic-tools/#resources","text":".NET Core diagnostics in containers Experimental tool dotnet-monitor , What's new , GItHub repository Spring Boot actuator endpoints","title":"Resources"},{"location":"observability/log-vs-metric-vs-trace/","text":"Logs vs Metrics vs Traces Overview Metrics The purpose of metrics is to inform observers about the health & operations regarding a component or system. A metric represents a point in time measure of a particular source, and data-wise tends to be very small. The compact size allows for efficient collection even at scale in large systems. Metrics also lend themselves very well to pre-aggregation within the component before collection, reducing computation cost for processing & storing large numbers of metric time series in a central system. Due to how efficiently metrics are processed & stored, it lends itself very well for use in automated alerting, as metrics are an excellent source for the health data for all components in the system. Logs Log data inform observers about the discrete events that occurred within a component or a set of components. Just about every software component log information about its activities over time. This rich data tends to be much larger than metric data and can cause processing issues, especially if components are logging too verbosely. Therefore, using log data to understand the health of an extensive system tends to be avoided and depends on metrics for that data. Once metric telemetry highlights potential problem sources, filtered log data for those sources can be used to understand what occurred. Traces Where logging provides an overview to a discrete, event-triggered log, tracing encompasses a much wider, continuous view of an application. The goal of tracing is to following a program\u2019s flow and data progression. In many instances, tracing represents a single user\u2019s journey through an entire app stack. Its purpose isn\u2019t reactive, but instead focused on optimization. By tracing through a stack, developers can identify bottlenecks and focus on improving performance. A distributed trace is defined as a collection of spans. A span is the smallest unit in a trace and represents a piece of the workflow in a distributed landscape. It can be an HTTP request, call to a database, or execution of a message from a queue. When a problem does occur, tracing allows you to see how you got there: Which function. The function\u2019s duration. Parameters passed. How deep into the function the user could get. Usage Guidance When to use metric or log data to track a particular piece of telemetry can be summarized with the following points: Use metrics to track the occurrence of an event, counting of items, the time taken to perform an action or to report the current value of a resource (CPU, memory, etc.) Use logs to track detailed information about an event also monitored by a metric, particularly errors, warnings or other exceptional situations. A trace provides visibility into how a request is processed across multiple services in a microservices environment. Every trace needs to have a unique identifier associated with it.","title":"Logs vs Metrics vs Traces"},{"location":"observability/log-vs-metric-vs-trace/#logs-vs-metrics-vs-traces","text":"","title":"Logs vs Metrics vs Traces"},{"location":"observability/log-vs-metric-vs-trace/#overview","text":"","title":"Overview"},{"location":"observability/log-vs-metric-vs-trace/#metrics","text":"The purpose of metrics is to inform observers about the health & operations regarding a component or system. A metric represents a point in time measure of a particular source, and data-wise tends to be very small. The compact size allows for efficient collection even at scale in large systems. Metrics also lend themselves very well to pre-aggregation within the component before collection, reducing computation cost for processing & storing large numbers of metric time series in a central system. Due to how efficiently metrics are processed & stored, it lends itself very well for use in automated alerting, as metrics are an excellent source for the health data for all components in the system.","title":"Metrics"},{"location":"observability/log-vs-metric-vs-trace/#logs","text":"Log data inform observers about the discrete events that occurred within a component or a set of components. Just about every software component log information about its activities over time. This rich data tends to be much larger than metric data and can cause processing issues, especially if components are logging too verbosely. Therefore, using log data to understand the health of an extensive system tends to be avoided and depends on metrics for that data. Once metric telemetry highlights potential problem sources, filtered log data for those sources can be used to understand what occurred.","title":"Logs"},{"location":"observability/log-vs-metric-vs-trace/#traces","text":"Where logging provides an overview to a discrete, event-triggered log, tracing encompasses a much wider, continuous view of an application. The goal of tracing is to following a program\u2019s flow and data progression. In many instances, tracing represents a single user\u2019s journey through an entire app stack. Its purpose isn\u2019t reactive, but instead focused on optimization. By tracing through a stack, developers can identify bottlenecks and focus on improving performance. A distributed trace is defined as a collection of spans. A span is the smallest unit in a trace and represents a piece of the workflow in a distributed landscape. It can be an HTTP request, call to a database, or execution of a message from a queue. When a problem does occur, tracing allows you to see how you got there: Which function. The function\u2019s duration. Parameters passed. How deep into the function the user could get.","title":"Traces"},{"location":"observability/log-vs-metric-vs-trace/#usage-guidance","text":"When to use metric or log data to track a particular piece of telemetry can be summarized with the following points: Use metrics to track the occurrence of an event, counting of items, the time taken to perform an action or to report the current value of a resource (CPU, memory, etc.) Use logs to track detailed information about an event also monitored by a metric, particularly errors, warnings or other exceptional situations. A trace provides visibility into how a request is processed across multiple services in a microservices environment. Every trace needs to have a unique identifier associated with it.","title":"Usage Guidance"},{"location":"observability/logs-privacy/","text":"Guidance for Privacy Overview To ensure the privacy of your system users, as well as comply with several regulations like GDPR, some types of data shouldn\u2019t exist in logs. This includes customer's sensitive, Personal Identifiable Information (PII), and any other data that wasn't legally sanctioned. Recommended Practices Separate components and minimize the parts of the system that log sensitive data. Keep sensitive data out of URLs, since request URLs are typically logged by proxies and web servers. Avoid using PII data for system debugging as much as possible. For example, use ids instead of usernames. Use Structured Logging and include a deny-list for sensitive properties. Put an extra effort on spotting logging statements with sensitive data during code review, as it is common for reviewers to skip reading logging statements. This can be added as an additional checkbox if you're using Pull Request Templates. Include mechanisms to detect sensitive data in logs, on your organizational pipelines for QA or Automated Testing. Tools and Implementation Methods Use these tools and methods for sensitive data de-identification in logs. Application Insights Application Insights offers telemetry interception in some of the SDKs, that can be done by implementing the ITelemetryProcessor interface. ITelemetryProcessor processes the telemetry information before it is sent to Application Insights, and can be useful in many situations, such as filtering and modifications. Below is an example of intercepting 'trace' typed telemetry: using Microsoft.ApplicationInsights.DataContracts ; namespace Example { using Microsoft.ApplicationInsights.Channel ; using Microsoft.ApplicationInsights.Extensibility ; internal class RedactTelemetryInitializer : ITelemetryInitializer { public void Initialize ( ITelemetry telemetry ) { var requestTelemetry = telemetry as TraceTelemetry ; if ( requestTelemetry == null ) return ; # redact emails from the message parameter requestTelemetry . Message = Regex . Replace ( requestTelemetry . Message , @\"[^@\\s]+@[^@\\s]+\\.[^@\\s]+\" , \"[email removed]\" ); } } } Elastic Stack Elastic Stack (formerly \"ELK stack\") allows logs interception by Logstash's filter-plugins . Using some of the existing plugins, like 'mutate', 'alter' and 'prune' might be sufficient for most cases of deidentifying and redacting PIIs. For a more robust and customized use-case, a 'ruby' plugin can be used, executing arbitrary Ruby code. Filter plugins also exists in some Logstash alternatives, like Fluentd and Fluent Bit . Presidio Presidio offers data protection and anonymization API. It provides fast identification and anonymization modules for private entities in text. Presidio allows using predefined or custom PII recognizers, leveraging Named Entity Recognition, regular expressions, rule based logic and checksum with relevant context in multiple languages. It can be used alongside the log interception methods mentioned above to help and ensure sensitive data is properly managed and governed. Presidio is containerized for REST HTTP API and also can be installed as a python package, to be called from python code. Instead of handling the anonymization in the application code, both APIs can be used using external calls. Elastic Stack, for example, can handle PII redaction using the 'ruby' filter plugin to call Presidio in REST HTTP API, or by calling a python script consuming Presidio as a package: logstash.conf input { ... } filter { ruby { code => 'require \"open3\" message = event.get(\"message\") # Call a python script triggering Presidio analyzer and anonymizer, and printing the result. cmd = \"python /path/to/presidio/anonymization/script.py \\\" #{ message } \\\" \" # Fetch the script' s stdout stdin , stdout , stderr = Open3 . popen3 ( cmd ) # Override message with the anonymized text. event . set ( \"message\" , stdout . read ) filter_matched ( event ) ' } } output { ... }","title":"Guidance for Privacy"},{"location":"observability/logs-privacy/#guidance-for-privacy","text":"","title":"Guidance for Privacy"},{"location":"observability/logs-privacy/#overview","text":"To ensure the privacy of your system users, as well as comply with several regulations like GDPR, some types of data shouldn\u2019t exist in logs. This includes customer's sensitive, Personal Identifiable Information (PII), and any other data that wasn't legally sanctioned.","title":"Overview"},{"location":"observability/logs-privacy/#recommended-practices","text":"Separate components and minimize the parts of the system that log sensitive data. Keep sensitive data out of URLs, since request URLs are typically logged by proxies and web servers. Avoid using PII data for system debugging as much as possible. For example, use ids instead of usernames. Use Structured Logging and include a deny-list for sensitive properties. Put an extra effort on spotting logging statements with sensitive data during code review, as it is common for reviewers to skip reading logging statements. This can be added as an additional checkbox if you're using Pull Request Templates. Include mechanisms to detect sensitive data in logs, on your organizational pipelines for QA or Automated Testing.","title":"Recommended Practices"},{"location":"observability/logs-privacy/#tools-and-implementation-methods","text":"Use these tools and methods for sensitive data de-identification in logs.","title":"Tools and Implementation Methods"},{"location":"observability/logs-privacy/#application-insights","text":"Application Insights offers telemetry interception in some of the SDKs, that can be done by implementing the ITelemetryProcessor interface. ITelemetryProcessor processes the telemetry information before it is sent to Application Insights, and can be useful in many situations, such as filtering and modifications. Below is an example of intercepting 'trace' typed telemetry: using Microsoft.ApplicationInsights.DataContracts ; namespace Example { using Microsoft.ApplicationInsights.Channel ; using Microsoft.ApplicationInsights.Extensibility ; internal class RedactTelemetryInitializer : ITelemetryInitializer { public void Initialize ( ITelemetry telemetry ) { var requestTelemetry = telemetry as TraceTelemetry ; if ( requestTelemetry == null ) return ; # redact emails from the message parameter requestTelemetry . Message = Regex . Replace ( requestTelemetry . Message , @\"[^@\\s]+@[^@\\s]+\\.[^@\\s]+\" , \"[email removed]\" ); } } }","title":"Application Insights"},{"location":"observability/logs-privacy/#elastic-stack","text":"Elastic Stack (formerly \"ELK stack\") allows logs interception by Logstash's filter-plugins . Using some of the existing plugins, like 'mutate', 'alter' and 'prune' might be sufficient for most cases of deidentifying and redacting PIIs. For a more robust and customized use-case, a 'ruby' plugin can be used, executing arbitrary Ruby code. Filter plugins also exists in some Logstash alternatives, like Fluentd and Fluent Bit .","title":"Elastic Stack"},{"location":"observability/logs-privacy/#presidio","text":"Presidio offers data protection and anonymization API. It provides fast identification and anonymization modules for private entities in text. Presidio allows using predefined or custom PII recognizers, leveraging Named Entity Recognition, regular expressions, rule based logic and checksum with relevant context in multiple languages. It can be used alongside the log interception methods mentioned above to help and ensure sensitive data is properly managed and governed. Presidio is containerized for REST HTTP API and also can be installed as a python package, to be called from python code. Instead of handling the anonymization in the application code, both APIs can be used using external calls. Elastic Stack, for example, can handle PII redaction using the 'ruby' filter plugin to call Presidio in REST HTTP API, or by calling a python script consuming Presidio as a package: logstash.conf input { ... } filter { ruby { code => 'require \"open3\" message = event.get(\"message\") # Call a python script triggering Presidio analyzer and anonymizer, and printing the result. cmd = \"python /path/to/presidio/anonymization/script.py \\\" #{ message } \\\" \" # Fetch the script' s stdout stdin , stdout , stderr = Open3 . popen3 ( cmd ) # Override message with the anonymized text. event . set ( \"message\" , stdout . read ) filter_matched ( event ) ' } } output { ... }","title":"Presidio"},{"location":"observability/microservices/","text":"Observability in Microservices Microservices is a very popular software architecture, where the application is arranged as a collection of loosely coupled services. Some of those services can be written in different languages by different teams. Motivations We need to consider special cases when creating a microservice architecture from the perspective of observability. We want to capture the interactions when making requests between those microservices and correlate them. Imagine we have a microservice that accesses a database to retrieve some data as part of a request. This microservice is going to be called by someone else as part of an incoming http request or an internal process being executed. What happens if a problem occurs during the retrieval of the data (or the update of the data)? How can we associate, or correlate, that this particular call failed in the destination microservice? This is a common issue. When calling other microservices, depending on the technology stack we use, we can accidentally hide errors and exceptions that might happen on the other side. If we are using a simple REST interface, the other microservice can return a 500 HTTP status code and we don't have any idea what happen inside that microservice. More important, we don't have any way to associate our Correlation Id to whatever happens inside that microservice. Therefore, is so important to have a plan in place to be able to extend your traceability and monitoring efforts, especially when using a microservice architecture. How to Extend Your Tracing Information Between Microservices The W3C consortium is working on a Trace Context definition that can be applied when using HTTP as the protocol in a microservice architecture. But let's explain how we can implement this functionality in our software. The main idea behind this is to propagate the correlation information between HTTP request so other pieces of software can read this information and correctly correlate telemetry across microservices. The way to propagate this information is to use HTTP Headers for the Correlation Id, parent Correlation Id, etc. When you are in the scope of a HTTP Request, your tracing system should already have created four properties that you can use to send across your microservices. RequestId:0HLQV2BC3VP2T:00000001, SpanId:da13aa3c6fd9c146, TraceId:f11a03e3f078414fa7c0a0ce568c8b5c, ParentId:5076c17d0a604244 This is an example of the four properties you can find which identify the current request. RequestId is the unique id that represent the current HTTP Request. SpanId is the default automatically generated span. You can have more than one Span that scope different functionality inside your software. TraceId represent the id for current log trace. ParentId is the parent span id, that in some case can be the same or something different. Example Now we are going to explore an example with 3 microservices that calls to each other in a row. This image is the summary of what is needed in each microservice to propagate the trace-id from A to C. The root caller is A and that is why it doesn't have a parent-id, only have a new trace-id. Next, A calls B using HTTP. To propagate the correlation information as part of the request, we are using two new headers based on the W3C Correlation specification, trace-id and parent-id. In this example because A is the root caller, A only sends its own trace-id to microservice B. When microservice B receives the incoming HTTP request, it checks the contents of these two headers. It reads the content of the trace-id header and sets its own parent-id to this trace-id (as shown in the green rectangle inside's B). In addition, it creates a new trace-id to signal that is a new scope for the telemetry. During the execution of microservice B, it also calls microservice C and repeats the pattern. As part of the request it includes the two headers and propagates trace-id and parent-id as well. Finally, microservice C, reads the value for the incoming trace-id and sets as his own parent-id, but also creates a new trace-id that will use to send telemetry about his own operations. Summary A number of Application Monitoring (APM) technology products already supports most of this Correlation Propagation. The most popular is OpenZipkin/B3-Propagation . W3C already proposed a recommendation for the W3C Trace Context , where you can see what SDK and frameworks already support this functionality. It's important to correctly implement the propagation specially when there are different teams that used different technology stacks in the same project. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the Trace Context object among a full stack of microservices implemented across different technical stacks.","title":"Observability in Microservices"},{"location":"observability/microservices/#observability-in-microservices","text":"Microservices is a very popular software architecture, where the application is arranged as a collection of loosely coupled services. Some of those services can be written in different languages by different teams.","title":"Observability in Microservices"},{"location":"observability/microservices/#motivations","text":"We need to consider special cases when creating a microservice architecture from the perspective of observability. We want to capture the interactions when making requests between those microservices and correlate them. Imagine we have a microservice that accesses a database to retrieve some data as part of a request. This microservice is going to be called by someone else as part of an incoming http request or an internal process being executed. What happens if a problem occurs during the retrieval of the data (or the update of the data)? How can we associate, or correlate, that this particular call failed in the destination microservice? This is a common issue. When calling other microservices, depending on the technology stack we use, we can accidentally hide errors and exceptions that might happen on the other side. If we are using a simple REST interface, the other microservice can return a 500 HTTP status code and we don't have any idea what happen inside that microservice. More important, we don't have any way to associate our Correlation Id to whatever happens inside that microservice. Therefore, is so important to have a plan in place to be able to extend your traceability and monitoring efforts, especially when using a microservice architecture.","title":"Motivations"},{"location":"observability/microservices/#how-to-extend-your-tracing-information-between-microservices","text":"The W3C consortium is working on a Trace Context definition that can be applied when using HTTP as the protocol in a microservice architecture. But let's explain how we can implement this functionality in our software. The main idea behind this is to propagate the correlation information between HTTP request so other pieces of software can read this information and correctly correlate telemetry across microservices. The way to propagate this information is to use HTTP Headers for the Correlation Id, parent Correlation Id, etc. When you are in the scope of a HTTP Request, your tracing system should already have created four properties that you can use to send across your microservices. RequestId:0HLQV2BC3VP2T:00000001, SpanId:da13aa3c6fd9c146, TraceId:f11a03e3f078414fa7c0a0ce568c8b5c, ParentId:5076c17d0a604244 This is an example of the four properties you can find which identify the current request. RequestId is the unique id that represent the current HTTP Request. SpanId is the default automatically generated span. You can have more than one Span that scope different functionality inside your software. TraceId represent the id for current log trace. ParentId is the parent span id, that in some case can be the same or something different.","title":"How to Extend Your Tracing Information Between Microservices"},{"location":"observability/microservices/#example","text":"Now we are going to explore an example with 3 microservices that calls to each other in a row. This image is the summary of what is needed in each microservice to propagate the trace-id from A to C. The root caller is A and that is why it doesn't have a parent-id, only have a new trace-id. Next, A calls B using HTTP. To propagate the correlation information as part of the request, we are using two new headers based on the W3C Correlation specification, trace-id and parent-id. In this example because A is the root caller, A only sends its own trace-id to microservice B. When microservice B receives the incoming HTTP request, it checks the contents of these two headers. It reads the content of the trace-id header and sets its own parent-id to this trace-id (as shown in the green rectangle inside's B). In addition, it creates a new trace-id to signal that is a new scope for the telemetry. During the execution of microservice B, it also calls microservice C and repeats the pattern. As part of the request it includes the two headers and propagates trace-id and parent-id as well. Finally, microservice C, reads the value for the incoming trace-id and sets as his own parent-id, but also creates a new trace-id that will use to send telemetry about his own operations.","title":"Example"},{"location":"observability/microservices/#summary","text":"A number of Application Monitoring (APM) technology products already supports most of this Correlation Propagation. The most popular is OpenZipkin/B3-Propagation . W3C already proposed a recommendation for the W3C Trace Context , where you can see what SDK and frameworks already support this functionality. It's important to correctly implement the propagation specially when there are different teams that used different technology stacks in the same project. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the Trace Context object among a full stack of microservices implemented across different technical stacks.","title":"Summary"},{"location":"observability/ml-observability/","text":"Observability in Machine Learning Development process of software system with machine learning component is more complex than traditional software. We need to monitor changes and variations in three dimensions: the code, the model and the data. We can distinguish two stages of such system lifespan: experimentation and production that require different approaches to observability as discussed below: Model Experimentation and Tuning Experimentation is a process of finding suitable machine learning model and its parameters via training and evaluating such models with one or more datasets. When developing and tuning machine learning models, the data scientists are interested in observing and comparing selected performance metrics for various model parameters. They also need a reliable way to reproduce a training process, such that a given dataset and given parameters produces the same models. There are many model metric evaluation solutions available, both open source (like MLFlow) and proprietary (like Azure Machine Learning Service), and of which some serve different purposes. To capture model metrics, there are a.o. the following options available: Azure Machine Learning Service SDK Azure Machine Learning service provides an SDK for Python, R and C# to capture your evaluation metrics to an Azure Machine Learning service (AML) Experiment. Experiments are viewed in the AML dashboard. Reproducibility is achieved by storing code or notebook snapshot together with viewed metric. You can create versioned Datasets within Azure Machine Learning service. MLFlow (for Databricks) MLFlow is open source framework, and can be hosted on Azure Databricks as its remote tracking server (it currently is the only solution that offers first-party integration with Databricks). You can use the MLFlow SDK tracking component to capture your evaluation metrics or any parameter you would like and track it at experimentation board in Azure Databricks. Source code and dataset version are also saved with log snapshot to provide reproducibility. TensorBoard TensorBoard is a popular tool amongst data scientist to visualize specific metrics of Deep Learning runs, especially of TensorFlow runs. TensorBoard is not an MLOps tool like AML/MLFlow, and therefore does not offer extensive logging capabilities. It is meant to be transient; and can therefore be used as an addition to an end-to-end MLOps tool like AML, but not as a complete MLOps tool. Application Insights Application Insights can be used as an alternative sink to capture model metrics, and can therefore offer more extensive options as metrics can be transferred to e.g. a PowerBI dashboard. It also enables log querying. However, this solution means that a custom application needs to be written to send logs to AppInsights (using for example the OpenCensus Python SDK), which would mean extra effort of creating/maintaining custom code. An extensive comparison of the four tools can be found as follows: Azure ML MLFlow TensorBoard Application Insights Metrics support Values, images, matrices, logs Values, images, matrices and plots as files Metrics relevant to DL research phase Values, images, matrices, logs Customizabile Basic Basic Very basic High Metrics accessible AML portal, AML SDK MLFlow UI, Tracking service API Tensorboard UI, history object Application Insights Logs accessible Rolling logs written to .txt files in blob storage, accessible via blob or AML portal. Not query-able Rolling logs are not stored Rolling logs are not stored Application Insights in Azure Portal. Query-able with KQL Ease of use and set up Very straightforward, only one portal More moving parts due to remote tracking server A bit over process overhead. Also depending on ML framework More moving parts as a custom app needs to be maintained Shareability Across people with access to AML workspace Across people with access to remote tracking server Across people with access to same directory Across people with access to AppInsights Model in Production The trained model can be deployed to production as container. Azure Machine Learning service provides SDK to deploy model as Azure Container Instance and publishes REST endpoint. You can monitor it using microservice observability methods( for more details -refer to Recipes section). MLFLow is an alternative way to deploy ML model as a service. Training and Re-Training To automatically retrain the model you can use AML Pipelines or Azure Databricks. When re-training with AML Pipelines you can monitor information of each run, including the output, logs, and various metrics in the Azure portal experiment dashboard, or manually extract it using the AML SDK Model Performance Over Time: Data Drift We re-train machine learning models to improve their performance and make models better aligned with data changing over time. However, in some cases model performance may degrade. This may happen in case data change dramatically and do not exhibit the patterns we observed during model development anymore. This effect is called data drift. Azure Machine Learning Service has preview feature to observe and report data drift. This article describes it in detail. Data Versioning It is recommended practice to add version to all datasets. You can create a versioned Azure ML Dataset for this purpose, or manually version it if using other systems.","title":"Observability in Machine Learning"},{"location":"observability/ml-observability/#observability-in-machine-learning","text":"Development process of software system with machine learning component is more complex than traditional software. We need to monitor changes and variations in three dimensions: the code, the model and the data. We can distinguish two stages of such system lifespan: experimentation and production that require different approaches to observability as discussed below:","title":"Observability in Machine Learning"},{"location":"observability/ml-observability/#model-experimentation-and-tuning","text":"Experimentation is a process of finding suitable machine learning model and its parameters via training and evaluating such models with one or more datasets. When developing and tuning machine learning models, the data scientists are interested in observing and comparing selected performance metrics for various model parameters. They also need a reliable way to reproduce a training process, such that a given dataset and given parameters produces the same models. There are many model metric evaluation solutions available, both open source (like MLFlow) and proprietary (like Azure Machine Learning Service), and of which some serve different purposes. To capture model metrics, there are a.o. the following options available: Azure Machine Learning Service SDK Azure Machine Learning service provides an SDK for Python, R and C# to capture your evaluation metrics to an Azure Machine Learning service (AML) Experiment. Experiments are viewed in the AML dashboard. Reproducibility is achieved by storing code or notebook snapshot together with viewed metric. You can create versioned Datasets within Azure Machine Learning service. MLFlow (for Databricks) MLFlow is open source framework, and can be hosted on Azure Databricks as its remote tracking server (it currently is the only solution that offers first-party integration with Databricks). You can use the MLFlow SDK tracking component to capture your evaluation metrics or any parameter you would like and track it at experimentation board in Azure Databricks. Source code and dataset version are also saved with log snapshot to provide reproducibility. TensorBoard TensorBoard is a popular tool amongst data scientist to visualize specific metrics of Deep Learning runs, especially of TensorFlow runs. TensorBoard is not an MLOps tool like AML/MLFlow, and therefore does not offer extensive logging capabilities. It is meant to be transient; and can therefore be used as an addition to an end-to-end MLOps tool like AML, but not as a complete MLOps tool. Application Insights Application Insights can be used as an alternative sink to capture model metrics, and can therefore offer more extensive options as metrics can be transferred to e.g. a PowerBI dashboard. It also enables log querying. However, this solution means that a custom application needs to be written to send logs to AppInsights (using for example the OpenCensus Python SDK), which would mean extra effort of creating/maintaining custom code. An extensive comparison of the four tools can be found as follows: Azure ML MLFlow TensorBoard Application Insights Metrics support Values, images, matrices, logs Values, images, matrices and plots as files Metrics relevant to DL research phase Values, images, matrices, logs Customizabile Basic Basic Very basic High Metrics accessible AML portal, AML SDK MLFlow UI, Tracking service API Tensorboard UI, history object Application Insights Logs accessible Rolling logs written to .txt files in blob storage, accessible via blob or AML portal. Not query-able Rolling logs are not stored Rolling logs are not stored Application Insights in Azure Portal. Query-able with KQL Ease of use and set up Very straightforward, only one portal More moving parts due to remote tracking server A bit over process overhead. Also depending on ML framework More moving parts as a custom app needs to be maintained Shareability Across people with access to AML workspace Across people with access to remote tracking server Across people with access to same directory Across people with access to AppInsights","title":"Model Experimentation and Tuning"},{"location":"observability/ml-observability/#model-in-production","text":"The trained model can be deployed to production as container. Azure Machine Learning service provides SDK to deploy model as Azure Container Instance and publishes REST endpoint. You can monitor it using microservice observability methods( for more details -refer to Recipes section). MLFLow is an alternative way to deploy ML model as a service.","title":"Model in Production"},{"location":"observability/ml-observability/#training-and-re-training","text":"To automatically retrain the model you can use AML Pipelines or Azure Databricks. When re-training with AML Pipelines you can monitor information of each run, including the output, logs, and various metrics in the Azure portal experiment dashboard, or manually extract it using the AML SDK","title":"Training and Re-Training"},{"location":"observability/ml-observability/#model-performance-over-time-data-drift","text":"We re-train machine learning models to improve their performance and make models better aligned with data changing over time. However, in some cases model performance may degrade. This may happen in case data change dramatically and do not exhibit the patterns we observed during model development anymore. This effect is called data drift. Azure Machine Learning Service has preview feature to observe and report data drift. This article describes it in detail.","title":"Model Performance Over Time: Data Drift"},{"location":"observability/ml-observability/#data-versioning","text":"It is recommended practice to add version to all datasets. You can create a versioned Azure ML Dataset for this purpose, or manually version it if using other systems.","title":"Data Versioning"},{"location":"observability/observability-as-code/","text":"Observability as Code As much as possible, configuration and management of observability assets such as cloud resource provisioning, monitoring alerts and dashboards must be managed as code. Observability as Code is achieved using any one of Terraform / Ansible / ARM Templates Examples of Observability as Code Dashboards as Code - Monitoring Dashboards can be created as JSON or XML templates. This template is source control maintained and any changes to the dashboards can be reviewed. Automation can be built for enabling the dashboard. More about how to do this in Azure . Grafana dashboard can also be configured as code which eventually can be source-controlled to be used in automation and pipelines. Alerts as Code - Alerts can be created within Azure by using Terraform or ARM templates. Such alerts can be source-controlled and be deployed as part of pipelines (Azure DevOps pipelines, Jenkins, GitHub Actions etc.). Few references of how to do this are: Terraform Monitor Metric Alert . Alerts can also be created based on log analytics query and can be defined as code using Terraform Monitor Scheduled Query Rules Alert . Automating Log Analytics Queries - There are several use cases where automation of log analytics queries may be needed. Example, Automatic Report Generation, Running custom queries programmatically for analysis, debugging etc. For these use cases to work, log queries should be source-controlled and automation can be built using log analytics REST or azure cli . Why It makes configuration repeatable and automatable. It also avoids manual configuration of monitoring alerts and dashboards from scratch across environments. Configured dashboards help troubleshoot errors during integration and deployment (CI/CD) We can audit changes and roll them back if there are any issues. Identify actionable insights from the generated metrics data across all environments, not just production. Configuration and management of observability assets like alert threshold, duration, configuration values using IAC help us in avoiding configuration mistakes, errors or overlooks during deployment. When practicing observability as code, the changes can be reviewed by the team similar to other code contributions.","title":"Observability as Code"},{"location":"observability/observability-as-code/#observability-as-code","text":"As much as possible, configuration and management of observability assets such as cloud resource provisioning, monitoring alerts and dashboards must be managed as code. Observability as Code is achieved using any one of Terraform / Ansible / ARM Templates","title":"Observability as Code"},{"location":"observability/observability-as-code/#examples-of-observability-as-code","text":"Dashboards as Code - Monitoring Dashboards can be created as JSON or XML templates. This template is source control maintained and any changes to the dashboards can be reviewed. Automation can be built for enabling the dashboard. More about how to do this in Azure . Grafana dashboard can also be configured as code which eventually can be source-controlled to be used in automation and pipelines. Alerts as Code - Alerts can be created within Azure by using Terraform or ARM templates. Such alerts can be source-controlled and be deployed as part of pipelines (Azure DevOps pipelines, Jenkins, GitHub Actions etc.). Few references of how to do this are: Terraform Monitor Metric Alert . Alerts can also be created based on log analytics query and can be defined as code using Terraform Monitor Scheduled Query Rules Alert . Automating Log Analytics Queries - There are several use cases where automation of log analytics queries may be needed. Example, Automatic Report Generation, Running custom queries programmatically for analysis, debugging etc. For these use cases to work, log queries should be source-controlled and automation can be built using log analytics REST or azure cli .","title":"Examples of Observability as Code"},{"location":"observability/observability-as-code/#why","text":"It makes configuration repeatable and automatable. It also avoids manual configuration of monitoring alerts and dashboards from scratch across environments. Configured dashboards help troubleshoot errors during integration and deployment (CI/CD) We can audit changes and roll them back if there are any issues. Identify actionable insights from the generated metrics data across all environments, not just production. Configuration and management of observability assets like alert threshold, duration, configuration values using IAC help us in avoiding configuration mistakes, errors or overlooks during deployment. When practicing observability as code, the changes can be reviewed by the team similar to other code contributions.","title":"Why"},{"location":"observability/observability-databricks/","text":"Observability for Azure Databricks Overview Azure Databricks is an Apache Spark\u2013based analytics service that makes it easy to rapidly develop and deploy big data analytics. Monitoring and troubleshooting performance issues is critical when operating production Azure Databricks workloads. It is important to log adequate information from Azure Databricks so that it is helpful to monitor and troubleshoot performance issues. Spark is designed to run on a cluster - a cluster is a set of Virtual Machines (VMs). Spark can horizontally scale with bigger workloads needed more VMs. Azure Databricks can scale in and out as needed. Approaches to Observability Azure Diagnostic Logs Azure Diagnostic Logging is provided out-of-the-box by Azure Databricks, providing visibility into actions performed against DBFS, Clusters, Accounts, Jobs, Notebooks, SSH, Workspace, Secrets, SQL Permissions, and Instance Pools. These logs are enabled using Azure Portal or CLI and can be configured to be delivered to one of these Azure resources. Log Analytics Workspace Blob Storage Event Hub Cluster Event Logs Cluster Event logs provide a quick overview into important Cluster lifecycle events. The log are structured - Timestamp, Event Type and Details. Unfortunately, there is no native way to export logs to Log Analytics. Logs will have to be delivered to Log Analytics either using REST API or polled in the dbfs using Azure Functions. VM Performance Metrics (OMS) Log Analytics Agent provides insights into the performance counters from the Cluster VMs and helps to understand the Cluster Utilization patters. Leveraging Linux OMX Agent to onboard VMs into Log Analytics, helps provide insights into the VM metrics, performance, inventory and syslog metrics. It is important to note that Linux OMS Agent is not specific to Azure Databricks. Application Logging Of all the logs collected, this is perhaps the most important one. Spark Monitoring library collects metrics about the driver, executors, JVM, HDFS, cache shuffling, DAGs, and much more. This library provides helpful insights to fine-tune Spark jobs. It allows monitoring and tracing each layer within Spark workloads, including performance and resource usage on the host and JVM, as well as Spark metrics and application-level logging. The library also includes ready-made Grafana dashboards that is a great starting point for building Azure Databricks dashboard. Logs via REST API Azure Databricks also provides REST API support. If there's any specific log data that is required, this data can be collected using the REST API calls. NSG Flow Logs Network security group (NSG) flow logs is a feature of Azure Network Watcher that allows you to log information about IP traffic flowing through an NSG. Flow data is sent to Azure Storage accounts from where you can access it as well as export it to any visualization tool, SIEM, or IDS of your choice. This log information is not specific to NSG Flow logs. This data can be used to identify unknown or undesired traffic and monitor traffic levels and/or bandwidth consumption. This is possible only with VNET-injected workspaces. Platform Logs Platform logs can be used to review provisioning/de-provisioning operations. This can be used to review activity in Databricks managed resource group. It helps discover operations performed at subscription level (like provisioning of VM, Disk etc.) These logs can be enabled via Azure Monitor > Activity Logs and shipped to Log Analytics. Ganglia Metrics Ganglia metrics is a Cluster Utilization UI and is available on the Azure Databricks. It is great for viewing live metrics of interactive clusters. Ganglia metrics is available by default and takes snapshot of usage every 15 minutes. Historical metrics are stored as .png files, making it impossible to analyze data.","title":"Observability for Azure Databricks"},{"location":"observability/observability-databricks/#observability-for-azure-databricks","text":"","title":"Observability for Azure Databricks"},{"location":"observability/observability-databricks/#overview","text":"Azure Databricks is an Apache Spark\u2013based analytics service that makes it easy to rapidly develop and deploy big data analytics. Monitoring and troubleshooting performance issues is critical when operating production Azure Databricks workloads. It is important to log adequate information from Azure Databricks so that it is helpful to monitor and troubleshoot performance issues. Spark is designed to run on a cluster - a cluster is a set of Virtual Machines (VMs). Spark can horizontally scale with bigger workloads needed more VMs. Azure Databricks can scale in and out as needed.","title":"Overview"},{"location":"observability/observability-databricks/#approaches-to-observability","text":"","title":"Approaches to Observability"},{"location":"observability/observability-databricks/#azure-diagnostic-logs","text":"Azure Diagnostic Logging is provided out-of-the-box by Azure Databricks, providing visibility into actions performed against DBFS, Clusters, Accounts, Jobs, Notebooks, SSH, Workspace, Secrets, SQL Permissions, and Instance Pools. These logs are enabled using Azure Portal or CLI and can be configured to be delivered to one of these Azure resources. Log Analytics Workspace Blob Storage Event Hub","title":"Azure Diagnostic Logs"},{"location":"observability/observability-databricks/#cluster-event-logs","text":"Cluster Event logs provide a quick overview into important Cluster lifecycle events. The log are structured - Timestamp, Event Type and Details. Unfortunately, there is no native way to export logs to Log Analytics. Logs will have to be delivered to Log Analytics either using REST API or polled in the dbfs using Azure Functions.","title":"Cluster Event Logs"},{"location":"observability/observability-databricks/#vm-performance-metrics-oms","text":"Log Analytics Agent provides insights into the performance counters from the Cluster VMs and helps to understand the Cluster Utilization patters. Leveraging Linux OMX Agent to onboard VMs into Log Analytics, helps provide insights into the VM metrics, performance, inventory and syslog metrics. It is important to note that Linux OMS Agent is not specific to Azure Databricks.","title":"VM Performance Metrics (OMS)"},{"location":"observability/observability-databricks/#application-logging","text":"Of all the logs collected, this is perhaps the most important one. Spark Monitoring library collects metrics about the driver, executors, JVM, HDFS, cache shuffling, DAGs, and much more. This library provides helpful insights to fine-tune Spark jobs. It allows monitoring and tracing each layer within Spark workloads, including performance and resource usage on the host and JVM, as well as Spark metrics and application-level logging. The library also includes ready-made Grafana dashboards that is a great starting point for building Azure Databricks dashboard.","title":"Application Logging"},{"location":"observability/observability-databricks/#logs-via-rest-api","text":"Azure Databricks also provides REST API support. If there's any specific log data that is required, this data can be collected using the REST API calls.","title":"Logs via REST API"},{"location":"observability/observability-databricks/#nsg-flow-logs","text":"Network security group (NSG) flow logs is a feature of Azure Network Watcher that allows you to log information about IP traffic flowing through an NSG. Flow data is sent to Azure Storage accounts from where you can access it as well as export it to any visualization tool, SIEM, or IDS of your choice. This log information is not specific to NSG Flow logs. This data can be used to identify unknown or undesired traffic and monitor traffic levels and/or bandwidth consumption. This is possible only with VNET-injected workspaces.","title":"NSG Flow Logs"},{"location":"observability/observability-databricks/#platform-logs","text":"Platform logs can be used to review provisioning/de-provisioning operations. This can be used to review activity in Databricks managed resource group. It helps discover operations performed at subscription level (like provisioning of VM, Disk etc.) These logs can be enabled via Azure Monitor > Activity Logs and shipped to Log Analytics.","title":"Platform Logs"},{"location":"observability/observability-databricks/#ganglia-metrics","text":"Ganglia metrics is a Cluster Utilization UI and is available on the Azure Databricks. It is great for viewing live metrics of interactive clusters. Ganglia metrics is available by default and takes snapshot of usage every 15 minutes. Historical metrics are stored as .png files, making it impossible to analyze data.","title":"Ganglia Metrics"},{"location":"observability/observability-pipelines/","text":"Observability of CI/CD Pipelines With increasing complexity to delivery pipelines, it is very important to consider Observability in the context of build and release of applications. Benefits Having proper instrumentation during build time helps gain insights into the various stages of the build and release process. Helps developers understand where the pipeline performance bottlenecks are, based on the data collected. This helps in having data-driven conversations around identifying latency between jobs, performance issues, artifact upload/download times providing valuable insights into agents availability and capacity. Helps to identify trends in failures, thus allowing developers to quickly do root cause analysis. Helps to provide an organization-wide view of pipeline health to easily identify trends. Points to Consider It is important to identify the Key Performance Indicators (KPIs) for evaluating a successful CI/CD pipeline. Where needed, additional tracing can be added to better record KPI metrics. For example, adding pipeline build tags to identify a 'Release Candidate' vs. 'Non-Release Candidate' helps in evaluating the end-to-end release process timeline. Depending on the tooling used (Azure DevOps, Jenkins etc.,), basic reporting on the pipelines is available out-of-the-box. It is important to evaluate these reports against the KPIs to understand if a custom reporting solution for their pipelines is needed. If required, custom dashboards can be built using third-party tools like Grafana or Power BI Dashboards.","title":"Observability of CI/CD Pipelines"},{"location":"observability/observability-pipelines/#observability-of-cicd-pipelines","text":"With increasing complexity to delivery pipelines, it is very important to consider Observability in the context of build and release of applications.","title":"Observability of CI/CD Pipelines"},{"location":"observability/observability-pipelines/#benefits","text":"Having proper instrumentation during build time helps gain insights into the various stages of the build and release process. Helps developers understand where the pipeline performance bottlenecks are, based on the data collected. This helps in having data-driven conversations around identifying latency between jobs, performance issues, artifact upload/download times providing valuable insights into agents availability and capacity. Helps to identify trends in failures, thus allowing developers to quickly do root cause analysis. Helps to provide an organization-wide view of pipeline health to easily identify trends.","title":"Benefits"},{"location":"observability/observability-pipelines/#points-to-consider","text":"It is important to identify the Key Performance Indicators (KPIs) for evaluating a successful CI/CD pipeline. Where needed, additional tracing can be added to better record KPI metrics. For example, adding pipeline build tags to identify a 'Release Candidate' vs. 'Non-Release Candidate' helps in evaluating the end-to-end release process timeline. Depending on the tooling used (Azure DevOps, Jenkins etc.,), basic reporting on the pipelines is available out-of-the-box. It is important to evaluate these reports against the KPIs to understand if a custom reporting solution for their pipelines is needed. If required, custom dashboards can be built using third-party tools like Grafana or Power BI Dashboards.","title":"Points to Consider"},{"location":"observability/pitfalls/","text":"Things to Watch for when Building Observable Systems Observability as an Afterthought One of the design goals when building a system should be to enable monitoring of the system. This helps planning and thinking application availability, logging and metrics at the time of design and development. Observability also acts as a great debugging tool providing developers a bird's eye view of the system. By leaving instrumentation and logging of metrics towards the end, the development teams lose valuable insights during development. Metric Fatigue It is recommended to collect and measure what you need and not what you can . Don't attempt to monitor everything. If the data is not actionable, it is useless and becomes noise. On the contrary, it is sometimes very difficult to forecast every possible scenario that could go wrong. There must be a balance between collecting what is needed vs. logging every single activity in the system. A general rule of thumb is to follow these principles rules that catch incidents must be simple, relevant and reliable any data that is collected but not aggregated or alerted on must be reviewed if it is still required. Context All data logged must contain rich context, which is useful for getting an overall view of the system and easy to trace back errors/failures during troubleshooting. While logging data, care must also be taken to avoid data silos. Personally Identifiable Information As a general rule, do not log any customer sensitive and Personal Identifiable Information (PII). Ensure any pertinent privacy regulations are followed regarding PII (Ex: GDPR etc.) Read more here on how to keep sensitive data out of logs.","title":"Things to Watch for when Building Observable Systems"},{"location":"observability/pitfalls/#things-to-watch-for-when-building-observable-systems","text":"","title":"Things to Watch for when Building Observable Systems"},{"location":"observability/pitfalls/#observability-as-an-afterthought","text":"One of the design goals when building a system should be to enable monitoring of the system. This helps planning and thinking application availability, logging and metrics at the time of design and development. Observability also acts as a great debugging tool providing developers a bird's eye view of the system. By leaving instrumentation and logging of metrics towards the end, the development teams lose valuable insights during development.","title":"Observability as an Afterthought"},{"location":"observability/pitfalls/#metric-fatigue","text":"It is recommended to collect and measure what you need and not what you can . Don't attempt to monitor everything. If the data is not actionable, it is useless and becomes noise. On the contrary, it is sometimes very difficult to forecast every possible scenario that could go wrong. There must be a balance between collecting what is needed vs. logging every single activity in the system. A general rule of thumb is to follow these principles rules that catch incidents must be simple, relevant and reliable any data that is collected but not aggregated or alerted on must be reviewed if it is still required.","title":"Metric Fatigue"},{"location":"observability/pitfalls/#context","text":"All data logged must contain rich context, which is useful for getting an overall view of the system and easy to trace back errors/failures during troubleshooting. While logging data, care must also be taken to avoid data silos.","title":"Context"},{"location":"observability/pitfalls/#personally-identifiable-information","text":"As a general rule, do not log any customer sensitive and Personal Identifiable Information (PII). Ensure any pertinent privacy regulations are followed regarding PII (Ex: GDPR etc.) Read more here on how to keep sensitive data out of logs.","title":"Personally Identifiable Information"},{"location":"observability/profiling/","text":"Profiling Overview Profiling is a form of runtime analysis that measures various components of the runtime such as, memory allocation, garbage collection, threads and locks, call stacks, or frequency and duration of specific functions. It can be used to see which functions are the most costly in your binary, allowing you to focus your effort on removing the largest inefficiencies as quickly as possible. It can help you find deadlocks, memory leaks, or inefficient memory allocation, and help inform decisions around resource allocation (ie: CPU or RAM). How to Profile your Applications Profiling is somewhat language dependent, so start off by searching for \"profile $language\" (some common tools are listed below). Additionally, Linux Perf is a good fallback, since a lot of languages have bindings in C/C++. Profiling does incur some cost, as it requires inspecting the call stack, and sometimes pausing the application all together (ie: to trigger a full GC in Java). It is recommended to continuously profile your services, say for 10s every 10 minutes. Consider the cost when deciding on tuning these parameters. Different tools visualize profiles differently. Common CPU profiles might use a directed graph or a flame graph. Unfortunately, each profiler tool typically uses its own format for storing profiles, and comes with its own visualization. Tools (Java, Go, Python, Ruby, eBPF) Pyroscope continuous profiling out of the box. (Java and Go) Flame - profiling containers in Kubernetes (Java, Python, Go) Datadog Continuous profiler (Go) profefe , which builds pprof to provide continuous profiling (Java) Eclipse Memory Analyzer","title":"Profiling"},{"location":"observability/profiling/#profiling","text":"","title":"Profiling"},{"location":"observability/profiling/#overview","text":"Profiling is a form of runtime analysis that measures various components of the runtime such as, memory allocation, garbage collection, threads and locks, call stacks, or frequency and duration of specific functions. It can be used to see which functions are the most costly in your binary, allowing you to focus your effort on removing the largest inefficiencies as quickly as possible. It can help you find deadlocks, memory leaks, or inefficient memory allocation, and help inform decisions around resource allocation (ie: CPU or RAM).","title":"Overview"},{"location":"observability/profiling/#how-to-profile-your-applications","text":"Profiling is somewhat language dependent, so start off by searching for \"profile $language\" (some common tools are listed below). Additionally, Linux Perf is a good fallback, since a lot of languages have bindings in C/C++. Profiling does incur some cost, as it requires inspecting the call stack, and sometimes pausing the application all together (ie: to trigger a full GC in Java). It is recommended to continuously profile your services, say for 10s every 10 minutes. Consider the cost when deciding on tuning these parameters. Different tools visualize profiles differently. Common CPU profiles might use a directed graph or a flame graph. Unfortunately, each profiler tool typically uses its own format for storing profiles, and comes with its own visualization.","title":"How to Profile your Applications"},{"location":"observability/profiling/#tools","text":"(Java, Go, Python, Ruby, eBPF) Pyroscope continuous profiling out of the box. (Java and Go) Flame - profiling containers in Kubernetes (Java, Python, Go) Datadog Continuous profiler (Go) profefe , which builds pprof to provide continuous profiling (Java) Eclipse Memory Analyzer","title":"Tools"},{"location":"observability/recipes-observability/","text":"Recipes Application Insights/ASP.NET GitHub Repo , Article . Application Insights/ASP.NET Core with Distributed Trace Context Propagation to Kafka GitHub Repo . Example: OpenTelemetry Over a Message Oriented Architecture in Java with Jaeger, Prometheus and Azure Monitor GitHub Repo Example: Setup Azure Monitor Dashboards and Alerts with Terraform GitHub Repo On-premises Application Insights On-premise Application Insights is a service that is compatible with Azure App Insights, but stores the data in an in-house database like PostgreSQL or object storage like Azurite . On-premises Application Insights is useful as a drop-in replacement for Azure Application Insights in scenarios where a solution must be cloud deployable but must also support on-premises disconnected deployment scenarios. On-premises Application Insights is also useful for testing telemetry integration. Issues related to telemetry can be hard to catch since often these integrations are excluded from unit-test or integration test flows due to it being non-trivial to use a live Azure Application Insights resource for testing, e.g. managing the lifetime of the resource, having to ignore old telemetry for assertions, if a new resource is used it can take a while for the telemetry to show up, etc. The On-premise Application Insights service can be used to make it easier to integrate with an Azure Application Insights compatible API endpoint during local development or continuous integration without having to spin up a resource in Azure. Additionally, the service simplifies integration testing of asynchronous workflows such as web workers since integration tests can now be written to assert against telemetry logged to the service, e.g. assert that no exceptions were logged, assert that some number of events of a specific type were logged within a certain time-frame, etc. Azure DevOps Pipelines Reporting with Power BI The Azure DevOps Pipelines Report contains a Power BI template for monitoring project, pipeline, and pipeline run data from an Azure DevOps (AzDO) organization. This dashboard recipe provides observability for AzDO pipelines by displaying various metrics (i.e. average runtime, run outcome statistics, etc.) in a table. Additionally, the second page of the template visualizes pipeline success and failure trends using Power BI charts. Documentation and setup information can be found in the project README. Python Logger Class for Application Insights using OpenCensus The Azure SDK for Python contains an Azure Monitor Opentelemetry Distro client library for Python . You can view samples of how to use the library in this GitHub Repo . With this library you can easily collect traces, metrics, and logs. Java OpenTelemetry Examples This GitHub Repo contains a set of fully-functional, working examples of using the OpenTelemetry Java APIs and SDK.","title":"Recipes"},{"location":"observability/recipes-observability/#recipes","text":"","title":"Recipes"},{"location":"observability/recipes-observability/#application-insightsaspnet","text":"GitHub Repo , Article .","title":"Application Insights/ASP.NET"},{"location":"observability/recipes-observability/#application-insightsaspnet-core-with-distributed-trace-context-propagation-to-kafka","text":"GitHub Repo .","title":"Application Insights/ASP.NET Core with Distributed Trace Context Propagation to Kafka"},{"location":"observability/recipes-observability/#example-opentelemetry-over-a-message-oriented-architecture-in-java-with-jaeger-prometheus-and-azure-monitor","text":"GitHub Repo","title":"Example: OpenTelemetry Over a Message Oriented Architecture in Java with Jaeger, Prometheus and Azure Monitor"},{"location":"observability/recipes-observability/#example-setup-azure-monitor-dashboards-and-alerts-with-terraform","text":"GitHub Repo","title":"Example: Setup Azure Monitor Dashboards and Alerts with Terraform"},{"location":"observability/recipes-observability/#on-premises-application-insights","text":"On-premise Application Insights is a service that is compatible with Azure App Insights, but stores the data in an in-house database like PostgreSQL or object storage like Azurite . On-premises Application Insights is useful as a drop-in replacement for Azure Application Insights in scenarios where a solution must be cloud deployable but must also support on-premises disconnected deployment scenarios. On-premises Application Insights is also useful for testing telemetry integration. Issues related to telemetry can be hard to catch since often these integrations are excluded from unit-test or integration test flows due to it being non-trivial to use a live Azure Application Insights resource for testing, e.g. managing the lifetime of the resource, having to ignore old telemetry for assertions, if a new resource is used it can take a while for the telemetry to show up, etc. The On-premise Application Insights service can be used to make it easier to integrate with an Azure Application Insights compatible API endpoint during local development or continuous integration without having to spin up a resource in Azure. Additionally, the service simplifies integration testing of asynchronous workflows such as web workers since integration tests can now be written to assert against telemetry logged to the service, e.g. assert that no exceptions were logged, assert that some number of events of a specific type were logged within a certain time-frame, etc.","title":"On-premises Application Insights"},{"location":"observability/recipes-observability/#azure-devops-pipelines-reporting-with-power-bi","text":"The Azure DevOps Pipelines Report contains a Power BI template for monitoring project, pipeline, and pipeline run data from an Azure DevOps (AzDO) organization. This dashboard recipe provides observability for AzDO pipelines by displaying various metrics (i.e. average runtime, run outcome statistics, etc.) in a table. Additionally, the second page of the template visualizes pipeline success and failure trends using Power BI charts. Documentation and setup information can be found in the project README.","title":"Azure DevOps Pipelines Reporting with Power BI"},{"location":"observability/recipes-observability/#python-logger-class-for-application-insights-using-opencensus","text":"The Azure SDK for Python contains an Azure Monitor Opentelemetry Distro client library for Python . You can view samples of how to use the library in this GitHub Repo . With this library you can easily collect traces, metrics, and logs.","title":"Python Logger Class for Application Insights using OpenCensus"},{"location":"observability/recipes-observability/#java-opentelemetry-examples","text":"This GitHub Repo contains a set of fully-functional, working examples of using the OpenTelemetry Java APIs and SDK.","title":"Java OpenTelemetry Examples"},{"location":"observability/pillars/dashboard/","text":"Dashboard Overview Dashboard is a form of data visualization that provides \"at a glance\" view of Key Performance Indicators(KPIs) of observable system. Dashboard connects multiple data sources allowing creation of visual representation of data insights which otherwise are difficult to understand. Dashboard can be used to: show trends identify patterns(user, usage, search etc) measure efficiency easily identify data outliers and correlations view health state or performance of the system give an outlook of the KPI that is important to a business/process Best Practices Common questions to ask yourself when building dashboard would be: Where did my user spend most of their time at? What is my user searching? How do I better help my team with alerts and troubleshooting? Is my system healthy for the past one day/week/month/quarter? Here are principles to consider when building dashboards: Separate a dashboard in multiple sections for simplicity. Adding page jump or anchor(#section) is also a plus if applicable. Add multiple and simple charts. Build simple chart, have more of them rather than a complicated all in one chart. Identify goals or KPI measurement. Identifying goals or KPI helps in defining what needs to be achieved. Here are some examples - server downtime, mean time to address error, service level agreement. Ask questions that can help reach the defined goal or KPI. This may sound counter-intuitive, the more questions asked while constructing dashboard the better the outcome will be. Questions like location, internet service provider, time of day the users make requests to server would be a good start. Validate the questions. This is often done with stakeholders, sponsors, leads or project managers. Observe the dashboard that is built. Is the data reflecting what the stakeholders set out to answer? Always remember this process takes time. Building dashboard is easy, building an observable dashboard to show pattern is hard. Recommended Tools Azure Monitor Workbooks - Supporting markdown, Azure Workbooks is tightly integrated with Azure services making this highly customizable without extra tool. Create dashboard using log query - Dashboard can be created using log query on Log Analytics data. Building dashboards using Application Insights - Dashboards can be created using Application Insights as well. Power Bi - Power Bi is one of the easier tools to create dashboards from data sources and reports. Grafana - Getting started with Grafana. Grafana is a popular open source tool for dashboarding and visualization. Azure Monitor as Grafana data source - This provides a step by step integration of Azure Monitor to Grafana. Brief comparison of various tools Dashboard Samples and Recipes Azure Workbooks Performance analysis - A measurement on how the system performs. Workbook template available in gallery. Failure analysis - A report about system failure with details. Workbook template available in gallery. Application Performance Index( Apdex ) - This is a way to measure user satisfaction. It classifies performance into three zones based on a baseline performance threshold T. The template for Appdex is available in Azure Workbooks gallery as well. Application Insights User retention analysis User navigation patterns analysis User session analysis For other tools, these can be used as a reference to recreate if a template is not readily available. Grafana with Azure Monitor as Data Source Azure Kubernetes Service - Cluster & Namespace Metrics - Container Insights metrics for Kubernetes clusters. Cluster utilization, namespace utilization, Node cpu & memory, Node disk usage & disk io, node network & kubelet docker operation metrics Azure Kubernetes Service - Container Level & Pod Metrics - This contains Container level and Pod Metrics like CPU and Memory which are missing in the above dashboard. Summary In order to build an observable dashboard, the goal is to make use of collected metrics, logs, traces to give an insight on how the system performs, user behaves and identify patterns. There are a lot of tools and templates out there. Whichever the choice is, a good dashboard is always a dashboard that can help you answer questions about the system and user, keep track of the KPI and goal while also allowing informed business decisions to be made.","title":"Dashboard"},{"location":"observability/pillars/dashboard/#dashboard","text":"","title":"Dashboard"},{"location":"observability/pillars/dashboard/#overview","text":"Dashboard is a form of data visualization that provides \"at a glance\" view of Key Performance Indicators(KPIs) of observable system. Dashboard connects multiple data sources allowing creation of visual representation of data insights which otherwise are difficult to understand. Dashboard can be used to: show trends identify patterns(user, usage, search etc) measure efficiency easily identify data outliers and correlations view health state or performance of the system give an outlook of the KPI that is important to a business/process","title":"Overview"},{"location":"observability/pillars/dashboard/#best-practices","text":"Common questions to ask yourself when building dashboard would be: Where did my user spend most of their time at? What is my user searching? How do I better help my team with alerts and troubleshooting? Is my system healthy for the past one day/week/month/quarter? Here are principles to consider when building dashboards: Separate a dashboard in multiple sections for simplicity. Adding page jump or anchor(#section) is also a plus if applicable. Add multiple and simple charts. Build simple chart, have more of them rather than a complicated all in one chart. Identify goals or KPI measurement. Identifying goals or KPI helps in defining what needs to be achieved. Here are some examples - server downtime, mean time to address error, service level agreement. Ask questions that can help reach the defined goal or KPI. This may sound counter-intuitive, the more questions asked while constructing dashboard the better the outcome will be. Questions like location, internet service provider, time of day the users make requests to server would be a good start. Validate the questions. This is often done with stakeholders, sponsors, leads or project managers. Observe the dashboard that is built. Is the data reflecting what the stakeholders set out to answer? Always remember this process takes time. Building dashboard is easy, building an observable dashboard to show pattern is hard.","title":"Best Practices"},{"location":"observability/pillars/dashboard/#recommended-tools","text":"Azure Monitor Workbooks - Supporting markdown, Azure Workbooks is tightly integrated with Azure services making this highly customizable without extra tool. Create dashboard using log query - Dashboard can be created using log query on Log Analytics data. Building dashboards using Application Insights - Dashboards can be created using Application Insights as well. Power Bi - Power Bi is one of the easier tools to create dashboards from data sources and reports. Grafana - Getting started with Grafana. Grafana is a popular open source tool for dashboarding and visualization. Azure Monitor as Grafana data source - This provides a step by step integration of Azure Monitor to Grafana. Brief comparison of various tools","title":"Recommended Tools"},{"location":"observability/pillars/dashboard/#dashboard-samples-and-recipes","text":"","title":"Dashboard Samples and Recipes"},{"location":"observability/pillars/dashboard/#azure-workbooks","text":"Performance analysis - A measurement on how the system performs. Workbook template available in gallery. Failure analysis - A report about system failure with details. Workbook template available in gallery. Application Performance Index( Apdex ) - This is a way to measure user satisfaction. It classifies performance into three zones based on a baseline performance threshold T. The template for Appdex is available in Azure Workbooks gallery as well.","title":"Azure Workbooks"},{"location":"observability/pillars/dashboard/#application-insights","text":"User retention analysis User navigation patterns analysis User session analysis For other tools, these can be used as a reference to recreate if a template is not readily available.","title":"Application Insights"},{"location":"observability/pillars/dashboard/#grafana-with-azure-monitor-as-data-source","text":"Azure Kubernetes Service - Cluster & Namespace Metrics - Container Insights metrics for Kubernetes clusters. Cluster utilization, namespace utilization, Node cpu & memory, Node disk usage & disk io, node network & kubelet docker operation metrics Azure Kubernetes Service - Container Level & Pod Metrics - This contains Container level and Pod Metrics like CPU and Memory which are missing in the above dashboard.","title":"Grafana with Azure Monitor as Data Source"},{"location":"observability/pillars/dashboard/#summary","text":"In order to build an observable dashboard, the goal is to make use of collected metrics, logs, traces to give an insight on how the system performs, user behaves and identify patterns. There are a lot of tools and templates out there. Whichever the choice is, a good dashboard is always a dashboard that can help you answer questions about the system and user, keep track of the KPI and goal while also allowing informed business decisions to be made.","title":"Summary"},{"location":"observability/pillars/logging/","text":"Logging Overview Logs are discrete events with the goal of helping engineers identify problem area(s) during failures. Collection Methods When it comes to log collection methods, two of the standard techniques are a direct-write, or an agent-based approach. Directly written log events are handled in-process of the particular component, usually utilizing a provided library. Azure Monitor has direct send capabilities, but it's not recommended for serious/production use. This approach has some advantages: There is no external process to configure or monitor No log file management (rolling, expiring) to prevent out of disk space issues. The potential trade-offs of this approach: Potentially higher memory usage if the particular library is using a memory backed buffer. In the event of an extended service outage, log data may get dropped or truncated due to buffer constraints. Multiple component process logging will manage & emit logs individually, which can be more complex to manage for the outbound load. Agent-based log collection relies on an external process running on the host machine, with the particular component emitting log data stdout or file. Writing log data to stdout is the preferred practice when running applications within a container environment like Kubernetes. The container runtime redirects the output to files, which can then be processed by an agent. Azure Monitor , Grafana Loki Elastic's Logstash and Fluent Bit are examples of log shipping agents. There are several advantages when using an agent to collect & ship log files: Centralized configuration. Collecting multiple sources of data with a single process. Local pre-processing & filtering of log data before sending it to a central service. Utilizing disk space as a data buffer during a service disruption. This approach isn't without trade-offs: Required exclusive CPU & memory resources for the processing of log data. Persistent disk space for buffering. Best Practices Pay attention to logging levels. Logging too much will increase costs and decrease application throughput. Ensure logging configuration can be modified without code changes. Ideally, make it changeable without application restarts. If available, take advantage of logging levels per category allowing granular logging configuration. Check for log levels before logging, thus avoiding allocations and string manipulation costs. Ensure service versions are included in logs to be able to identify problematic releases. Log a raised exception only once. In your handlers, only catch expected exceptions that you can handle gracefully (even with a specific return code). If you want to log and rethrow, leave it to the top level exception handler. Do the minimal amount of cleanup work needed then throw to maintain the original stack trace. Don\u2019t log a warning or stack trace for expected exceptions (eg: properly expected 404, 403 HTTP statuses). Fine tune logging levels in production (>= warning for instance). During a new release the verbosity can be increased to facilitate bug identification. If using sampling, implement this at the service level rather than defining it in the logging system. This way we have control over what gets logged. An additional benefit is reduced number of roundtrips. Only include failures from health checks and non-business driven requests. Ensure a downstream system malfunction won't cause repetitive logs being stored. Don't reinvent the wheel, use existing tools to collect and analyze the data. Ensure personal identifiable information policies and restrictions are followed. Ensure errors and exceptions in dependent services are captured and logged. For example, if an application uses Redis cache, Service Bus or any other service, any errors/exceptions raised while accessing these services should be captured and logged. If there's Sufficient Log Data, is there a Need for Instrumenting Metrics? Logs vs Metrics vs Traces covers some high level guidance on when to utilize metric data and when to use log data. Both have a valuable part to play in creating observable systems. Having Problems Identifying What to Log? At application startup : Unrecoverable errors from startup. Warnings if application still runnable, but not as expected (i.e. not providing blob connection string, thus resorting to local files. Another example is if there's a need to fail back to a secondary service or a known good state, because it didn\u2019t get an answer from a primary dependency.) Information about the service\u2019s state at startup (build #, configs loaded, etc.) Per incoming request : Basic information for each incoming request: the url (scrubbed of any personally identifying data, a.k.a. PII), any user/tenant/request dimensions, response code returned, request-to-response latency, payload size, record counts, etc. (whatever you need to learn something from the aggregate data) Warning for any unexpected exceptions, caught only at the top controller/interceptor and logged with or alongside the request info, with stack trace. Return a 500. This code doesn\u2019t know what happened. Per outgoing request : Basic information for each outgoing request: the url (scrubbed of any personally identifying data, a.k.a. PII), any user/tenant/request dimensions, response code returned, request-to-response latency, payload sizes, record counts returned, etc. Report perceived availability and latency of dependencies and including slicing/clustering data that could help with later analysis. Recommended Tools Azure Monitor - Umbrella of services including system metrics, log analytics and more. Grafana Loki - An open source log aggregation platform, built on the learnings from the Prometheus Community for highly efficient collection & storage of log data at scale. The Elastic Stack - An open source log analytics tech stack utilizing Logstash, Beats, Elastic search and Kibana. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources.","title":"Logging"},{"location":"observability/pillars/logging/#logging","text":"","title":"Logging"},{"location":"observability/pillars/logging/#overview","text":"Logs are discrete events with the goal of helping engineers identify problem area(s) during failures.","title":"Overview"},{"location":"observability/pillars/logging/#collection-methods","text":"When it comes to log collection methods, two of the standard techniques are a direct-write, or an agent-based approach. Directly written log events are handled in-process of the particular component, usually utilizing a provided library. Azure Monitor has direct send capabilities, but it's not recommended for serious/production use. This approach has some advantages: There is no external process to configure or monitor No log file management (rolling, expiring) to prevent out of disk space issues. The potential trade-offs of this approach: Potentially higher memory usage if the particular library is using a memory backed buffer. In the event of an extended service outage, log data may get dropped or truncated due to buffer constraints. Multiple component process logging will manage & emit logs individually, which can be more complex to manage for the outbound load. Agent-based log collection relies on an external process running on the host machine, with the particular component emitting log data stdout or file. Writing log data to stdout is the preferred practice when running applications within a container environment like Kubernetes. The container runtime redirects the output to files, which can then be processed by an agent. Azure Monitor , Grafana Loki Elastic's Logstash and Fluent Bit are examples of log shipping agents. There are several advantages when using an agent to collect & ship log files: Centralized configuration. Collecting multiple sources of data with a single process. Local pre-processing & filtering of log data before sending it to a central service. Utilizing disk space as a data buffer during a service disruption. This approach isn't without trade-offs: Required exclusive CPU & memory resources for the processing of log data. Persistent disk space for buffering.","title":"Collection Methods"},{"location":"observability/pillars/logging/#best-practices","text":"Pay attention to logging levels. Logging too much will increase costs and decrease application throughput. Ensure logging configuration can be modified without code changes. Ideally, make it changeable without application restarts. If available, take advantage of logging levels per category allowing granular logging configuration. Check for log levels before logging, thus avoiding allocations and string manipulation costs. Ensure service versions are included in logs to be able to identify problematic releases. Log a raised exception only once. In your handlers, only catch expected exceptions that you can handle gracefully (even with a specific return code). If you want to log and rethrow, leave it to the top level exception handler. Do the minimal amount of cleanup work needed then throw to maintain the original stack trace. Don\u2019t log a warning or stack trace for expected exceptions (eg: properly expected 404, 403 HTTP statuses). Fine tune logging levels in production (>= warning for instance). During a new release the verbosity can be increased to facilitate bug identification. If using sampling, implement this at the service level rather than defining it in the logging system. This way we have control over what gets logged. An additional benefit is reduced number of roundtrips. Only include failures from health checks and non-business driven requests. Ensure a downstream system malfunction won't cause repetitive logs being stored. Don't reinvent the wheel, use existing tools to collect and analyze the data. Ensure personal identifiable information policies and restrictions are followed. Ensure errors and exceptions in dependent services are captured and logged. For example, if an application uses Redis cache, Service Bus or any other service, any errors/exceptions raised while accessing these services should be captured and logged.","title":"Best Practices"},{"location":"observability/pillars/logging/#if-theres-sufficient-log-data-is-there-a-need-for-instrumenting-metrics","text":"Logs vs Metrics vs Traces covers some high level guidance on when to utilize metric data and when to use log data. Both have a valuable part to play in creating observable systems.","title":"If there's Sufficient Log Data, is there a Need for Instrumenting Metrics?"},{"location":"observability/pillars/logging/#having-problems-identifying-what-to-log","text":"At application startup : Unrecoverable errors from startup. Warnings if application still runnable, but not as expected (i.e. not providing blob connection string, thus resorting to local files. Another example is if there's a need to fail back to a secondary service or a known good state, because it didn\u2019t get an answer from a primary dependency.) Information about the service\u2019s state at startup (build #, configs loaded, etc.) Per incoming request : Basic information for each incoming request: the url (scrubbed of any personally identifying data, a.k.a. PII), any user/tenant/request dimensions, response code returned, request-to-response latency, payload size, record counts, etc. (whatever you need to learn something from the aggregate data) Warning for any unexpected exceptions, caught only at the top controller/interceptor and logged with or alongside the request info, with stack trace. Return a 500. This code doesn\u2019t know what happened. Per outgoing request : Basic information for each outgoing request: the url (scrubbed of any personally identifying data, a.k.a. PII), any user/tenant/request dimensions, response code returned, request-to-response latency, payload sizes, record counts returned, etc. Report perceived availability and latency of dependencies and including slicing/clustering data that could help with later analysis.","title":"Having Problems Identifying What to Log?"},{"location":"observability/pillars/logging/#recommended-tools","text":"Azure Monitor - Umbrella of services including system metrics, log analytics and more. Grafana Loki - An open source log aggregation platform, built on the learnings from the Prometheus Community for highly efficient collection & storage of log data at scale. The Elastic Stack - An open source log analytics tech stack utilizing Logstash, Beats, Elastic search and Kibana. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources.","title":"Recommended Tools"},{"location":"observability/pillars/metrics/","text":"Metrics Overview Metrics provide a near real-time stream of data, informing operators and stakeholders about the functions the system is performing as well as its health. Unlike logging and tracing, metric data tends to be more efficient to transmit and store. Collection Methods Metric collection approaches fall into two broad categories: push metrics & pull metrics. Push metrics means that the originating component sends the data to a remote service or agent. Azure Monitor and Etsy's statsd are examples of push metrics. Some strengths with push metrics include: Only require network egress to the remote target. Originating component controls the frequency of measurement. Simplified configuration as the component only needs to know the destination of where to send data. Some trade-offs with this approach: At scale, it is much more difficult to control data transmission rates, which can cause service throttling or dropping of values. Determining if every component, particularly in a dynamic scale environment, is healthy and sending data is difficult. In the case of pull metrics, each originating component publishes an endpoint for the metric agent to connect to and gather measurements. Prometheus and its ecosystem of tools are an example of pull style metrics. Benefits experienced using a pull metrics setup may involve: Singular configuration for determining what is measured and the frequency of measurement for the local environment. Every measurement target has a meta metric related to if the collection is successful or not, which can be used as a general health check. Support for routing, filtering and processing of metrics before sending them onto a globally central metrics store. Items of concern to some may include: Configuring & managing data sources can lead to a complex configuration. Prometheus has tooling to auto-discover and configure data sources in some environments, such as Kubernetes, but there are always exceptions to this, which lead to configuration complexity. Network configuration may add further complexity if firewalls and other ACLs need to be managed to allow connectivity. Best Practices When Should I use Metrics Instead of Logs? Logs vs Metrics vs Traces covers some high level guidance on when to utilize metric data and when to use log data. Both have a valuable part to play in creating observable systems. What Should be Tracked? System critical measurements that relate to the application/machine health, which are usually excellent alert candidates. Work with your engineering and devops peers to identify the metrics, but they may include: CPU and memory utilization. Request rate. Queue length. Unexpected exception count. Dependent service metrics like response time for Redis cache, Sql server or Service bus. Important business-related measurements, which drive reporting to stakeholders. Consult with the various stakeholders of the component, but some examples may include: Jobs performed. User Session length. Games played. Site visits. Dimension Labels Modern metric systems today usually define a single time series metric as the combination of the name of the metric and its dictionary of dimension labels. Labels are an excellent way to distinguish one instance of a metric, from another while still allowing for aggregation and other operations to be performed on the set for analysis. Some common labels used in metrics may include: Container Name Host name Code Version Kubernetes cluster name Azure Region Note : Since dimension labels are used for aggregations and grouping operations, do not use unique strings or those with high cardinality as the value of a label. The value of the label is significantly diminished for reporting and in many cases has a negative performance hit on the metric system used to track it. Recommended Tools Azure Monitor - Umbrella of services including system metrics, log analytics and more. Prometheus - A real-time monitoring & alerting application. It's exposition format for exposing time-series is the basis for OpenMetrics's standard format. Thanos - Open source, highly available Prometheus setup with long term storage capabilities. Cortex - Horizontally scalable, highly available, multi-tenant, long term Prometheus. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources.","title":"Metrics"},{"location":"observability/pillars/metrics/#metrics","text":"","title":"Metrics"},{"location":"observability/pillars/metrics/#overview","text":"Metrics provide a near real-time stream of data, informing operators and stakeholders about the functions the system is performing as well as its health. Unlike logging and tracing, metric data tends to be more efficient to transmit and store.","title":"Overview"},{"location":"observability/pillars/metrics/#collection-methods","text":"Metric collection approaches fall into two broad categories: push metrics & pull metrics. Push metrics means that the originating component sends the data to a remote service or agent. Azure Monitor and Etsy's statsd are examples of push metrics. Some strengths with push metrics include: Only require network egress to the remote target. Originating component controls the frequency of measurement. Simplified configuration as the component only needs to know the destination of where to send data. Some trade-offs with this approach: At scale, it is much more difficult to control data transmission rates, which can cause service throttling or dropping of values. Determining if every component, particularly in a dynamic scale environment, is healthy and sending data is difficult. In the case of pull metrics, each originating component publishes an endpoint for the metric agent to connect to and gather measurements. Prometheus and its ecosystem of tools are an example of pull style metrics. Benefits experienced using a pull metrics setup may involve: Singular configuration for determining what is measured and the frequency of measurement for the local environment. Every measurement target has a meta metric related to if the collection is successful or not, which can be used as a general health check. Support for routing, filtering and processing of metrics before sending them onto a globally central metrics store. Items of concern to some may include: Configuring & managing data sources can lead to a complex configuration. Prometheus has tooling to auto-discover and configure data sources in some environments, such as Kubernetes, but there are always exceptions to this, which lead to configuration complexity. Network configuration may add further complexity if firewalls and other ACLs need to be managed to allow connectivity.","title":"Collection Methods"},{"location":"observability/pillars/metrics/#best-practices","text":"","title":"Best Practices"},{"location":"observability/pillars/metrics/#when-should-i-use-metrics-instead-of-logs","text":"Logs vs Metrics vs Traces covers some high level guidance on when to utilize metric data and when to use log data. Both have a valuable part to play in creating observable systems.","title":"When Should I use Metrics Instead of Logs?"},{"location":"observability/pillars/metrics/#what-should-be-tracked","text":"System critical measurements that relate to the application/machine health, which are usually excellent alert candidates. Work with your engineering and devops peers to identify the metrics, but they may include: CPU and memory utilization. Request rate. Queue length. Unexpected exception count. Dependent service metrics like response time for Redis cache, Sql server or Service bus. Important business-related measurements, which drive reporting to stakeholders. Consult with the various stakeholders of the component, but some examples may include: Jobs performed. User Session length. Games played. Site visits.","title":"What Should be Tracked?"},{"location":"observability/pillars/metrics/#dimension-labels","text":"Modern metric systems today usually define a single time series metric as the combination of the name of the metric and its dictionary of dimension labels. Labels are an excellent way to distinguish one instance of a metric, from another while still allowing for aggregation and other operations to be performed on the set for analysis. Some common labels used in metrics may include: Container Name Host name Code Version Kubernetes cluster name Azure Region Note : Since dimension labels are used for aggregations and grouping operations, do not use unique strings or those with high cardinality as the value of a label. The value of the label is significantly diminished for reporting and in many cases has a negative performance hit on the metric system used to track it.","title":"Dimension Labels"},{"location":"observability/pillars/metrics/#recommended-tools","text":"Azure Monitor - Umbrella of services including system metrics, log analytics and more. Prometheus - A real-time monitoring & alerting application. It's exposition format for exposing time-series is the basis for OpenMetrics's standard format. Thanos - Open source, highly available Prometheus setup with long term storage capabilities. Cortex - Horizontally scalable, highly available, multi-tenant, long term Prometheus. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources.","title":"Recommended Tools"},{"location":"observability/pillars/tracing/","text":"Tracing Overview Produces the information required to observe series of correlated operations in a distributed system. Once collected they show the path, measurements and faults in an end-to-end transaction. Best Practices Ensure that at least key business transactions are traced. Include in each trace necessary information to identify software releases (i.e. service name, version). This is important to correlate deployments and system degradation. Ensure dependencies are included in trace (databases, I/O). If costs are a concern use sampling, avoiding throwing away errors, unexpected behavior and critical information. Don't reinvent the wheel, use existing tools to collect and analyze the data. Ensure personal identifiable information policies and restrictions are followed. Recommended Tools Azure Monitor - Umbrella of services including system metrics, log analytics and more. Jaeger Tracing - Open source, end-to-end distributed tracing. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the Trace Context object among a full stack of microservices implemented across different technical stacks.","title":"Tracing"},{"location":"observability/pillars/tracing/#tracing","text":"","title":"Tracing"},{"location":"observability/pillars/tracing/#overview","text":"Produces the information required to observe series of correlated operations in a distributed system. Once collected they show the path, measurements and faults in an end-to-end transaction.","title":"Overview"},{"location":"observability/pillars/tracing/#best-practices","text":"Ensure that at least key business transactions are traced. Include in each trace necessary information to identify software releases (i.e. service name, version). This is important to correlate deployments and system degradation. Ensure dependencies are included in trace (databases, I/O). If costs are a concern use sampling, avoiding throwing away errors, unexpected behavior and critical information. Don't reinvent the wheel, use existing tools to collect and analyze the data. Ensure personal identifiable information policies and restrictions are followed.","title":"Best Practices"},{"location":"observability/pillars/tracing/#recommended-tools","text":"Azure Monitor - Umbrella of services including system metrics, log analytics and more. Jaeger Tracing - Open source, end-to-end distributed tracing. Grafana - Open source dashboard & visualization tool. Supports Log, Metrics and Distributed tracing data sources. Consider using OpenTelemetry as it implements open-source cross-platform context propagation for end-to-end distributed transactions over heterogeneous components out-of-the-box. It takes care of automatically creating and managing the Trace Context object among a full stack of microservices implemented across different technical stacks.","title":"Recommended Tools"},{"location":"observability/tools/","text":"Tools and Patterns There are a number of modern tools to make systems observable. While identifying and/or creating tools that work for your system, here are a few things to consider to help guide the choices. Must be simple to integrate and easy to use. It must be possible to aggregate and visualize data. Tools must provide real-time data. Must be able to guide users to the problem area with suitable, adequate end-to-end context. Choices Loki OpenTelemetry Kubernetes Dashboards Prometheus Service Mesh Leveraging a Service Mesh that follows the Sidecar Pattern quickly sets up a go-to set of metrics, and traces (although traces need to be propagated from incoming requests to outgoing requests manually). A sidecar works by intercepting all incoming and outgoing traffic to your image. It then adds trace headers to each request and emits a standard set of logs and metrics. These metrics are extremely powerful for observability, allowing every service, whether client-side or server-side, to leverage a unified set of metrics, including: Latency Bytes Request Rate Error Rate In a microservice architecture, pinpointing the root cause of a spike in 500's can be non-trivial, but with the added observability from a sidecar you can quickly determine which service in your service mesh resulted in the spike in errors. Service Mesh's have a large surface area for configuration, and can seem like a daunting undertaking to deploy. However, most services (including Linkerd) offer a sane set of defaults, and can be deployed via the happy path to quickly land these observability wins.","title":"Tools and Patterns"},{"location":"observability/tools/#tools-and-patterns","text":"There are a number of modern tools to make systems observable. While identifying and/or creating tools that work for your system, here are a few things to consider to help guide the choices. Must be simple to integrate and easy to use. It must be possible to aggregate and visualize data. Tools must provide real-time data. Must be able to guide users to the problem area with suitable, adequate end-to-end context.","title":"Tools and Patterns"},{"location":"observability/tools/#choices","text":"Loki OpenTelemetry Kubernetes Dashboards Prometheus","title":"Choices"},{"location":"observability/tools/#service-mesh","text":"Leveraging a Service Mesh that follows the Sidecar Pattern quickly sets up a go-to set of metrics, and traces (although traces need to be propagated from incoming requests to outgoing requests manually). A sidecar works by intercepting all incoming and outgoing traffic to your image. It then adds trace headers to each request and emits a standard set of logs and metrics. These metrics are extremely powerful for observability, allowing every service, whether client-side or server-side, to leverage a unified set of metrics, including: Latency Bytes Request Rate Error Rate In a microservice architecture, pinpointing the root cause of a spike in 500's can be non-trivial, but with the added observability from a sidecar you can quickly determine which service in your service mesh resulted in the spike in errors. Service Mesh's have a large surface area for configuration, and can seem like a daunting undertaking to deploy. However, most services (including Linkerd) offer a sane set of defaults, and can be deployed via the happy path to quickly land these observability wins.","title":"Service Mesh"},{"location":"observability/tools/KubernetesDashboards/","text":"Kubernetes UI Dashboards This document covers the options and benefits of various Kubernetes UI Dashboards which are useful tools for monitoring and debugging your application on Kubernetes Clusters. It allows the management of applications running in the cluster, debug them and manage the cluster all through these dashboards. Overview and Background There are times when not all solutions can be run locally. This limitation could be due to a cloud service which does not offer a robust or efficient way to locally debug the environment. In these cases, it is necessary to use other tools which provide the capabilities to monitor your application with Kubernetes. Advantages and Use Cases Allows the ability to view, manage and monitor the operational aspects of the Kubernetes Cluster. Benefits of using a UI dashboard includes the following: see an overview of the cluster deploy applications onto the cluster troubleshoot applications running on the cluster view, create, modify, and delete Kubernetes resources view basic resource metrics including resource usage for Kubernetes objects view and access logs live view of the pods state (e.g. started, terminating, etc) Different dashboards may provide different functionalities, and the use case to choose a particular dashboard will depend on the requirements. For example, many dashboards provide a way to only monitor your applications on Kubernetes but do not provide a way to manage them. Open Source Dashboards There are currently several UI dashboards available to monitor your applications or manage them with Kubernetes. For example: Octant Prometheus and Grafana Kube Prometheus Stack Chart : provides an easy way to operate end-to-end Kubernetes cluster monitoring with Prometheus using the Prometheus Operator. K8Dash kube-ops-view : a tool to visualize node occupancy & utilization Lens : Client side desktop tool Thanos and Cortex : Multi-cluster implementations Resources Alternatives to Kubernetes Dashboard","title":"Kubernetes UI Dashboards"},{"location":"observability/tools/KubernetesDashboards/#kubernetes-ui-dashboards","text":"This document covers the options and benefits of various Kubernetes UI Dashboards which are useful tools for monitoring and debugging your application on Kubernetes Clusters. It allows the management of applications running in the cluster, debug them and manage the cluster all through these dashboards.","title":"Kubernetes UI Dashboards"},{"location":"observability/tools/KubernetesDashboards/#overview-and-background","text":"There are times when not all solutions can be run locally. This limitation could be due to a cloud service which does not offer a robust or efficient way to locally debug the environment. In these cases, it is necessary to use other tools which provide the capabilities to monitor your application with Kubernetes.","title":"Overview and Background"},{"location":"observability/tools/KubernetesDashboards/#advantages-and-use-cases","text":"Allows the ability to view, manage and monitor the operational aspects of the Kubernetes Cluster. Benefits of using a UI dashboard includes the following: see an overview of the cluster deploy applications onto the cluster troubleshoot applications running on the cluster view, create, modify, and delete Kubernetes resources view basic resource metrics including resource usage for Kubernetes objects view and access logs live view of the pods state (e.g. started, terminating, etc) Different dashboards may provide different functionalities, and the use case to choose a particular dashboard will depend on the requirements. For example, many dashboards provide a way to only monitor your applications on Kubernetes but do not provide a way to manage them.","title":"Advantages and Use Cases"},{"location":"observability/tools/KubernetesDashboards/#open-source-dashboards","text":"There are currently several UI dashboards available to monitor your applications or manage them with Kubernetes. For example: Octant Prometheus and Grafana Kube Prometheus Stack Chart : provides an easy way to operate end-to-end Kubernetes cluster monitoring with Prometheus using the Prometheus Operator. K8Dash kube-ops-view : a tool to visualize node occupancy & utilization Lens : Client side desktop tool Thanos and Cortex : Multi-cluster implementations","title":"Open Source Dashboards"},{"location":"observability/tools/KubernetesDashboards/#resources","text":"Alternatives to Kubernetes Dashboard","title":"Resources"},{"location":"observability/tools/OpenTelemetry/","text":"Open Telemetry Building observable systems enable one to measure how well or bad the application is behaving and WHY it is behaving either way. Adopting open-source standards related to implementing telemetry and tracing features built on top of the OpenTelemetry framework helps decouple vendor-specific implementations while maintaining an extensible, standard, and portable open-source solution. OpenTelemetry is an open-source observability standard that defines how to generate, collect and describe telemetry in distributed systems. OpenTelemetry also provides a single-point distribution of a set of APIs, SDKs, and instrumentation libraries that implements the open-source standard, which can collect, process, and orchestrate telemetry data (signals) like traces, metrics, and logs. It supports multiple popular languages (Java, .NET, Python, JavaScript, Golang, Erlang, etc.). Open telemetry follows a vendor-agnostic and standards-based approach for collecting and managing telemetry data. An important point to note is that OpenTelemetry does not have its own backend; all telemetry collected by OpenTelemetry Collector must be sent to a backend like Prometheus, Jaeger, Zipkin, Azure Monitor, etc. Open telemetry is also the 2nd most active CNCF project only after Kubernetes. The main two Problems OpenTelemetry solves are: First, vendor neutrality for tracing, monitoring, and logging APIs and second, out-of-the-box cross-platform context propagation implementation for end-to-end distributed tracing over heterogeneous components. Open Telemetry Core Concepts Open Telemetry Implementation Patterns A detailed explanation of OpenTelemetry concepts is out of the scope of this repo. There is plenty of available information about how the SDK and the automatic instrumentation are configured and how the Exporters, Tracers, Context, and Span's hierarchy work. See the Reference section for valuable OpenTelemetry resources. However, understanding the core implementation patterns will help you know what approach better fits the scenario you are trying to solve. These are three main patterns as follows: Automatic telemetry: Support for automatic instrumentation is available for some languages. OpenTelemetry automatic instrumentation (100% codeless) is typically done through library hooks or monkey-patching library code. Automatic instrumentation will intercept all interactions and dependencies and automatically send the telemetry to the configured exporters. More information about this concept can be found in the OpenTelemetry instrumentation doc . Manual tracing: This must be done by coding using the OpenTelemetry SDK, managing the tracer objects to obtain Spans, and forming instrumented OpenTelemetry Scopes to identify the code segments to be manually traced. Also, by using the @WithSpan annotations (method decorations in C# and Java ) to mark whole methods that will be automatically traced. Hybrid approach: Most Production-ready scenarios will require a mix of both techniques, using the automatic instrumentation to collect automatic telemetry and the OpenTelemetry SDK to identify code segments that are important to instrument manually. When considering production-ready scenarios, the hybrid approach is the way to go as it allows for a throughout cover over the whole solution. It provides automatic context propagation and events correlation out of the box. Collector The collector is a separate process that is designed to be a \u2018sink\u2019 for telemetry data emitted by many processes, which can then export that data to backend systems. The collector has two different deployment strategies \u2013 either running as an agent alongside a service or as a gateway which is a remote application. In general, using both is recommended: the agent would be deployed with your service and run as a separate process or in a sidecar; meanwhile, the collector would be deployed separately, as its own application running in a container or virtual machine. Each agent would forward telemetry data to the collector, which could then export it to a variety of backend systems such as Lightstep, Jaeger, or Prometheus. The agent can be also replaced with the automatic instrumentation if supported. The automatic instrumentation provides the collector capabilities of retrieving, processing and exporting the telemetry. Regardless of how you choose to instrument or deploy OpenTelemetry, exporters provide powerful options for reporting telemetry data. You can directly export from your service, you can proxy through the collector, or you can aggregate into standalone collectors \u2013 or even a mix of these. Instrumentation Libraries A library that enables observability for another library is called an instrumentation library. OpenTelemetry libraries are language specific, currently there is good support for Java, Python, Javascript, dotnet and golang. Support for automatic instrumentation is available for some libraries which make using OpenTelemetry easy and trivial. In case automatic instrumentation is not available, manual instrumentation can be configured by using the OpenTelemetry SDK. Integration of OpenTelemetry OpenTelemetry can be used to collect, process and export data into multiple backends, some popular integrations supported with OpenTelemetry are: Zipkin Prometheus Jaeger New Relic Azure Monitor AWS X-Ray Datadog Kafka Lightstep Splunk GCP Monitor Why use OpenTelemetry The main reason to use OpenTelemetry is that it offers an open-source standard for implementing distributed telemetry (context propagation) over heterogeneous systems. There is no need to reinvent the wheel to implement end-to-end business flow transactions monitoring when using OpenTelemetry. It enables tracing, metrics, and logging telemetry through a set of single-distribution multi-language libraries and tools that allow for a plug-and-play telemetry architecture that includes the concept of agents and collectors. Moreover, avoiding any proprietary lock down and achieving vendor-agnostic neutrality for tracing, monitoring, and logging APIs AND backends allow maximum portability and extensibility patterns. Another good reason to use OpenTelemetry would be whether the stack uses OpenCensus or OpenTracing. As OpenCensus and OpenTracing have carved the way for OpenTelemetry, it makes sense to introduce OpenTelemetry where OpenCensus or OpenTracing is used as it still has backward compatibility. Apart from adding custom attributes, sampling, collecting data for metrics and traces, OpenTelemetry is governed by specifications and backed up by big players in the Observability landscape like Microsoft, Splunk, AppDynamics, etc. OpenTelemetry will likely become a de-facto open-source standard for enabling metrics and tracing when all features become GA. Current Status of OpenTelemetry Project OpenTelemetry is a project which emerged from merging of OpenCensus and OpenTracing in 2019. Although OpenCensus and OpenTracing are frozen and no new features are being developed for them, OpenTelemetry has backward compatibility with OpenCensus and OpenTracing. Some features of OpenTelemetry are still in beta, feature support for different languages is being tracked here: Feature Status of OpenTelemetry . Status of OpenTelemetry project can be tracked here . From the website: Our goal is to provide a generally available, production quality release for the tracing data source across most OpenTelemetry components in the first half of 2021. Several components have already reached this milestone! We expect metrics to reach the same status in the second half of 2021 and are targeting logs in 2022. What to Watch Out for As OpenTelemetry is a very recent project (first GA version of some features released in 2020), many features are still in beta hence due diligence needs to be done before using such features in production. Also, OpenTelemetry supports many popular languages but features in all languages are not at par. Some languages offer more features as compared to other languages. It also needs to be called out as some features are not in GA, there may be some incompatibility issues with the tooling. That being said, OpenTelemetry is one of the most active projects of CNCF , so it is expected that many more features would reach GA soon. January 2022 UPDATE Apart from the logging specification and implementation that are still marked as draft or beta, all other specifications and implementations regarding tracing and metrics are marked as stable or feature-freeze. Many libraries are still on active development whatsoever, so thorough analysis has to be made depending on the language on a feature basis. Integration Options with Azure Monitor Using the Azure Monitor OpenTelemetry Exporter Library This scenario uses the OpenTelemetry SDK as the core instrumentation library. Basically this means you will instrument your application using the OpenTelemetry libraries, but you will additionally use the Azure Monitor OpenTelemetry Exporter and then added it as an additional exporter with the OpenTelemetry SDK. In this way, the OpenTelemetry traces your application creates will be pushed to your Azure Monitor Instance. Using the Application Insights Agent Jar File - Java Only Java OpenTelemetry instrumentation provides another way to integrate with Azure Monitor, by using Applications Insights Java Agent jar . When configuring this option, the Applications Insights Agent file is added when executing the application. The applicationinsights.json configuration file must be also be added as part of the applications artifacts. Pay close attention to the preview section, where the \"openTelemetryApiSupport\": true, property is set to true, enabling the agent to intercept OpenTelemetry telemetry created in the application code pushing it to Azure Monitor. OpenTelemetry Java Agent instrumentation supports many libraries and frameworks and application servers . Application Insights Java Agent enhances this list. Therefore, the main difference between running the OpenTelemetry Java Agent vs. the Application Insights Java Agent is demonstrated in the amount of traces getting logged in Azure Monitor. When running with Application Insights Java agent there's more telemetry getting pushed to Azure Monitor. On the other hand, when running the solution using the Application Insights agent mode, it is essential to highlight that nothing gets logged on Jaeger (or any other OpenTelemetry exporter). All traces will be pushed exclusively to Azure Monitor. However, both manual instrumentation done via the OpenTelemetry SDK and all automatic traces, dependencies, performance counters, and metrics being instrumented by the Application Insights agent are sent to Azure Monitor. Although there is a rich amount of additional data automatically instrumented by the Application Insights agent, it can be deduced that it is not necessarily OpenTelemetry compliant. Only the traces logged by the manual instrumentation using the OpenTelemetry SDK are. OpenTelemetry vs Application Insights Agents Compared Highlight OpenTelemetry Agent App Insights Agent Automatic Telemetry Y Y Manual OpenTelemetry Y Y Plug and Play Exports Y N Multiple Exports Y N Full Open Telemetry layout (decoupling agents, collectors and exporters) Y N Enriched out of the box telemetry N Y Unified telemetry backend N Y Summary As you may have guessed, there is no \"one size fits all\" approach when implementing OpenTelemetry with Azure Monitor as a backend. At the time of this writing, if you want to have the flexibility of having different OpenTelemetry backends, you should definitively go with the OpenTelemetry Agent, even though you'd sacrifice all automating tracing flowing to Azure Monitor. On the other hand, if you want to get the best of Azure Monitor and still want to instrument your code with the OpenTelemetry SDK, you should use the Application Insights Agent and manually instrument your code with the OpenTelemetry SDK to get the best of both worlds. Either way, instrumenting your code with OpenTelemetry seems the right approach as the ecosystem will only get bigger, better, and more robust. Advanced topics Use the Azure OpenTelemetry Tracing plugin library for Java to enable distributed tracing across Azure components through OpenTelemetry. Manual Trace Context Propagation The trace context is stored in Thread-local storage. When the application flow involves multiple threads (eg. multithreaded work-queue, asynchronous processing) then the traces won't get combined into one end-to-end trace chain with automatic context propagation . To achieve that you need to manually propagate the trace context ( example in Java ) by storing the trace headers along with the work-queue item. Telemetry Testing Mission critical telemetry data should be covered by testing. You can cover telemetry by tests by mocking the telemetry collector web server. In automated testing environment the telemetry instrumentation can be configured to use OTLP exporter and point the OTLP exporter endpoint to the collector web server. Using mocking servers libraries (eg. MockServer or WireMock) can help verify the telemetry data pushed to the collector. Resources OpenTelemetry Official Site Getting Started with dotnet and OpenTelemetry Using OpenTelemetry Collector OpenTelemetry Java SDK Manual Instrumentation OpenTelemetry Instrumentation Agent for Java Application Insights Java Agent Azure Monitor OpenTelemetry Exporter client library for Java Azure OpenTelemetry Tracing plugin library for Java Application Insights Agent's OpenTelemetry configuration","title":"Open Telemetry"},{"location":"observability/tools/OpenTelemetry/#open-telemetry","text":"Building observable systems enable one to measure how well or bad the application is behaving and WHY it is behaving either way. Adopting open-source standards related to implementing telemetry and tracing features built on top of the OpenTelemetry framework helps decouple vendor-specific implementations while maintaining an extensible, standard, and portable open-source solution. OpenTelemetry is an open-source observability standard that defines how to generate, collect and describe telemetry in distributed systems. OpenTelemetry also provides a single-point distribution of a set of APIs, SDKs, and instrumentation libraries that implements the open-source standard, which can collect, process, and orchestrate telemetry data (signals) like traces, metrics, and logs. It supports multiple popular languages (Java, .NET, Python, JavaScript, Golang, Erlang, etc.). Open telemetry follows a vendor-agnostic and standards-based approach for collecting and managing telemetry data. An important point to note is that OpenTelemetry does not have its own backend; all telemetry collected by OpenTelemetry Collector must be sent to a backend like Prometheus, Jaeger, Zipkin, Azure Monitor, etc. Open telemetry is also the 2nd most active CNCF project only after Kubernetes. The main two Problems OpenTelemetry solves are: First, vendor neutrality for tracing, monitoring, and logging APIs and second, out-of-the-box cross-platform context propagation implementation for end-to-end distributed tracing over heterogeneous components.","title":"Open Telemetry"},{"location":"observability/tools/OpenTelemetry/#open-telemetry-core-concepts","text":"","title":"Open Telemetry Core Concepts"},{"location":"observability/tools/OpenTelemetry/#open-telemetry-implementation-patterns","text":"A detailed explanation of OpenTelemetry concepts is out of the scope of this repo. There is plenty of available information about how the SDK and the automatic instrumentation are configured and how the Exporters, Tracers, Context, and Span's hierarchy work. See the Reference section for valuable OpenTelemetry resources. However, understanding the core implementation patterns will help you know what approach better fits the scenario you are trying to solve. These are three main patterns as follows: Automatic telemetry: Support for automatic instrumentation is available for some languages. OpenTelemetry automatic instrumentation (100% codeless) is typically done through library hooks or monkey-patching library code. Automatic instrumentation will intercept all interactions and dependencies and automatically send the telemetry to the configured exporters. More information about this concept can be found in the OpenTelemetry instrumentation doc . Manual tracing: This must be done by coding using the OpenTelemetry SDK, managing the tracer objects to obtain Spans, and forming instrumented OpenTelemetry Scopes to identify the code segments to be manually traced. Also, by using the @WithSpan annotations (method decorations in C# and Java ) to mark whole methods that will be automatically traced. Hybrid approach: Most Production-ready scenarios will require a mix of both techniques, using the automatic instrumentation to collect automatic telemetry and the OpenTelemetry SDK to identify code segments that are important to instrument manually. When considering production-ready scenarios, the hybrid approach is the way to go as it allows for a throughout cover over the whole solution. It provides automatic context propagation and events correlation out of the box.","title":"Open Telemetry Implementation Patterns"},{"location":"observability/tools/OpenTelemetry/#collector","text":"The collector is a separate process that is designed to be a \u2018sink\u2019 for telemetry data emitted by many processes, which can then export that data to backend systems. The collector has two different deployment strategies \u2013 either running as an agent alongside a service or as a gateway which is a remote application. In general, using both is recommended: the agent would be deployed with your service and run as a separate process or in a sidecar; meanwhile, the collector would be deployed separately, as its own application running in a container or virtual machine. Each agent would forward telemetry data to the collector, which could then export it to a variety of backend systems such as Lightstep, Jaeger, or Prometheus. The agent can be also replaced with the automatic instrumentation if supported. The automatic instrumentation provides the collector capabilities of retrieving, processing and exporting the telemetry. Regardless of how you choose to instrument or deploy OpenTelemetry, exporters provide powerful options for reporting telemetry data. You can directly export from your service, you can proxy through the collector, or you can aggregate into standalone collectors \u2013 or even a mix of these.","title":"Collector"},{"location":"observability/tools/OpenTelemetry/#instrumentation-libraries","text":"A library that enables observability for another library is called an instrumentation library. OpenTelemetry libraries are language specific, currently there is good support for Java, Python, Javascript, dotnet and golang. Support for automatic instrumentation is available for some libraries which make using OpenTelemetry easy and trivial. In case automatic instrumentation is not available, manual instrumentation can be configured by using the OpenTelemetry SDK.","title":"Instrumentation Libraries"},{"location":"observability/tools/OpenTelemetry/#integration-of-opentelemetry","text":"OpenTelemetry can be used to collect, process and export data into multiple backends, some popular integrations supported with OpenTelemetry are: Zipkin Prometheus Jaeger New Relic Azure Monitor AWS X-Ray Datadog Kafka Lightstep Splunk GCP Monitor","title":"Integration of OpenTelemetry"},{"location":"observability/tools/OpenTelemetry/#why-use-opentelemetry","text":"The main reason to use OpenTelemetry is that it offers an open-source standard for implementing distributed telemetry (context propagation) over heterogeneous systems. There is no need to reinvent the wheel to implement end-to-end business flow transactions monitoring when using OpenTelemetry. It enables tracing, metrics, and logging telemetry through a set of single-distribution multi-language libraries and tools that allow for a plug-and-play telemetry architecture that includes the concept of agents and collectors. Moreover, avoiding any proprietary lock down and achieving vendor-agnostic neutrality for tracing, monitoring, and logging APIs AND backends allow maximum portability and extensibility patterns. Another good reason to use OpenTelemetry would be whether the stack uses OpenCensus or OpenTracing. As OpenCensus and OpenTracing have carved the way for OpenTelemetry, it makes sense to introduce OpenTelemetry where OpenCensus or OpenTracing is used as it still has backward compatibility. Apart from adding custom attributes, sampling, collecting data for metrics and traces, OpenTelemetry is governed by specifications and backed up by big players in the Observability landscape like Microsoft, Splunk, AppDynamics, etc. OpenTelemetry will likely become a de-facto open-source standard for enabling metrics and tracing when all features become GA.","title":"Why use OpenTelemetry"},{"location":"observability/tools/OpenTelemetry/#current-status-of-opentelemetry-project","text":"OpenTelemetry is a project which emerged from merging of OpenCensus and OpenTracing in 2019. Although OpenCensus and OpenTracing are frozen and no new features are being developed for them, OpenTelemetry has backward compatibility with OpenCensus and OpenTracing. Some features of OpenTelemetry are still in beta, feature support for different languages is being tracked here: Feature Status of OpenTelemetry . Status of OpenTelemetry project can be tracked here . From the website: Our goal is to provide a generally available, production quality release for the tracing data source across most OpenTelemetry components in the first half of 2021. Several components have already reached this milestone! We expect metrics to reach the same status in the second half of 2021 and are targeting logs in 2022.","title":"Current Status of OpenTelemetry Project"},{"location":"observability/tools/OpenTelemetry/#what-to-watch-out-for","text":"As OpenTelemetry is a very recent project (first GA version of some features released in 2020), many features are still in beta hence due diligence needs to be done before using such features in production. Also, OpenTelemetry supports many popular languages but features in all languages are not at par. Some languages offer more features as compared to other languages. It also needs to be called out as some features are not in GA, there may be some incompatibility issues with the tooling. That being said, OpenTelemetry is one of the most active projects of CNCF , so it is expected that many more features would reach GA soon.","title":"What to Watch Out for"},{"location":"observability/tools/OpenTelemetry/#january-2022-update","text":"Apart from the logging specification and implementation that are still marked as draft or beta, all other specifications and implementations regarding tracing and metrics are marked as stable or feature-freeze. Many libraries are still on active development whatsoever, so thorough analysis has to be made depending on the language on a feature basis.","title":"January 2022 UPDATE"},{"location":"observability/tools/OpenTelemetry/#integration-options-with-azure-monitor","text":"","title":"Integration Options with Azure Monitor"},{"location":"observability/tools/OpenTelemetry/#using-the-azure-monitor-opentelemetry-exporter-library","text":"This scenario uses the OpenTelemetry SDK as the core instrumentation library. Basically this means you will instrument your application using the OpenTelemetry libraries, but you will additionally use the Azure Monitor OpenTelemetry Exporter and then added it as an additional exporter with the OpenTelemetry SDK. In this way, the OpenTelemetry traces your application creates will be pushed to your Azure Monitor Instance.","title":"Using the Azure Monitor OpenTelemetry Exporter Library"},{"location":"observability/tools/OpenTelemetry/#using-the-application-insights-agent-jar-file-java-only","text":"Java OpenTelemetry instrumentation provides another way to integrate with Azure Monitor, by using Applications Insights Java Agent jar . When configuring this option, the Applications Insights Agent file is added when executing the application. The applicationinsights.json configuration file must be also be added as part of the applications artifacts. Pay close attention to the preview section, where the \"openTelemetryApiSupport\": true, property is set to true, enabling the agent to intercept OpenTelemetry telemetry created in the application code pushing it to Azure Monitor. OpenTelemetry Java Agent instrumentation supports many libraries and frameworks and application servers . Application Insights Java Agent enhances this list. Therefore, the main difference between running the OpenTelemetry Java Agent vs. the Application Insights Java Agent is demonstrated in the amount of traces getting logged in Azure Monitor. When running with Application Insights Java agent there's more telemetry getting pushed to Azure Monitor. On the other hand, when running the solution using the Application Insights agent mode, it is essential to highlight that nothing gets logged on Jaeger (or any other OpenTelemetry exporter). All traces will be pushed exclusively to Azure Monitor. However, both manual instrumentation done via the OpenTelemetry SDK and all automatic traces, dependencies, performance counters, and metrics being instrumented by the Application Insights agent are sent to Azure Monitor. Although there is a rich amount of additional data automatically instrumented by the Application Insights agent, it can be deduced that it is not necessarily OpenTelemetry compliant. Only the traces logged by the manual instrumentation using the OpenTelemetry SDK are.","title":"Using the Application Insights Agent Jar File - Java Only"},{"location":"observability/tools/OpenTelemetry/#opentelemetry-vs-application-insights-agents-compared","text":"Highlight OpenTelemetry Agent App Insights Agent Automatic Telemetry Y Y Manual OpenTelemetry Y Y Plug and Play Exports Y N Multiple Exports Y N Full Open Telemetry layout (decoupling agents, collectors and exporters) Y N Enriched out of the box telemetry N Y Unified telemetry backend N Y","title":"OpenTelemetry vs Application Insights Agents Compared"},{"location":"observability/tools/OpenTelemetry/#summary","text":"As you may have guessed, there is no \"one size fits all\" approach when implementing OpenTelemetry with Azure Monitor as a backend. At the time of this writing, if you want to have the flexibility of having different OpenTelemetry backends, you should definitively go with the OpenTelemetry Agent, even though you'd sacrifice all automating tracing flowing to Azure Monitor. On the other hand, if you want to get the best of Azure Monitor and still want to instrument your code with the OpenTelemetry SDK, you should use the Application Insights Agent and manually instrument your code with the OpenTelemetry SDK to get the best of both worlds. Either way, instrumenting your code with OpenTelemetry seems the right approach as the ecosystem will only get bigger, better, and more robust.","title":"Summary"},{"location":"observability/tools/OpenTelemetry/#advanced-topics","text":"Use the Azure OpenTelemetry Tracing plugin library for Java to enable distributed tracing across Azure components through OpenTelemetry.","title":"Advanced topics"},{"location":"observability/tools/OpenTelemetry/#manual-trace-context-propagation","text":"The trace context is stored in Thread-local storage. When the application flow involves multiple threads (eg. multithreaded work-queue, asynchronous processing) then the traces won't get combined into one end-to-end trace chain with automatic context propagation . To achieve that you need to manually propagate the trace context ( example in Java ) by storing the trace headers along with the work-queue item.","title":"Manual Trace Context Propagation"},{"location":"observability/tools/OpenTelemetry/#telemetry-testing","text":"Mission critical telemetry data should be covered by testing. You can cover telemetry by tests by mocking the telemetry collector web server. In automated testing environment the telemetry instrumentation can be configured to use OTLP exporter and point the OTLP exporter endpoint to the collector web server. Using mocking servers libraries (eg. MockServer or WireMock) can help verify the telemetry data pushed to the collector.","title":"Telemetry Testing"},{"location":"observability/tools/OpenTelemetry/#resources","text":"OpenTelemetry Official Site Getting Started with dotnet and OpenTelemetry Using OpenTelemetry Collector OpenTelemetry Java SDK Manual Instrumentation OpenTelemetry Instrumentation Agent for Java Application Insights Java Agent Azure Monitor OpenTelemetry Exporter client library for Java Azure OpenTelemetry Tracing plugin library for Java Application Insights Agent's OpenTelemetry configuration","title":"Resources"},{"location":"observability/tools/Prometheus/","text":"Prometheus Overview Originally built at SoundCloud, Prometheus is an open-source monitoring and alerting toolkit based on time series metrics data. It has become a de facto standard metrics solution in the Cloud Native world and widely used with Kubernetes. The core of Prometheus is a server that scrapes and stores metrics. There are other numerous optional features and components like an Alert-manager and client libraries for programming languages to extend the functionalities of Prometheus beyond the basics. The client libraries offer four metric types : Counter , Gauge , Histogram , and Summary . Why Prometheus? Prometheus is a time series database and allow for events or measurements to be tracked, monitored, and aggregated over time. Prometheus is a pull-based tool. One of the biggest advantages of Prometheus over other monitoring tools is that Prometheus actively scrapes targets in order to retrieve metrics from them. Prometheus also supports the push model for pushing metrics. Prometheus allows for control over how to scrape, and how often to scrape them. Through the Prometheus server, there can be multiple scrape configurations, allowing for multiple rates for different targets. Similar to Grafana , visualization for the time series can be directly done through the Prometheus Web UI. The Web UI provides the ability to easily filter and have an overview of what is taking place with your different targets. Prometheus provides a powerful functional query language called PromQL (Prometheus Query Language) that lets the user aggregate time series data in real time. Integration with Other Tools The Prometheus client libraries allow you to add instrumentation to your code and expose internal metrics via an HTTP endpoint. The official Prometheus client libraries currently are Go , Java or Scala , Python and Ruby . Unofficial third-party libraries include: .NET/C# , Node.js , and C++ . Prometheus' metrics format is supported by a wide array of tools and services including: Azure Monitor Stackdriver Datadog CloudWatch New Relic Flagger Grafana GitLab etc... There are numerous exporters which are used in exporting existing metrics from third-party databases, hardware, CI/CD tools, messaging systems, APIs and other monitoring systems. In addition to client libraries and exporters, there is a significant number of integration points for service discovery, remote storage, alerts and management. Resources Prometheus Docs Prometheus Best Practices Grafana with Prometheus","title":"Prometheus"},{"location":"observability/tools/Prometheus/#prometheus","text":"","title":"Prometheus"},{"location":"observability/tools/Prometheus/#overview","text":"Originally built at SoundCloud, Prometheus is an open-source monitoring and alerting toolkit based on time series metrics data. It has become a de facto standard metrics solution in the Cloud Native world and widely used with Kubernetes. The core of Prometheus is a server that scrapes and stores metrics. There are other numerous optional features and components like an Alert-manager and client libraries for programming languages to extend the functionalities of Prometheus beyond the basics. The client libraries offer four metric types : Counter , Gauge , Histogram , and Summary .","title":"Overview"},{"location":"observability/tools/Prometheus/#why-prometheus","text":"Prometheus is a time series database and allow for events or measurements to be tracked, monitored, and aggregated over time. Prometheus is a pull-based tool. One of the biggest advantages of Prometheus over other monitoring tools is that Prometheus actively scrapes targets in order to retrieve metrics from them. Prometheus also supports the push model for pushing metrics. Prometheus allows for control over how to scrape, and how often to scrape them. Through the Prometheus server, there can be multiple scrape configurations, allowing for multiple rates for different targets. Similar to Grafana , visualization for the time series can be directly done through the Prometheus Web UI. The Web UI provides the ability to easily filter and have an overview of what is taking place with your different targets. Prometheus provides a powerful functional query language called PromQL (Prometheus Query Language) that lets the user aggregate time series data in real time.","title":"Why Prometheus?"},{"location":"observability/tools/Prometheus/#integration-with-other-tools","text":"The Prometheus client libraries allow you to add instrumentation to your code and expose internal metrics via an HTTP endpoint. The official Prometheus client libraries currently are Go , Java or Scala , Python and Ruby . Unofficial third-party libraries include: .NET/C# , Node.js , and C++ . Prometheus' metrics format is supported by a wide array of tools and services including: Azure Monitor Stackdriver Datadog CloudWatch New Relic Flagger Grafana GitLab etc... There are numerous exporters which are used in exporting existing metrics from third-party databases, hardware, CI/CD tools, messaging systems, APIs and other monitoring systems. In addition to client libraries and exporters, there is a significant number of integration points for service discovery, remote storage, alerts and management.","title":"Integration with Other Tools"},{"location":"observability/tools/Prometheus/#resources","text":"Prometheus Docs Prometheus Best Practices Grafana with Prometheus","title":"Resources"},{"location":"observability/tools/loki/","text":"Loki Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system, created by Grafana Labs inspired by the learnings from Prometheus. Loki is commonly referred as 'Prometheus, but for logs', which makes total sense. Both tools follow the same architecture, which is an agent collecting metrics in each of the components of the software system, a server which stores the logs and also the Grafana dashboard, which access the loki server to build its visualizations and queries. That being said, Loki has three main components: Promtail It is the agent portion of Loki. It can be used to grab logs from several places, like var/log/ for example. The configuration of the Promtail is a yaml file called config-promtail.yml . In this file, its described all the paths and log sources that will be aggregated on Loki Server. Loki Server Loki Server is responsible for receiving and storing all the logs received from all the different systems. The Loki Server is also responsible for the queries done on Grafana, for example. Grafana Dashboards Grafana Dashboards are responsible for creating the visualizations and performing queries. After all, it will be a web page that people with the right access can log into to see, query and create alerts for the aggregated logs. Why use Loki The main reason to use Loki instead of other log aggregation tools, is that Loki optimizes the necessary storage. It does that by following the same pattern as prometheus, which index the labels and make chunks of the log itself, using less space than just storing the raw logs. Resources Loki Official Site Inserting logs into Loki Adding Loki Source to Grafana Loki Best Practices","title":"Loki"},{"location":"observability/tools/loki/#loki","text":"Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system, created by Grafana Labs inspired by the learnings from Prometheus. Loki is commonly referred as 'Prometheus, but for logs', which makes total sense. Both tools follow the same architecture, which is an agent collecting metrics in each of the components of the software system, a server which stores the logs and also the Grafana dashboard, which access the loki server to build its visualizations and queries. That being said, Loki has three main components:","title":"Loki"},{"location":"observability/tools/loki/#promtail","text":"It is the agent portion of Loki. It can be used to grab logs from several places, like var/log/ for example. The configuration of the Promtail is a yaml file called config-promtail.yml . In this file, its described all the paths and log sources that will be aggregated on Loki Server.","title":"Promtail"},{"location":"observability/tools/loki/#loki-server","text":"Loki Server is responsible for receiving and storing all the logs received from all the different systems. The Loki Server is also responsible for the queries done on Grafana, for example.","title":"Loki Server"},{"location":"observability/tools/loki/#grafana-dashboards","text":"Grafana Dashboards are responsible for creating the visualizations and performing queries. After all, it will be a web page that people with the right access can log into to see, query and create alerts for the aggregated logs.","title":"Grafana Dashboards"},{"location":"observability/tools/loki/#why-use-loki","text":"The main reason to use Loki instead of other log aggregation tools, is that Loki optimizes the necessary storage. It does that by following the same pattern as prometheus, which index the labels and make chunks of the log itself, using less space than just storing the raw logs.","title":"Why use Loki"},{"location":"observability/tools/loki/#resources","text":"Loki Official Site Inserting logs into Loki Adding Loki Source to Grafana Loki Best Practices","title":"Resources"},{"location":"privacy/","text":"Privacy fundamentals This part of the engineering playbook focuses on privacy design guidelines and principles. Private data handling and protection requires both the proper design of software, systems and databases, as well as the implementation of organizational processes and procedures. In general, developers working on ISE projects should adhere to Microsoft's recommended standard practices and regulations on Privacy and Data Handling. The playbook currently contains two main parts: Privacy and Data : Best practices for properly handling sensitive and private data. Privacy frameworks : A list of frameworks which could be applied in private data scenarios.","title":"Privacy fundamentals"},{"location":"privacy/#privacy-fundamentals","text":"This part of the engineering playbook focuses on privacy design guidelines and principles. Private data handling and protection requires both the proper design of software, systems and databases, as well as the implementation of organizational processes and procedures. In general, developers working on ISE projects should adhere to Microsoft's recommended standard practices and regulations on Privacy and Data Handling. The playbook currently contains two main parts: Privacy and Data : Best practices for properly handling sensitive and private data. Privacy frameworks : A list of frameworks which could be applied in private data scenarios.","title":"Privacy fundamentals"},{"location":"privacy/data-handling/","text":"Privacy and Data Goal The goal of this section is to briefly describe best practices in privacy fundamentals for data heavy projects or portions of a project that may contain data. What it is not : This document is not a checklist for how customers or readers should handle data in their environment, and does not override Microsoft's or the customers' policies for data handling, data protection and information security. Introduction Microsoft runs on trust. Our customers trust ISE to adhere to the highest standards when handling their data. Protecting our customers' data is a joint responsibility between Microsoft and the customers; both have the responsibility to help projects follow the guidelines outlined on this page. Developers working on ISE projects should implement best practices and guidance on handling data throughout the project phases. This page is not meant to suggest how customers should handle data in their environment. It does not override : Microsoft's Information Security Policy Limited Data Protection Addendum Professional Services Data Protection Addendum 5 W's of Data Handling When working on an engagement it is important to address the following 5 W 's: Who \u2013 gets access to and with whom will we share the data and/or models developed with the data? What \u2013 data is shared with us and under what expectations and understanding. Customers need to be explicit about how the data they share applies to the overarching effort. The understanding shouldn't be vague and we shouldn't have access to broad set of data if not necessary. Where \u2013 will the data be stored and what legal jurisdiction will preside over that data. This is particularly important in countries like Germany, where different privacy laws apply but also important when it comes to responding to legal subpoenas for the data. When \u2013 will the access to data be provided and for how long? It is important to not leave straggling access to data once the engagement is completed, and define a priori the data retention policies. Why \u2013 have you given access to the data? This is particularly important to clarify the purpose and any restrictions on usage beyond the intended purpose. Please use the above guidelines to ensure the data is used only for intended purposes and thereby gain trust. It is important to be aware of data handling best practices and ensure the required clarity is provided to adhere to the above 5Ws. Handling Data in ISE Engagements Data should never leave customer-controlled environments and contractors and/or other members in the engagement should never have access to complete customer data sets but use limited customer data sets using the following prioritized approaches: Contractors or engagement partners do not work directly with production data, data will be copied before processing per the guidelines below. Always apply data minimization principles to minimize the blast radius of errors, only work with the minimal data set required to achieve the goals. Generate synthetic data to support engagement work. If synthetic data is not possible to achieve project goals, request anonymized data in which the likelihood that unique individuals can be re-identified is minimal. Select a suitably diverse, limited data set, again, follow the Principles of Data Minimization and attempt to work with the fewest rows possible to achieve the goals. Before work begins on data, ensure OS patches are up to date and permissions are properly set with no open internet access. Developers working on ISE projects will work with our customers to define the data needed for each engagement. If there is a need to access production data, ISE needs to review the need with their lead and work with the customer to put audits in place verifying what data was accessed. Production data must only be shared with approved members of the engagement team and must not be processed/transferred outside of the customer controlled environment. Customers should provide ISE with a copy of the requested data in a location managed by the customer. The customer should consider turning any logging capabilities on so they can clearly identify who has access and what they do with that access. ISE should notify the customer when they are done with the data and suggest the customer destroy copies of the data if they are no longer needed. Our Guiding Principles when Handling Data in an Engagement Never directly access production data. Explicitly state the intended purpose of data that can be used for engagement. Only share copies of the production data with the approved members of the engagement team. The entire team should work together to ensure that there are no dead copies of data. When the data is no longer needed, the team should promptly work to clean up engagement copies of data. Do not send any copies of the production data outside the customer-controlled environment. Only use the minimal subset of the data needed for the purpose of the engagement. Questions to Consider when Working with Data What data do we need? What is the legal basis for processing this data? If we are the processor based on contract obligation what is our responsibility listed in the contract? Does the contract need to be amended? How can we contain data proliferation? What security controls are in place to protect this data? What is the data breech protocol? How does this data benefit the data subject? What is the lifespan of this data? Do we need to keep this data linked to a data subject? Can we turn this data into Not in a Position to Identify (NPI) data to be used later on? How is the system architected so data subject rights can be fulfilled? (ex manually, automated) If personal data is involved have engaged privacy and legal teams for this project? Summary It is important to only pull in data that is needed for the problem at hand, when this is put in practice we find that we only maintain data that is adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed. This is particularly important for personal data. Once you have personal data there are many rules and regulations that apply, some examples of these might be HIPAA, GDPR, CCPA. The customer should be aware of and surface any applicable regulations that apply to their data. Furthermore the seven principles of privacy by design should be reviewed and considered when handling any type of sensitive data. Resources Microsoft Trust Center Tools for responsible AI - Protect Data Protection Resources FAQ and White Papers Microsoft Compliance Offerings Accountability Readiness Checklists Privacy by Design The 7 Foundational Principles","title":"Privacy and Data"},{"location":"privacy/data-handling/#privacy-and-data","text":"","title":"Privacy and Data"},{"location":"privacy/data-handling/#goal","text":"The goal of this section is to briefly describe best practices in privacy fundamentals for data heavy projects or portions of a project that may contain data. What it is not : This document is not a checklist for how customers or readers should handle data in their environment, and does not override Microsoft's or the customers' policies for data handling, data protection and information security.","title":"Goal"},{"location":"privacy/data-handling/#introduction","text":"Microsoft runs on trust. Our customers trust ISE to adhere to the highest standards when handling their data. Protecting our customers' data is a joint responsibility between Microsoft and the customers; both have the responsibility to help projects follow the guidelines outlined on this page. Developers working on ISE projects should implement best practices and guidance on handling data throughout the project phases. This page is not meant to suggest how customers should handle data in their environment. It does not override : Microsoft's Information Security Policy Limited Data Protection Addendum Professional Services Data Protection Addendum","title":"Introduction"},{"location":"privacy/data-handling/#5-ws-of-data-handling","text":"When working on an engagement it is important to address the following 5 W 's: Who \u2013 gets access to and with whom will we share the data and/or models developed with the data? What \u2013 data is shared with us and under what expectations and understanding. Customers need to be explicit about how the data they share applies to the overarching effort. The understanding shouldn't be vague and we shouldn't have access to broad set of data if not necessary. Where \u2013 will the data be stored and what legal jurisdiction will preside over that data. This is particularly important in countries like Germany, where different privacy laws apply but also important when it comes to responding to legal subpoenas for the data. When \u2013 will the access to data be provided and for how long? It is important to not leave straggling access to data once the engagement is completed, and define a priori the data retention policies. Why \u2013 have you given access to the data? This is particularly important to clarify the purpose and any restrictions on usage beyond the intended purpose. Please use the above guidelines to ensure the data is used only for intended purposes and thereby gain trust. It is important to be aware of data handling best practices and ensure the required clarity is provided to adhere to the above 5Ws.","title":"5 W's of Data Handling"},{"location":"privacy/data-handling/#handling-data-in-ise-engagements","text":"Data should never leave customer-controlled environments and contractors and/or other members in the engagement should never have access to complete customer data sets but use limited customer data sets using the following prioritized approaches: Contractors or engagement partners do not work directly with production data, data will be copied before processing per the guidelines below. Always apply data minimization principles to minimize the blast radius of errors, only work with the minimal data set required to achieve the goals. Generate synthetic data to support engagement work. If synthetic data is not possible to achieve project goals, request anonymized data in which the likelihood that unique individuals can be re-identified is minimal. Select a suitably diverse, limited data set, again, follow the Principles of Data Minimization and attempt to work with the fewest rows possible to achieve the goals. Before work begins on data, ensure OS patches are up to date and permissions are properly set with no open internet access. Developers working on ISE projects will work with our customers to define the data needed for each engagement. If there is a need to access production data, ISE needs to review the need with their lead and work with the customer to put audits in place verifying what data was accessed. Production data must only be shared with approved members of the engagement team and must not be processed/transferred outside of the customer controlled environment. Customers should provide ISE with a copy of the requested data in a location managed by the customer. The customer should consider turning any logging capabilities on so they can clearly identify who has access and what they do with that access. ISE should notify the customer when they are done with the data and suggest the customer destroy copies of the data if they are no longer needed.","title":"Handling Data in ISE Engagements"},{"location":"privacy/data-handling/#our-guiding-principles-when-handling-data-in-an-engagement","text":"Never directly access production data. Explicitly state the intended purpose of data that can be used for engagement. Only share copies of the production data with the approved members of the engagement team. The entire team should work together to ensure that there are no dead copies of data. When the data is no longer needed, the team should promptly work to clean up engagement copies of data. Do not send any copies of the production data outside the customer-controlled environment. Only use the minimal subset of the data needed for the purpose of the engagement.","title":"Our Guiding Principles when Handling Data in an Engagement"},{"location":"privacy/data-handling/#questions-to-consider-when-working-with-data","text":"What data do we need? What is the legal basis for processing this data? If we are the processor based on contract obligation what is our responsibility listed in the contract? Does the contract need to be amended? How can we contain data proliferation? What security controls are in place to protect this data? What is the data breech protocol? How does this data benefit the data subject? What is the lifespan of this data? Do we need to keep this data linked to a data subject? Can we turn this data into Not in a Position to Identify (NPI) data to be used later on? How is the system architected so data subject rights can be fulfilled? (ex manually, automated) If personal data is involved have engaged privacy and legal teams for this project?","title":"Questions to Consider when Working with Data"},{"location":"privacy/data-handling/#summary","text":"It is important to only pull in data that is needed for the problem at hand, when this is put in practice we find that we only maintain data that is adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed. This is particularly important for personal data. Once you have personal data there are many rules and regulations that apply, some examples of these might be HIPAA, GDPR, CCPA. The customer should be aware of and surface any applicable regulations that apply to their data. Furthermore the seven principles of privacy by design should be reviewed and considered when handling any type of sensitive data.","title":"Summary"},{"location":"privacy/data-handling/#resources","text":"Microsoft Trust Center Tools for responsible AI - Protect Data Protection Resources FAQ and White Papers Microsoft Compliance Offerings Accountability Readiness Checklists Privacy by Design The 7 Foundational Principles","title":"Resources"},{"location":"privacy/privacy-frameworks/","text":"Privacy Related frameworks The following tools/frameworks could be leveraged when data analysis or model development needs to take place on private data. Note that the use of such frameworks still requires the solution to adhere to privacy regulations and others, and additional safeguards should be applied. Typical Scenarios for Leveraging a Privacy Framework Sharing data or results while preserving data subjects' privacy Performing analysis or statistical modeling on private data Developing privacy preserving ML models and data pipelines Privacy Frameworks Protecting private data involves the entire data lifecycle, from acquisition, through storage, processing, analysis, modeling and usage in reports or machine learning models. Proper safeguards and restrictions should be applied in each of these phases. In this section we provide a non-exhaustive list of privacy frameworks which can be leveraged for protecting and preserving privacy. We focus on four main use cases in the data lifecycle: Obtaining non-sensitive data Establishing trusted research and modeling environments Creating privacy preserving data and ML pipelines Data loss prevention Obtaining Non-Sensitive Data In many scenarios, analysts, researchers and data scientists require access to a non-sensitive version or sample of the private data. In this section we focus on two approaches for obtaining non-sensitive data. Note: These two approaches do not guarantee that the outcome would not include private data, and additional measures should be applied. Data De-Identification De-identification is the process of applying a set of transformations to a dataset, in order to lower the risk of unintended disclosure of personal data. De-identification involves the removal or substitution of both direct identifiers (such as name, or social security number) or quasi-identifiers, which can be used for re-identification using additional external information. De-identification can be applied to different types of data, such as structured data, images and text. However, de-identification of non-structured data often involves statistical approaches which might result in undetected PII (Personal Identifiable Information) or non-private information being redacted or replaced. Here we outline several de-identification solutions available as open source: Solution Notes Presidio Presidio helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more in unstructured text and images. It's useful when high customization is required, for example to detect custom PII entities or languages. Link to repo , link to docs , link to demo . FHIR tools for anonymization FHIR Tools for Anonymization is an open-source project that helps anonymize healthcare FHIR data (FHIR=Fast Healthcare Interoperability Resources, a standard for exchanging Electric Health Records), on-premises or in the cloud, for secondary usage such as research, public health, and more. Link . Works with FHIR format (Stu3 and R4), allows different strategies for anonymization (date shift, crypto-hash, encrypt, substitute, perturb, generalize) ARX Anonymization using statistical models, specifically k-anonymity, \u2113-diversity, t-closeness and \u03b4-presence. Useful for validating the anonymization of aggregated data. Links: Repo , Website . Written in Java. k-Anonymity GitHub repo with examples on how to produce k-anonymous datasets. k-anonymity protects the privacy of individual persons by pooling their attributes into groups of at least k people. repo Synthetic Data Generation A synthetic dataset is a repository of data generated from actual data and has the same statistical properties as the real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. The potential benefit of such synthetic datasets is for sensitive applications \u2013 medical classifications or financial modelling, where getting hands on a high-quality labelled dataset is often prohibitive. When determining the best method for creating synthetic data, it is essential first to consider what type of synthetic data you aim to have. There are two broad categories to choose from, each with different benefits and drawbacks: Fully synthetic: This data does not contain any original data, which means that re-identification of any single unit is almost impossible, and all variables are still fully available. Partially synthetic: Only sensitive data is replaced with synthetic data, which requires a heavy dependency on the imputation model. This leads to decreased model dependence but does mean that some disclosure is possible due to the actual values within the dataset. Solution Notes Synthea Synthea was developed with numerous data sources collected on the internet, including US Census Bureau demographics, Centers for Disease Control and Prevention prevalence and incidence rates, and National Institutes of Health reports. The source code and disease models include annotations and citations for all data, statistics, and treatments. These models of diseases and treatments interact appropriately with the health record. PII dataset generator A synthetic data generator developed on top of Fake Name Generator which takes a text file with templates (e.g. my name is PERSON ) and creates a list of Input Samples which contain fake PII entities instead of placeholders. CheckList CheckList provides a framework for perturbation techniques to evaluate specific behavioral capabilities of NLP models systematically Mimesis Mimesis a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages. Faker Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you. Plaitpy The idea behind plait.py is that it should be easy to model fake data that has an interesting shape. Currently, many fake data generators model their data as a collection of IID variables; with plait.py we can stitch together those variables into a more coherent model. Trusted Research and Modeling Environments Trusted Research Environments Trusted Research Environments (TREs) enable organizations to create secure workspaces for analysts, data scientists and researchers who require access to sensitive data. TREs enforce a secure boundary around distinct workspaces to enable information governance controls. Each workspace is accessible by a set of authorized users, prevents the exfiltration of sensitive data, and has access to one or more datasets provided by the data platform. We highlight several alternatives for Trusted Research Environments: Solution Notes Azure Trusted Research Environment An Open Source TRE for Azure. Aridhia DRE Eyes-Off Machine Learning In certain situations, Data Scientists may need to train models on data they are not allowed to see. In these cases, an \"eyes-off\" approach is recommended. An eyes-off approach provides a data scientist with an environment in which scripts can be run on the data but direct access to samples is not allowed. When using Azure ML, tools such as the Identity Based Data Access can enable this scenario, alongside proper role assignment for users. During the processing within the eyes-off environment, only certain outputs (e.g. logs) are allowed to be extracted back to the user. For example, a user would be able to submit a script which trains a model and inspect the model's performance, but would not be able to see on which samples the model predicted the wrong output. In addition to the eyes-off environment, this approach usually entails providing access to an \"eyes-on\" dataset, which is a representative, cleansed, sample set of data for model design purposes. The Eyes-on dataset is often a de-identified subset of the private dataset, or a synthetic dataset generated based on the characteristics of the private dataset. Private Data Sharing Platforms Various tools and systems allow different parties to share data with 3rd parties while protecting private entities, and securely process data while reducing the likelihood of data exfiltration. These tools include Secure Multi Party Computation (SMPC) systems, Homomorphic Encryption systems, Confidential Computing , private data analysis frameworks such as PySift among others. Privacy Preserving Data Pipelines and ML Even when our data is secure, private entities can still be extracted when the data is consumed. Privacy preserving data pipelines and ML models focus on minimizing the risk of private data exfiltration during data querying or model predictions. Differential Privacy Differential privacy (DP) is a system that enables one to extract meaningful insights from datasets about subgroups of people, while also providing strong guarantees with regards to protecting any given individual's privacy. This is typically achieved by adding a small statistical noise to every individual's information, thereby introducing uncertainty in the data. However, the insights gleaned still accurately represent what we intend to learn about the population in the aggregate. This approach is known to be robust to re-identification attacks and data reconstruction by adversaries who possess auxiliary information. For a more comprehensive overview, check out Differential privacy: A primer for a non-technical audience . DP has been widely adopted in various scenarios such as learning from census data, user telemetry data analysis, audience engagement to advertisements, and health data insights where PII protection is of paramount importance. However, DP is less suitable for small datasets. Tools that implement DP include SmartNoise , Tensorflow Privacy among some others. Homomorphic Encryption Homomorphic Encryption (HE) is a form of encryption allowing one to perform calculations on encrypted data without decrypting it first. The result of the computation F is in an encrypted form, which on decrypting gives us the same result if computation F was done on raw unencrypted data. ( source ) Homomorphic Encryption frameworks: Solution Notes Microsoft SEAL Secure Cloud Storage and Computation, ML Modeling. A widely used open-source library from Microsoft that supports the BFV and the CKKS schemes. Palisade A widely used open-source library from a consortium of DARPA-funded defense contractors that supports multiple homomorphic encryption schemes such as BGV, BFV, CKKS, TFHE and FHEW, among others, with multiparty support. Link to repo PySift Private deep learning. PySyft decouples private data from model training, using Federated Learning, Differential Privacy, and Encrypted Computation (like Multi-Party Computation (MPC) and Homomorphic Encryption (HE)) within the main Deep Learning frameworks like PyTorch and TensorFlow. A list of additional OSS tools can be found here . Federated Learning Federated learning is a Machine Learning technique which allows the training of ML models in a decentralized way without having to share the actual data. Instead of sending data to the processing engine of the model, the approach is to distribute the model to the different data owners and perform training in a distributed fashion. Federated learning frameworks: Solution Notes TensorFlow Federated Learning OSS federated learning system built on top of TensorFlow FATE An OSS federated learning system with different options for deployment and different algorithms adapted for federated learning IBM Federated Learning A Python based federated learning framework focused on enterprise environments. Data Loss Prevention Organizations have sensitive information under their control such as financial data, proprietary data, credit card numbers, health records, or social security numbers. To help protect this sensitive data and reduce risk, they need a way to prevent their users from inappropriately sharing it with people who shouldn't have it. This practice is called data loss prevention (DLP) . Below we focus on two aspects of DLP: Sensitive data classification and Access management. Sensitive Data Classification Sensitive data classification is an important aspect of DLP, as it allows organizations to track, monitor, secure and identify sensitive and private data. Furthermore, different sensitivity levels can be applied to different data items, facilitating proper governance and cataloging. There are typically four levels data classification levels: Public Internal Confidential Restricted / Highly confidential Tools for data classification on Azure: Solution Notes Microsoft Information Protection (MIP) A suite for DLP, sensitive data classification, cataloging and more. Azure Purview A unified data governance service, which includes the classification and cataloging of sensitive data. Azure Purview leverages the MIP technology for data classification and more. Data Discovery & Classification for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Basic capabilities for discovering, classifying, labeling, and reporting the sensitive data in Azure SQL and Synapse databases. Data Discovery & Classification for SQL Server Capabilities for discovering, classifying, labeling & reporting the sensitive data in SQL Server databases. Often, tools used for de-identification can also serve as sensitive data classifiers. Refer to de-identification tools for such tools. Additional resources: Example guidelines for data classification Learn about sensitivity levels Access Management Access control is an important component of privacy by design and falls into overall data lifecycle protection. Successful access control will restrict access only to authorized individuals that should have access to data. Once data is secure in an environment, it is important to review who should access this data and for what purpose. Access control may be audited with a comprehensive logging strategy which may include the integration of activity logs that can provide insight into operations performed on resources in a subscription. OWASP Access Control Cheat Sheet","title":"Privacy Related frameworks"},{"location":"privacy/privacy-frameworks/#privacy-related-frameworks","text":"The following tools/frameworks could be leveraged when data analysis or model development needs to take place on private data. Note that the use of such frameworks still requires the solution to adhere to privacy regulations and others, and additional safeguards should be applied.","title":"Privacy Related frameworks"},{"location":"privacy/privacy-frameworks/#typical-scenarios-for-leveraging-a-privacy-framework","text":"Sharing data or results while preserving data subjects' privacy Performing analysis or statistical modeling on private data Developing privacy preserving ML models and data pipelines","title":"Typical Scenarios for Leveraging a Privacy Framework"},{"location":"privacy/privacy-frameworks/#privacy-frameworks","text":"Protecting private data involves the entire data lifecycle, from acquisition, through storage, processing, analysis, modeling and usage in reports or machine learning models. Proper safeguards and restrictions should be applied in each of these phases. In this section we provide a non-exhaustive list of privacy frameworks which can be leveraged for protecting and preserving privacy. We focus on four main use cases in the data lifecycle: Obtaining non-sensitive data Establishing trusted research and modeling environments Creating privacy preserving data and ML pipelines Data loss prevention","title":"Privacy Frameworks"},{"location":"privacy/privacy-frameworks/#obtaining-non-sensitive-data","text":"In many scenarios, analysts, researchers and data scientists require access to a non-sensitive version or sample of the private data. In this section we focus on two approaches for obtaining non-sensitive data. Note: These two approaches do not guarantee that the outcome would not include private data, and additional measures should be applied.","title":"Obtaining Non-Sensitive Data"},{"location":"privacy/privacy-frameworks/#data-de-identification","text":"De-identification is the process of applying a set of transformations to a dataset, in order to lower the risk of unintended disclosure of personal data. De-identification involves the removal or substitution of both direct identifiers (such as name, or social security number) or quasi-identifiers, which can be used for re-identification using additional external information. De-identification can be applied to different types of data, such as structured data, images and text. However, de-identification of non-structured data often involves statistical approaches which might result in undetected PII (Personal Identifiable Information) or non-private information being redacted or replaced. Here we outline several de-identification solutions available as open source: Solution Notes Presidio Presidio helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more in unstructured text and images. It's useful when high customization is required, for example to detect custom PII entities or languages. Link to repo , link to docs , link to demo . FHIR tools for anonymization FHIR Tools for Anonymization is an open-source project that helps anonymize healthcare FHIR data (FHIR=Fast Healthcare Interoperability Resources, a standard for exchanging Electric Health Records), on-premises or in the cloud, for secondary usage such as research, public health, and more. Link . Works with FHIR format (Stu3 and R4), allows different strategies for anonymization (date shift, crypto-hash, encrypt, substitute, perturb, generalize) ARX Anonymization using statistical models, specifically k-anonymity, \u2113-diversity, t-closeness and \u03b4-presence. Useful for validating the anonymization of aggregated data. Links: Repo , Website . Written in Java. k-Anonymity GitHub repo with examples on how to produce k-anonymous datasets. k-anonymity protects the privacy of individual persons by pooling their attributes into groups of at least k people. repo","title":"Data De-Identification"},{"location":"privacy/privacy-frameworks/#synthetic-data-generation","text":"A synthetic dataset is a repository of data generated from actual data and has the same statistical properties as the real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. The potential benefit of such synthetic datasets is for sensitive applications \u2013 medical classifications or financial modelling, where getting hands on a high-quality labelled dataset is often prohibitive. When determining the best method for creating synthetic data, it is essential first to consider what type of synthetic data you aim to have. There are two broad categories to choose from, each with different benefits and drawbacks: Fully synthetic: This data does not contain any original data, which means that re-identification of any single unit is almost impossible, and all variables are still fully available. Partially synthetic: Only sensitive data is replaced with synthetic data, which requires a heavy dependency on the imputation model. This leads to decreased model dependence but does mean that some disclosure is possible due to the actual values within the dataset. Solution Notes Synthea Synthea was developed with numerous data sources collected on the internet, including US Census Bureau demographics, Centers for Disease Control and Prevention prevalence and incidence rates, and National Institutes of Health reports. The source code and disease models include annotations and citations for all data, statistics, and treatments. These models of diseases and treatments interact appropriately with the health record. PII dataset generator A synthetic data generator developed on top of Fake Name Generator which takes a text file with templates (e.g. my name is PERSON ) and creates a list of Input Samples which contain fake PII entities instead of placeholders. CheckList CheckList provides a framework for perturbation techniques to evaluate specific behavioral capabilities of NLP models systematically Mimesis Mimesis a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages. Faker Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you. Plaitpy The idea behind plait.py is that it should be easy to model fake data that has an interesting shape. Currently, many fake data generators model their data as a collection of IID variables; with plait.py we can stitch together those variables into a more coherent model.","title":"Synthetic Data Generation"},{"location":"privacy/privacy-frameworks/#trusted-research-and-modeling-environments","text":"","title":"Trusted Research and Modeling Environments"},{"location":"privacy/privacy-frameworks/#trusted-research-environments","text":"Trusted Research Environments (TREs) enable organizations to create secure workspaces for analysts, data scientists and researchers who require access to sensitive data. TREs enforce a secure boundary around distinct workspaces to enable information governance controls. Each workspace is accessible by a set of authorized users, prevents the exfiltration of sensitive data, and has access to one or more datasets provided by the data platform. We highlight several alternatives for Trusted Research Environments: Solution Notes Azure Trusted Research Environment An Open Source TRE for Azure. Aridhia DRE","title":"Trusted Research Environments"},{"location":"privacy/privacy-frameworks/#eyes-off-machine-learning","text":"In certain situations, Data Scientists may need to train models on data they are not allowed to see. In these cases, an \"eyes-off\" approach is recommended. An eyes-off approach provides a data scientist with an environment in which scripts can be run on the data but direct access to samples is not allowed. When using Azure ML, tools such as the Identity Based Data Access can enable this scenario, alongside proper role assignment for users. During the processing within the eyes-off environment, only certain outputs (e.g. logs) are allowed to be extracted back to the user. For example, a user would be able to submit a script which trains a model and inspect the model's performance, but would not be able to see on which samples the model predicted the wrong output. In addition to the eyes-off environment, this approach usually entails providing access to an \"eyes-on\" dataset, which is a representative, cleansed, sample set of data for model design purposes. The Eyes-on dataset is often a de-identified subset of the private dataset, or a synthetic dataset generated based on the characteristics of the private dataset.","title":"Eyes-Off Machine Learning"},{"location":"privacy/privacy-frameworks/#private-data-sharing-platforms","text":"Various tools and systems allow different parties to share data with 3rd parties while protecting private entities, and securely process data while reducing the likelihood of data exfiltration. These tools include Secure Multi Party Computation (SMPC) systems, Homomorphic Encryption systems, Confidential Computing , private data analysis frameworks such as PySift among others.","title":"Private Data Sharing Platforms"},{"location":"privacy/privacy-frameworks/#privacy-preserving-data-pipelines-and-ml","text":"Even when our data is secure, private entities can still be extracted when the data is consumed. Privacy preserving data pipelines and ML models focus on minimizing the risk of private data exfiltration during data querying or model predictions.","title":"Privacy Preserving Data Pipelines and ML"},{"location":"privacy/privacy-frameworks/#differential-privacy","text":"Differential privacy (DP) is a system that enables one to extract meaningful insights from datasets about subgroups of people, while also providing strong guarantees with regards to protecting any given individual's privacy. This is typically achieved by adding a small statistical noise to every individual's information, thereby introducing uncertainty in the data. However, the insights gleaned still accurately represent what we intend to learn about the population in the aggregate. This approach is known to be robust to re-identification attacks and data reconstruction by adversaries who possess auxiliary information. For a more comprehensive overview, check out Differential privacy: A primer for a non-technical audience . DP has been widely adopted in various scenarios such as learning from census data, user telemetry data analysis, audience engagement to advertisements, and health data insights where PII protection is of paramount importance. However, DP is less suitable for small datasets. Tools that implement DP include SmartNoise , Tensorflow Privacy among some others.","title":"Differential Privacy"},{"location":"privacy/privacy-frameworks/#homomorphic-encryption","text":"Homomorphic Encryption (HE) is a form of encryption allowing one to perform calculations on encrypted data without decrypting it first. The result of the computation F is in an encrypted form, which on decrypting gives us the same result if computation F was done on raw unencrypted data. ( source ) Homomorphic Encryption frameworks: Solution Notes Microsoft SEAL Secure Cloud Storage and Computation, ML Modeling. A widely used open-source library from Microsoft that supports the BFV and the CKKS schemes. Palisade A widely used open-source library from a consortium of DARPA-funded defense contractors that supports multiple homomorphic encryption schemes such as BGV, BFV, CKKS, TFHE and FHEW, among others, with multiparty support. Link to repo PySift Private deep learning. PySyft decouples private data from model training, using Federated Learning, Differential Privacy, and Encrypted Computation (like Multi-Party Computation (MPC) and Homomorphic Encryption (HE)) within the main Deep Learning frameworks like PyTorch and TensorFlow. A list of additional OSS tools can be found here .","title":"Homomorphic Encryption"},{"location":"privacy/privacy-frameworks/#federated-learning","text":"Federated learning is a Machine Learning technique which allows the training of ML models in a decentralized way without having to share the actual data. Instead of sending data to the processing engine of the model, the approach is to distribute the model to the different data owners and perform training in a distributed fashion. Federated learning frameworks: Solution Notes TensorFlow Federated Learning OSS federated learning system built on top of TensorFlow FATE An OSS federated learning system with different options for deployment and different algorithms adapted for federated learning IBM Federated Learning A Python based federated learning framework focused on enterprise environments.","title":"Federated Learning"},{"location":"privacy/privacy-frameworks/#data-loss-prevention","text":"Organizations have sensitive information under their control such as financial data, proprietary data, credit card numbers, health records, or social security numbers. To help protect this sensitive data and reduce risk, they need a way to prevent their users from inappropriately sharing it with people who shouldn't have it. This practice is called data loss prevention (DLP) . Below we focus on two aspects of DLP: Sensitive data classification and Access management.","title":"Data Loss Prevention"},{"location":"privacy/privacy-frameworks/#sensitive-data-classification","text":"Sensitive data classification is an important aspect of DLP, as it allows organizations to track, monitor, secure and identify sensitive and private data. Furthermore, different sensitivity levels can be applied to different data items, facilitating proper governance and cataloging. There are typically four levels data classification levels: Public Internal Confidential Restricted / Highly confidential Tools for data classification on Azure: Solution Notes Microsoft Information Protection (MIP) A suite for DLP, sensitive data classification, cataloging and more. Azure Purview A unified data governance service, which includes the classification and cataloging of sensitive data. Azure Purview leverages the MIP technology for data classification and more. Data Discovery & Classification for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Basic capabilities for discovering, classifying, labeling, and reporting the sensitive data in Azure SQL and Synapse databases. Data Discovery & Classification for SQL Server Capabilities for discovering, classifying, labeling & reporting the sensitive data in SQL Server databases. Often, tools used for de-identification can also serve as sensitive data classifiers. Refer to de-identification tools for such tools. Additional resources: Example guidelines for data classification Learn about sensitivity levels","title":"Sensitive Data Classification"},{"location":"privacy/privacy-frameworks/#access-management","text":"Access control is an important component of privacy by design and falls into overall data lifecycle protection. Successful access control will restrict access only to authorized individuals that should have access to data. Once data is secure in an environment, it is important to review who should access this data and for what purpose. Access control may be audited with a comprehensive logging strategy which may include the integration of activity logs that can provide insight into operations performed on resources in a subscription. OWASP Access Control Cheat Sheet","title":"Access Management"},{"location":"security/","text":"Security Developers working on projects should adhere to industry-recommended standard practices for secure design and implementation of code. For the purposes of our customers, this means our engineers should understand the OWASP Top 10 Web Application Security Risks , as well as how to mitigate as many of them as possible, using the resources below. If you are looking for a fast way to get started evaluating your application or design, check out the \"Secure Coding Practices Quick Reference\" document below, which contains an itemized checklist of high-level concepts you can validate are being done properly. This checklist covers many common errors associated with the OWASP Top 10 list linked above, and should be the minimum amount of effort being put into security. Requesting Security Reviews When requesting a security review for your application, please make sure you have familiarized yourself with the Rules of Engagement . This will help you to prepare the application for testing, as well as understand the scope limits of the test. Quick Resources Secure Coding Practices Quick Reference Web Application Security Quick Reference Security Mindset/Creating a Security Program Quick Start Credential Scanning / Secret Detection Threat Modelling Azure DevOps Security Security Engineering DevSecOps Practices Azure DevOps Data Protection Overview Security and Identity in Azure DevOps Security Code Analysis DevSecOps Introduce security to your project at early stages. The DevSecOps section covers security practices, automation, tools and frameworks as part of the application CI. OWASP Cheat Sheets Note: OWASP is considered to be the gold-standard in computer security information. OWASP maintains an extensive series of cheat sheets which cover all the OWASP Top 10 and more. Below, many of the more relevant cheat sheets have been summarized. To view all the cheat sheets, check out their Cheat Sheet Index . Attack Surface Analysis Authorization Basics Content Security Policy (CSP) Cross-Site Request Forgery (CSRF) Prevention Cross-Site Scripting (XSS) Prevention Cryptographic Storage Deserialization Docker/Kubernetes (k8s) Security Input Validation Key Management OS Command Injection Defense Query Parameterization Examples Server-Side Request Forgery Prevention SQL Injection Prevention Unvalidated Redirects and Forwards Web Service Security XML Security Recommended Tools Check out the list of tools to help enable security in your projects. Note: Although some tools are agnostic, the below list is geared towards Cloud Native security, with a focus on Kubernetes. Vulnerability Scanning SonarCloud Integrates with Azure Devops with the click of a button. Snyk Trivy Cloudsploit Anchore Other tools from OWASP See why you should check for vulnerabilities at all layers of the stack , as well as a couple of other useful tips to reduce surface area for attacks. Runtime Security Falco Tracee Kubelinter May not fully qualify as runtime security, but helps ensure you're enabling best practices. Binary Authorization Binary authorization can happen both at the docker registry layer, and runtime (ie: via a K8s admission controller). The authorization check ensures that the image is signed by a trusted authority. This can occur for both (pre-approved) 3rd party images, and internal images. Taking this a step further the signing should occur only on images where all code has been reviewed and approved. Binary authorization can both reduce the impact of damage from a compromised hosting environment, and the damage from malicious insiders. Harbor Operator available Portieris Notary Note harbor leverages notary internally. TUF Other K8s Security OPA , Gatekeeper , and the Gatekeeper Library cert-manager for easy certificate provisioning and automatic rotation. Quickly enable mTLS between your microservices with Linkerd . Resources Non-Functional Requirements Guidance","title":"Security"},{"location":"security/#security","text":"Developers working on projects should adhere to industry-recommended standard practices for secure design and implementation of code. For the purposes of our customers, this means our engineers should understand the OWASP Top 10 Web Application Security Risks , as well as how to mitigate as many of them as possible, using the resources below. If you are looking for a fast way to get started evaluating your application or design, check out the \"Secure Coding Practices Quick Reference\" document below, which contains an itemized checklist of high-level concepts you can validate are being done properly. This checklist covers many common errors associated with the OWASP Top 10 list linked above, and should be the minimum amount of effort being put into security.","title":"Security"},{"location":"security/#requesting-security-reviews","text":"When requesting a security review for your application, please make sure you have familiarized yourself with the Rules of Engagement . This will help you to prepare the application for testing, as well as understand the scope limits of the test.","title":"Requesting Security Reviews"},{"location":"security/#quick-resources","text":"Secure Coding Practices Quick Reference Web Application Security Quick Reference Security Mindset/Creating a Security Program Quick Start Credential Scanning / Secret Detection Threat Modelling","title":"Quick Resources"},{"location":"security/#azure-devops-security","text":"Security Engineering DevSecOps Practices Azure DevOps Data Protection Overview Security and Identity in Azure DevOps Security Code Analysis","title":"Azure DevOps Security"},{"location":"security/#devsecops","text":"Introduce security to your project at early stages. The DevSecOps section covers security practices, automation, tools and frameworks as part of the application CI.","title":"DevSecOps"},{"location":"security/#owasp-cheat-sheets","text":"Note: OWASP is considered to be the gold-standard in computer security information. OWASP maintains an extensive series of cheat sheets which cover all the OWASP Top 10 and more. Below, many of the more relevant cheat sheets have been summarized. To view all the cheat sheets, check out their Cheat Sheet Index . Attack Surface Analysis Authorization Basics Content Security Policy (CSP) Cross-Site Request Forgery (CSRF) Prevention Cross-Site Scripting (XSS) Prevention Cryptographic Storage Deserialization Docker/Kubernetes (k8s) Security Input Validation Key Management OS Command Injection Defense Query Parameterization Examples Server-Side Request Forgery Prevention SQL Injection Prevention Unvalidated Redirects and Forwards Web Service Security XML Security","title":"OWASP Cheat Sheets"},{"location":"security/#recommended-tools","text":"Check out the list of tools to help enable security in your projects. Note: Although some tools are agnostic, the below list is geared towards Cloud Native security, with a focus on Kubernetes. Vulnerability Scanning SonarCloud Integrates with Azure Devops with the click of a button. Snyk Trivy Cloudsploit Anchore Other tools from OWASP See why you should check for vulnerabilities at all layers of the stack , as well as a couple of other useful tips to reduce surface area for attacks. Runtime Security Falco Tracee Kubelinter May not fully qualify as runtime security, but helps ensure you're enabling best practices. Binary Authorization Binary authorization can happen both at the docker registry layer, and runtime (ie: via a K8s admission controller). The authorization check ensures that the image is signed by a trusted authority. This can occur for both (pre-approved) 3rd party images, and internal images. Taking this a step further the signing should occur only on images where all code has been reviewed and approved. Binary authorization can both reduce the impact of damage from a compromised hosting environment, and the damage from malicious insiders. Harbor Operator available Portieris Notary Note harbor leverages notary internally. TUF Other K8s Security OPA , Gatekeeper , and the Gatekeeper Library cert-manager for easy certificate provisioning and automatic rotation. Quickly enable mTLS between your microservices with Linkerd .","title":"Recommended Tools"},{"location":"security/#resources","text":"Non-Functional Requirements Guidance","title":"Resources"},{"location":"security/rules-of-engagement/","text":"Application Security Analysis: Rules of Engagement When performing application security analysis, it is expected that the tester follow the Rules of Engagement as laid out below. This is to standardize the scope of application testing and provide a concrete awareness of what is considered \"out of scope\" for security analysis. Rules of Engagement - For Those Requesting Review Web Application Firewalls can be up and configured, but do not enable any automatic blocking. This can greatly slow down the person performing the test. Similarly, if a service is running on a virtual machine, ensure services such as fail2ban are disabled. You cannot make changes to the running application until the test is complete. This is to prevent accidentally breaking an otherwise valid attack in progress. Any review results are not considered as \"final\". A security review should always be performed by a security team orchestrated by the customer prior to moving an application into production. If a customer requires further assistance, they can engage Premier Support. Rules of Engagement - For Those Performing Tests Do not attempt to perform Denial-of-Service attacks or otherwise crash services. Heavy active scanning is tolerated (and is assumed to be somewhat of a load test) but deliberate takedowns are not permitted. Do not interact with human beings. Phishing credentials or other such client-side attacks are off-limits. Detailing XSS and similar attacks is encouraged as a part of the test, but do not leverage these against internal users or customers. Attack from a single point. Especially if the application is currently in the customer's hands, provide the IP address or hostname of the attacking host to avoid setting off alarms.","title":"Application Security Analysis: Rules of Engagement"},{"location":"security/rules-of-engagement/#application-security-analysis-rules-of-engagement","text":"When performing application security analysis, it is expected that the tester follow the Rules of Engagement as laid out below. This is to standardize the scope of application testing and provide a concrete awareness of what is considered \"out of scope\" for security analysis.","title":"Application Security Analysis: Rules of Engagement"},{"location":"security/rules-of-engagement/#rules-of-engagement-for-those-requesting-review","text":"Web Application Firewalls can be up and configured, but do not enable any automatic blocking. This can greatly slow down the person performing the test. Similarly, if a service is running on a virtual machine, ensure services such as fail2ban are disabled. You cannot make changes to the running application until the test is complete. This is to prevent accidentally breaking an otherwise valid attack in progress. Any review results are not considered as \"final\". A security review should always be performed by a security team orchestrated by the customer prior to moving an application into production. If a customer requires further assistance, they can engage Premier Support.","title":"Rules of Engagement - For Those Requesting Review"},{"location":"security/rules-of-engagement/#rules-of-engagement-for-those-performing-tests","text":"Do not attempt to perform Denial-of-Service attacks or otherwise crash services. Heavy active scanning is tolerated (and is assumed to be somewhat of a load test) but deliberate takedowns are not permitted. Do not interact with human beings. Phishing credentials or other such client-side attacks are off-limits. Detailing XSS and similar attacks is encouraged as a part of the test, but do not leverage these against internal users or customers. Attack from a single point. Especially if the application is currently in the customer's hands, provide the IP address or hostname of the attacking host to avoid setting off alarms.","title":"Rules of Engagement - For Those Performing Tests"},{"location":"security/threat-modelling-example/","text":"Threat Modelling Example This document covers the threat models for a sample project which takes video frames from video camera and process these frames on IoTEdge device and send them to Azure Cognitive Service to get the audio output. These models can be considered as reference template to show how we can construct threat modeling document. Each of the labeled entities in the figures below are accompanied by meta-information which describe the threats, recommended mitigations, and the associated security principle or goal . Architecture Diagram Assets Asset Entry Point Trust Level Azure Blob Storage Http End point Connection String Azure Monitor Http End Point Connection String Azure Cognitive Service Http End Point Connection String IoTEdge Module: M1 Http End Point Public Access (Local Area Network) IoTEdge Module: M2 Http End Point Public Access (Local Area Network) IoTEdge Module: M3 Http End Point Public Access (Local Area Network) IoTEdge Module: IoTEdgeMetricsCollector Http EndPoint Public Access (Local Area Network) Application Insights Http End Point Connection String Data Flow Diagram Client Browser makes requests to the M1 IoTEdge module. Browser and IoTEdge device are on same network, so browser directly hits the webapp URL. M1 IoTEdge module interacts with other two IoTEdge modules to render live stream from video device and display order scanning results via WebSockets. IoTEdge modules interact with Azure Cognitive service to get the translated text via OCR and audio stream via Text to Speech Service. IoTEdge modules send telemetry information to application insights. IoTEdge device is deployed with IoTEdge runtime which interacts with IoTEdge hub for deployments. IoTEdge module also sends some data to Azure storage which is required for debugging purpose. Cognitive service, application insights and Azure Storage are authenticated using connection strings which are stored in GitHub secrets and deployed using CI/CD pipelines. Threat List Assumptions Secrets like ACR credentials are stored in GitHub secrets store which are deployed to IoTEdge Device by CI/CD pipelines. However, CI/CD pipelines are out of scope. Threats Vector Threat Mitigation (1) Sniff Unencrypted data can be intercepted in transit Not Mitigated (2) Access to M1 IoT Edge Module Unauthorized Access to M1 IoT Edge Module Not Mitigated (3) Access to M2 IoT Edge Module Unauthorized Access to M2 IoT Edge Module Not Mitigated (4) Access to M3 IoT Edge Module Unauthorized Access to M3 IoT Edge Module Not Mitigated (5) Steal Storage Credentials Unauthorized Access to M2 IoTEdge Module where database secrets are used Not Mitigated (6) Denial Of Service Dos attack on all IoTEdge Modules since there is no Authentication Not Mitigated (7) Tampering with Log data Application Insights is connected via Connection String which is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper the log data. Not Mitigated (8) Tampering with video camera device. Video camera path is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper the video feed or use another video source or fake video stream. Not Mitigated (9) Spoofing Tampering Azure IoT Hub connection string is stored in .env file on IoTEdge Device. Once user gains access to the device, .env file can be read and attacker cause Dos attacks on IoTHub Not Mitigated (10) Denial of Service DDOS attack Azure Cognitive Service connection string is stored in .env file on IoTEdge Device. Once user gains access to the device, .env file can be read and attacker cause DoS attacks on Azure Cognitive Service Not Mitigated (11) Tampering with Storage Storage connection string is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper data on storage or read from the storage. Not Mitigated (12) Tampering with Storage Cognitive Service connection string is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker use cognitive service API's for his own purpose causing increase cost to use. Not Mitigated Threat Model Threat Properties Notable Threats # Principle Threat Mitigation 1 Authenticity Since channel from browser to IoTEdge Module is not authenticated, anyone can spoof it once gains access to WiFi network. Add authentication in all IoTEdge modules. 2 Confidentiality and Integrity As a result of the vulnerability of not encrypting data, plaintext data could be intercepted during transit via a man-in-the-middle (MitM) attack. Sensitive data could be exposed or tampered with to allow further exploits. All products and services must encrypt data in transit using approved cryptographic protocols and algorithms. Use TLS to encrypt all HTTP-based network traffic. Use other mechanisms, such as IPSec, to encrypt non-HTTP network traffic that contains customer or confidential data. Applies to data flow from browser to IoTEdge modules. 3 Confidentiality Data is a valuable target for most threat actors and attacking the data store directly, as opposed to stealing it during transit, allows data exfiltration at a much larger scale. In our scenario we are storing some data in Azure Blob containers. All customer or confidential data must be encrypted before being written to non-volatile storage media (encrypted at-rest) per the following requirements. Use approved algorithms. This includes AES-256, AES-192, or AES-128. Encryption must be enabled before writing data to storage. Applies to all data stores on the diagram. Azure Storage encrypt data at rest by default (AES-256). 4 Confidentiality Broken or non-existent authentication mechanisms may allow attackers to gain access to confidential information. All services within the Azure Trust Boundary must authenticate all incoming requests, including requests coming from the same network. Proper authorizations should also be applied to prevent unnecessary privileges. Whenever available, use Azure Managed Identities to authenticate services. Service Principals may be used if Managed Identities are not supported. External users or services may use UserName + Passwords, Tokens, Certificates or Connection Strings to authenticate, provided these are stored on Key Vault or any other vaulting solution. For authorization, use Azure RBAC to segregate duties and grant only the least amount of access to perform an action at a particular scope. Applies to Azure services like Azure IoTHub, Azure Cognitive Service, Azure Application Insights are authenticated using connection strings. 5 Confidentiality and Integrity A large attack surface, particularly those that are exposed on the internet, will increase the probability of a compromise Minimize the application attack surface by limiting publicly exposed services. Use strong network controls by using virtual networks, subnets and network security groups to protect against unsolicited traffic. Use Azure Private Endpoint for Azure Storage. Applies to Azure storage. 6 Confidentiality and Integrity Browser and IoTEdge device are connected over in store WIFI network Minimize the attack on WIFI network by using secure algorithm like WPA2. Applies to connection between browser and IoTEdge devices. 7 Integrity Exploitation of insufficient logging and monitoring is the bedrock of nearly every major incident. Attackers rely on the lack of monitoring and timely response to achieve their goals without being detected. Logging of critical application events must be performed to ensure that, should a security incident occur, incident response and root-cause analysis may be done. Steps must also be taken to ensure that logs are available and cannot be overwritten or destroyed through malicious or accidental occurrences. At a minimum, the following events should be logged. Login/logout events Privilege delegation events Security validation failures (e.g. input validation or authorization check failures) Application errors and system events Application and system start-ups and shut-downs, as well as logging initialization 6 Availability Exploitation of the public endpoint by malicious actors who aim to render the service unavailable to its intended users by interrupting the service normal activity, for instance by flooding the target service with requests until normal traffic is unable to be processed (Denial of Service) Application is accessed via web app deployed as one of the IoTEdge modules on the IoTEdge device. This app can be accessed by anyone in the local area network. Hence DDoS attacks are possible if the attacker gained access to local area network. All services deployed as IoTEdge modules must use authentication. Applies to services deployed on IoTEdge device 7 Integrity Tampering with data Data at rest, in Azure Storage must be encrypted on disk. Data at rest, in Azure can be protected further by Azure Advanced Threat Protection. Data at rest, in Azure Storage and Azure monitor workspace will use Azure RBAC to segregate duties and grant only the least amount of access to perform an action at a particular scope. Data in motion between services can be encrypted in TLS 1.2 Applies to data flow between IoTEdge modules and Azure Services. Security Principles Confidentiality refers to the objective of keeping data private or secret. In practice, it\u2019s about controlling access to data to prevent unauthorized disclosure. Integrity is about ensuring that data has not been tampered with and, therefore, can be trusted. It is correct, authentic, and reliable. Availability means that networks, systems, and applications are up and running. It ensures that authorized users have timely, reliable access to resources when they are needed.","title":"Threat Modelling Example"},{"location":"security/threat-modelling-example/#threat-modelling-example","text":"This document covers the threat models for a sample project which takes video frames from video camera and process these frames on IoTEdge device and send them to Azure Cognitive Service to get the audio output. These models can be considered as reference template to show how we can construct threat modeling document. Each of the labeled entities in the figures below are accompanied by meta-information which describe the threats, recommended mitigations, and the associated security principle or goal .","title":"Threat Modelling Example"},{"location":"security/threat-modelling-example/#architecture-diagram","text":"","title":"Architecture Diagram"},{"location":"security/threat-modelling-example/#assets","text":"Asset Entry Point Trust Level Azure Blob Storage Http End point Connection String Azure Monitor Http End Point Connection String Azure Cognitive Service Http End Point Connection String IoTEdge Module: M1 Http End Point Public Access (Local Area Network) IoTEdge Module: M2 Http End Point Public Access (Local Area Network) IoTEdge Module: M3 Http End Point Public Access (Local Area Network) IoTEdge Module: IoTEdgeMetricsCollector Http EndPoint Public Access (Local Area Network) Application Insights Http End Point Connection String","title":"Assets"},{"location":"security/threat-modelling-example/#data-flow-diagram","text":"Client Browser makes requests to the M1 IoTEdge module. Browser and IoTEdge device are on same network, so browser directly hits the webapp URL. M1 IoTEdge module interacts with other two IoTEdge modules to render live stream from video device and display order scanning results via WebSockets. IoTEdge modules interact with Azure Cognitive service to get the translated text via OCR and audio stream via Text to Speech Service. IoTEdge modules send telemetry information to application insights. IoTEdge device is deployed with IoTEdge runtime which interacts with IoTEdge hub for deployments. IoTEdge module also sends some data to Azure storage which is required for debugging purpose. Cognitive service, application insights and Azure Storage are authenticated using connection strings which are stored in GitHub secrets and deployed using CI/CD pipelines.","title":"Data Flow Diagram"},{"location":"security/threat-modelling-example/#threat-list","text":"","title":"Threat List"},{"location":"security/threat-modelling-example/#assumptions","text":"Secrets like ACR credentials are stored in GitHub secrets store which are deployed to IoTEdge Device by CI/CD pipelines. However, CI/CD pipelines are out of scope.","title":"Assumptions"},{"location":"security/threat-modelling-example/#threats","text":"Vector Threat Mitigation (1) Sniff Unencrypted data can be intercepted in transit Not Mitigated (2) Access to M1 IoT Edge Module Unauthorized Access to M1 IoT Edge Module Not Mitigated (3) Access to M2 IoT Edge Module Unauthorized Access to M2 IoT Edge Module Not Mitigated (4) Access to M3 IoT Edge Module Unauthorized Access to M3 IoT Edge Module Not Mitigated (5) Steal Storage Credentials Unauthorized Access to M2 IoTEdge Module where database secrets are used Not Mitigated (6) Denial Of Service Dos attack on all IoTEdge Modules since there is no Authentication Not Mitigated (7) Tampering with Log data Application Insights is connected via Connection String which is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper the log data. Not Mitigated (8) Tampering with video camera device. Video camera path is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper the video feed or use another video source or fake video stream. Not Mitigated (9) Spoofing Tampering Azure IoT Hub connection string is stored in .env file on IoTEdge Device. Once user gains access to the device, .env file can be read and attacker cause Dos attacks on IoTHub Not Mitigated (10) Denial of Service DDOS attack Azure Cognitive Service connection string is stored in .env file on IoTEdge Device. Once user gains access to the device, .env file can be read and attacker cause DoS attacks on Azure Cognitive Service Not Mitigated (11) Tampering with Storage Storage connection string is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker can tamper data on storage or read from the storage. Not Mitigated (12) Tampering with Storage Cognitive Service connection string is stored in .env file on the IoTEdge device. Once user gains access to the device, .env file can be read and attacker use cognitive service API's for his own purpose causing increase cost to use. Not Mitigated","title":"Threats"},{"location":"security/threat-modelling-example/#threat-model","text":"","title":"Threat Model"},{"location":"security/threat-modelling-example/#threat-properties","text":"Notable Threats # Principle Threat Mitigation 1 Authenticity Since channel from browser to IoTEdge Module is not authenticated, anyone can spoof it once gains access to WiFi network. Add authentication in all IoTEdge modules. 2 Confidentiality and Integrity As a result of the vulnerability of not encrypting data, plaintext data could be intercepted during transit via a man-in-the-middle (MitM) attack. Sensitive data could be exposed or tampered with to allow further exploits. All products and services must encrypt data in transit using approved cryptographic protocols and algorithms. Use TLS to encrypt all HTTP-based network traffic. Use other mechanisms, such as IPSec, to encrypt non-HTTP network traffic that contains customer or confidential data. Applies to data flow from browser to IoTEdge modules. 3 Confidentiality Data is a valuable target for most threat actors and attacking the data store directly, as opposed to stealing it during transit, allows data exfiltration at a much larger scale. In our scenario we are storing some data in Azure Blob containers. All customer or confidential data must be encrypted before being written to non-volatile storage media (encrypted at-rest) per the following requirements. Use approved algorithms. This includes AES-256, AES-192, or AES-128. Encryption must be enabled before writing data to storage. Applies to all data stores on the diagram. Azure Storage encrypt data at rest by default (AES-256). 4 Confidentiality Broken or non-existent authentication mechanisms may allow attackers to gain access to confidential information. All services within the Azure Trust Boundary must authenticate all incoming requests, including requests coming from the same network. Proper authorizations should also be applied to prevent unnecessary privileges. Whenever available, use Azure Managed Identities to authenticate services. Service Principals may be used if Managed Identities are not supported. External users or services may use UserName + Passwords, Tokens, Certificates or Connection Strings to authenticate, provided these are stored on Key Vault or any other vaulting solution. For authorization, use Azure RBAC to segregate duties and grant only the least amount of access to perform an action at a particular scope. Applies to Azure services like Azure IoTHub, Azure Cognitive Service, Azure Application Insights are authenticated using connection strings. 5 Confidentiality and Integrity A large attack surface, particularly those that are exposed on the internet, will increase the probability of a compromise Minimize the application attack surface by limiting publicly exposed services. Use strong network controls by using virtual networks, subnets and network security groups to protect against unsolicited traffic. Use Azure Private Endpoint for Azure Storage. Applies to Azure storage. 6 Confidentiality and Integrity Browser and IoTEdge device are connected over in store WIFI network Minimize the attack on WIFI network by using secure algorithm like WPA2. Applies to connection between browser and IoTEdge devices. 7 Integrity Exploitation of insufficient logging and monitoring is the bedrock of nearly every major incident. Attackers rely on the lack of monitoring and timely response to achieve their goals without being detected. Logging of critical application events must be performed to ensure that, should a security incident occur, incident response and root-cause analysis may be done. Steps must also be taken to ensure that logs are available and cannot be overwritten or destroyed through malicious or accidental occurrences. At a minimum, the following events should be logged. Login/logout events Privilege delegation events Security validation failures (e.g. input validation or authorization check failures) Application errors and system events Application and system start-ups and shut-downs, as well as logging initialization 6 Availability Exploitation of the public endpoint by malicious actors who aim to render the service unavailable to its intended users by interrupting the service normal activity, for instance by flooding the target service with requests until normal traffic is unable to be processed (Denial of Service) Application is accessed via web app deployed as one of the IoTEdge modules on the IoTEdge device. This app can be accessed by anyone in the local area network. Hence DDoS attacks are possible if the attacker gained access to local area network. All services deployed as IoTEdge modules must use authentication. Applies to services deployed on IoTEdge device 7 Integrity Tampering with data Data at rest, in Azure Storage must be encrypted on disk. Data at rest, in Azure can be protected further by Azure Advanced Threat Protection. Data at rest, in Azure Storage and Azure monitor workspace will use Azure RBAC to segregate duties and grant only the least amount of access to perform an action at a particular scope. Data in motion between services can be encrypted in TLS 1.2 Applies to data flow between IoTEdge modules and Azure Services.","title":"Threat Properties"},{"location":"security/threat-modelling-example/#security-principles","text":"Confidentiality refers to the objective of keeping data private or secret. In practice, it\u2019s about controlling access to data to prevent unauthorized disclosure. Integrity is about ensuring that data has not been tampered with and, therefore, can be trusted. It is correct, authentic, and reliable. Availability means that networks, systems, and applications are up and running. It ensures that authorized users have timely, reliable access to resources when they are needed.","title":"Security Principles"},{"location":"security/threat-modelling/","text":"Threat Modeling Threat modeling is an effective way to help secure your systems, applications, networks, and services. It's a systematic approach that identifies potential threats and recommendations to help reduce risk and meet security objectives earlier in the development lifecycle. Threat Modeling Phases Diagram Capture all requirements for your system and create a data-flow diagram Identify Apply a threat-modeling framework to the data-flow diagram and find potential security issues. Here we can use STRIDE framework to identify the threats. Mitigate Decide how to approach each issue with the appropriate combination of security controls. Validate Verify requirements are met, issues are found, and security controls are implemented. Example of these phases is covered in the threat modelling example. More details about these phases can be found at Threat Modeling Security Fundamentals. Threat Modeling Example Here is an example of a threat modeling document which talks about the architecture and different phases involved in the threat modeling. This document can be used as reference template for creating threat modeling documents. Resources Threat Modeling Microsoft Threat Modeling Tool STRIDE (Threat modeling framework)","title":"Threat Modeling"},{"location":"security/threat-modelling/#threat-modeling","text":"Threat modeling is an effective way to help secure your systems, applications, networks, and services. It's a systematic approach that identifies potential threats and recommendations to help reduce risk and meet security objectives earlier in the development lifecycle.","title":"Threat Modeling"},{"location":"security/threat-modelling/#threat-modeling-phases","text":"Diagram Capture all requirements for your system and create a data-flow diagram Identify Apply a threat-modeling framework to the data-flow diagram and find potential security issues. Here we can use STRIDE framework to identify the threats. Mitigate Decide how to approach each issue with the appropriate combination of security controls. Validate Verify requirements are met, issues are found, and security controls are implemented. Example of these phases is covered in the threat modelling example. More details about these phases can be found at Threat Modeling Security Fundamentals.","title":"Threat Modeling Phases"},{"location":"security/threat-modelling/#threat-modeling-example","text":"Here is an example of a threat modeling document which talks about the architecture and different phases involved in the threat modeling. This document can be used as reference template for creating threat modeling documents.","title":"Threat Modeling Example"},{"location":"security/threat-modelling/#resources","text":"Threat Modeling Microsoft Threat Modeling Tool STRIDE (Threat modeling framework)","title":"Resources"},{"location":"source-control/","text":"Source Control There are many options when working with Source Control. In ISE we use AzureDevOps for private repositories and GitHub for public repositories. Goal Following industry best practice to work in geo-distributed teams which encourage contributions from all across ISE as well as the broader OSS community Improve code quality by enforcing reviews before merging into main branches Improve traceability of features and fixes through a clean commit history General Guidance Consistency is important, so agree to the approach as a team before starting to code. Treat this as a design decision, so include a design proposal and review, in the same way as you would document all design decisions (see Working Agreements and Design Reviews ). Creating a New Repository When creating a new repository, the team should at least do the following Agree on the branch , release and merge strategy Define the merge strategy ( linear or non-linear ) Lock the default branch and merge using pull requests (PRs) Agree on branch naming (e.g. user/your_alias/feature_name ) Establish branch/PR policies For public repositories the default branch should contain the following files: LICENSE README.md contributing.md Contributing to an Existing Repository When working on an existing project, git clone the repository and ensure you understand the team's branch, merge and release strategy (e.g. through the projects CONTRIBUTING.md file ). Mixed DevOps Environments For most engagements having a single hosted DevOps environment (i.e. Azure DevOps) is the preferred path but there are times when a mixed DevOps environment (i.e. Azure DevOps for Agile/Work item tracking & GitHub for Source Control) is needed due to customer requirements. When working in a mixed environment: Manually tag PR's in work items Ensure that the scope of work items / tasks align with PR's Resources Git --local-branching-on-the-cheap Azure DevOps ISE Git details details on how to use Git as part of a ISE project. GitHub - Removing sensitive data from a repository How Git Works Pluralsight course Mastering Git Pluralsight course","title":"Source Control"},{"location":"source-control/#source-control","text":"There are many options when working with Source Control. In ISE we use AzureDevOps for private repositories and GitHub for public repositories.","title":"Source Control"},{"location":"source-control/#goal","text":"Following industry best practice to work in geo-distributed teams which encourage contributions from all across ISE as well as the broader OSS community Improve code quality by enforcing reviews before merging into main branches Improve traceability of features and fixes through a clean commit history","title":"Goal"},{"location":"source-control/#general-guidance","text":"Consistency is important, so agree to the approach as a team before starting to code. Treat this as a design decision, so include a design proposal and review, in the same way as you would document all design decisions (see Working Agreements and Design Reviews ).","title":"General Guidance"},{"location":"source-control/#creating-a-new-repository","text":"When creating a new repository, the team should at least do the following Agree on the branch , release and merge strategy Define the merge strategy ( linear or non-linear ) Lock the default branch and merge using pull requests (PRs) Agree on branch naming (e.g. user/your_alias/feature_name ) Establish branch/PR policies For public repositories the default branch should contain the following files: LICENSE README.md contributing.md","title":"Creating a New Repository"},{"location":"source-control/#contributing-to-an-existing-repository","text":"When working on an existing project, git clone the repository and ensure you understand the team's branch, merge and release strategy (e.g. through the projects CONTRIBUTING.md file ).","title":"Contributing to an Existing Repository"},{"location":"source-control/#mixed-devops-environments","text":"For most engagements having a single hosted DevOps environment (i.e. Azure DevOps) is the preferred path but there are times when a mixed DevOps environment (i.e. Azure DevOps for Agile/Work item tracking & GitHub for Source Control) is needed due to customer requirements. When working in a mixed environment: Manually tag PR's in work items Ensure that the scope of work items / tasks align with PR's","title":"Mixed DevOps Environments"},{"location":"source-control/#resources","text":"Git --local-branching-on-the-cheap Azure DevOps ISE Git details details on how to use Git as part of a ISE project. GitHub - Removing sensitive data from a repository How Git Works Pluralsight course Mastering Git Pluralsight course","title":"Resources"},{"location":"source-control/component-versioning/","text":"Component Versioning Goal Larger applications consist of multiple components that reference each other and rely on compatibility of the interfaces/contracts of the components. To achieve the goal of loosely coupled applications, each component should be versioned independently hence allowing developers to detect breaking changes or seamless updates just by looking at the version number. Version Numbers and Versioning Schemes For developers or other components to detect breaking changes the version number of a component is important. There is different versioning number schemes, e.g. major.minor[.build[.revision]] or major.minor[.maintenance[.build]] . Upon build / CI these version numbers are being generated. During CD / release components are pushed to a component repository such as Nuget, NPM, Docker Hub where a history of different versions is being kept. Each build the version number is incremented at the last digit. Updating the major / minor version indicates changes of the API / interfaces / contracts: Major Version: A breaking change Minor Version: A backwards-compatible minor change Build / Revision: No API change, just a different build. Semantic Versioning Semantic Versioning is a versioning scheme specifying how to interpret the different version numbers. The most common format is major.minor.patch . The version number is incremented based on the following rules: Major version when you make incompatible API changes, Minor version when you add functionality in a backwards-compatible manner, and Patch version when you make backwards-compatible bug fixes. Examples of semver version numbers: 1.0.0-alpha.1 : +1 commit after the alpha release of 1.0.0 2.1.0-beta : 2.1.0 in beta branch 2.4.2 : 2.4.2 release A common practice is to determine the version number during the build process. For this the source control repository is utilized to determine the version number automatically based the source code repository. The GitVersion tool uses the git history to generate repeatable and unique version number based on number of commits since last major or minor release commit messages tags branch names Version updates happen through: Commit messages or tags for Major / Minor / Revision updates. When using commit messages a convention such as Conventional Commits is recommended (see Git Guidance - Commit Message Structure ) Branch names (e.g. develop, release/..) for Alpha / Beta / RC Otherwise: Number of commits (+12, ...) Semantic Versioning Within a Monorepo A monorepo, short for \"monolithic repository\", is a software development practice where multiple related projects, components, or modules are stored within a single version-controlled repository as opposed to maintaining them in separate repositories. Challenges with Versioning in a Monorepo Structure Versioning in a monorepo involves making decisions about how to assign version numbers to different projects and components contained within the repository. Assigning a single version number to all projects in a monorepo can lead to frequent version increments if changes in one project don't match the significance of changes in another. This might be excessive if some projects undergo rapid development while others evolve more slowly. Ideally, we would want each project within the monorepo to have its own version number. Changes in one project shouldn't necessarily trigger version changes in others. This strategy allows projects to evolve at their own pace, without forcing all projects to adopt the same version number. It aligns well with the differing release cadences of distinct projects. semantic-release Package for Versioning semantic-release simplifies the entire process of releasing a package, which encompasses tasks such as identifying the upcoming version number, producing release notes, and distributing the package. This process severs the direct link between human sentiments and version identifiers. Instead, it rigorously adheres to the Semantic Versioning standards and effectively conveys the significance of alterations to end users. semantic-release relies on commit messages to assess how codebase changes impact consumers. By adhering to structured conventions for commit messages, semantic-release autonomously identifies the subsequent semantic version, compiles a changelog, and releases the software. Angular Commit Message Conventions serve as the default for semantic-release . However, the configuration options of the @semantic-release/commit-analyzer and @semantic-release/release-notes-generator plugins, including presets, can be adjusted to modify the commit message format. The table below shows which commit message gets you which release type when semantic-release runs (using the default configuration): Commit message Release type fix(pencil): stop graphite breaking when too much pressure applied Patch Fix Release feat(pencil): add 'graphiteWidth' option Minor Feature Release perf(pencil): remove graphiteWidth option BREAKING CHANGE: The graphiteWidth option has been removed. The default graphite width of 10mm is always used for performance reasons. Major Breaking Release (Note that the BREAKING CHANGE: token must be in the footer of the commit) The inherent setup of semantic-release presumes a direct correspondence between a GitHub repository and a package. Hence changes anywhere in the project result in a version upgrade for the project. The semantic-release-monorepo tool permits the utilization of semantic-release within a solitary GitHub repository that encompasses numerous packages. Instead of attributing all commits to a single package, commits are assigned to packages based on the files that a commit touched. If a commit touches a file in or below a package's root, it will be considered for that package's next release. A single commit can belong to multiple packages and may trigger the release of multiple packages. In order to avoid version collisions, generated git tags are namespaced using the given package's name: <package-name> - <version> . semantic-release Configurations semantic-release \u2019s options, mode and plugins can be set via either: A .releaserc file, written in YAML or JSON, with optional extensions: .yaml/.yml/.json/.js/.cjs A release.config.(js|cjs) file that exports an object A release key in the project's package.json file Here is an example .releaserc file which contains the configuration for: 1. git tags for the releases from different types of branches 2. Any plugins required, list of supported plugins can be found here . In this file semantic-release-monorepo plugin is extended. { \"ci\" : true , \"repositoryUrl\" : \"your repository url\" , \"branches\" : [ \"master\" , { \"name\" : \"feature/*\" , \"prerelease\" : \"beta-${name.replace(/\\\\//g, '-').replace(/_/g, '-')}\" }, { \"name\" : \"[a-zA-Z0-9_]+/[a-zA-Z0-9-_]+\" , \"prerelease\" : \"dev-${name.replace(/\\\\//g, '-').replace(/_/g, '--')}\" } ], \"plugins\" : [ \"@semantic-release/commit-analyzer\" , \"@semantic-release/release-notes-generator\" , [ \"@semantic-release/exec\" , { \"verifyReleaseCmd\" : \"echo ${nextRelease.name} > .VERSION\" } ], \"semantic-release-ado\" ], \"extends\" : \"semantic-release-monorepo\" } Resources GitVersion Semantic Versioning Versioning in C# semantic-release semantic-release-monorepo","title":"Component Versioning"},{"location":"source-control/component-versioning/#component-versioning","text":"","title":"Component Versioning"},{"location":"source-control/component-versioning/#goal","text":"Larger applications consist of multiple components that reference each other and rely on compatibility of the interfaces/contracts of the components. To achieve the goal of loosely coupled applications, each component should be versioned independently hence allowing developers to detect breaking changes or seamless updates just by looking at the version number.","title":"Goal"},{"location":"source-control/component-versioning/#version-numbers-and-versioning-schemes","text":"For developers or other components to detect breaking changes the version number of a component is important. There is different versioning number schemes, e.g. major.minor[.build[.revision]] or major.minor[.maintenance[.build]] . Upon build / CI these version numbers are being generated. During CD / release components are pushed to a component repository such as Nuget, NPM, Docker Hub where a history of different versions is being kept. Each build the version number is incremented at the last digit. Updating the major / minor version indicates changes of the API / interfaces / contracts: Major Version: A breaking change Minor Version: A backwards-compatible minor change Build / Revision: No API change, just a different build.","title":"Version Numbers and Versioning Schemes"},{"location":"source-control/component-versioning/#semantic-versioning","text":"Semantic Versioning is a versioning scheme specifying how to interpret the different version numbers. The most common format is major.minor.patch . The version number is incremented based on the following rules: Major version when you make incompatible API changes, Minor version when you add functionality in a backwards-compatible manner, and Patch version when you make backwards-compatible bug fixes. Examples of semver version numbers: 1.0.0-alpha.1 : +1 commit after the alpha release of 1.0.0 2.1.0-beta : 2.1.0 in beta branch 2.4.2 : 2.4.2 release A common practice is to determine the version number during the build process. For this the source control repository is utilized to determine the version number automatically based the source code repository. The GitVersion tool uses the git history to generate repeatable and unique version number based on number of commits since last major or minor release commit messages tags branch names Version updates happen through: Commit messages or tags for Major / Minor / Revision updates. When using commit messages a convention such as Conventional Commits is recommended (see Git Guidance - Commit Message Structure ) Branch names (e.g. develop, release/..) for Alpha / Beta / RC Otherwise: Number of commits (+12, ...)","title":"Semantic Versioning"},{"location":"source-control/component-versioning/#semantic-versioning-within-a-monorepo","text":"A monorepo, short for \"monolithic repository\", is a software development practice where multiple related projects, components, or modules are stored within a single version-controlled repository as opposed to maintaining them in separate repositories.","title":"Semantic Versioning Within a Monorepo"},{"location":"source-control/component-versioning/#challenges-with-versioning-in-a-monorepo-structure","text":"Versioning in a monorepo involves making decisions about how to assign version numbers to different projects and components contained within the repository. Assigning a single version number to all projects in a monorepo can lead to frequent version increments if changes in one project don't match the significance of changes in another. This might be excessive if some projects undergo rapid development while others evolve more slowly. Ideally, we would want each project within the monorepo to have its own version number. Changes in one project shouldn't necessarily trigger version changes in others. This strategy allows projects to evolve at their own pace, without forcing all projects to adopt the same version number. It aligns well with the differing release cadences of distinct projects.","title":"Challenges with Versioning in a Monorepo Structure"},{"location":"source-control/component-versioning/#semantic-release-package-for-versioning","text":"semantic-release simplifies the entire process of releasing a package, which encompasses tasks such as identifying the upcoming version number, producing release notes, and distributing the package. This process severs the direct link between human sentiments and version identifiers. Instead, it rigorously adheres to the Semantic Versioning standards and effectively conveys the significance of alterations to end users. semantic-release relies on commit messages to assess how codebase changes impact consumers. By adhering to structured conventions for commit messages, semantic-release autonomously identifies the subsequent semantic version, compiles a changelog, and releases the software. Angular Commit Message Conventions serve as the default for semantic-release . However, the configuration options of the @semantic-release/commit-analyzer and @semantic-release/release-notes-generator plugins, including presets, can be adjusted to modify the commit message format. The table below shows which commit message gets you which release type when semantic-release runs (using the default configuration): Commit message Release type fix(pencil): stop graphite breaking when too much pressure applied Patch Fix Release feat(pencil): add 'graphiteWidth' option Minor Feature Release perf(pencil): remove graphiteWidth option BREAKING CHANGE: The graphiteWidth option has been removed. The default graphite width of 10mm is always used for performance reasons. Major Breaking Release (Note that the BREAKING CHANGE: token must be in the footer of the commit) The inherent setup of semantic-release presumes a direct correspondence between a GitHub repository and a package. Hence changes anywhere in the project result in a version upgrade for the project. The semantic-release-monorepo tool permits the utilization of semantic-release within a solitary GitHub repository that encompasses numerous packages. Instead of attributing all commits to a single package, commits are assigned to packages based on the files that a commit touched. If a commit touches a file in or below a package's root, it will be considered for that package's next release. A single commit can belong to multiple packages and may trigger the release of multiple packages. In order to avoid version collisions, generated git tags are namespaced using the given package's name: <package-name> - <version> .","title":"semantic-release Package for Versioning"},{"location":"source-control/component-versioning/#semantic-release-configurations","text":"semantic-release \u2019s options, mode and plugins can be set via either: A .releaserc file, written in YAML or JSON, with optional extensions: .yaml/.yml/.json/.js/.cjs A release.config.(js|cjs) file that exports an object A release key in the project's package.json file Here is an example .releaserc file which contains the configuration for: 1. git tags for the releases from different types of branches 2. Any plugins required, list of supported plugins can be found here . In this file semantic-release-monorepo plugin is extended. { \"ci\" : true , \"repositoryUrl\" : \"your repository url\" , \"branches\" : [ \"master\" , { \"name\" : \"feature/*\" , \"prerelease\" : \"beta-${name.replace(/\\\\//g, '-').replace(/_/g, '-')}\" }, { \"name\" : \"[a-zA-Z0-9_]+/[a-zA-Z0-9-_]+\" , \"prerelease\" : \"dev-${name.replace(/\\\\//g, '-').replace(/_/g, '--')}\" } ], \"plugins\" : [ \"@semantic-release/commit-analyzer\" , \"@semantic-release/release-notes-generator\" , [ \"@semantic-release/exec\" , { \"verifyReleaseCmd\" : \"echo ${nextRelease.name} > .VERSION\" } ], \"semantic-release-ado\" ], \"extends\" : \"semantic-release-monorepo\" }","title":"semantic-release Configurations"},{"location":"source-control/component-versioning/#resources","text":"GitVersion Semantic Versioning Versioning in C# semantic-release semantic-release-monorepo","title":"Resources"},{"location":"source-control/merge-strategies/","text":"Merge Strategies Agree if you want a linear or non-linear commit history. There are pros and cons to both approaches: Pro linear: Avoid messy git history, use linear history Con linear: Why you should stop using Git rebase Approach for Non-Linear Commit History Merging topic into main A---B---C topic / \\ D---E---F---G---H main git fetch origin git checkout main git merge topic Two Approaches to Achieve a Linear Commit History Rebase Topic Branch Before Merging into Main Before merging topic into main , we rebase topic with the main branch: A---B---C topic / \\ D---E---F-----------G---H main git checkout main git pull git checkout topic git rebase origin/main Create a PR topic --> main in Azure DevOps and approve using the squash merge option Rebase Topic Branch Before Squash Merge into Main Squash merging is a merge option that allows you to condense the Git history of topic branches when you complete a pull request. Instead of adding each commit on topic to the history of main , a squash merge takes all the file changes and adds them to a single new commit on main . A---B---C topic / D---E---F-----------G---H main Create a PR topic --> main in Azure DevOps and approve using the squash merge option","title":"Merge Strategies"},{"location":"source-control/merge-strategies/#merge-strategies","text":"Agree if you want a linear or non-linear commit history. There are pros and cons to both approaches: Pro linear: Avoid messy git history, use linear history Con linear: Why you should stop using Git rebase","title":"Merge Strategies"},{"location":"source-control/merge-strategies/#approach-for-non-linear-commit-history","text":"Merging topic into main A---B---C topic / \\ D---E---F---G---H main git fetch origin git checkout main git merge topic","title":"Approach for Non-Linear Commit History"},{"location":"source-control/merge-strategies/#two-approaches-to-achieve-a-linear-commit-history","text":"","title":"Two Approaches to Achieve a Linear Commit History"},{"location":"source-control/merge-strategies/#rebase-topic-branch-before-merging-into-main","text":"Before merging topic into main , we rebase topic with the main branch: A---B---C topic / \\ D---E---F-----------G---H main git checkout main git pull git checkout topic git rebase origin/main Create a PR topic --> main in Azure DevOps and approve using the squash merge option","title":"Rebase Topic Branch Before Merging into Main"},{"location":"source-control/merge-strategies/#rebase-topic-branch-before-squash-merge-into-main","text":"Squash merging is a merge option that allows you to condense the Git history of topic branches when you complete a pull request. Instead of adding each commit on topic to the history of main , a squash merge takes all the file changes and adds them to a single new commit on main . A---B---C topic / D---E---F-----------G---H main Create a PR topic --> main in Azure DevOps and approve using the squash merge option","title":"Rebase Topic Branch Before Squash Merge into Main"},{"location":"source-control/naming-branches/","text":"Naming Branches When contributing to existing projects, look for and stick with the agreed branch naming convention. In open source projects this information is typically found in the contributing instructions, often in a file named CONTRIBUTING.md . In the beginning of a new project the team agrees on the project conventions including the branch naming strategy. Here's an example of a branch naming convention: <user alias>/ [ feature/bug/hotfix ] /<work item ID>_<title> Which could translate to something as follows: dickinson/feature/271_add_more_cowbell The example above is just that - an example. The team can choose to omit or add parts. Choosing a branch convention can depend on the development model (e.g. trunk-based development ), versioning model, tools used in managing source control, matter of taste etc. Focus on simplicity and reducing ambiguity; a good branch naming strategy allows the team to understand the purpose and ownership of each branch in the repository.","title":"Naming Branches"},{"location":"source-control/naming-branches/#naming-branches","text":"When contributing to existing projects, look for and stick with the agreed branch naming convention. In open source projects this information is typically found in the contributing instructions, often in a file named CONTRIBUTING.md . In the beginning of a new project the team agrees on the project conventions including the branch naming strategy. Here's an example of a branch naming convention: <user alias>/ [ feature/bug/hotfix ] /<work item ID>_<title> Which could translate to something as follows: dickinson/feature/271_add_more_cowbell The example above is just that - an example. The team can choose to omit or add parts. Choosing a branch convention can depend on the development model (e.g. trunk-based development ), versioning model, tools used in managing source control, matter of taste etc. Focus on simplicity and reducing ambiguity; a good branch naming strategy allows the team to understand the purpose and ownership of each branch in the repository.","title":"Naming Branches"},{"location":"source-control/secrets-management/","text":"Working with Secrets in Source Control The best way to avoid leaking secrets is to store them in local/private files and exclude these from git tracking with a .gitignore file. E.g. the following pattern will exclude all files with the extension .private.config : # remove private configuration *.private.config For more details on proper management of credentials and secrets in source control, and handling an accidental commit of secrets to source control, please refer to the Secrets Management document which has further information, split by language as well. As an extra security measure, apply credential scanning in your CI/CD pipeline.","title":"Working with Secrets in Source Control"},{"location":"source-control/secrets-management/#working-with-secrets-in-source-control","text":"The best way to avoid leaking secrets is to store them in local/private files and exclude these from git tracking with a .gitignore file. E.g. the following pattern will exclude all files with the extension .private.config : # remove private configuration *.private.config For more details on proper management of credentials and secrets in source control, and handling an accidental commit of secrets to source control, please refer to the Secrets Management document which has further information, split by language as well. As an extra security measure, apply credential scanning in your CI/CD pipeline.","title":"Working with Secrets in Source Control"},{"location":"source-control/git-guidance/","text":"Git Guidance What is Git? Git is a distributed version control system. This means that - unlike SVN or CVS - it doesn't use a central server to synchronize. Instead, every participant has a local copy of the source-code, and the attached history that is kept in sync by comparing commit hashes (SHA hashes of changes between each git commit command) making up the latest version (called HEAD ). For example: repo 1 : A -> B -> C -> D -> HEAD repo 2 : A -> B -> HEAD repo 3 : X -> Y -> Z -> HEAD repo 4 : A -> J -> HEAD Since they share a common history, repo 1 and repo 2 can be synchronized fairly easily, repo 4 may be able to synchronize as well, but it's going to have to add a commit (J, and maybe a merge commit) to repo 1. Repo 3 cannot be easily synchronized with the others. Everything related to these commits is stored in a local .git directory in the root of the repository. In other words, by using Git you are simply creating immutable file histories that uniquely identify the current state and therefore allow sharing whatever comes after. It's a Merkle tree . Be sure to run git help after Git installation to find really in-depth explanations of everything. Installation Git is a tool set that must be installed. Install Git and follow the First-Time Git Setup . A recommended installation is the Git Lens extension for Visual Studio Code . Visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more. You can use these commands as well to configure your Git for Visual Studio Code as an editor for merge conflicts and diff tool. git config --global user.name [ YOUR FIRST AND LAST NAME ] git config --global user.email [ YOUR E-MAIL ADDRESS ] git config --global merge.tool vscode git config --global mergetool.vscode.cmd \"code --wait $MERGED \" git config --global diff.tool vscode git config --global difftool.vscode.cmd \"code --wait --diff $LOCAL $REMOTE \" Basic Workflow A basic Git workflow is as follows; you can find more information on the specific steps below. # pull the latest changes git pull # start a new feature branch based on the develop branch git checkout -b feature/123-add-git-instructions develop # edit some files # add and commit the files git add <file> git commit -m \"add basic instructions\" # edit some files # add and commit the files git add <file> git commit -m \"add more advanced instructions\" # check your changes git status # push the branch to the remote repository git push --set-upstream origin feature/123-add-git-instructions Cloning Whenever you want to make a change to a repository, you need to first clone it. Cloning a repository pulls down a full copy of all the repository data, so that you can work on it locally. This copy includes all versions of every file and folder for the project. git clone https://github.com/username/repo-name You only need to clone the repository the first time. Before any subsequent branches you can sync any changes from the remote repository using git pull . Branching To avoid adding code that has not been peer reviewed to the main branch (ex. develop ) we typically work in feature branches, and merge these back to the main trunk with a Pull Request. It's even the case that often the main or develop branch of a repository are locked so that you can't make changes without a Pull Request. Therefore, it is useful to create a separate branch for your local/feature work, so that you can work and track your changes in this branch. Pull the latest changes and create a new branch for your work based on the trunk (in this case develop ). git pull git checkout -b feature/feature-name develop At any point, you can move between the branches with git checkout <branch> as long as you have committed or stashed your work. If you forget the name of your branch use git branch --all to list all branches. Committing To avoid losing work, it is good to commit often in small chunks. This allows you to revert only the last changes if you discover a problem and also neatly explains exactly what changes were made and why. Make changes to your branch Check what files were changed > git status On branch feature/271-basic-commit-info Changes not staged for commit: ( use \"git add <file>...\" to update what will be committed ) ( use \"git restore <file>...\" to discard changes in working directory ) modified: source-control/git-guidance/README.md Track the files you wish to include in the commit. To track all modified files: git add --all Or to track only specific files: git add source-control/git-guidance/README.md Commit the changes to your local branch with a descriptive commit message git commit -m \"add basic git instructions\" Pushing When you are done working, push your changes to a branch in the remote repository using: git push The first time you push, you first need to set an upstream branch as follows. After the first push, the --set-upstream parameter and branch name are not needed anymore. git push --set-upstream origin feature/feature-name Once the feature branch is pushed to the remote repository, it is visible to anyone with access to the code. Merging We encourage the use of Pull Request to merge code to the main repository to make sure that all code in the final product is code reviewed The Pull Request (PR) process in Azure DevOps , GitHub and other similar tools make it easy both to start a PR, review a PR and merge a PR. Merge Conflicts If multiple people make changes to the same files, you may need to resolve any conflicts that have occurred before you can merge. # check out the develop branch and get the latest changes git checkout develop git pull # check out your branch git checkout <your branch> # merge the develop branch into your branch git merge develop # if merge conflicts occur, above command will fail with a message telling you that there are conflicts to be solved # find which files need to be resolved git status You can start an interactive process that will show which files have conflicts. Sometimes you removed a file, where it was changed in dev. Or you made changes to some lines in a file where another developer made changes as well. If you went through the installation steps mentioned before, Visual Studio Code is set up as merge tool. You can also use a merge tool like kdiff3 . When editing conflicts occur, the process will automatically open Visual Studio Code where the conflicting parts are highlighted in green and blue, and you have make a choice: Accept your changes (current) Accept the changes from dev branch (incoming) Accept them both and fix the code (probably needed) Here are lines that are either unchanged from the common ancestor, or cleanly resolved because only one side changed. <<<<<<< yours:sample.txt Conflict resolution is hard; let's go shopping. ======= Git makes conflict resolution easy. >>>>>>> theirs:sample.txt And here is another line that is cleanly resolved or unmodified When this process is completed, make sure you test the result by executing build, checks, test to validate this merged result. # conclude the merge git merge --continue # verify that everything went ok git log # push the changes to the remote branch git push If no other conflicts appear, the PR can now be merged, and your branch deleted. Use squash to reduce your changes into a single commit, so the commit history can be within an acceptable size. Stashing Changes git stash is super handy if you have un-committed changes in your working directory, but you want to work on a different branch. You can run git stash , save the un-committed work, and revert to the HEAD commit. You can retrieve the saved changes by running git stash pop : git stash \u2026 git stash pop Or you can move the current state into a new branch: git stash branch <new_branch_to_save_changes> Recovering Lost Commits If you \"lost\" a commit that you want to return to, for example to revert a git rebase where your commits got squashed, you can use git reflog to find the commit: git reflog Then you can use the reflog reference ( HEAD@{} ) to reset to a specific commit before the rebase: git reset HEAD@ { 2 } Commit Best Practices A commit combines changes into a logical unit. Adding a descriptive commit message can aid in comprehending the code changes and understanding the rationale behind the modifications. Consider the following when making your commits: Make small commits. This makes changes easier to review, and if we need to revert a commit, we lose less work. Consider splitting the commit into separate commits with git add -p if it includes more than one logical change or bug fix. Don't mix whitespace changes with functional code changes. It is hard to determine if the line has a functional change or only removes a whitespace, so functional changes may go unnoticed. Commit complete and well tested code. Never commit incomplete code, get in the habit of testing your code before committing. Write good commit messages. Why is it necessary? It may fix a bug, add a feature, improve performance, or just be a change for the sake of correctness What effects does this change have? In addition to the obvious ones, this may include benchmarks, side effects etc. You can specify the default git editor, which allows you to write your commit messages using your favorite editor. The following command makes Visual Studio Code your default git editor: git config --global core.editor \"code --wait\" Commit Message Structure The essential parts of a commit message are: subject line: a short description of the commit, maximum 50 characters long body (optional): a longer description of the commit, wrapped at 72 characters, separated from the subject line by a blank line You are free to structure commit messages; however, git commands like git log utilize above structure. Therefore, it can be helpful to follow a convention within your team and to utilize git best. For example, Conventional Commits is a lightweight convention that complements SemVer , by describing the features, fixes, and breaking changes made in commit messages. See Component Versioning for more information on versioning. For more information on commit message conventions, see: A Note About Git Commit Messages Conventional Commits Git commit best practices How to Write a Git Commit Message How to Write Better Git Commit Messages Information in commit messages On commit messages Managing Remotes A local git repository can have one or more backing remote repositories. You can list the remote repositories using git remote - by default, the remote repository you cloned from will be called origin > git remote -v origin https://github.com/microsoft/code-with-engineering-playbook.git ( fetch ) origin https://github.com/microsoft/code-with-engineering-playbook.git ( push ) Working with Forks You can set multiple remotes. This is useful for example if you want to work with a forked version of the repository. For more info on how to set upstream remotes and syncing repositories when working with forks see GitHub's Working with forks documentation . Updating the Remote if a Repository Changes Names If the repository is changed in some way, for example a name change, or if you want to switch between HTTPS and SSH you need to update the remote # list the existing remotes > git remote -v origin https://hostname/username/repository-name.git ( fetch ) origin https://hostname/username/repository-name.git ( push ) # change the remote url git remote set-url origin https://hostname/username/new-repository-name.git # verify that the remote URL has changed > git remote -v origin https://hostname/username/new-repository-name.git ( fetch ) origin https://hostname/username/new-repository-name.git ( push ) Rolling Back Changes Reverting and Deleting Commits To \"undo\" a commit, run the following two commands: git revert and git reset . git revert creates a new commit that undoes commits while git reset allows deleting commits entirely from the commit history. If you have committed secrets/keys, git reset will remove them from the commit history! To delete the latest commit use HEAD~ : git reset --hard HEAD~1 To delete commits back to a specific commit, use the respective commit id: git reset --hard <sha1-commit-id> after you deleted the unwanted commits, push using force : git push origin HEAD --force Interactive rebase for undoing commits: git rebase -i HEAD~N The above command will open an interactive session in an editor (for example vim) with the last N commits sorted from oldest to newest. To undo a commit, delete the corresponding line of the commit and save the file. Git will rewrite the commits in the order listed in the file and because one (or many) commits were deleted, the commit will no longer be part of the history. Running rebase will locally modify the history, after this one can use force to push the changes to remote without the deleted commit. Using Submodules Submodules can be useful in more complex deployment and/or development scenarios Adding a submodule to your repo git submodule add -b master <your_submodule> Initialize and pull a repo with submodules: git submodule init git submodule update --init --remote git submodule foreach git checkout master git submodule foreach git pull origin Working with Images, Video and Other Binary Content Avoid committing frequently changed binary files, such as large images, video or compiled code to your git repository. Binary content is not diffed like text content, so cloning or pulling from the repository may pull each revision of the binary file. One solution to this problem is Git LFS (Git Large File Storage) - an open source Git extension for versioning large files. You can find more information on Git LFS in the Git LFS and VFS document . Working with Large Repositories When working with a very large repository of which you don't require all the files, you can use VFS for Git - an open source Git extension that virtualize the file system beneath your Git repository, so that you seem to work in a regular working directory but while VFS for Git only downloads objects as they are needed. You can find more information on VFS for Git in the Git LFS and VFS document . Tools Visual Studio Code is a cross-platform powerful source code editor with built in git commands. Within Visual Studio Code editor you can review diffs, stage changes, make commits, pull and push to your git repositories. You can refer to Visual Studio Code Git Support for documentation. Use a shell/terminal to work with Git commands instead of relying on GUI clients . If you're working on Windows, posh-git is a great PowerShell environment for Git. Another option is to use Git bash for Windows . On Linux/Mac, install git and use your favorite shell/terminal.","title":"Git Guidance"},{"location":"source-control/git-guidance/#git-guidance","text":"","title":"Git Guidance"},{"location":"source-control/git-guidance/#what-is-git","text":"Git is a distributed version control system. This means that - unlike SVN or CVS - it doesn't use a central server to synchronize. Instead, every participant has a local copy of the source-code, and the attached history that is kept in sync by comparing commit hashes (SHA hashes of changes between each git commit command) making up the latest version (called HEAD ). For example: repo 1 : A -> B -> C -> D -> HEAD repo 2 : A -> B -> HEAD repo 3 : X -> Y -> Z -> HEAD repo 4 : A -> J -> HEAD Since they share a common history, repo 1 and repo 2 can be synchronized fairly easily, repo 4 may be able to synchronize as well, but it's going to have to add a commit (J, and maybe a merge commit) to repo 1. Repo 3 cannot be easily synchronized with the others. Everything related to these commits is stored in a local .git directory in the root of the repository. In other words, by using Git you are simply creating immutable file histories that uniquely identify the current state and therefore allow sharing whatever comes after. It's a Merkle tree . Be sure to run git help after Git installation to find really in-depth explanations of everything.","title":"What is Git?"},{"location":"source-control/git-guidance/#installation","text":"Git is a tool set that must be installed. Install Git and follow the First-Time Git Setup . A recommended installation is the Git Lens extension for Visual Studio Code . Visualize code authorship at a glance via Git blame annotations and code lens, seamlessly navigate and explore Git repositories, gain valuable insights via powerful comparison commands, and so much more. You can use these commands as well to configure your Git for Visual Studio Code as an editor for merge conflicts and diff tool. git config --global user.name [ YOUR FIRST AND LAST NAME ] git config --global user.email [ YOUR E-MAIL ADDRESS ] git config --global merge.tool vscode git config --global mergetool.vscode.cmd \"code --wait $MERGED \" git config --global diff.tool vscode git config --global difftool.vscode.cmd \"code --wait --diff $LOCAL $REMOTE \"","title":"Installation"},{"location":"source-control/git-guidance/#basic-workflow","text":"A basic Git workflow is as follows; you can find more information on the specific steps below. # pull the latest changes git pull # start a new feature branch based on the develop branch git checkout -b feature/123-add-git-instructions develop # edit some files # add and commit the files git add <file> git commit -m \"add basic instructions\" # edit some files # add and commit the files git add <file> git commit -m \"add more advanced instructions\" # check your changes git status # push the branch to the remote repository git push --set-upstream origin feature/123-add-git-instructions","title":"Basic Workflow"},{"location":"source-control/git-guidance/#cloning","text":"Whenever you want to make a change to a repository, you need to first clone it. Cloning a repository pulls down a full copy of all the repository data, so that you can work on it locally. This copy includes all versions of every file and folder for the project. git clone https://github.com/username/repo-name You only need to clone the repository the first time. Before any subsequent branches you can sync any changes from the remote repository using git pull .","title":"Cloning"},{"location":"source-control/git-guidance/#branching","text":"To avoid adding code that has not been peer reviewed to the main branch (ex. develop ) we typically work in feature branches, and merge these back to the main trunk with a Pull Request. It's even the case that often the main or develop branch of a repository are locked so that you can't make changes without a Pull Request. Therefore, it is useful to create a separate branch for your local/feature work, so that you can work and track your changes in this branch. Pull the latest changes and create a new branch for your work based on the trunk (in this case develop ). git pull git checkout -b feature/feature-name develop At any point, you can move between the branches with git checkout <branch> as long as you have committed or stashed your work. If you forget the name of your branch use git branch --all to list all branches.","title":"Branching"},{"location":"source-control/git-guidance/#committing","text":"To avoid losing work, it is good to commit often in small chunks. This allows you to revert only the last changes if you discover a problem and also neatly explains exactly what changes were made and why. Make changes to your branch Check what files were changed > git status On branch feature/271-basic-commit-info Changes not staged for commit: ( use \"git add <file>...\" to update what will be committed ) ( use \"git restore <file>...\" to discard changes in working directory ) modified: source-control/git-guidance/README.md Track the files you wish to include in the commit. To track all modified files: git add --all Or to track only specific files: git add source-control/git-guidance/README.md Commit the changes to your local branch with a descriptive commit message git commit -m \"add basic git instructions\"","title":"Committing"},{"location":"source-control/git-guidance/#pushing","text":"When you are done working, push your changes to a branch in the remote repository using: git push The first time you push, you first need to set an upstream branch as follows. After the first push, the --set-upstream parameter and branch name are not needed anymore. git push --set-upstream origin feature/feature-name Once the feature branch is pushed to the remote repository, it is visible to anyone with access to the code.","title":"Pushing"},{"location":"source-control/git-guidance/#merging","text":"We encourage the use of Pull Request to merge code to the main repository to make sure that all code in the final product is code reviewed The Pull Request (PR) process in Azure DevOps , GitHub and other similar tools make it easy both to start a PR, review a PR and merge a PR.","title":"Merging"},{"location":"source-control/git-guidance/#merge-conflicts","text":"If multiple people make changes to the same files, you may need to resolve any conflicts that have occurred before you can merge. # check out the develop branch and get the latest changes git checkout develop git pull # check out your branch git checkout <your branch> # merge the develop branch into your branch git merge develop # if merge conflicts occur, above command will fail with a message telling you that there are conflicts to be solved # find which files need to be resolved git status You can start an interactive process that will show which files have conflicts. Sometimes you removed a file, where it was changed in dev. Or you made changes to some lines in a file where another developer made changes as well. If you went through the installation steps mentioned before, Visual Studio Code is set up as merge tool. You can also use a merge tool like kdiff3 . When editing conflicts occur, the process will automatically open Visual Studio Code where the conflicting parts are highlighted in green and blue, and you have make a choice: Accept your changes (current) Accept the changes from dev branch (incoming) Accept them both and fix the code (probably needed) Here are lines that are either unchanged from the common ancestor, or cleanly resolved because only one side changed. <<<<<<< yours:sample.txt Conflict resolution is hard; let's go shopping. ======= Git makes conflict resolution easy. >>>>>>> theirs:sample.txt And here is another line that is cleanly resolved or unmodified When this process is completed, make sure you test the result by executing build, checks, test to validate this merged result. # conclude the merge git merge --continue # verify that everything went ok git log # push the changes to the remote branch git push If no other conflicts appear, the PR can now be merged, and your branch deleted. Use squash to reduce your changes into a single commit, so the commit history can be within an acceptable size.","title":"Merge Conflicts"},{"location":"source-control/git-guidance/#stashing-changes","text":"git stash is super handy if you have un-committed changes in your working directory, but you want to work on a different branch. You can run git stash , save the un-committed work, and revert to the HEAD commit. You can retrieve the saved changes by running git stash pop : git stash \u2026 git stash pop Or you can move the current state into a new branch: git stash branch <new_branch_to_save_changes>","title":"Stashing Changes"},{"location":"source-control/git-guidance/#recovering-lost-commits","text":"If you \"lost\" a commit that you want to return to, for example to revert a git rebase where your commits got squashed, you can use git reflog to find the commit: git reflog Then you can use the reflog reference ( HEAD@{} ) to reset to a specific commit before the rebase: git reset HEAD@ { 2 }","title":"Recovering Lost Commits"},{"location":"source-control/git-guidance/#commit-best-practices","text":"A commit combines changes into a logical unit. Adding a descriptive commit message can aid in comprehending the code changes and understanding the rationale behind the modifications. Consider the following when making your commits: Make small commits. This makes changes easier to review, and if we need to revert a commit, we lose less work. Consider splitting the commit into separate commits with git add -p if it includes more than one logical change or bug fix. Don't mix whitespace changes with functional code changes. It is hard to determine if the line has a functional change or only removes a whitespace, so functional changes may go unnoticed. Commit complete and well tested code. Never commit incomplete code, get in the habit of testing your code before committing. Write good commit messages. Why is it necessary? It may fix a bug, add a feature, improve performance, or just be a change for the sake of correctness What effects does this change have? In addition to the obvious ones, this may include benchmarks, side effects etc. You can specify the default git editor, which allows you to write your commit messages using your favorite editor. The following command makes Visual Studio Code your default git editor: git config --global core.editor \"code --wait\"","title":"Commit Best Practices"},{"location":"source-control/git-guidance/#commit-message-structure","text":"The essential parts of a commit message are: subject line: a short description of the commit, maximum 50 characters long body (optional): a longer description of the commit, wrapped at 72 characters, separated from the subject line by a blank line You are free to structure commit messages; however, git commands like git log utilize above structure. Therefore, it can be helpful to follow a convention within your team and to utilize git best. For example, Conventional Commits is a lightweight convention that complements SemVer , by describing the features, fixes, and breaking changes made in commit messages. See Component Versioning for more information on versioning. For more information on commit message conventions, see: A Note About Git Commit Messages Conventional Commits Git commit best practices How to Write a Git Commit Message How to Write Better Git Commit Messages Information in commit messages On commit messages","title":"Commit Message Structure"},{"location":"source-control/git-guidance/#managing-remotes","text":"A local git repository can have one or more backing remote repositories. You can list the remote repositories using git remote - by default, the remote repository you cloned from will be called origin > git remote -v origin https://github.com/microsoft/code-with-engineering-playbook.git ( fetch ) origin https://github.com/microsoft/code-with-engineering-playbook.git ( push )","title":"Managing Remotes"},{"location":"source-control/git-guidance/#working-with-forks","text":"You can set multiple remotes. This is useful for example if you want to work with a forked version of the repository. For more info on how to set upstream remotes and syncing repositories when working with forks see GitHub's Working with forks documentation .","title":"Working with Forks"},{"location":"source-control/git-guidance/#updating-the-remote-if-a-repository-changes-names","text":"If the repository is changed in some way, for example a name change, or if you want to switch between HTTPS and SSH you need to update the remote # list the existing remotes > git remote -v origin https://hostname/username/repository-name.git ( fetch ) origin https://hostname/username/repository-name.git ( push ) # change the remote url git remote set-url origin https://hostname/username/new-repository-name.git # verify that the remote URL has changed > git remote -v origin https://hostname/username/new-repository-name.git ( fetch ) origin https://hostname/username/new-repository-name.git ( push )","title":"Updating the Remote if a Repository Changes Names"},{"location":"source-control/git-guidance/#rolling-back-changes","text":"","title":"Rolling Back Changes"},{"location":"source-control/git-guidance/#reverting-and-deleting-commits","text":"To \"undo\" a commit, run the following two commands: git revert and git reset . git revert creates a new commit that undoes commits while git reset allows deleting commits entirely from the commit history. If you have committed secrets/keys, git reset will remove them from the commit history! To delete the latest commit use HEAD~ : git reset --hard HEAD~1 To delete commits back to a specific commit, use the respective commit id: git reset --hard <sha1-commit-id> after you deleted the unwanted commits, push using force : git push origin HEAD --force Interactive rebase for undoing commits: git rebase -i HEAD~N The above command will open an interactive session in an editor (for example vim) with the last N commits sorted from oldest to newest. To undo a commit, delete the corresponding line of the commit and save the file. Git will rewrite the commits in the order listed in the file and because one (or many) commits were deleted, the commit will no longer be part of the history. Running rebase will locally modify the history, after this one can use force to push the changes to remote without the deleted commit.","title":"Reverting and Deleting Commits"},{"location":"source-control/git-guidance/#using-submodules","text":"Submodules can be useful in more complex deployment and/or development scenarios Adding a submodule to your repo git submodule add -b master <your_submodule> Initialize and pull a repo with submodules: git submodule init git submodule update --init --remote git submodule foreach git checkout master git submodule foreach git pull origin","title":"Using Submodules"},{"location":"source-control/git-guidance/#working-with-images-video-and-other-binary-content","text":"Avoid committing frequently changed binary files, such as large images, video or compiled code to your git repository. Binary content is not diffed like text content, so cloning or pulling from the repository may pull each revision of the binary file. One solution to this problem is Git LFS (Git Large File Storage) - an open source Git extension for versioning large files. You can find more information on Git LFS in the Git LFS and VFS document .","title":"Working with Images, Video and Other Binary Content"},{"location":"source-control/git-guidance/#working-with-large-repositories","text":"When working with a very large repository of which you don't require all the files, you can use VFS for Git - an open source Git extension that virtualize the file system beneath your Git repository, so that you seem to work in a regular working directory but while VFS for Git only downloads objects as they are needed. You can find more information on VFS for Git in the Git LFS and VFS document .","title":"Working with Large Repositories"},{"location":"source-control/git-guidance/#tools","text":"Visual Studio Code is a cross-platform powerful source code editor with built in git commands. Within Visual Studio Code editor you can review diffs, stage changes, make commits, pull and push to your git repositories. You can refer to Visual Studio Code Git Support for documentation. Use a shell/terminal to work with Git commands instead of relying on GUI clients . If you're working on Windows, posh-git is a great PowerShell environment for Git. Another option is to use Git bash for Windows . On Linux/Mac, install git and use your favorite shell/terminal.","title":"Tools"},{"location":"source-control/git-guidance/git-lfs-and-vfs/","text":"Using Git LFS and VFS for Git Introduction Git LFS and VFS for Git are solutions for using Git with (large) binary files and large source trees. Git LFS Git is very good and keeping track of changes in text-based files like code, but it is not that good at tracking binary files. For instance, if you store a Photoshop image file (PSD) in a repository, with every change, the complete file is stored again in the history. This can make the history of the Git repo very large, which makes a clone of the repository more and more time-consuming. A solution to work with binary files is using Git LFS (or Git Large File System). This is an extension to Git and must be installed separately, and it can only be used with a repository platform that supports LFS. GitHub.com and Azure DevOps for instance are platforms that have support for LFS. The way it works in short, is that a placeholder file is stored in the repo with information for the LFS system. It looks something like this: version https://git-lfs.github.com/spec/v1 oid a747cfbbef63fc0a3f5ffca332ae486ee7bf77c1d1b9b2de02e261ef97d085fe size 4923023 The actual file is stored in a separate storage. This way Git will track changes in this placeholder file, not the large file. The combination of using Git and Git LFS will hide this from the developer though. You will just work with the repository and files as before. When working with these large files yourself, you'll still see the Git history grown on your own machine, as Git will still start tracking these large files locally, but when you clone the repo, the history is actually pretty small. So it's beneficial for others not working directly on the large files. Pros of Git LFS Uses the end to end Git workflow for all files Git LFS supports file locking to avoid conflicts for undiffable assets Git LFS is fully supported in Azure DevOps Services Cons of Git LFS Everyone who contributes to the repository needs to install Git LFS If not set up properly: Binary files committed through Git LFS are not visible as Git will only download the data describing the large file Committing large binaries will push the full binary to the repository Git cannot merge the changes from two different versions of a binary file; file locking mitigates this Azure Repos do not support using SSH for repositories with Git LFS tracked files - for more information see the Git LFS authentication documentation Installation and use of Git LFS Go to https://git-lfs.github.com and download and install the setup from there. For every repository you want to use LFS, you have to go through these steps: Setup LFS for the repo: git lfs install Indicate which files have to be considered as large files (or binary files). As an example, to consider all Photoshop files to be large: git lfs track \"*.psd\" There are more fine-grained ways to indicate files in a folder and more. See the Git LFS Documentation . With these commands a .gitattribute file is created which contains these settings and must be part of the repository. From here on you just use the standard Git commands to work in the repository. The rest will be handled by Git and Git LFS. Common LFS Commands Install Git LFS git lfs install # windows sudo apt-get git-lfs # linux See the Git LFS installation instructions for installation on other systems Track .mp4 files with Git LFS git lfs track '*.mp4' Update the .gitattributes file listing the files and patterns to track *.mp4 filter = lfs diff = lfs merge = lfs -text docs/images/* filter = lfs diff = lfs merge = lfs -text List all patterns tracked git lfs track List all files tracked git lfs ls-files Download files to your working directory git lfs pull git lfs pull --include = \"path/to/file\" VFS for Git Imagine a large repository containing multiple projects, ex. one per feature. As a developer you may only be working on some features, and thus you don't want to download all the projects in the repo. By default, with Git however, cloning the repository means you will download all files/projects. VFS for Git (or Virtual File System for Git) solves this problem, as it will only download what you need to your local machine, but if you look in the file system, e.g. with Windows Explorer, it will show all the folders and files including the correct file sizes. The Git platform must support GVFS to make this work. GitHub.com and Azure DevOps both support this out of the box. Installation and use of VFS for Git Microsoft create VFS for Git and made it open source. It can be found at https://github.com/microsoft/VFSForGit . It's only available for Windows. The necessary installers can be found at https://github.com/Microsoft/VFSForGit/releases On the releases page you'll find two important downloads: Git 2.28.0.0 installer, which is a requirement for running VFS for Git. This is not the same as the standard Git for Windows install! SetupGVFS installer. Download those files and install them on your machine. To be able to use VFS for Git for a repository, a .gitattributes file needs to be added to the repo with this line in it: * -text To clone a repository to your machine using VFS for Git you use gvfs instead of git like so: gvfs clone [ URL ] [ dir ] Once this is done, you have a folder which contains a src folder which contains the contents of the repository. This is done because of a practice to put all outputs of build systems outside this tree. This makes it easier to manage .gitignore files and to keep Git performant with lots of files. For working with the repository you just use Git commands as before. To remove a VFS for Git repository from your machine, make sure the VFS process is stopped and execute this command from the main folder: gvfs unmount This will stop the process and unregister it, after that you can safely remove the folder. Resources Git LFS getting started Git LFS manual Git LFS on Azure Repos","title":"Using Git LFS and VFS for Git Introduction"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#using-git-lfs-and-vfs-for-git-introduction","text":"Git LFS and VFS for Git are solutions for using Git with (large) binary files and large source trees.","title":"Using Git LFS and VFS for Git Introduction"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#git-lfs","text":"Git is very good and keeping track of changes in text-based files like code, but it is not that good at tracking binary files. For instance, if you store a Photoshop image file (PSD) in a repository, with every change, the complete file is stored again in the history. This can make the history of the Git repo very large, which makes a clone of the repository more and more time-consuming. A solution to work with binary files is using Git LFS (or Git Large File System). This is an extension to Git and must be installed separately, and it can only be used with a repository platform that supports LFS. GitHub.com and Azure DevOps for instance are platforms that have support for LFS. The way it works in short, is that a placeholder file is stored in the repo with information for the LFS system. It looks something like this: version https://git-lfs.github.com/spec/v1 oid a747cfbbef63fc0a3f5ffca332ae486ee7bf77c1d1b9b2de02e261ef97d085fe size 4923023 The actual file is stored in a separate storage. This way Git will track changes in this placeholder file, not the large file. The combination of using Git and Git LFS will hide this from the developer though. You will just work with the repository and files as before. When working with these large files yourself, you'll still see the Git history grown on your own machine, as Git will still start tracking these large files locally, but when you clone the repo, the history is actually pretty small. So it's beneficial for others not working directly on the large files.","title":"Git LFS"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#pros-of-git-lfs","text":"Uses the end to end Git workflow for all files Git LFS supports file locking to avoid conflicts for undiffable assets Git LFS is fully supported in Azure DevOps Services","title":"Pros of Git LFS"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#cons-of-git-lfs","text":"Everyone who contributes to the repository needs to install Git LFS If not set up properly: Binary files committed through Git LFS are not visible as Git will only download the data describing the large file Committing large binaries will push the full binary to the repository Git cannot merge the changes from two different versions of a binary file; file locking mitigates this Azure Repos do not support using SSH for repositories with Git LFS tracked files - for more information see the Git LFS authentication documentation","title":"Cons of Git LFS"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#installation-and-use-of-git-lfs","text":"Go to https://git-lfs.github.com and download and install the setup from there. For every repository you want to use LFS, you have to go through these steps: Setup LFS for the repo: git lfs install Indicate which files have to be considered as large files (or binary files). As an example, to consider all Photoshop files to be large: git lfs track \"*.psd\" There are more fine-grained ways to indicate files in a folder and more. See the Git LFS Documentation . With these commands a .gitattribute file is created which contains these settings and must be part of the repository. From here on you just use the standard Git commands to work in the repository. The rest will be handled by Git and Git LFS.","title":"Installation and use of Git LFS"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#common-lfs-commands","text":"Install Git LFS git lfs install # windows sudo apt-get git-lfs # linux See the Git LFS installation instructions for installation on other systems Track .mp4 files with Git LFS git lfs track '*.mp4' Update the .gitattributes file listing the files and patterns to track *.mp4 filter = lfs diff = lfs merge = lfs -text docs/images/* filter = lfs diff = lfs merge = lfs -text List all patterns tracked git lfs track List all files tracked git lfs ls-files Download files to your working directory git lfs pull git lfs pull --include = \"path/to/file\"","title":"Common LFS Commands"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#vfs-for-git","text":"Imagine a large repository containing multiple projects, ex. one per feature. As a developer you may only be working on some features, and thus you don't want to download all the projects in the repo. By default, with Git however, cloning the repository means you will download all files/projects. VFS for Git (or Virtual File System for Git) solves this problem, as it will only download what you need to your local machine, but if you look in the file system, e.g. with Windows Explorer, it will show all the folders and files including the correct file sizes. The Git platform must support GVFS to make this work. GitHub.com and Azure DevOps both support this out of the box.","title":"VFS for Git"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#installation-and-use-of-vfs-for-git","text":"Microsoft create VFS for Git and made it open source. It can be found at https://github.com/microsoft/VFSForGit . It's only available for Windows. The necessary installers can be found at https://github.com/Microsoft/VFSForGit/releases On the releases page you'll find two important downloads: Git 2.28.0.0 installer, which is a requirement for running VFS for Git. This is not the same as the standard Git for Windows install! SetupGVFS installer. Download those files and install them on your machine. To be able to use VFS for Git for a repository, a .gitattributes file needs to be added to the repo with this line in it: * -text To clone a repository to your machine using VFS for Git you use gvfs instead of git like so: gvfs clone [ URL ] [ dir ] Once this is done, you have a folder which contains a src folder which contains the contents of the repository. This is done because of a practice to put all outputs of build systems outside this tree. This makes it easier to manage .gitignore files and to keep Git performant with lots of files. For working with the repository you just use Git commands as before. To remove a VFS for Git repository from your machine, make sure the VFS process is stopped and execute this command from the main folder: gvfs unmount This will stop the process and unregister it, after that you can safely remove the folder.","title":"Installation and use of VFS for Git"},{"location":"source-control/git-guidance/git-lfs-and-vfs/#resources","text":"Git LFS getting started Git LFS manual Git LFS on Azure Repos","title":"Resources"}]}
\ No newline at end of file
diff --git a/sitemap.xml b/sitemap.xml
index ec978c81d..a46c325a3 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -2,1202 +2,1202 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/ISE/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/engineering-fundamentals-checklist/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/the-first-week-of-an-ise-project/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/continuous-delivery/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/continuous-integration/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/azure-devops-service-connection-security/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/dependency-and-container-scanning/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/evaluate-open-source-software/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/penetration-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/secrets-management/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/secrets-management/credential_scanning/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/secrets-management/secrets_rotation/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/secrets-management/static-code-analysis/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets-ado/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/dev-sec-ops/secrets-management/recipes/detect-secrets/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/gitops/deploying-with-gitops/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/gitops/github-workflows/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/gitops/secret-management/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/gitops/secret-management/azure-devops-secret-management-per-branch/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/gitops/secret-management/secret-rotation-in-pods/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/cd-on-low-code-solutions/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/ci-pipeline-for-better-documentation/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/ci-with-jupyter-notebooks/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/inclusive-linting/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/reusing-devcontainers-within-a-pipeline/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/github-actions/runtime-variables/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/terraform/save-output-to-variable-group/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/terraform/share-common-variables-naming-conventions/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/CI-CD/recipes/terraform/terraform-structure-guidelines/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/UI-UX/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/UI-UX/recommended-technologies/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/backlog-management/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/ceremonies/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/roles/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/backlog-management/external-feedback/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/backlog-management/minimal-slices/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/backlog-management/risk-management/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/collaboration/add-pairing-field-azure-devops-cards/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/collaboration/pair-programming-tools/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/collaboration/social-question/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/collaboration/teaming-up/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/collaboration/virtual-collaboration/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/collaboration/why-collaboration/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/effective-organization/delivery-plan/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/advanced-topics/effective-organization/scrum-of-scrums/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/team-agreements/definition-of-done/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/team-agreements/definition-of-ready/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/team-agreements/team-manifesto/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/agile-development/team-agreements/working-agreement/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/cdc-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/e2e-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/e2e-testing/testing-comparison/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/e2e-testing/testing-methods/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/e2e-testing/recipes/gauge-framework/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/e2e-testing/recipes/postman-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/fault-injection-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/integration-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/performance-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/performance-testing/iterative-perf-test-template/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/performance-testing/load-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/shadow-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/smoke-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/synthetic-monitoring-tests/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/tech-specific-samples/building-containers-with-azure-devops/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/tech-specific-samples/blobstorage-unit-tests/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/templates/case-study-template/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/templates/test-type-template/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/ui-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/ui-testing/teams-tests/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/unit-testing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/unit-testing/authoring-example/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/unit-testing/custom-connector/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/unit-testing/mocking/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/unit-testing/tdd-example/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/automated-testing/unit-testing/why-unit-tests/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/faq/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/inclusion-in-code-review/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/pull-request-template/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/pull-requests/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/tools/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/evidence-and-measures/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/process-guidance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/process-guidance/author-guidance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/process-guidance/reviewer-guidance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/azure-pipelines-yaml/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/bash/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/csharp/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/go/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/java/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/javascript-and-typescript/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/markdown/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/python/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/code-reviews/recipes/terraform/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/exception-handling/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/readme/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/cloud-resource-design-guidance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/data-heavy-design-guidance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/distributed-system-design-reference/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/network-architecture-guidance-for-azure/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/network-architecture-guidance-for-hybrid/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/non-functional-requirements-capture-guide/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/object-oriented-design-reference/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-patterns/rest-api-design-guidance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/doc/decision-log/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/doc/adr/0001-record-architecture-decisions/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/doc/adr/0002-app-level-logging/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/examples/memory/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/examples/memory/decision-log/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/examples/memory/Architecture/Data-Model/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/examples/memory/Deployment/Environments/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/decision-log/examples/memory/trade-studies/gitops/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/async-design-reviews/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/engagement-process/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/engineering-feasibility-spikes/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/high-level-design-recipe/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/milestone-epic-design-review-recipe/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/preferred-diagram-tooling/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/technical-spike/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/templates/feature-story-design-review/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/templates/milestone-epic-design-review/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/templates/template-task-design-review/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/recipes/templates/template-technical-spike/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/trade-studies/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/design-reviews/trade-studies/template/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/diagram-types/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/diagram-types/class-diagrams/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/diagram-types/component-diagrams/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/diagram-types/deployment-diagrams/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/diagram-types/sequence-diagrams/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/sustainability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/sustainability/sustainable-action-disclaimers/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/design/sustainability/sustainable-engineering-principles/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/client-app-inner-loop/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/copilots/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/cross-platform-tasks/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/devcontainers-getting-started/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/devcontainers-going-further/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/execute-local-pipeline-with-docker/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/fake-services-inner-loop/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/onboarding-guide-template/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/developer-experience/toggle-vnet-dev-environment/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/best-practices/automation/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/best-practices/establish-and-manage/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/best-practices/good-documentation/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/guidance/code/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/guidance/engineering-feedback/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/guidance/project-and-repositories/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/guidance/pull-requests/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/guidance/rest-apis/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/guidance/work-items/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/recipes/deploy-docfx-azure-website/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/recipes/static-website-with-mkdocs/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/recipes/sync-wiki-between-repos/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/recipes/using-docfx-and-tools/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/tools/automation/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/tools/integrations/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/tools/languages/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/documentation/tools/wikis/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/engineering-feedback/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/engineering-feedback/feedback-examples/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/engineering-feedback/feedback-faq/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/engineering-feedback/feedback-guidance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/agile-development-considerations-for-ml-projects/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/data-exploration/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/envisioning-and-problem-formulation/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/envisioning-summary-template/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/feasibility-studies/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/ml-fundamentals-checklist/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/ml-model-checklist/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/model-experimentation/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/profiling-ml-and-mlops-code/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/proposed-ml-process/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/responsible-ai/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/testing-data-science-and-mlops-code/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/machine-learning/tpm-considerations-for-ml-projects/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/accessibility/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/availability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/capacity/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/compliance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/data-integrity/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/disaster-recovery/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/internationalization/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/interoperability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/maintainability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/performance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/portability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/reliability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/scalability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/non-functional-requirements/usability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/alerting/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/best-practices/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/correlation-id/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/diagnostic-tools/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/log-vs-metric-vs-trace/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/logs-privacy/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/microservices/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/ml-observability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/observability-as-code/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/observability-databricks/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/observability-pipelines/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/pitfalls/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/profiling/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/recipes-observability/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/pillars/dashboard/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/pillars/logging/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/pillars/metrics/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/pillars/tracing/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/tools/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/tools/KubernetesDashboards/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/tools/OpenTelemetry/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/tools/Prometheus/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/observability/tools/loki/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/privacy/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/privacy/data-handling/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/privacy/privacy-frameworks/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/security/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/security/rules-of-engagement/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/security/threat-modelling-example/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/security/threat-modelling/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/source-control/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/source-control/component-versioning/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/source-control/merge-strategies/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/source-control/naming-branches/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/source-control/secrets-management/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/source-control/git-guidance/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://microsoft.github.io/code-with-engineering-playbook/source-control/git-guidance/git-lfs-and-vfs/</loc>
-         <lastmod>2024-09-17</lastmod>
+         <lastmod>2024-09-27</lastmod>
          <changefreq>daily</changefreq>
     </url>
 </urlset>
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 01a50b6c2..9a762bf04 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ