Exercise 5 — Advanced Patterns and Capstone

Topics: $expr, $let, set operators, multi-stage pipeline design

Reading before you start: 14 - Variable and Reference Expressions, 15 - Set Operators

Database: aggregation_lab — see Setup - Database


Part A — Reference Expressions: $expr and $let

Task A1

Find all projects where actualCost is greater than half of estimatedCost.

Use $expr inside $match to compare the two fields:

$expr: { $gt: ["$actualCost", { $divide: ["$estimatedCost", 2] }] }

Task A2

Find all employees where the number of skills is greater than the number of ratings.

Use $expr with $gt and $size on both arrays.

Task A3

Use $let to define a variable totalFunding as estimatedCost for each project, then compute fundingPerTeamMember as totalFunding / teamSize.

$addFields: {
  fundingPerTeamMember: {
    $let: {
      vars: {
        totalFunding: "$estimatedCost",
        teamSize: { $size: "$team" }
      },
      in: { $divide: ["$$totalFunding", "$$teamSize"] }
    }
  }
}

Project title, fundingPerTeamMember.

Task A4

Use $let to compute a “weighted score” for each employee:

  • Define avgRat as the average of ratings
  • Define skillCount as the size of skills array
  • Compute the result as (avgRat * 10) + (skillCount * 3)

Project name and weightedScore.

Task A5

Find all tasks where hours is greater than the average hours across all TODO tasks (which is about 12.5).

Approach:

  1. First, compute the average in a separate mental step (or hardcode it)
  2. Then use $match with $expr and $gt to filter

Alternatively, challenge yourself: can you solve this with a single pipeline? (Hint: $group with _id: null first, then $lookup back.)


Part B — Set Operators

Task B1

The departments collection has a tags array. Write a pipeline that computes the intersection of tags between the “Engineering” and “Product” departments.

Approach:

  1. Match only those two departments
  2. Group them with _id: null, pushing all tags arrays
  3. Use $setIntersection on the two arrays

Or use a simpler approach: match Engineering, add a $lookup for Product, then compute the intersection.

Task B2

For each department, compute the set difference between its tags and the array ["tech", "product"]. This shows which tags are unique to each department (not in the common set).

$addFields: {
  uniqueTags: {
    $setDifference: ["$tags", ["tech", "product"]]
  }
}

Task B3

For each department, check whether its tags are a subset of ["tech", "product", "core", "management"].

Use $setIsSubset.

Task B4

Compute the union of all skills across all employees in the Engineering department.

  1. Match employees in Engineering (department_id: ObjectId("bbb000000000000000000001"))
  2. Unwind skills
  3. Group with _id: null, using $addToSet to collect unique skills

Alternatively, use $setUnion by grouping the skill arrays and reducing them.


Part C — Capstone Challenge

Time budget: This section is designed to take about 90 minutes. Plan your pipelines on paper before typing.

Capstone 1: Company Dashboard Report

Build a single pipeline that produces a company overview report with the following information for each department:

  • departmentName: the name of the department
  • location: where the department is located
  • employeeCount: number of employees
  • activeCount: number of active employees
  • averageSalary: average salary
  • topSkills: a deduplicated list of all skills across all employees in the department (you’ll need $unwind on skills followed by $addToSet in $group)
  • averageRating: the average of all individual ratings across all employees in the department (hint: $unwind ratings before grouping)

Save the result to company_dashboard.

Requirements:

  • Start from the employees collection
  • Join with departments using $lookup
  • The final output should have one document per department
  • Think carefully about the order of $unwind and $group operations

Capstone 2: Project Risk Assessment

Build a pipeline that identifies at-risk projects. A project is “at-risk” if:

  • Its status is not "COMPLETED"
  • Its budget utilization (actualCost / estimatedCost) is above 70%
  • OR it has fewer than 50% of milestones completed

The output should include:

  • title
  • status, priority
  • budgetUtilization (percentage)
  • completionRate (percentage of milestones completed)
  • riskLevel: use $switch to assign "CRITICAL" if both conditions are met, "HIGH" if budget > 70%, "MEDIUM" if completion < 50%, "LOW" otherwise
  • teamSize
  • teamMembers: use $lookup to resolve the team member ObjectIds to employee names

Sort by riskLevel (you may need to assign numeric values and sort by those).

Capstone 3: Customer Lifetime Value

Build a pipeline that computes each customer’s lifetime value and profile:

  • customerName, customerType, loyaltyTier, city
  • totalOrders: number of orders
  • totalSpent: sum of all order totals (each order’s total = sum of price * quantity for all items)
  • averageOrderValue: totalSpent / totalOrders
  • productsPurchased: deduplicated list of all product names they’ve ordered
  • spendingTier: use $switch"VIP" if totalSpent > 2000, "REGULAR" if > 500, "NEW" otherwise

Save to customer_lifetime_value.


Part D — Open Design Challenge

Task D1

Design your own analysis. Pick any combination of the collections in the database and design a pipeline that answers a business question you find interesting.

Requirements:

  • Must use at least 5 different stages
  • Must use at least 2 different expression operators
  • Must include a comment at the top explaining what question you’re answering

Some ideas for inspiration:

  • “Which employees work on the most projects?”
  • “What is the average task load per person per project?”
  • “Which services produce the most log entries per hour?”
  • “Do business customers spend more per order than private customers?”
  • “Is there a correlation between team size and budget utilization?”

Present your pipeline and explain the result.


Final Reflection:

  • Look back at your Exercise 1 pipelines. How much more complex can you build now?
  • Which operator or concept was the hardest to understand? What helped it click?
  • When would you use $expr vs. a regular $match condition?
  • How would you approach designing a pipeline for a new problem from scratch? What’s your process?