Low level tool design ex.pdf

Content text Low level tool design ex.pdf

System Design Exercise: Building a High-Performance Duplicate File Scanner Objective Design a high-performance, recursive file scanning tool that identifies duplicate files across one or more filesystems containing millions or even billions of files. The focus is on designing an efficient and scalable solution. You are expected to think carefully about concurrency, memory constraints, disk I/O, and correctness. The tool should not rely on file names but determine duplication based on file content. Functional Requirements 1. Recursively scan one or more root directories and identify sets of duplicate files (files with identical content). 2. Should work across extremely large filesystems (millions to billions of files). 3. Must not rely on file names or paths for determining uniqueness. 4. Output should group identical files together by their full path. Performance and Efficiency Requirements ● The tool must scale to handle terabytes of data and avoid unnecessary memory usage. ● Must use efficient file comparison strategies to avoid reading entire files unnecessarily. ● Consider parallelizing the scan using both: ○ Multithreading (for I/O concurrency) ○ Multiprocessing (to utilize multiple CPU cores for hashing) ● Design should support early filtering techniques (e.g., file size checks) to reduce work.

Write a short technical design document (1–2 pages) covering: 1. High-level architecture of the tool (e.g., scanner, hasher, scheduler). 2. Concurrency model and how you avoid bottlenecks. 3. Step-by-step workflow of how duplicate files are detected. 4. Tradeoffs you considered (e.g., CPU vs I/O, memory usage, speed vs accuracy).

PDF Google Drive Downloader v1.1

Content text Low level tool design ex.pdf

Related document

PDF Google Drive Downloader v1.1

Title Low level tool design ex.pdf ✅

Content text Low level tool design ex.pdf

Related document