Adding Ability to Store and Query Downloaded Packages
Organization: AboutCode
Project: ScanCode.io
Mentors: - Philippe Ombredanne - Ayan Sinha Mahapatra
Overview
ScanCode.io currently stores scanned packages on disk without a centralized index, leading to duplicate storage, project-specific data, and potential data loss when inputs are deleted. This project enhances ScanCode.io by introducing structured package storage and querying, enabling indexing, reuse across projects, and reliable preservation.
Implementation
The project involved the following key components and steps:
This project addresses the limitations of ScanCode.io’s unstructured package storage by adding a system to index, reuse, and preserve packages reliably.
Storage System Development:
Created a DownloadStore abstract base class in archiving.py to define the interface for managing package content and metadata storage.
Built the LocalFilesystemProvider class to store downloads on the local filesystem, using a SHA256-based nested directory structure.
Implemented methods for storing (put), retrieving (get), listing (list), and searching (find) downloads, with metadata saved in origin-<hash>.json files.
Integration with ScanCode.io:
Updated pipelines/init.py to incorporate the archiving system into ScanCode.io’s pipeline workflow, ensuring downloaded packages are stored during execution.
Revised input.py to process package download inputs, passing content, download_url, download_date, and filename to the archiving system.
User Interface Enhancements:
Modified the project resource view to display stored package information, including download URLs and dates.
Validation and Testing:
Wrote unit tests in test_archiving.py to verify LocalFilesystemProvider functionality (put, get, list, find), testing normal cases, edge cases (e.g., empty files), and errors (e.g., duplicate origins).
Linked Pull Requests
Sr. No |
Name |
Link |
|---|---|---|
1 |
Add download archiving system |
|
2 |
Support local package storage |
Pre-GSoC Work
Here are some PRs submitted before GSoC:
Links
Project Idea: GSoC 2025 Idea
GSoC Project Page: GSoC 2025
Proposal: Project Proposal
Future Work
Future enhancements include implementing the web UI for the LocalFilesystemProvider to enable package uploads, searches, listings, and retrievals in ScanCode.io, with Django views, templates, and URL routes, backed by comprehensive testing. Additionally, integrating an external cloud storage option (e.g., AWS S3) alongside the local filesystem will extend the DownloadStore interface, providing scalable and remote storage capabilities.
Closing Note
During GSoC 2025, my mentors and I held weekly meetings to discuss progress, challenges, and next steps. I am deeply grateful to my mentors for their guidance and support, which greatly enriched my learning experience.