Approximately similar file detection

Project: Approximately similar file detection in DeltaCode

Arnav Mandal <arnav.mandal1234@gmail.com>

Project Overview

DeltaCode is a tool to compare and report scan differences. It takes JSON files as an input which is the output of ScanCode-toolkit as well. When comparing files, it only uses the exact comparison. By exact comparison, I mean it compares the hash value of the files. The output of DeltaCode is a JSON/CSV file which includes the details of the scan such as delta score, delta count, etc. The goal of this project is to improve the usefulness of the delta by also finding files that are mostly the same (e.g. quasi or near duplicates) vs. files that are completely different. After this project, DeltaCode would be able to detect similar files in a directory approximately.

Requirements of the project

  • Provided two files using ScanCode-toolkit, the new near-duplicate detection should return the distance between the two files.

  • The code should be seamlessly integrated with ScanCode-toolkit. It should be highly configurable by the maintainers.

  • The strictness of near-duplicates should be noted and adjusted by a threshold variable.

The Project

  • Addition of new fingerprint plugin in the ScanCode Toolkit.

  • Implementation and integration of the fingerprint generation algorithm in the ScanCode Toolkit codebase.

  • Implementation of distance finding algorithm between the files and process them further in the DeltaCode codebase.

  • Integration of fingerprint field in the JSON file to compare the deltas and provide them with appropriate scores.

  • Make changes to old unit tests and addition of new unit tests in ScanCode Toolkit as well as DeltaCode.

I have completed all the tasks that were in the scope of this GSoC project.

Pull Requests