Google Summer of Code 2017
Welcome to AboutCode! This year AboutCode is a mentoring Organization for the Google Summer of Code 2017 edition.
AboutCode is a project to uncover data … about software code:
where does it come from?
what is its license? copyright?
is it secure, maintained, well coded?
All these are questions that are important to find answers to when there are million of free and open source software components available on the web.
Where software comes from and what is its license should be a problem of the past, such that everyone can safely consume more free and open source software. Come and join us to make it so!
Our tools are used to help detect and report the origin and license of source code, packages and binaries, as well as discover software and package dependencies, track vulnerabilities, bugs and other important software component attributes.
Contact
Subscribe to the mailing list at https://lists.sourceforge.net/lists/listinfo/aboutcode-discuss and introduce yourself and start the discussion! The mailing list is usually the better option to avoid timezone gaps.
The list archive have also plenty of interesting information. Someone may have asked your question before. Search and browse the archives at https://sourceforge.net/p/aboutcode/mailman/aboutcode-discuss/ !
For short chats, you can also join the #aboutcode IRC channel on Freenode or the Gitter channel at https://gitter.im/aboutcode-org/discuss
For personal issues, you can contact the org admin directly: @pombredanne and pombredanne@gmail.com
Please ask questions the smart way: http://www.catb.org/~esr/faqs/smart-questions.html
Technology
Discovering the origin of code is a vast topic. We primarily use Python for this and some C/C++ and JavaScript, but we are open to using any other language within reason.
Our domain includes text analysis and processing (for instance for copyrights and licenses), parsing (for package manifest formats), binary analysis (to detect the origin and license of binaries, which source code they come from, etc) as well as web based tools and APIs (to expose the tools and libraries as web services).
About your project application
We expect your application to be in the range of 1000 words. Anything less than that will probably not contain enough information for us to determine whether you are the right person for the job. Your proposal should contain at least the following information, plus anything you think is relevant:
Your name
Title of your proposal
Abstract of your proposal
Detailed description of your idea including explanation on why is it innovative and what it will contribute
hint: explain your data structures and the main processing flows in details.
Description of previous work, existing solutions (links to prototypes, bibliography are more than welcome)
Mention the details of your academic studies, any previous work, internships
Relevant skills that will help you to achieve the goal (programming languages, frameworks)?
Any previous open-source projects (or even previous GSoC) you have contributed to and links.
Do you plan to have any other commitments during GSoC that may affect your work? Any vacations/holidays? Will you be available full time to work on your project? (Hint: do not bother applying if this is not a serious full time commitment)
Subscribe to the mailing list at https://lists.sourceforge.net/lists/listinfo/aboutcode-discuss or join the #aboutcode IRC channel on Freenode and introduce yourself and start the discussion!
You need to understand something about open source licensing or package managers or code and binaries static analysis. The best way to demonstrate your capability would be to submit a small patch ahead of the project selection for an existing issue or a new issue.
Project ideas
ScanCode live scan server :
This project is to use ScanCode as a library in a web and REST API application that allows you to scan code on demand by entering a URL and then store the scan results. It could also be made available as a Travis or Github integration to scan on commit with webhooks. Bonus feature is to scan based on a received tweet of similar IRC or IM integration.
URLS :
Mentors :
@steven-esser https://github.com/steven-esser
@tdruez https://github.com/tdruez
Package security vulnerability data feed (and scanner) :
The end goal for this project is to build on existing projects to match packages identified by ScanCode to existing vulnerability alerts. This is not trivial as there are several gaps in the CVE data and how they relate to packages as they are detected by ScanCode or else. This is a green field project.
The key points to tackle are:
create the tools to build a free and open source structured and curate security feed
the aggregation of packages vulnerabilities feeds in a common and structured model (CVE, distro trackers, etc),
the aggregation of additional security data (CWE, CPE, and more) in that model,
the correlation of the different data items, creating accurate relationships and matching of actual package identifiers to vulnerabilities,
an architecture for community curation of vulnerabilities, correlation and enhancement of the data.
as a side bonus, build the tools in ScanCode to match detected packages to this feed. Note there is no FOSS tool and DB that does all of this today (only proprietary solutions such as vfeed or vulndb).
Some Related URLS for other projects in the same realm :
and many more including Linux distro feeds
Mentors :
@steven-esser https://github.com/steven-esser
@JonoYang https://github.com/JonoYang
@pombredanne https://github.com/pombredanne
Port the Python license expression library to JScript and prepare and publish an NPM package:
Use automated code translation (for JS) for the port. Add license expression support to AboutCodeMgr with this library. As a bonus, create a web server app and API service to parse and normalize ScanCode and SPDX license expressions either in Python or JavaScript.
URLS :
Mentors :
@JonoYang https://github.com/JonoYang
@steven-esser https://github.com/steven-esser
MatchCode :
Create a system for matching code using checksums and fingerprints against a repository of checksums and fingerprints. Create a basic repository for storing these fingerprints and expose a basic API. Create a client that can collect fingerprints on code and get matches using API calls to this repository or package manager APIs (Maven, Pypi, etc), or search engines APIs such as searchcode.com, debsources, or Github or Bitbucket commit hash searches/API or the SoftwareHeritage.org API.
URLS :
https://github.com/nexB/scancode-toolkit-contrib for samecode fingerprints drafts.
https://github.com/nexB/scancode-toolkit for commoncode hashes
Mentors :
@pombredanne https://github.com/pombredanne
ScanCode scan deduction :
The goal of this project is to take existing scan and match results and infer summaries and deduction at a higher level, such as the licensing of a whole directory tree.
URLS :
Mentors :
@pombredanne https://github.com/pombredanne
@JonoYang https://github.com/JonoYang
DeltaCode :
A new tool to help determine at a high level if the licensing for two codebases or versions of code has changed, and if so how. This is NOT a generic diff tool that identifies all codebase differences, rather it focuses on changes in licensing based on differences between ScanCode files.
Mentor :
@steven-esser https://github.com/steven-esser
License and copyright detection benchmark :
Compare ScanCode runtimes with Fossology, licensee, LicenseFinder, license-check, ninka, slic, LiD and others. This project is to create a comprehensive test suite and a benchmark for several FOSS open source license and copyright detection engines, establish mappings between the different conventions they use for license identification and evaluate and publish the results of detection accuracy and precision.
Note that this not about the speed of scanning: the performance and time taken is accessory and a nice to have result only. What matters is the accuracy of the license detection:
is the right license detected and how correct is this detection?
when a license is detected is the correct exact text matched and returned?
So what is needed is a (large) test set of files.
Then establishing a ground truth for reference e.g. detecting then reviewing manually possibly with scancode to set up the baseline that will be used to compare all the scanners.
Then run the other tools and scancode to see how well they perform and of course establish a mapping of license identifiers: each tool may use different license ids so we need to map these to the ids used in the test baseline (e.g. the scancode license keys): all this has to be built, possibly reusing some or all of the scancode tests and lacing in all the tests from the other tools and adding more ass needed.
Mentors :
@mjherzog https://github.com/mjherzog
@pombredanne https://github.com/pombredanne
Improved copyright parsing in ScanCode :
by keeping track of line numbers and offsets where copyrights are found. This would likely require either replacing or enhancing NLTK which is used as a natural language parser to add support for tracking where a copyright has been detected in a scanned text.
URLS :
Mentor :
@JonoYang https://github.com/JonoYang
Support full JSON and ABCD formats in AttributeCode
URLS :
Mentor :
@chinyeungli https://github.com/chinyeungli
Transparent archive extraction in ScanCode :
ScanCode archive extraction is currently done with a separate command line invocation. The goal of this project is to integrate archive extraction transparently into the ScanCode scan loop.
URLS :
Mentor :
@pombredanne https://github.com/pombredanne
Automated docker and VM images static package analysis :
to determine which packages are installed in Docker layers for RPMs, Debian or Alpine Linux. This is for the conan Docker image analysis tool.
URLS :
Mentor :
@pombredanne https://github.com/pombredanne
Plugin architecture for ScanCode :
Create ScanCode plugins for outputs to multiple formats (CSV, JSON, SPDX, Debian Copyright)
Static analysis of binaries for build tracing in TraceCode :
TraceCode does system call tracing. The goal of this project is to do the same using symbol, debug symbol or string matching to accomplish something similar,
URLS :
https://github.com/nexB/tracecode-build for the existing non-static tool
https://github.com/nexB/scancode-toolkit-contrib for the work in progress on binaries/symbols parsers/extractors
Mentor :
@pombredanne https://github.com/pombredanne
Better support tracing the lifecycle of file descriptors in TraceCode build :
TraceCode does system call tracing. The goal of this project is to improve the way we track open/close file descriptors in the trace to reconstruct the life of a file.
URLS :
Mentor :
@pombredanne https://github.com/pombredanne
Create Debian and RPM packages for ScanCode, AttributeCode and TraceCode.
Consider also including an AppImage.org package. If you think this may not fill in a full three months project, consider also adding some extras such as submitting the packages to Debian and Fedora.
AboutCode Manager test suite and Ci :
Create an extensive test suite for the Electron app and setup the CI to run unit, integration and smoke tests on Ci for Windows, Linux and Mac.
URLS :
Mentors :
@jdaguil https://github.com/jdaguil
@pombredanne https://github.com/pombredanne
DependentCode :
Create a tool for mostly universal package dependencies resolution.
URLS :
Mentors :
@pombredanne https://github.com/pombredanne
FetchCode :
Create a tool for mostly universal package and code download from VCS, web, ftp, etc.
Mentors :
@pombredanne https://github.com/pombredanne