Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working version, needs improvement [WIP] #39

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

mariabg
Copy link

@mariabg mariabg commented Jul 17, 2018

Hi 👋I am currently working on this issue and I would like to discuss three questions I have in order to know which may be the best approach. Following the schema presented in the issue, I already coded the first step, along with the structure of the whole implementation. Please let me know if anything should be change from what I pushed so far.
Question 1: is there a simpler way than the following one to implement the parsing of the comments in the header? I thought about mapping comment syntaxes for languages (language as key, next to regex for different type of comments, both line and block). This seams to me a bit convoluted and you may have a more straight forward implementation in mind. Also, this selects every comment in the code, not just the header ones.
Question 2 is about the heuristics to split the license information from the descriptive comments. Should this be done through a fuzzy search of license keywords like “this file”, “contributor”, “you may”, “you may not” within every line? Were you thinking about more complex heuristics, and if so, which ones and where can I read about them?
Question 3: once these two separations are done, should I create a querying function for the comments or can globalLicenseDatabase().QueryLicenseText() be used for this purpose?

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
candidates := [][]byte{}
langs := []string{}
for _, file := range files {
lang, safe := enry.GetLanguageByExtension(file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use GetLanguage which is more accurate.

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented Jul 17, 2018

Hi @mariabg

So cool that you've decided to help!

Regarding (1), your approach makes much sense. We need to define the mapping between the languages and the functions which extract comments within the first 1024 bytes (supposing that we cover 99% of cases by this, the other 1% is too crazy to support). They can be regexps, can be something else. I guess regexps.

Regarding (2), yeah this is not obvious. We need to take 50-100 examples and see how different heuristics work on them, then pick the best. I used some simple scans for e.g. "---" or "\n\n\n" to split licenses inside a file, some of those rules may work here too. The good news is that there is almost always the single license in a source file.

(3) is better to create a separate function, say QuerySourceFile(). We can further extend the command-line tool to optionally output the full report for all the files. Bonus points for generating an SPDX document from it.

Finally, there should be the separate database for license headers extracted from SPDX github.

Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Copy link
Author

@mariabg mariabg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vmarkovtsev, I've just pushed the first fully working version of my code, and I left a couple of comments on it. Also, could you explain this "there should be the separate database for license headers extracted from SPDX github" a bit further? I don't understand where/how I should implement this 😅.

Let me know your thoughts on it, cheers!

}
append(db.QueryLicenseText(string(text)))
if len(candidates) == 0 {
// TO DO: split license-comments from description-comments.
Copy link
Author

@mariabg mariabg Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As all of the files in the project are scanned for licenses I though about doing this split here for performance's sake, what do you think? Should it be done before to improve % on the output? If the license is clear, like on this project, is is found without problem.

licensedb/internal/investigation.go Outdated Show resolved Hide resolved
@mariabg mariabg changed the title First steps + some questions on this issue [WIP] Working version, needs improvement [WIP] Jul 30, 2018
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Signed-off-by: María Benavente <mariabenaventeg@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants