-
-
Notifications
You must be signed in to change notification settings - Fork 643
Detect gibberish copyright #2402 #4610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
JonoYang
wants to merge
7
commits into
develop
Choose a base branch
from
2402-detect-gibberish-copyright
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
70d2083
Explore use of nostril to check copyrights #2402
JonoYang 1e9b432
Check for nonsense line in is_candidate #2402
JonoYang 7e6317f
Use 2-char markov chain gibberish detector #2402
JonoYang 21b7450
Handle paths with pathlib.Path #2402
JonoYang f3fd656
Add basic test for gibberish detector #2402
JonoYang 8b10843
Mark several tests with expected failures #2402
JonoYang 84a4449
Add ABOUT file for gibberish.py #2402
JonoYang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| zxcvwerjasc | ||
| nmnjcviburili,<> | ||
| zxcvnadtruqe | ||
| ertrjiloifdfyyoiu | ||
| grty iuewdiivjh |
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| Copyright (c) All Rights Reserved. Hair Plus Trading Co., Inc. | ||
| South Baylo University Copyright (c) All Right Reserved. | ||
| Created by shazron on 11-06-15. Copyright 2011 . All rights reserved. | ||
| Copyright (c) All Rights Reserved 2014-2019 New Avenue Foundation. | ||
| 'Copyright 2017 AllThingsTalk' | ||
| Copyright (C) All Rights Are Reserved. Chungjungwon. Iotacoffee.Com 2011 | ||
| copyright(c) All rights reserved localism,Inc. | ||
| Crown Copyright C All rights reserved. | ||
| copyright(c) All rights reserved istyle Inc. | ||
| [assembly: AssemblyCopyright(""Copyright © 2013"")] | ||
| <span>Copyright (C) All Rights Reserved </span> <span>2007-2020版权所有: 镇江日报社 </span> | ||
| Copyright (c) - All Rights Reserved - PROAIM Medical. | ||
| Copyright (c), ALL Consulting, 2008 | ||
| Created by Samvel Khalatyan, May 28, 2013 Copyright 2013, All rights reserved | ||
| Iotacoffee.Com 2011 Copyright (C) All Rights Are Reserved. | ||
| Copyright (C) All Rights Reserved, Lei Connection Inc. | ||
| Copyright(c) All Saints Episcopal Church, Fort Worth, 2011, church based at 3290 Lackland Road,, Fort Worth, TX 76116 | ||
| * Created by claudio beatrice on 2/21/10. Copyright 2010. All rights reserved. | ||
| Copyright(c) All rights reserved by Minds, Japan Council for Quality Health Care. | ||
| Copyright (C) All Rights Reserved by Leh. www.leh.jp | ||
| Copyright (C) All rights Reserved by 株式会社 朝日住宅社 | ||
| /* For iOS video I/O | ||
| * by Eduard Feicho on 29/07/12 | ||
| * Copyright 2012. All rights reserved. | ||
| // Copyright (c) 2002-2010, Industrial Light & Magic, a division of Lucas | ||
| // Digital Ltd. LLC | ||
| // | ||
| // All rights reserved. | ||
| Copyright (c) 2006, Industrial Light & Magic, a division of Lucasfilm | ||
| Entertainment Company Ltd. Portions contributed and copyright held by | ||
| others as indicated. All rights reserved. | ||
| copyright__ = 'Copyright 2017 AllThingsTalk' | ||
| Copyright EAVISE | ||
| UCL are copyrighted software distributed | ||
| Foursquare © 2019 | ||
| Copyright (C) 2019, by Djilani CARDINEAU. | ||
| # Copyright michimani All rights reserved. | ||
| Copyright(c) All Rights Reserved by Chinese Service Center for Scholarly Exchange | ||
| Copyright(c) All right reserved SSC. Ltd. | ||
| Third party copyrights are property of their respective owners. | ||
| Copyright (c) All Rights Reserved by the District Export Council of Georgia. | ||
| //COPYRIGHT | ||
| // | ||
| //All contributions by the University of California: | ||
| //Copyright (c) 2014, The Regents of the University of California (Regents) | ||
| //All rights reserved. | ||
| // | ||
| //All other contributions: | ||
| //Copyright (c) 2014, the respective contributors | ||
| //All rights reserved. | ||
| // | ||
| //Caffe uses a shared copyright model: each contributor holds copyright over | ||
| //their contributions to Caffe. The project versioning records all such | ||
| //contribution and copyright details. If a contributor wants to further mark | ||
| //their specific copyright on a particular contribution, they should indicate | ||
| //their copyright solely in the commit message of the change when it is | ||
| //committed. | ||
| // | ||
| //LICENSE | ||
| Copyright (C) 2013 Opensim Ltd. | ||
| #COPYRIGHT | ||
| # | ||
| #All contributions by the University of California: | ||
| #Copyright (c) 2014, 2015, The Regents of the University of California (Regents) | ||
| #All rights reserved. | ||
| # | ||
| #All other contributions: | ||
| #Copyright (c) 2014, 2015, the respective contributors | ||
| #All rights reserved. | ||
| LICENSE: Copyright 2016, All Rights Reserved | ||
| (a)Download original face detection dataset -> (b)Convert annotation to the PASCAL VOC format -> (c)Create LMDB database with images + annotations for training | ||
| (c) Copyright CNRI, All Rights Reserved. NO WARRANTY. | ||
| Copyright (C), 2001-2011, Acme Tech. Co. Ltd. | ||
| * libtiff/{tif_dirinfo.c, tif_dir.h, tif_dir.c, tif_print.c}: Make | ||
| DocumentName, Artist, HostComputer, ImageDescription, Make, Model, | ||
| Copyright, DateTime, PageName, TextureFormat, TextureWrapModes and | ||
| TargetPrinter tags custom. | ||
| COPYRIGHT (C) All About, Inc. All Rights Reserved. | ||
| Copyright 2019, All Rights Reserved. # Author: Pine <[email protected]> | ||
| * For iOS video I/O | ||
| * by Eduard Feicho on 29/07/12 | ||
| * by Alexander Shishkov on 17/07/13 | ||
| * Copyright 2012. All rights reserved. | ||
| COPYRIGHT(C) ALL JAPAN PRO-WRESTLING Co., Ltd. | ||
| :copyright: Copyright (c) Joe Joyce and contributors, 2016-2019. | ||
| Copyright 2014 uh-sem-blee, Co. | ||
| Copyright (c) 2016 the Authors | ||
| // Copyright (C) 2013, OpenCV Foundation, all rights reserved. | ||
| // Third party copyrights are property of their respective owners. | ||
| * For iOS video I/O | ||
| * by Xiaochao Yang on 06/15/11 modified from | ||
| * cap_qtkit.mm for Nicholas Butko for Mac OS version. | ||
| * Copyright 2011. All rights reserved. | ||
| Copyright (c) All the Raige Dog Salon. All Rights Reserved. | ||
| [assembly: AssemblyCopyright(""Copyright © 2014"")] | ||
| <a href="http://www.enox.biz/">Copyright (C) All rights Reserved by 株式会社エノックス</a> | ||
| 2008 Nuance Communications | ||
| Copyright 2008 TJ | ||
| Scilab (c)INRIA-ENPC | ||
| Copyright (c) 2006, FUJITA Yuji |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| about_resource: gibberish.py | ||
| name: Gibberish-Detector | ||
| download_url: https://raw.githubusercontent.com/yapus/gibberish/01637fe1fda827529ca76b8d6fee2de9100719f1/gibberish/gibberish.py | ||
| homepage_url: https://github.com/yapus/gibberish/ | ||
| authors: | | ||
| Rob Renaud | ||
| Iakov Pustilnik | ||
| owner: Iakov Pustilnik | ||
| license_expression: mit | ||
| license_file: gibberish.LICENSE | ||
| copyright: Copyright (c) 2015 Rob Renaud | ||
| notes: gibberish.py is a reorganization of the code at | ||
| https://github.com/rrenaud/Gibberish-Detector into an object oriented class. | ||
| gibberish.py was originally a Python script written by Rob Renaud as a solution | ||
| to a stack overflow question (https://stackoverflow.com/a/6298193). The original | ||
| repo at https://github.com/rrenaud/Gibberish-Detector has been cloned many times | ||
| and has been expanded upon by multiple authors. This instance of gibberish.py | ||
| was maintained by Iakov Pustilnik. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| The MIT License (MIT) | ||
|
|
||
| Copyright (c) 2015 Rob Renaud | ||
|
|
||
| Permission is hereby granted, free of charge, to any person obtaining a copy | ||
| of this software and associated documentation files (the "Software"), to deal | ||
| in the Software without restriction, including without limitation the rights | ||
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
| copies of the Software, and to permit persons to whom the Software is | ||
| furnished to do so, subject to the following conditions: | ||
|
|
||
| The above copyright notice and this permission notice shall be included in | ||
| all copies or substantial portions of the Software. | ||
|
|
||
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
| THE SOFTWARE. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| #!/usr/bin/python | ||
| # | ||
| # From: https://raw.githubusercontent.com/yapus/gibberish/01637fe1fda827529ca76b8d6fee2de9100719f1/gibberish/gibberish.py | ||
| # | ||
| # 12Jun2017 Petr Janata - added srcfile and outfile | ||
| # 17Jun2107 Petr Janata - expanded set of accepted characters to include digits and hyphen | ||
| # | ||
| # whch is based off of: | ||
| # https://raw.githubusercontent.com/rrenaud/Gibberish-Detector/aa1d4e4555362b3dada97ebe6ecc23a84fc470fe/gib_detect_train.py | ||
| # | ||
|
|
||
| import math | ||
| import pickle | ||
| from pathlib import Path | ||
|
|
||
| data_dir = Path(__file__).parent / 'data' / 'gibberish' | ||
| model_path = data_dir / 'gib_model.pki' | ||
| big_file_path = data_dir / 'big.txt' | ||
| good_file_path = data_dir / 'good.txt' | ||
| bad_file_path = data_dir / 'bad.txt' | ||
|
|
||
| accepted_chars = 'abcdefghijklmnopqrstuvwxyz0123456789- ' | ||
| pos = dict([(char, idx) for idx, char in enumerate(accepted_chars)]) | ||
|
|
||
|
|
||
| class Gibberish(object): | ||
| def __init__(self): | ||
| if model_path.exists(): | ||
| self.load_persisted_model() | ||
| else: | ||
| self.train() | ||
|
|
||
| def persist_model(self): | ||
| with open(model_path, 'wb') as f: | ||
| pickle.dump(vars(self), f) | ||
|
|
||
| def load_persisted_model(self): | ||
| with open(model_path, 'rb') as f: | ||
| persisted_model = pickle.load(f) | ||
|
||
| for key, value in persisted_model.items(): | ||
| setattr(self, key, value) | ||
|
|
||
| def normalize(self, line): | ||
| """ Return only the subset of chars from accepted_chars. | ||
| This helps keep the model relatively small by ignoring punctuation, | ||
| infrequenty symbols, etc. """ | ||
| return [c.lower() for c in line if c.lower() in accepted_chars] | ||
|
|
||
| def ngram(self, n, l): | ||
| """ Return all n grams from l after normalizing """ | ||
| filtered = self.normalize(l) | ||
| for start in range(0, len(filtered) - n + 1): | ||
| yield ''.join(filtered[start:start + n]) | ||
|
|
||
| def avg_transition_prob(self, l, log_prob_mat): | ||
| """ Return the average transition prob from l through log_prob_mat. """ | ||
| log_prob = 0.0 | ||
| transition_ct = 0 | ||
| for a, b in self.ngram(2, l): | ||
| log_prob += log_prob_mat[pos[a]][pos[b]] | ||
| transition_ct += 1 | ||
| # The exponentiation translates from log probs to probs. | ||
| return math.exp(log_prob / (transition_ct or 1)) | ||
|
|
||
| def train(self, bigfile=big_file_path, goodfile=good_file_path, | ||
| badfile=bad_file_path): | ||
| """ Write a simple model as a pickle file """ | ||
| k = len(accepted_chars) | ||
| # Assume we have seen 10 of each character pair. This acts as a kind of | ||
| # prior or smoothing factor. This way, if we see a character transition | ||
| # live that we've never observed in the past, we won't assume the entire | ||
| # string has 0 probability. | ||
| counts = [[10 for i in range(k)] for i in range(k)] | ||
|
|
||
| # Count transitions from big text file, taken | ||
| # from http://norvig.com/spell-correct.html | ||
| for line in open(bigfile): | ||
| for a, b in self.ngram(2, line): | ||
| counts[pos[a]][pos[b]] += 1 | ||
|
|
||
| # Normalize the counts so that they become log probabilities. | ||
| # We use log probabilities rather than straight probabilities to avoid | ||
| # numeric underflow issues with long texts. | ||
| # This contains a justification: | ||
| # http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/ | ||
| for i, row in enumerate(counts): | ||
| s = float(sum(row)) | ||
| for j in range(len(row)): | ||
| row[j] = math.log(row[j] / s) | ||
|
|
||
| # Find the probability of generating a few arbitrarily choosen good and | ||
| # bad phrases. | ||
| good_probs = [self.avg_transition_prob(l, counts) for l in open(goodfile)] | ||
| bad_probs = [self.avg_transition_prob(l, counts) for l in open(badfile)] | ||
|
|
||
| # Assert that we actually are capable of detecting the junk. | ||
| assert min(good_probs) > max(bad_probs) | ||
|
|
||
| # And pick a threshold halfway between the worst good and best bad inputs. | ||
| thresh = (min(good_probs) + max(bad_probs)) / 2 | ||
| self.mat = counts | ||
| self.thresh = thresh | ||
| self.persist_model() | ||
|
|
||
| def detect_gibberish(self, text): | ||
| text = ''.join(self.normalize(text)) | ||
| return self.avg_transition_prob(text, self.mat) < self.thresh | ||
|
|
||
| def percent_gibberish(self, text): | ||
| text = ''.join(self.normalize(text)) | ||
| text = text.strip() | ||
| words = text.split(' ') | ||
| if len(words) == 0: | ||
| return 0 | ||
|
|
||
| gibberish_count = 0 | ||
| for word in words: | ||
| if self.detect_gibberish(word): | ||
| gibberish_count += 1 | ||
|
|
||
| return float(gibberish_count) / float(len(words)) | ||
|
|
||
| def gibberish_pct(self, text): | ||
| text = ''.join(self.normalize(text)) | ||
| return self.avg_transition_prob(text, self.mat) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,3 +9,6 @@ authors_summary: | |
| count: 1 | ||
| - value: Nikos Mavrogiannopoulos <[email protected]> | ||
| count: 1 | ||
| expected_failures: | ||
| - authors | ||
| - authors_summary | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,3 +17,7 @@ authors: | |
| - Juergen Seifert <[email protected]> | ||
| - Juergen Seifert <[email protected]> | ||
| - Juergen Seifert <[email protected]> | ||
| expected_failures: | ||
| - copyrights | ||
| - holders | ||
| - authors | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,3 @@ what: | |
| - copyrights | ||
| - holders | ||
| - authors | ||
| copyrights: | ||
| - (c) (c) 2AICAA3SSY | ||
| holders: | ||
| - 2AICAA3SSY | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,3 @@ what: | |
| - copyrights | ||
| - holders | ||
| - authors | ||
| copyrights: | ||
| - U1e (c) IjAx | ||
| holders: | ||
| - U1e IjAx | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,3 @@ what: | |
| - copyrights | ||
| - holders | ||
| - authors | ||
| copyrights: | ||
| - Xz eaaeuyATNRU (c) Ijr | ||
| holders: | ||
| - Xz eaaeuyATNRU Ijr | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,3 @@ what: | |
| - copyrights | ||
| - holders | ||
| - authors | ||
| copyrights: | ||
| - (c) cc.fr | ||
| holders: | ||
| - cc.fr | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,3 @@ what: | |
| - copyrights | ||
| - holders | ||
| - authors | ||
| copyrights: | ||
| - (c) Oo2 UOY | ||
| holders: | ||
| - Oo2 UOY | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,3 @@ what: | |
| - copyrights | ||
| - holders | ||
| - authors | ||
| copyrights: | ||
| - I. (c) Uao | ||
| holders: | ||
| - I. Uao | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,3 @@ what: | |
| - copyrights | ||
| - holders | ||
| - authors | ||
| copyrights: | ||
| - (c) UOSSOO-O (c) | ||
| holders: | ||
| - UOSSOO-O | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,3 @@ what: | |
| - copyrights | ||
| - holders | ||
| - authors | ||
| copyrights: | ||
| - (c) Cj d Dj | ||
| holders: | ||
| - Cj d Dj | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a fine provenance research!
The original from @rrenaud at https://github.com/rrenaud/Gibberish-Detector references a SO answer
And the SO author is the same as the GH author: https://stackoverflow.com/users/286449/rob-neuhaus
So this settles the original license to be MIT as per @rrenaud choice.
Then we have this chain of forks and derivations to document:
It would be nice and the right thing to do to keep the credits to each author for this chain of forks and refinements and ... I guess we could either:
(The license has stayed MIT all the way so this is about credits, not the license itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to help with this, if guided as to what to do.