Message ID | 20240821203154.175838-1-mmichels@redhat.com |
---|---|
State | Deferred |
Headers | show |
Series | [ovs-dev] checkpatch: Add check for non-inclusive language. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | warning | apply and check: warning |
ovsrobot/github-robot-_Build_and_Test | fail | github build: failed |
ovsrobot/github-robot-_ovn-kubernetes | fail | github build: failed |
On 8/21/24 22:31, Mark Michelson wrote: > This takes the approach of hard-coding the words in a text file, rather > than trying to use a dynamic list of words. The inclusive naming project > provides a way to download their word list as JSON, but this is avoided > here for a few reasons: > > * Only the base form of words are provided. We would need to analyze the > part of speech and try to generate the other forms of the words. > * Some entries are more "example" than anything. For example, they > provide "blackhat-whitehat" as a single entry, even though you're much > more likely to come across the individual words, rather than this > specific hyphenated variety. Similarly, "whitelist" is provided in the > word list but "blacklist" is not. > > If it turns out that the word list updates frequently, then it may be > worth moving to a more dynamic approach, at the expense of accuracy. For > now, this seems like a nice approach. > > We do not consider this an error in checkpatch, but a warning. There are > some cases where non-inclusive words are acceptable, (such as > ovs_abort(), or referring to the "master" branch of a third-party repo). > This is why the warning only suggests to consider an alternative. Hi, Mark. I'm not sure this a right thing to do. The point of avoiding non-inclusive language is to not have instances of it in the repo. But this change explicitly adds all the non-inclusive words to the repo and that IMO contradicts the goal. My 2c. Best regards, Ilya Maximets.
On 8/26/24 08:43, Ilya Maximets wrote: > On 8/21/24 22:31, Mark Michelson wrote: >> This takes the approach of hard-coding the words in a text file, rather >> than trying to use a dynamic list of words. The inclusive naming project >> provides a way to download their word list as JSON, but this is avoided >> here for a few reasons: >> >> * Only the base form of words are provided. We would need to analyze the >> part of speech and try to generate the other forms of the words. >> * Some entries are more "example" than anything. For example, they >> provide "blackhat-whitehat" as a single entry, even though you're much >> more likely to come across the individual words, rather than this >> specific hyphenated variety. Similarly, "whitelist" is provided in the >> word list but "blacklist" is not. >> >> If it turns out that the word list updates frequently, then it may be >> worth moving to a more dynamic approach, at the expense of accuracy. For >> now, this seems like a nice approach. >> >> We do not consider this an error in checkpatch, but a warning. There are >> some cases where non-inclusive words are acceptable, (such as >> ovs_abort(), or referring to the "master" branch of a third-party repo). >> This is why the warning only suggests to consider an alternative. > > Hi, Mark. I'm not sure this a right thing to do. The point of avoiding > non-inclusive language is to not have instances of it in the repo. But > this change explicitly adds all the non-inclusive words to the repo and > that IMO contradicts the goal. > > My 2c. > > Best regards, Ilya Maximets. > Your point is 100% valid. As a follow-up, do you have any ideas for how we can add automated checks for non-inclusive language without listing offending words in our repo? I mentioned in the commit message the shortcomings of attempting a dynamic approach, but maybe there's some middle ground between fully-dynamic and a static list of words that I'm not thinking of. Thanks, Mark Michelson
Mark Michelson <mmichels@redhat.com> writes: > On 8/26/24 08:43, Ilya Maximets wrote: >> On 8/21/24 22:31, Mark Michelson wrote: >>> This takes the approach of hard-coding the words in a text file, rather >>> than trying to use a dynamic list of words. The inclusive naming project >>> provides a way to download their word list as JSON, but this is avoided >>> here for a few reasons: >>> >>> * Only the base form of words are provided. We would need to analyze the >>> part of speech and try to generate the other forms of the words. >>> * Some entries are more "example" than anything. For example, they >>> provide "blackhat-whitehat" as a single entry, even though you're much >>> more likely to come across the individual words, rather than this >>> specific hyphenated variety. Similarly, "whitelist" is provided in the >>> word list but "blacklist" is not. >>> >>> If it turns out that the word list updates frequently, then it may be >>> worth moving to a more dynamic approach, at the expense of accuracy. For >>> now, this seems like a nice approach. >>> >>> We do not consider this an error in checkpatch, but a warning. There are >>> some cases where non-inclusive words are acceptable, (such as >>> ovs_abort(), or referring to the "master" branch of a third-party repo). >>> This is why the warning only suggests to consider an alternative. >> Hi, Mark. I'm not sure this a right thing to do. The point of >> avoiding >> non-inclusive language is to not have instances of it in the repo. But >> this change explicitly adds all the non-inclusive words to the repo and >> that IMO contradicts the goal. >> My 2c. >> Best regards, Ilya Maximets. >> > > Your point is 100% valid. As a follow-up, do you have any ideas for > how we can add automated checks for non-inclusive language without > listing offending words in our repo? I mentioned in the commit message > the shortcomings of attempting a dynamic approach, but maybe there's > some middle ground between fully-dynamic and a static list of words > that I'm not thinking of. Unfortunately, I haven't found any prior art for this. What I mean is, I don't see any projects that have such an inclusive language check being done automatically. There have been some attempts but it doesn't seem that anything got adopted anywhere. Maybe it would be better if we could ask the inclusive language place to publish a list that is suitable for programs like codespell, checkpatch.py, enchant, etc. That might make adoption easier for us since we could pull the list dynamically and not have such language be a part of a repository. After all, if there are a list of words we wish to avoid, we need to know the list. And if we're trying to avoid keeping them in the repository, I think it stands to reason they have to exist somewhere. And having an automated means of pulling and evaluating them makes a lot of sense. An alternative might be to include the option to farm the patch text to an AI model and ask whether it seems to avoid non-inclusive terminology. That is a dynamic approach, but may flag words which aren't part of the inclusive naming list (whether that would be a positive change would need to be evaluated). > Thanks, > Mark Michelson > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff --git a/tests/checkpatch.at b/tests/checkpatch.at index 6ac0e51f3..79c02b229 100755 --- a/tests/checkpatch.at +++ b/tests/checkpatch.at @@ -642,5 +642,43 @@ try_checkpatch \ #8 FILE: tests/something.at:1: C_H_E_C_K([as gw1 ovs-ofctl dump-flows br-int table=42 | grep "dl_dst=00:00:02:01:02:04" | wc -l], [0], [[1]]) " +AT_CLEANUP + +AT_SETUP([checkpatch - non-inclusive words]) +# This test does not extensively test every single word in the list of +# non-inclusive words. + +# Try a simple exact match with a single word +try_checkpatch \ + "COMMON_PATCH_HEADER + +/* Let's be sure this change doesn't cripple our performance */ + " \ + "WARNING: Non-inclusive word 'cripple' found. Consider replacing. + #8 FILE: A.c:1: + /* Let's be sure this change doesn't cripple our performance */ +" +# Put more than one word on the line, and use different forms of the word. +try_checkpatch \ + "COMMON_PATCH_HEADER + +/* The grandfathers are hallucinating again */ + " \ + "WARNING: Non-inclusive word 'grandfathers' found. Consider replacing. + WARNING: Non-inclusive word 'hallucinating' found. Consider replacing. + #8 FILE: A.c:1: + /* The grandfathers are hallucinating again */ +" + +# And finally, make sure punctuation, etc. don't interfere. +try_checkpatch \ + "COMMON_PATCH_HEADER + +/* Set up master/slave tribe, but don't abort! */ + " \ + "WARNING: Non-inclusive word 'abort' found. Consider replacing. + WARNING: Non-inclusive word 'master' found. Consider replacing. + WARNING: Non-inclusive word 'slave' found. Consider replacing. + WARNING: Non-inclusive word 'tribe' found. Consider replacing. + #8 FILE: A.c:1: + /* Set up master/slave tribe, but don't abort! */ +" AT_CLEANUP diff --git a/utilities/checkpatch.py b/utilities/checkpatch.py index 35204daa2..9a06cf0a1 100755 --- a/utilities/checkpatch.py +++ b/utilities/checkpatch.py @@ -19,6 +19,8 @@ import getopt import os import re import sys +import functools +from pathlib import Path RETURN_CHECK_INITIAL_STATE = 0 RETURN_CHECK_STATE_WITH_RETURN = 1 @@ -582,6 +584,32 @@ def empty_return_with_brace(line): return False +@functools.cache +def load_excluded_words(): + parent_dir = Path(__file__).parent + with open(parent_dir / "excluded_word_list.txt", "r") as f: + return [line.strip() for line in f] + + +def contains_non_inclusive_words(line): + # This returns true if a word is found that falls afoul of our inclusive + # language guidelines. The list of words is sourced from the Tier 1, Tier 2, + # and Tier 3 word lists from https://inclusivenaming.org/word-lists/ . + + excluded_words = load_excluded_words() + + problem_found = False + for word in excluded_words: + match = re.search(rf'\b{word}\b', line, flags=re.IGNORECASE) + if match: + print_warning( + f"Non-inclusive word '{word}' found. Consider replacing." + ) + problem_found = True + + return problem_found + + file_checks = [ {'regex': __regex_added_doc_rst, 'check': check_new_docs_index}, @@ -668,6 +696,8 @@ checks = [ lambda: print_warning("Use of hardcoded table=<NUMBER> or" " resubmit=(,<NUMBER>) is discouraged in tests." " Consider using MACRO instead.")}, + {'regex': None, 'match_name': None, + 'check': lambda x: contains_non_inclusive_words(x)}, ] diff --git a/utilities/excluded_word_list.txt b/utilities/excluded_word_list.txt new file mode 100644 index 000000000..7a2ce4e09 --- /dev/null +++ b/utilities/excluded_word_list.txt @@ -0,0 +1,59 @@ +abort +aborts +aborting +aborted +abortion +blackhat +blackhats +whitehat +whitehats +cripple +cripples +crippling +cripplingly +crippled +grandfather +grandfathers +grandfathered +grandfathering +master +masters +slave +slaves +slaved +slaving +slavery +slavish +slavishly +tribe +tribes +tribal +tribally +whitelist +whitelists +whitelisted +whitelisting +blacklist +blacklists +blacklisted +blacklisting +sanity-check +sanity-checks +sanity-checked +sanity-checking +blast-radius +blast-radii +hallucinate +hallucinates +hallucinated +hallucinating +hallucination +man-hour +man-hours +man-in-the-middle +men-in-the-middle +segregate +segregates +segregated +segregating +segregation
This takes the approach of hard-coding the words in a text file, rather than trying to use a dynamic list of words. The inclusive naming project provides a way to download their word list as JSON, but this is avoided here for a few reasons: * Only the base form of words are provided. We would need to analyze the part of speech and try to generate the other forms of the words. * Some entries are more "example" than anything. For example, they provide "blackhat-whitehat" as a single entry, even though you're much more likely to come across the individual words, rather than this specific hyphenated variety. Similarly, "whitelist" is provided in the word list but "blacklist" is not. If it turns out that the word list updates frequently, then it may be worth moving to a more dynamic approach, at the expense of accuracy. For now, this seems like a nice approach. We do not consider this an error in checkpatch, but a warning. There are some cases where non-inclusive words are acceptable, (such as ovs_abort(), or referring to the "master" branch of a third-party repo). This is why the warning only suggests to consider an alternative. On a side note, running this patch through the updated checkpatch.py is pretty funny. Signed-off-by: Mark Michelson <mmichels@redhat.com> --- tests/checkpatch.at | 38 ++++++++++++++++++++ utilities/checkpatch.py | 30 ++++++++++++++++ utilities/excluded_word_list.txt | 59 ++++++++++++++++++++++++++++++++ 3 files changed, 127 insertions(+) create mode 100644 utilities/excluded_word_list.txt