diff mbox series

[ovs-dev] checkpatch: Add check for non-inclusive language.

Message ID 20240821203154.175838-1-mmichels@redhat.com
State Deferred
Headers show
Series [ovs-dev] checkpatch: Add check for non-inclusive language. | expand

Checks

Context Check Description
ovsrobot/apply-robot warning apply and check: warning
ovsrobot/github-robot-_Build_and_Test fail github build: failed
ovsrobot/github-robot-_ovn-kubernetes fail github build: failed

Commit Message

Mark Michelson Aug. 21, 2024, 8:31 p.m. UTC
This takes the approach of hard-coding the words in a text file, rather
than trying to use a dynamic list of words. The inclusive naming project
provides a way to download their word list as JSON, but this is avoided
here for a few reasons:

* Only the base form of words are provided. We would need to analyze the
  part of speech and try to generate the other forms of the words.
* Some entries are more "example" than anything. For example, they
  provide "blackhat-whitehat" as a single entry, even though you're much
  more likely to come across the individual words, rather than this
  specific hyphenated variety. Similarly, "whitelist" is provided in the
  word list but "blacklist" is not.

If it turns out that the word list updates frequently, then it may be
worth moving to a more dynamic approach, at the expense of accuracy. For
now, this seems like a nice approach.

We do not consider this an error in checkpatch, but a warning. There are
some cases where non-inclusive words are acceptable, (such as
ovs_abort(), or referring to the "master" branch of a third-party repo).
This is why the warning only suggests to consider an alternative.

On a side note, running this patch through the updated checkpatch.py is
pretty funny.

Signed-off-by: Mark Michelson <mmichels@redhat.com>
---
 tests/checkpatch.at              | 38 ++++++++++++++++++++
 utilities/checkpatch.py          | 30 ++++++++++++++++
 utilities/excluded_word_list.txt | 59 ++++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+)
 create mode 100644 utilities/excluded_word_list.txt

Comments

Ilya Maximets Aug. 26, 2024, 12:43 p.m. UTC | #1
On 8/21/24 22:31, Mark Michelson wrote:
> This takes the approach of hard-coding the words in a text file, rather
> than trying to use a dynamic list of words. The inclusive naming project
> provides a way to download their word list as JSON, but this is avoided
> here for a few reasons:
> 
> * Only the base form of words are provided. We would need to analyze the
>   part of speech and try to generate the other forms of the words.
> * Some entries are more "example" than anything. For example, they
>   provide "blackhat-whitehat" as a single entry, even though you're much
>   more likely to come across the individual words, rather than this
>   specific hyphenated variety. Similarly, "whitelist" is provided in the
>   word list but "blacklist" is not.
> 
> If it turns out that the word list updates frequently, then it may be
> worth moving to a more dynamic approach, at the expense of accuracy. For
> now, this seems like a nice approach.
> 
> We do not consider this an error in checkpatch, but a warning. There are
> some cases where non-inclusive words are acceptable, (such as
> ovs_abort(), or referring to the "master" branch of a third-party repo).
> This is why the warning only suggests to consider an alternative.

Hi, Mark.  I'm not sure this a right thing to do.  The point of avoiding
non-inclusive language is to not have instances of it in the repo.  But
this change explicitly adds all the non-inclusive words to the repo and
that IMO contradicts the goal.

My 2c.

Best regards, Ilya Maximets.
Mark Michelson Aug. 26, 2024, 6:55 p.m. UTC | #2
On 8/26/24 08:43, Ilya Maximets wrote:
> On 8/21/24 22:31, Mark Michelson wrote:
>> This takes the approach of hard-coding the words in a text file, rather
>> than trying to use a dynamic list of words. The inclusive naming project
>> provides a way to download their word list as JSON, but this is avoided
>> here for a few reasons:
>>
>> * Only the base form of words are provided. We would need to analyze the
>>    part of speech and try to generate the other forms of the words.
>> * Some entries are more "example" than anything. For example, they
>>    provide "blackhat-whitehat" as a single entry, even though you're much
>>    more likely to come across the individual words, rather than this
>>    specific hyphenated variety. Similarly, "whitelist" is provided in the
>>    word list but "blacklist" is not.
>>
>> If it turns out that the word list updates frequently, then it may be
>> worth moving to a more dynamic approach, at the expense of accuracy. For
>> now, this seems like a nice approach.
>>
>> We do not consider this an error in checkpatch, but a warning. There are
>> some cases where non-inclusive words are acceptable, (such as
>> ovs_abort(), or referring to the "master" branch of a third-party repo).
>> This is why the warning only suggests to consider an alternative.
> 
> Hi, Mark.  I'm not sure this a right thing to do.  The point of avoiding
> non-inclusive language is to not have instances of it in the repo.  But
> this change explicitly adds all the non-inclusive words to the repo and
> that IMO contradicts the goal.
> 
> My 2c.
> 
> Best regards, Ilya Maximets.
> 

Your point is 100% valid.  As a follow-up, do you have any ideas for how 
we can add automated checks for non-inclusive language without listing 
offending words in our repo? I mentioned in the commit message the 
shortcomings of attempting a dynamic approach, but maybe there's some 
middle ground between fully-dynamic and a static list of words that I'm 
not thinking of.

Thanks,
Mark Michelson
Aaron Conole Aug. 28, 2024, 2:45 p.m. UTC | #3
Mark Michelson <mmichels@redhat.com> writes:

> On 8/26/24 08:43, Ilya Maximets wrote:
>> On 8/21/24 22:31, Mark Michelson wrote:
>>> This takes the approach of hard-coding the words in a text file, rather
>>> than trying to use a dynamic list of words. The inclusive naming project
>>> provides a way to download their word list as JSON, but this is avoided
>>> here for a few reasons:
>>>
>>> * Only the base form of words are provided. We would need to analyze the
>>>    part of speech and try to generate the other forms of the words.
>>> * Some entries are more "example" than anything. For example, they
>>>    provide "blackhat-whitehat" as a single entry, even though you're much
>>>    more likely to come across the individual words, rather than this
>>>    specific hyphenated variety. Similarly, "whitelist" is provided in the
>>>    word list but "blacklist" is not.
>>>
>>> If it turns out that the word list updates frequently, then it may be
>>> worth moving to a more dynamic approach, at the expense of accuracy. For
>>> now, this seems like a nice approach.
>>>
>>> We do not consider this an error in checkpatch, but a warning. There are
>>> some cases where non-inclusive words are acceptable, (such as
>>> ovs_abort(), or referring to the "master" branch of a third-party repo).
>>> This is why the warning only suggests to consider an alternative.
>> Hi, Mark.  I'm not sure this a right thing to do.  The point of
>> avoiding
>> non-inclusive language is to not have instances of it in the repo.  But
>> this change explicitly adds all the non-inclusive words to the repo and
>> that IMO contradicts the goal.
>> My 2c.
>> Best regards, Ilya Maximets.
>> 
>
> Your point is 100% valid.  As a follow-up, do you have any ideas for
> how we can add automated checks for non-inclusive language without
> listing offending words in our repo? I mentioned in the commit message
> the shortcomings of attempting a dynamic approach, but maybe there's
> some middle ground between fully-dynamic and a static list of words
> that I'm not thinking of.

Unfortunately, I haven't found any prior art for this.  What I mean is,
I don't see any projects that have such an inclusive language check
being done automatically.  There have been some attempts but it doesn't
seem that anything got adopted anywhere.

Maybe it would be better if we could ask the inclusive language place to
publish a list that is suitable for programs like codespell,
checkpatch.py, enchant, etc.  That might make adoption easier for us
since we could pull the list dynamically and not have such language be a
part of a repository.  After all, if there are a list of words we wish
to avoid, we need to know the list.  And if we're trying to avoid
keeping them in the repository, I think it stands to reason they have to
exist somewhere.  And having an automated means of pulling and
evaluating them makes a lot of sense.

An alternative might be to include the option to farm the patch text to
an AI model and ask whether it seems to avoid non-inclusive terminology.
That is a dynamic approach, but may flag words which aren't part of the
inclusive naming list (whether that would be a positive change would
need to be evaluated).

> Thanks,
> Mark Michelson
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/tests/checkpatch.at b/tests/checkpatch.at
index 6ac0e51f3..79c02b229 100755
--- a/tests/checkpatch.at
+++ b/tests/checkpatch.at
@@ -642,5 +642,43 @@  try_checkpatch \
     #8 FILE: tests/something.at:1:
     C_H_E_C_K([as gw1 ovs-ofctl dump-flows br-int table=42 | grep "dl_dst=00:00:02:01:02:04" | wc -l], [0], [[1]])
 "
+AT_CLEANUP
+
+AT_SETUP([checkpatch - non-inclusive words])
+# This test does not extensively test every single word in the list of
+# non-inclusive words.
+
+# Try a simple exact match with a single word
+try_checkpatch \
+   "COMMON_PATCH_HEADER
+    +/* Let's be sure this change doesn't cripple our performance */
+    " \
+    "WARNING: Non-inclusive word 'cripple' found. Consider replacing.
+    #8 FILE: A.c:1:
+    /* Let's be sure this change doesn't cripple our performance */
+"
 
+# Put more than one word on the line, and use different forms of the word.
+try_checkpatch \
+   "COMMON_PATCH_HEADER
+    +/* The grandfathers are hallucinating again */
+    " \
+    "WARNING: Non-inclusive word 'grandfathers' found. Consider replacing.
+    WARNING: Non-inclusive word 'hallucinating' found. Consider replacing.
+    #8 FILE: A.c:1:
+    /* The grandfathers are hallucinating again */
+"
+
+# And finally, make sure punctuation, etc. don't interfere.
+try_checkpatch \
+   "COMMON_PATCH_HEADER
+    +/* Set up master/slave tribe, but don't abort! */
+    " \
+    "WARNING: Non-inclusive word 'abort' found. Consider replacing.
+    WARNING: Non-inclusive word 'master' found. Consider replacing.
+    WARNING: Non-inclusive word 'slave' found. Consider replacing.
+    WARNING: Non-inclusive word 'tribe' found. Consider replacing.
+    #8 FILE: A.c:1:
+    /* Set up master/slave tribe, but don't abort! */
+"
 AT_CLEANUP
diff --git a/utilities/checkpatch.py b/utilities/checkpatch.py
index 35204daa2..9a06cf0a1 100755
--- a/utilities/checkpatch.py
+++ b/utilities/checkpatch.py
@@ -19,6 +19,8 @@  import getopt
 import os
 import re
 import sys
+import functools
+from pathlib import Path
 
 RETURN_CHECK_INITIAL_STATE = 0
 RETURN_CHECK_STATE_WITH_RETURN = 1
@@ -582,6 +584,32 @@  def empty_return_with_brace(line):
     return False
 
 
+@functools.cache
+def load_excluded_words():
+    parent_dir = Path(__file__).parent
+    with open(parent_dir / "excluded_word_list.txt", "r") as f:
+        return [line.strip() for line in f]
+
+
+def contains_non_inclusive_words(line):
+    # This returns true if a word is found that falls afoul of our inclusive
+    # language guidelines. The list of words is sourced from the Tier 1, Tier 2,
+    # and Tier 3 word lists from https://inclusivenaming.org/word-lists/ .
+
+    excluded_words = load_excluded_words()
+
+    problem_found = False
+    for word in excluded_words:
+        match = re.search(rf'\b{word}\b', line, flags=re.IGNORECASE)
+        if match:
+            print_warning(
+                f"Non-inclusive word '{word}' found. Consider replacing."
+            )
+            problem_found = True
+
+    return problem_found
+
+
 file_checks = [
         {'regex': __regex_added_doc_rst,
          'check': check_new_docs_index},
@@ -668,6 +696,8 @@  checks = [
          lambda: print_warning("Use of hardcoded table=<NUMBER> or"
                                " resubmit=(,<NUMBER>) is discouraged in tests."
                                " Consider using MACRO instead.")},
+    {'regex': None, 'match_name': None,
+     'check': lambda x: contains_non_inclusive_words(x)},
 ]
 
 
diff --git a/utilities/excluded_word_list.txt b/utilities/excluded_word_list.txt
new file mode 100644
index 000000000..7a2ce4e09
--- /dev/null
+++ b/utilities/excluded_word_list.txt
@@ -0,0 +1,59 @@ 
+abort
+aborts
+aborting
+aborted
+abortion
+blackhat
+blackhats
+whitehat
+whitehats
+cripple
+cripples
+crippling
+cripplingly
+crippled
+grandfather
+grandfathers
+grandfathered
+grandfathering
+master
+masters
+slave
+slaves
+slaved
+slaving
+slavery
+slavish
+slavishly
+tribe
+tribes
+tribal
+tribally
+whitelist
+whitelists
+whitelisted
+whitelisting
+blacklist
+blacklists
+blacklisted
+blacklisting
+sanity-check
+sanity-checks
+sanity-checked
+sanity-checking
+blast-radius
+blast-radii
+hallucinate
+hallucinates
+hallucinated
+hallucinating
+hallucination
+man-hour
+man-hours
+man-in-the-middle
+men-in-the-middle
+segregate
+segregates
+segregated
+segregating
+segregation