Message ID | 20200220182723.52615-1-titouan.christophe@railnova.eu |
---|---|
State | Accepted |
Headers | show |
Series | [1/1] support/scripts/pkg-stats: iterate over CVEs in streaming | expand |
Hi Titouan, El jue., 20 feb. 2020 a las 19:27, Titouan Christophe (<titouan.christophe@railnova.eu>) escribió: > > The NVD files that are used to build the list of CVEs affecting > Buildroot packages are quite large (a few hundreds MB of json), > and cause the pkg-stats scripts to have a huge memory footprint > (a few GB with Python 2.7). > > However, because we only need to iterate on CVE items one by one, > we can process them in streaming (ie decoding one CVE at a time > from the JSON representation). Because the json module from the > python standard library does not support such a mode of operation, > we switch to the third-party package ijson, which is compatible > with both Python 2 and Python3. > > To run the script with these modifications, one should install > the ijson python package. This can be done with pip: > `pip install ijson`. On Debian based distributions, this can > also be done with the apt package manager: > `apt install python-ijson`. > > Signed-off-by: Titouan Christophe <titouan.christophe@railnova.eu> > --- > support/scripts/pkg-stats | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats > index c113cf9606..7721d98459 100755 > --- a/support/scripts/pkg-stats > +++ b/support/scripts/pkg-stats > @@ -25,6 +25,7 @@ import re > import subprocess > import requests # URL checking > import json > +import ijson > import certifi > import distutils.version > import time > @@ -231,11 +232,11 @@ class CVE: > for year in range(NVD_START_YEAR, datetime.datetime.now().year + 1): > filename = CVE.download_nvd_year(nvd_dir, year) > try: > - content = json.load(gzip.GzipFile(filename)) > + content = ijson.items(gzip.GzipFile(filename), 'CVE_Items.item') > except: > print("ERROR: cannot read %s. Please remove the file then rerun this script" % filename) > raise > - for cve in content["CVE_Items"]: > + for cve in content: > yield cls(cve['cve']) > > def each_product(self): This is _way_ better. In my test run observing top output, resident memory stayed around 50 MB. Reviewed-by: Thomas De Schampheleire <thomas.de_schampheleire@nokia.com> Tested-by: Thomas De Schampheleire <thomas.de_schampheleire@nokia.com>
>>>>> "Thomas" == Thomas De Schampheleire <patrickdepinguin+buildroot@gmail.com> writes: > Hi Titouan, > El jue., 20 feb. 2020 a las 19:27, Titouan Christophe > (<titouan.christophe@railnova.eu>) escribió: >> >> The NVD files that are used to build the list of CVEs affecting >> Buildroot packages are quite large (a few hundreds MB of json), >> and cause the pkg-stats scripts to have a huge memory footprint >> (a few GB with Python 2.7). >> >> However, because we only need to iterate on CVE items one by one, >> we can process them in streaming (ie decoding one CVE at a time >> from the JSON representation). Because the json module from the >> python standard library does not support such a mode of operation, >> we switch to the third-party package ijson, which is compatible >> with both Python 2 and Python3. >> >> To run the script with these modifications, one should install >> the ijson python package. This can be done with pip: >> `pip install ijson`. On Debian based distributions, this can >> also be done with the apt package manager: >> `apt install python-ijson`. >> >> Signed-off-by: Titouan Christophe <titouan.christophe@railnova.eu> > This is _way_ better. In my test run observing top output, resident > memory stayed around 50 MB. Nice! > Reviewed-by: Thomas De Schampheleire <thomas.de_schampheleire@nokia.com> > Tested-by: Thomas De Schampheleire <thomas.de_schampheleire@nokia.com> Committed, thanks.
diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats index c113cf9606..7721d98459 100755 --- a/support/scripts/pkg-stats +++ b/support/scripts/pkg-stats @@ -25,6 +25,7 @@ import re import subprocess import requests # URL checking import json +import ijson import certifi import distutils.version import time @@ -231,11 +232,11 @@ class CVE: for year in range(NVD_START_YEAR, datetime.datetime.now().year + 1): filename = CVE.download_nvd_year(nvd_dir, year) try: - content = json.load(gzip.GzipFile(filename)) + content = ijson.items(gzip.GzipFile(filename), 'CVE_Items.item') except: print("ERROR: cannot read %s. Please remove the file then rerun this script" % filename) raise - for cve in content["CVE_Items"]: + for cve in content: yield cls(cve['cve']) def each_product(self):
The NVD files that are used to build the list of CVEs affecting Buildroot packages are quite large (a few hundreds MB of json), and cause the pkg-stats scripts to have a huge memory footprint (a few GB with Python 2.7). However, because we only need to iterate on CVE items one by one, we can process them in streaming (ie decoding one CVE at a time from the JSON representation). Because the json module from the python standard library does not support such a mode of operation, we switch to the third-party package ijson, which is compatible with both Python 2 and Python3. To run the script with these modifications, one should install the ijson python package. This can be done with pip: `pip install ijson`. On Debian based distributions, this can also be done with the apt package manager: `apt install python-ijson`. Signed-off-by: Titouan Christophe <titouan.christophe@railnova.eu> --- support/scripts/pkg-stats | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)