API Reference#

Working with streams#

A naive program buffers all inputs into memory before writing any outputs. OCDS files can be very large, and loading them into memory can exhaust all available memory. The command-line interface therefore reads inputs and writes outputs progressively or one-at-a-time (that is, it “streams”), as much as possible. Streaming writes outputs faster and requires less memory than buffering.

Output#

Several library methods return dictionaries with generators as values, which can’t be serialized using the json module without extra work. Use the ocdskit.util.json_dumps(), ocdskit.util.json_dump() and ocdskit.util.iterencode() methods instead.

Input#

The command-line interface uses ijson to iteratively parse the JSON inputs with a read buffer of 64 kB.

To start, this uses the same amount of memory as json.load(f):

import ijson

with open(filename) as f:
    for item in ijson.items(f, ''):
        # do stuff

If you are working with release packages or record packages and only need the releases or records, set the prefix argument ('' above) as described in ijson’s documentation. Instead of loading the entire package into memory, this instead loads each release or record one-at-a-time. For example:

for item in ijson.items(f, 'releases.item'):
for item in ijson.items(f, 'records.item'):

The prefix argument is also relevant if you are working with files that embed OCDS data. For example:

for item in ijson.items(f, 'results.item'):

If you are parsing concatenated JSON, add multiple_values=True. For example:

for item in ijson.items(f, '', multiple_values=True):