Working with streams¶
A naive program buffers all inputs into memory before writing any outputs. OCDS files can be very large, and loading them into memory can exhaust all available memory. The command-line interface therefore reads inputs and writes outputs progressively or one-at-a-time (that is, it “streams”), as much as possible. Streaming writes outputs faster and requires less memory than buffering.
Several library methods return dictionaries with generators as values, which can’t be serialized using the
json module without extra work. Use the
ocdskit.util.iterencode() methods instead.
The command-line interface uses ijson to iteratively parse the JSON inputs with a read buffer of 64 kB.
To start, this uses the same amount of memory as
import ijson with open(filename) as f: for item in ijson.items(f, ''): # do stuff
If you are working with release packages or record packages and only need the releases or records, set the
prefix argument (
'' above) as described in ijson’s documentation. Instead of loading the entire package into memory, this instead loads each release or record one-at-a-time. For example:
for item in ijson.items(f, 'releases.item'):
for item in ijson.items(f, 'records.item'):
prefix argument is also relevant if you are working with files that embed OCDS data. For example:
for item in ijson.items(f, 'results.item'):
If you are parsing concatenated JSON, add
multiple_values=True. For example:
for item in ijson.items(f, '', multiple_values=True):