Debugging an OHI Pipeline Component

Please note that all opinions are that of the author.

Last Updated On: 2025-10-13 07:27:41 -0400

This morning I did my usual OHI Pipeline dance and I discovered a problem so I wrote this document to illustrate my debugging process. Normally I would have grabbed a co-worker and pair programmed on the problem in order to maximize the Knowledge Transfer but that seems to NOT be a thing so we will focus on the written aspect of Knowledge Transfer to see if things can be captured.

NOTE 1: All of this is based on the ohi_sqs_pipeline repository which is the OHI Data Pipeline that Scott created. Given that William has announced his plans to replace my work entirely and his dim view of my work, it is possible that is irrelevant but in the event that my work survives, this may be useful.

NOTE 2: Everything I’ve done in the ohi_sqs_pipeline repo operates on consistent principles and by reading this it should give you a good feel for how any part of the pipeline works. And while it is said that “A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines.” (Ralph Waldo Emerson), I don’t believe that consistency in terms of how a software code base works is foolish; it is vital to understanding how to do anything.

Step 0: Check out Scott’s Work

Do a git clone i.e.

git clone git@github.com:adl-tech/ohi_sqs_pipeline.git

into a working directory.

Note: I’m an software engineer; all counters start from 0 (at least that is if we’re not using Julia; what were they thinking …).

Step 1: Get the Shell Scripts

A key part of my work is contained in bin directories beneath each “pipeline component”. The bin directories contain shell scripts that contain key environment variables such as the AWS security keys needed to make anything run. Because these shell scripts contain security keys, they can’t be checked into github or we will have a violation of 12 factor principles. All my code, including the bin directories, is always deployed to Jenkins and can be found in this path:

/home/ubuntu/ohi_sqs_pipeline

and the bin directories can be copied down to actual checkout directory.

Here’s an example of how to do this (I’m giving this example because it is relevant to this document):

cd _your_checked_out_source_code_dir
cd cpu/utilities/list_queues_with_counts
mkdir bin
cd bin
scp -p -r -i ~/.ssh/test_instance.pem ubuntu@18.190.74.229:/home/ubuntu/ohi_sqs_pipeline/cpu/utilities/list_queues_with_counts/bin/* .

You are going to need to do this, copy down bin directories, for each of these paths:

cpu/utilities/list_queues_with_counts
cpu/utilities/list_redis_keys_with_counts
cpu/utilities/list_s3_buckets_with_counts
(the root directory of the checkout)
cpu/copy_to_s3
cpu/data_streamer_to_sqs
cpu/expert_invective
cpu/expert_perspective
cpu/expert_sentiment
cpu/expert_should_analyze
cpu/json_normalization
cpu/sqs_to_redis
gpu/expert_anti_semitism    

Note 1: The .gitignore file prevents these from being checked in so there’s no issue with them being copied down. This is a one time setup set that you’re going to need to.

Note 2: The only reason for the shell scripts not being checked in is violating the 12 factor principles. Shell scripting and environment variables are my single weakest technical skill and I strongly wish that these had to be checked in. I’m aware that this is an annoying task. I asked William for help on this issue and he demurred to provide it arguing that it wasn’t necessary when I asked (I suspect his plan to replace my code base entirely was already afoot in his mind even then).

Step 1: Knowing That There is a Problem

The first thing is how do you know that there’s a problem. Well we have tooling as part of the OHI. That tooling are the different programs listed in the cpu/utilities directory:

cpu/utilities/list_queues_with_counts
cpu/utilities/list_redis_keys_with_counts
cpu/utilities/list_s3_buckets_with_counts

The tool we are going to want to execute is:

cpu/utilities/list_queues_with_counts

And you do this with bin/run so you are going to want to do :

cd _your_checked_out_source_code_dir
cd cpu/utilities/list_queues_with_counts
bin/run

but if this is your first time through you are going to need to do this first:

pip install -r requirements.txt

to get any part of your Python tool chain installed.

Sidebar: I am a firm believer, for developers, in command line tooling as an efficient and highly navigable way to do almost anything because it is both documentable and repeatable. But, in our modern age of cloud computing where environment variables need to be set and everything has flags that define how it runs, I find that process just plain annoying so I have wrapped all parts of the OHI with, generally, three basic commands:

bin/run -- run the tool / pipeline component for development / debugging purposes
bin/docker_build -- build the Docker component for the tool / pipeline component
bin/docker_run -- run the Docker component

The tool list_queues_with_counts should, if all things are correct, execute and display output like this:

❯ bin/run
*********************************************************************
List Queues Utility starting up
Copyright (C) 2020, The AntiDefamation League
*********************************************************************


queue_name = ohi-datastreamer -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-json-normalization -- BATCHED items in queue: 5346135 - Approximate number of tweets = 545,305,770
queue_name = ohi-should-analyze -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-invective -- BATCHED items in queue: 2519470 - Approximate number of tweets = 256,985,940
queue_name = ohi-perspective -- BATCHED items in queue: 2724168 - Approximate number of tweets = 277,865,136
queue_name = ohi-sentiment -- BATCHED items in queue: 367832 - Approximate number of tweets = 37,518,864
queue_name = ohi-anti-semitism -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-copy-to-s3 -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-sqs-to-redis -- BATCHED items in queue: 0 - Approximate number of tweets = 0

Please note that the number of items per queue tends to go up
as data flows through the system because the JSON documents get larger
with each pass and the size of each batch is fixed

The ohi_sqs_pipeline is just that – a data flow pipeline where SQS is the underlying pipe and data flows from bucket to bucket.

Now let’s run it again, just a few seconds later:

❯ bin/run
*********************************************************************
List Queues Utility starting up
Copyright (C) 2020, The AntiDefamation League
*********************************************************************


queue_name = ohi-datastreamer -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-json-normalization -- BATCHED items in queue: 5343575 - Approximate number of tweets = 545,044,650
queue_name = ohi-should-analyze -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-invective -- BATCHED items in queue: 2518569 - Approximate number of tweets = 256,894,038
queue_name = ohi-perspective -- BATCHED items in queue: 2723419 - Approximate number of tweets = 277,788,738
queue_name = ohi-sentiment -- BATCHED items in queue: 367307 - Approximate number of tweets = 37,465,314
queue_name = ohi-anti-semitism -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-copy-to-s3 -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-sqs-to-redis -- BATCHED items in queue: 0 - Approximate number of tweets = 0

Please note that the number of items per queue tends to go up
as data flows through the system because the JSON documents get larger
with each pass and the size of each batch is fixed

What you can see here, between invocations of this tool, is movement. Specifically from the first time we typed:

queue_name = ohi-invective -- BATCHED items in queue: 2519470 - Approximate number of tweets = 256,985,940

queue_name = ohi-invective -- BATCHED items in queue: 2518569 - Approximate number of tweets = 256,894,038

That’s the pipeline in action, things should be continuously flowing from sqs bucket to sqs bucket.

But Where’s the Problem Scott?

The first problem that we see is this:

queue_name = ohi-datastreamer -- BATCHED items in queue: 0 - Approximate number of tweets = 0

The ohi-datastreamer bucket is the place where the datastreamer json files live so they can be processed by json_normalization. If we don’t have inputs into the pipeline then that’s a problem.

SSH’ing into the Right Server

I know that SSH is old school, antiquated, etc but the OHI is still very much a system under development and this is how I tend to debug things – I look at them on production and figure out what’s going wrong. Here’s that process:

cd _your_checked_out_source_code_dir
bin/ssh_jenkins

Now you’re going to need to get the right local ip address to get into this box. So you’re going to want to do this:

Go to https://console.aws.amazon.com and do the happy credentials dance; aren’t passwords fun; we have so many to manage!!!
Go to https://us-east-2.console.aws.amazon.com/ec2/v2/home?region=us-east-2#Instances:sort=tag:Name
In the filter bar, set the cost_type tag and then set its value to variable (the boxes that make up an OHI cluster are all tagged as a variable cost since they could go away at any time; hence they aren’t a fixed cost but a variable cost).
Check the box next to the name: ohi-sqs_data_streamer_to_sqs
Find the Private IP setting below and copy the ip address. Just now I got the value: 172.31.8.6

On your terminal where you are logged into Jenkins, do a history search for ssh (**history

grep ss**h) to get the exact command or copy and paste this:

ssh  -o "StrictHostKeyChecking no" -i ~/.ssh/test_instance.pem ubuntu@172.31.8.6

And that should get you into the box.

Now you want to change into the source code:

cd ~/ohi_sqs_pipeline
cd cpu/data_streamer_to_sqs

All ohi pipeline components operate by SystemD services so you want to start by looking at the service itself and seeing if its running:

ubuntu on ip-172-31-8-6 in ohi_sqs_pipeline/cpu/data_streamer_to_sqs
❯ sudo systemctl status service.service
● service.service - OHI Data Streamer to SQS Service
   Loaded: loaded (/etc/systemd/system/service.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2020-07-30 07:22:00 UTC; 57min ago
  Process: 25141 ExecStart=/bin/bash -lc /home/ubuntu/ohi_sqs_pipeline/cpu/data_streamer_to_sqs/systemd/service.sh (code=exited, status=125)
 Main PID: 25141 (code=exited, status=125)

Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: service.service: Service hold-off time over, scheduling restart.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: service.service: Scheduled restart job, restart counter is at 5.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: Stopped OHI Data Streamer to SQS Service.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: service.service: Start request repeated too quickly.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: service.service: Failed with result 'exit-code'.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: Failed to start OHI Data Streamer to SQS Service.

What this tells us is that the service isn’t running and while that isn’t terribly helpful, it is your first troubleshooting step. My next step would be to try running bin/docker_run and seeing what happens:

❯ bin/docker_run
Unable to find image 'data_streamer_to_sqs:latest' locally
docker: Error response from daemon: pull access denied for data_streamer_to_sqs, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.

And now, Grasshopper, we are starting to close in on the problem. Perhaps the docker build process as part of deployment failed:

❯ bin/docker_build
Sending build context to Docker daemon  139.3kB
Step 1/13 : FROM python:3.7.3-stretch
 ---> 34a518642c76
Step 2/13 : WORKDIR /usr/src/app
 ---> Using cache
 ---> 3ccf07c83f92
Step 3/13 : COPY requirements.txt ./
 ---> e14ae25dbaeb
Step 4/13 : RUN pip3 install --no-cache-dir -r requirements.txt
 ---> Running in 111df77ae2b7
Collecting iso8601 (from -r requirements.txt (line 7))
  Downloading https://files.pythonhosted.org/packages/ef/57/7162609dab394d38bbc7077b7ba0a6f10fb09d8b7701ea56fa1edc0c4345/iso8601-0.1.12-py2.py3-none-any.whl
Collecting boto3 (from -r requirements.txt (line 8))
  Downloading https://files.pythonhosted.org/packages/bd/83/22bc643490012047408bfeec8422c79ba54ecc089e70c946cf1686e15084/boto3-1.14.31-py2.py3-none-any.whl (129kB)
Collecting ujson (from -r requirements.txt (line 9))
  Downloading https://files.pythonhosted.org/packages/82/f2/12ca7bfd7879f8ed1b53104f2a6751a7722d63b12951c91c61ff433e5170/ujson-3.0.0-cp37-cp37m-manylinux1_x86_64.whl (176kB)
Collecting redis (from -r requirements.txt (line 10))
  Downloading https://files.pythonhosted.org/packages/a7/7c/24fb0511df653cf1a5d938d8f5d19802a88cef255706fdda242ff97e91b7/redis-3.5.3-py2.py3-none-any.whl (72kB)
Collecting requests (from -r requirements.txt (line 11))
  Downloading https://files.pythonhosted.org/packages/45/1e/0c169c6a5381e241ba7404532c16a21d86ab872c9bed8bdcd4c423954103/requests-2.24.0-py2.py3-none-any.whl (61kB)
Collecting pyyaml (from -r requirements.txt (line 12))
  Downloading https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz (269kB)
Collecting zstandard (from -r requirements.txt (line 13))
  Downloading https://files.pythonhosted.org/packages/e3/ab/c003ac7407a4b7683894bb028ed1197ef890a4f705c340be22f2a951599b/zstandard-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl (2.2MB)
Collecting tarfile (from -r requirements.txt (line 14))
  ERROR: Could not find a version that satisfies the requirement tarfile (from -r requirements.txt (line 14)) (from versions: none)
ERROR: No matching distribution found for tarfile (from -r requirements.txt (line 14))
WARNING: You are using pip version 19.1.1, however version 20.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
The command '/bin/sh -c pip3 install --no-cache-dir -r requirements.txt' returned a non-zero code: 1

Now let’s look at the requirements.txt file:

❯ cat requirements.txt
# Execute this with:
# pip3 install -r requirements.txt

#
# These will likely come out before deploy; used for investigation / trial
#
iso8601
boto3
ujson
redis
requests
pyyaml
zstandard
tarfile

What the log above tells us is that the last thing installed is zstandard and the failure is on tarfile.

A bit of python digging illustrated that the problem here is …drum roll please… me. Apparently tarfile is a Python builtin like os and doesn’t need to be in requirements.txt. Sigh. My last bit of major work on the OHI was rewriting data_streamer_to_sqs from Ruby to Python and I, clearly stupidly, got this wrong by dropping it into requirements.txt. And then I compounded the error by deploying it without ever trying bin/docker_build. And now for today’s work I will be clearly:

Reviewing all requirements.txt files
Updating the missing version numbers
Running bin/docker_build locally until it passes
Checking everything in
Redeploying the pipeline

Once this was fixed, simply by removing tarfile from the deployed requirements.txt and then running bin/docker_build, I ran:

bin/docker_run

to verify the fix. Here’s what the output looks like on 7/30/20:

ubuntu on ip-172-31-8-6 in ohi_sqs_pipeline/cpu/data_streamer_to_sqs
❯ bin/docker_run
*********************************************************************
ADL OHI DATA STREAMER TO SQS starting up
About to process data from sqs: None and write to ohi-datastreamer
Copyright (C) 2020, The AntiDefamation League
*********************************************************************




...Reading from input bucket ohi-incoming-archive



  In process
    In make_destination_dir
    In should_process
      datastreamer/2020/06-26/datastreamer-2020-06-26t00-50-gqbfsvgu.tar.zst
    In download_s3_file
    In decompress_zstandard_to_folder
    In decompress_tar_to_folder
    in strip_header_from_data_streamer_files
      1593132766000000000-0023.json
        in send_json_str_to_sqs
      in set_file_state_to_processing
          Hit threshold -- Number of bytes in sqs_container = 246374; number of items in sqs_container = 100
          Hit threshold -- Number of bytes in sqs_container = 248983; number of items in sqs_container = 100
      in set_file_state_to_processed
      1593132821508600075-0050.json
        in send_json_str_to_sqs
      in set_file_state_to_processing
          Hit threshold -- Number of bytes in sqs_container = 244159; number of items in sqs_container = 101
          Hit threshold -- Number of bytes in sqs_container = 250304; number of items in sqs_container = 103
          Hit threshold -- Number of bytes in sqs_container = 247571; number of items in sqs_container = 99
      in set_file_state_to_processed
      1593132766000000000-0051.json
        in send_json_str_to_sqs

Confirming this fix att the SystemD Level

The final step is to stop the running process with CTRL+C and then restart the SystemD process.

This requires a three step process:

status
start
status

Here’s what that looks like:

❯ sudo systemctl status service.service
● service.service - OHI Data Streamer to SQS Service
   Loaded: loaded (/etc/systemd/system/service.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2020-07-30 07:22:00 UTC; 1h 30min ago
  Process: 25141 ExecStart=/bin/bash -lc /home/ubuntu/ohi_sqs_pipeline/cpu/data_streamer_to_sqs/systemd/service.sh (code=exited, status=125)
 Main PID: 25141 (code=exited, status=125)

Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: service.service: Service hold-off time over, scheduling restart.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: service.service: Scheduled restart job, restart counter is at 5.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: Stopped OHI Data Streamer to SQS Service.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: service.service: Start request repeated too quickly.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: service.service: Failed with result 'exit-code'.
Jul 30 07:22:00 ip-172-31-8-6 systemd[1]: Failed to start OHI Data Streamer to SQS Service.

ubuntu on ip-172-31-8-6 in ohi_sqs_pipeline/cpu/data_streamer_to_sqs
❯ sudo systemctl start service.service

ubuntu on ip-172-31-8-6 in ohi_sqs_pipeline/cpu/data_streamer_to_sqs
❯ sudo systemctl status service.service
● service.service - OHI Data Streamer to SQS Service
   Loaded: loaded (/etc/systemd/system/service.service; disabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-07-30 08:52:31 UTC; 6s ago
 Main PID: 27316 (service.sh)
    Tasks: 10 (limit: 4915)
   CGroup: /system.slice/service.service
           ├─27316 /bin/bash /home/ubuntu/ohi_sqs_pipeline/cpu/data_streamer_to_sqs/systemd/service.sh
           └─27345 docker run -e RAILS_ENV=production -e PROGRAM_NAME=data_streamer_to_sqs -e

Running the list_queues_with_counts Tool Again

In closing we need to run the list_queues_with_counts tool again and see if things have changed:

❯ bin/run
*********************************************************************
List Queues Utility starting up
Copyright (C) 2020, The AntiDefamation League
*********************************************************************


queue_name = ohi-datastreamer -- BATCHED items in queue: 1316 - Approximate number of tweets = 134,232
queue_name = ohi-json-normalization -- BATCHED items in queue: 5254474 - Approximate number of tweets = 535,956,348
queue_name = ohi-should-analyze -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-invective -- BATCHED items in queue: 2486773 - Approximate number of tweets = 253,650,846
queue_name = ohi-perspective -- BATCHED items in queue: 2696264 - Approximate number of tweets = 275,018,928
queue_name = ohi-sentiment -- BATCHED items in queue: 348246 - Approximate number of tweets = 35,521,092
queue_name = ohi-anti-semitism -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-copy-to-s3 -- BATCHED items in queue: 0 - Approximate number of tweets = 0
queue_name = ohi-sqs-to-redis -- BATCHED items in queue: 0 - Approximate number of tweets = 0

Please note that the number of items per queue tends to go up
as data flows through the system because the JSON documents get larger
with each pass and the size of each batch is fixed

Closing Note - This isn’t a Problem

Please note that this:

queue_name = ohi-sqs-to-redis -- BATCHED items in queue: 0 - Approximate number of tweets = 0

isn’t a problem because sqs_to_redis doesn’t put anything in the s3 bucket. This is the final stage in the pipeline and the final output isn’t to this sqs bucket; it is to redis.