mishka comments on Process Substitution Without Shell?

mishka 29 Nov 2023 15:19 UTC
3 points
0
Yes, so, just like in my example, the first round of conversation is nearly useless. Let’s see, if the second iteration is any better (it does turn to named pipes):

ME (without triple quotes, but here I am including them for better formatting):

Unfortunately, this is concatenating the two downloads and providing them to cmd on it’s stdin
```
    # Run the command with the output of the two gunzip processes as inputs
    cmd = subprocess.Popen(["cmd", "-1", "-", "-2", "-"],
                           stdin=subprocess.PIPE, stdout=subprocess.PIPE)
```
What we’d like to happen instead is to process those two downloads by cmd in parallel.

GPT-4:

To process two downloads in parallel and provide them as separate inputs to a command using Python’s subprocess module, you can employ named pipes (FIFOs). Named pipes can be created using the os.mkfifo method, which allows two independent data streams for the two downloads.

Here’s a revised version of the script that uses named pipes for parallel processing:
```
import subprocess
import os
import threading

def stream_to_fifo(fifo_name, command):
    with subprocess.Popen(command, stdout=subprocess.PIPE) as process:
        with open(fifo_name, 'wb', 0) as fifo:
            for line in process.stdout:
                fifo.write(line)

def run_pipeline(path1, path2, pathOut):
    fifo1 = "/tmp/fifo1"
    fifo2 = "/tmp/fifo2"

    # Create named pipes
    os.mkfifo(fifo1)
    os.mkfifo(fifo2)

    # Start the download and gunzip processes, streaming to FIFOs
    threading.Thread(target=stream_to_fifo, args=(fifo1, ["aws", "s3", "cp", path1, "-", "|", "gunzip"])).start()
    threading.Thread(target=stream_to_fifo, args=(fifo2, ["aws", "s3", "cp", path2, "-", "|", "gunzip"])).start()

    # Run the command with the FIFOs as inputs
    with open(fifo1, 'rb') as input1, open(fifo2, 'rb') as input2:
        cmd = subprocess.Popen(["cmd", "-1", "-", "-2", "-"], stdin=subprocess.PIPE, stdout=subprocess.PIPE)

        # Create threads to avoid blocking
        def stream_to_cmd(input_stream):
            for line in input_stream:
                cmd.stdin.write(line)
            input_stream.close()

        threading.Thread(target=stream_to_cmd, args=(input1,)).start()
        threading.Thread(target=stream_to_cmd, args=(input2,)).start()

        # Close cmd's stdin after the inputs are done
        cmd.stdin.close()

        # Pipe cmd's output to gzip and then upload
        gzip = subprocess.Popen(["gzip"], stdin=cmd.stdout, stdout=subprocess.PIPE)
        subprocess.run(["aws", "s3", "cp", "-", pathOut], stdin=gzip.stdout)

    # Cleanup
    os.remove(fifo1)
    os.remove(fifo2)

# Example usage
run_pipeline("s3://path/to/file1", "s3://path/to/file2", "s3://path/to/output")
```
This script creates two named pipes and starts two threads to handle the downloads and the decompression. The outputs of these processes are directed to the named pipes. The main command reads from these pipes. The use of threads ensures that the reading and writing to the command’s stdin do not block each other.

This approach is more complex than the initial script but allows the parallel processing of the two data streams as required. As before, error handling and cleanup are simplified for demonstration purposes and should be robustly implemented for production use.
- jefftk 29 Nov 2023 18:22 UTC
  2 points
  0
  Parent
  Unfortunately while this starts off looking promising it’s missing key steps:
```
cmd = subprocess.Popen(["cmd", "-1", "-", "-2", "-"], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
```
  It’s still trying to stream the input files to cmd’s standard input, instead of providing cmd with two separate filenames which when read stream their respective inputs.
  - mishka 29 Nov 2023 19:06 UTC
    3 points
    0
    Parent
    Let’s do another iteration with your comment as the prompt:
    
    GPT-4:
    
    To correctly handle the situation where cmd expects filenames as arguments and those files are actually streams of data from subprocesses, you can use the /dev/fd filesystem on Unix-like systems. This special filesystem allows you to treat file descriptors as file paths, which is essentially what Bash’s process substitution does.
    
    Here’s how you can modify the Python script to achieve this:
    
    Use subprocesses to handle the download and decompression, just like before.
    Instead of writing to named pipes, pass the file descriptors of these subprocesses directly to cmd.
    
    Here’s the revised script:
    
    import subprocess import os def run_pipeline(path1, path2, pathOut): # Start downloading the first file and pipe it to gunzip dl1 = subprocess.Popen(["aws", "s3", "cp", path1, "-"], stdout=subprocess.PIPE) gunzip1 = subprocess.Popen(["gunzip"], stdin=dl1.stdout, stdout=subprocess.PIPE) # Start downloading the second file and pipe it to gunzip dl2 = subprocess.Popen(["aws", "s3", "cp", path2, "-"], stdout=subprocess.PIPE) gunzip2 = subprocess.Popen(["gunzip"], stdin=dl2.stdout, stdout=subprocess.PIPE) # Wait for the pipes to be ready dl1.stdout.close() dl2.stdout.close() # Run the command with file descriptors cmd = subprocess.Popen(["cmd", "-1", f"/dev/fd/{gunzip1.stdout.fileno()}", "-2", f"/dev/fd/{gunzip2.stdout.fileno()}"], stdout=subprocess.PIPE) # Pipe cmd's output to gzip and then upload gzip = subprocess.Popen(["gzip"], stdin=cmd.stdout, stdout=subprocess.PIPE) subprocess.run(["aws", "s3", "cp", "-", pathOut], stdin=gzip.stdout) # Cleanup gunzip1.stdout.close() gunzip2.stdout.close() gzip.stdout.close() # Example usage run_pipeline("s3://path/to/file1", "s3://path/to/file2", "s3://path/to/output")
    
    In this script:
    
    * `gunzip1.stdout.fileno()` and `gunzip2.stdout.fileno()` give the file descriptors of the gunzip processes. * These descriptors are converted to file paths using `/dev/fd/`. * `cmd` is then invoked with these file paths as arguments, making it read from the streams.
    
    This approach replicates the behavior of Bash’s process substitution in Python, allowing cmd to read from the streams as if they were files. Note that this method is specific to Unix-like systems that support the /dev/fd filesystem.
    - jefftk 29 Nov 2023 20:30 UTC
      3 points
      0
      Parent
      Thanks! I haven’t tested this, but reading through it seems plausible. This is the first GPT response I’ve seen where it “understood” that the command needed its inputs as two file arguments.
      
      (Though all of my playing was with free 3.5)
      - mishka 29 Nov 2023 20:42 UTC
        3 points
        0
        Parent
        Great!
        
        Yes, it often requires a conversation; taking a few turns to discuss the situation and create enough context for it...
        
        The difference between GPT-4 and GPT-3.5 is rather drastic. GPT-4 might still be the only “next generation” system available for public use in the world today (although, to be fair, I have not explored new Anthropic models), and I think it is the first system which often seems to actually demonstrate “true understanding”, whatever the nature of that phenomenon might be.
- mishka 29 Nov 2023 15:28 UTC
  1 point
  0
  Parent
  (But, of course, if one would really like an API like pipeline = subprocess.Pipeline() then perhaps it is possible to implement a Pipeline class which would do exactly what you describe in the example.
  
  Or, perhaps, a Pythonic wrapper which has a reasonable user-facing syntax, but converts it into the shell call above, would be even better...)