15-418 Final Project | Tim Brooks and Suhaas Reddy
For our final deliverable, we implemented a parallel approximation of seam carving on still images. Reducing our original seam carving plans down to images, we first implemented a dynamic programming-based content-aware resizing algorithm for still images. Upon realizing the limitations of the inherently sequential DP dependencies, we devised an approximation of seam carving through a greedy approach to the problem, providing similar and, in some cases, better, results when compared to the true min-seam algorithm. The approximation relieved data dependencies to enable much faster runtimes.
Note that we changed our project idea due to infeasibility halfway through the project timeline. More about or old project and why we changed can be found at our initial webpage.
The standard seam carving algorithm uses extremely simple data structures. Essentially, because of the dynamic programming approach, the algorithm uses the gradient of the current row of pixels, along with the weights of the seams running to each of the pixels in the previous row, to calculate the minimum seam to each pixel in the current row by extending one of the three adjacent min-seams above each pixel to that pixel. In order to accomplish this, the algorithm maintains an energy gradient for the entire image, based on the differences in RGB values between neighboring pixels, a matrix of offsets (-1,0, or 1) to traverse back up minimum seams to their start, and a previous and current sum array, to store the sums of the weights of the min-seams leading to the current and previous row of the image.
To find the initial energy gradient for the image, we simply compare pixels to one another an calculate a difference score to store at every index of the energy matrix. To avoid edge cases in energy updates and improve performance, we avoid allowing the seam to reach the edge of the image by setting the energy on the left and right sides to U_CHAR_MAX. To update this data structure as we traverse down the image, we look at the 3 seam weight sum values above each pixel in the current row, up to the left, directly up, and up to the right, and determine which is minimal. Then, we update the index matrix to reflect our choice, -1 for left, 0 for up, 1 for right, and add the current pixel's energy gradient to the selected seam's summed energy to get the new sum for that x coordinate. We continue this process down the image to find minimum seams leading to every pixel in the bottom row of the image. At the end of one iteration from the top to bottom of the image, we can iterate across the sum array to determine the minimum-weight seam's ending pixel, and trace that seam back up the image to find the entire seam and remove it. We then repair the image and energy values along the hole, and repeat the process until we have removed the desired number of seams. This allows us to output a smaller image than the input, which still retains the most salient information from the initial photo.
We predicted that finding the initial energy matrix and calculating the new sums for each row of the image would stand to benefit the most from parallelism, but in the end, spawning threads to calculate the energy map actually hurt performance, possibly due to the overhead of spawning the threads in the first place. Therefore, we were only able to see benefits from parallelizing the loop across each row of the image, which also has a great deal of locality to exploit because neighboring pixel values will already be in the cache of a thread iterating across a section of the image. As described earlier, iterating down the rows of the image in the dynamic programming algorithm is inherently sequential, and these dependencies severely limit the parallelism we can exploit in the DP algorithm. We are also limited by the sequentiality of iterating back up the matrix to find, remove, and repair the seam calculated by each step.
In tackling this problem, we considered and tested implementations on both CPUs and GPUs. We found that with our initial DP implementation, the problem was faster on a CPU, and therefore proceeded on a CPU for our other implementations. While the improved performance of a GPU is extremely evident on problems that can be broken down into blocks, such as convolutions or the rendering in our second assignment, our seam carving algorithm suffers from requiring the entire image, or much larger blocks of the image, in memory. On a CPU, we therefore greatly benefit from L3 cache hits, and have minimal transfer of data. However, on the GPU, the performance is inhibited by accessing and moving data which does not fit on shared memory, as well as the overhead of synchronizing between many kernel calls. On each iteration, we would need to transfer data back to the CPU to repair the image, or use a single GPU thread, which severely underutilized GPU compute power. An algorithm for seam carving that is more suitable for the benefits of a GPU could be explored, but for the extent of our project, we selected a CPU and used the Xeon Phi 5110p’s on the Latedays cluster for their high number of execution contexts, large caches, and our familiarity with the Latedays system. We selected OpenMP as our primary means of using threads to parallelize the application for ease of tweaking parameters relative to other systems. We elected to first focus on parallelizing with threads rather than SIMD, though we believe that the application could definitely be improved via vector instructions as well.
Once we had implemented the dynamic programming algorithm described in previous sections, we realized that the inherent sequentiality of a dynamic programming approach we significantly bottlenecking our performance. Observe the runtime chart of the dynamic programming algorithm on 1080x1920 pixel images:
Adding threads actually increased runtime of the algorithm because the overhead of spawning and joining threads actually overshadowed the increased comput. In some cases, half of our wall clock execution time was spent in the get_seam() function, which iterates down the rows of the image finding a seam through dynamic programming. We were able to improve performance slightly by dividing processing of each row among a few threads, but soon hit a ceiling where the overhead of managing threads overwhelmed the benefit, and achieved less than a 1.2x speedup in either case. To counter this, we devised an approximation of the original algorithm which significantly reduced sequential dependencies.
Our initial approximation attempt involved dividing the image into horizontal strips and running DP seam carving separately on the strips. This separated dependencies between strips of the image, but produced extremely undesirable results; the salient parts of the image could vary drastically between chunks, causing seams to be removed from different vertical sections of each strip. This meant strips did not need to (and usually didn't) line up.
Our next attempt at an approximation involved a greedy approximation. Rather than accumulating seam sums and having each pixel look upwards to determine which seam to extend, we had seams look downwards to search for the lowest energy pixel among its three down-neighbors. This meant that seams no longer needed to produce global minimum weight seams, but generally at least a few seams produced via this method have reasonably low energies, and the approach completely eliminates vertical dependencies between entire rows of the image. Each seam can extend itself completely independently of the others. This means we no longer need to sequentially process the rows of the image to find minimum seams.
We then took an extra step to combine this approximation with the previous chunking idea. In the image below, the horizontal line represents the pixels from which we start growing our seams. By growing one seam upwards and one downwards across the image, we reduce the degradation in seam quality caused by using local minima, and also decrease data dependencies, allowing the top and bottom half of each seam to be calculated independently.
This dual approximation produced quite good results which maintained most of the important parts of the image. In some cases, we even think the results look better than the original DP algorithm, as below:
To measure the performance of our approximation, we used speedup as measured by omp_get_wtime() calls at the start and end of our algorithm's execution, which provided us with the wall clock runtime of our algorithms, independent of multithreading provided by OpenMP. To benchmark our performance, we used 1080x1920 and 2848x4288 images of landscapes because they provided clear distinct objects that we would like to be preserved by the algorithm and approximation. It would be nice to test our algorithm and approximation on very large images with large thread counts, but we didn't do so mainly for time constraint reasons. The baseline runtime of the algorithm already jumped to over 250 seconds on the larger image above, but in future exploration, we would definitely like to optimize our approximation to work well with varying image sizes. Using two relatively differently sized images did show us a little bit of how our current approximation holds up over varying image sizes. Below we can see the runtimes of the original and greedy approximations with varying thread counts.
Using 16 threads in our approximation provided over a 4x speedup relative to the original dynamic programming algorithm. Recall that attempting to parallelize the DP algorithm actually introduced more overhead than the time it saved, so this represents a speedup relative to the most parallel version of the original algorithm we could hope to produce through multithreading alone. In the first chart, you may notice that performance of the greedy algorithm tended to flatten at around 8 to 16 threads. Based on some basic wall clock time calls, we spent a significant portion of the wall clock time of each seam removal in the remove_seam function, which sequentially removes the minimum cost seam and repairs the pixels/energy matrices. However, we did not use a true profiler on latedays, so we are unable to verify this for certain. For the type of approximation we performed, we are very confident that a CPU was the correct choice, but other faster approximations involving breaking the image into smaller sections could warrant implementation on a GPU.
We were inspired to pursue seam carving largely by the 15-210 assignment on dynamic programming, and gained our basic understanding of the DP algorithm from that assignment. Additionally, gained more specific information about the algorithm from this video from SIGGRAPH 2007 as well as the original paper by Shai Avidan and Ariel Shamir.
Suhaas - Majority of writeups/presentation, gridcut/maxflow temporal seam carving initial implementation, some debugging basic CPU implementation of 2d seam carving, basic tiling approximation, debugging greedy approximation, benchmarking
Tim - debugging gridcut/maxflow temporal seam carving, editing test videos for initial temporal seam carving project, majority of initial CPU and GPU code for 2d seam carving, greedy approximation implementation for CPU