giovedì 12 marzo 2015

Fork Join to transform list of objects

I'm starting this new blog with a little framework that is aimed to simplify the processing of lists in Java.

I'm quite sure in your career you have faced the following use case: given a list of objects you have to transform it in a new list of objects, where each object is obtained through a transformation (complicated or simple) of the original objects.

The simple way to resolve this problem is to iterate through the original list, transform each object and add it to a new list. It is something like this:


List<Object> newList = new ArrayList<Object>();
for (Object originalObject : originalList) {
    newList.add(transform(originalObject));


It is simple, isn't it?

The problem may arise when you have a huge amount of objects.

Keep in mind that we're not talking about big data, so the 'huge' word should be intended as thousand of objects, a sizing that is too little to start thinking about a complicated map/reduce algorithm but, at the same time, it has impact on performances if the process if executed sequentially like showed in the code snippet above.

The solution I developed is based on the Fork/Join pool introduced in Java 7. Probably with Java 8, with its support for lambdas, there is a more elegant and coincise solution, but our target JVM at the moment is JDK 7, so I focused on this platform.

This is the generic list transformer.



import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.RecursiveTask;

public class ListTransformer<T, S> extends RecursiveTask<List<T>> {

private static final long serialVersionUID = 1L;

static final int LIMIT = 1000;

private List<S> sources;
private TransformFunction<T, S> function;

public ListTransformer ( List<S> sources, TransformFunction<T, S> function ) {

super ();
this.sources = sources;
this.function = function;
}

@Override
protected List<T> compute () {

if ( sources.size() < LIMIT ) {
List<T> preparedEntities = new ArrayList<T> ();
for ( S source : sources ) {
preparedEntities.add ( function.transform( source ) );
}
return preparedEntities;
}
else {

int divider = sources.size () / 2;

ListTransformer<T, S> curLeft = new ListTransformer<T, S> ( sources.subList ( 0, divider ), function );

ListTransformer<T, S> curRight = new ListTransformer<T, S> ( sources.subList ( divider, sources.size() ), function );

curRight.fork ();
List<T> leftResult = curLeft.compute ();
List<T> rightResult = curRight.join ();
leftResult.addAll ( rightResult );

return leftResult;
}
}

public interface TransformFunction<T, S> {

T transform (S source);
}
}


The class is parameterized: S should be intended as the source object, T as the transformed object. The process is quite simple. If the size of the processed list is above a reasonable limit, we create two new subtasks, firing the first and wait for the completion of the second, which can trigger internally other tasks until we get a list of the desired size to apply the transformation.

Let's see it in action. Suppose you have to deal with the trivial task of transforming a list of Integer into a list of String. This is a test case that shows how to implement this task with a Fork / Join approach

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ForkJoinPool;

import org.junit.Assert;
import org.junit.Test;


public class ListTransformerTest {


@Test
public void testList() {
List<Integer> source = new ArrayList<Integer>();
for (int i = 0; i < 5000; i ++) {
source.add(i);
}

ForkJoinPool pool = new ForkJoinPool ();

List<String> transform = pool.invoke ( new ListTransformer<String, Integer> ( source, 
new ListTransformer.TransformFunction<String, Integer>() {

@Override
public String transform ( Integer source ) {

return source.toString ();
}

   } ) );

for (int i = 0; i < 5000; i ++) {
Assert.assertEquals( "" + i, transform.get ( i ) );
}
}
}


The test checks that the order of the new list is what we're expecting. This is important, because I don't think you want that the new list contains objects in random order :-)

Now it's time to demonstrate that this is not only an exercise but it can actually give you much better performances. I created a simple JMH benchmark (in a future post I will talk about this great framework) to test the performances of the Fork / Join approach compared to the traditional sequential approach. The tests were conducted on a MacBook Pro late 2014 model (core I7 2.4 GHz, 16 GB Ram)

Benchmark                              (size)   Mode  Cnt      Score  Error 
ListTransformerBenchmark.testNew         1000  thrpt    5  26182,426 ±  899,082  ops/s
ListTransformerBenchmark.testNew         2000  thrpt    5  16690,305 ± 2213,340  ops/s
ListTransformerBenchmark.testNew         5000  thrpt    5   9268,967 ±  257,888  ops/s
ListTransformerBenchmark.testNew        50000  thrpt    5   1356,710 ±   81,285  ops/s
ListTransformerBenchmark.testStandard    1000  thrpt    5  32834,898 ±  870,725  ops/s
ListTransformerBenchmark.testStandard    2000  thrpt    5  15759,069 ±  884,292  ops/s
ListTransformerBenchmark.testStandard    5000  thrpt    5   6198,490 ±  253,743  ops/s

ListTransformerBenchmark.testStandard   50000  thrpt    5    561,933 ±   27,503  ops/s

The testNew, as you may guess, is the method that use the Fork/Join transformation, while the testStandard method use a simple loop.

The results show that for a small list, the traditional approach is faster, probably because the overhead necessary to create the ForkJoin pool is heavier than the transformation process itself. When the list reaches a size of some thousand of objects, the multithreaded approach shines. We can see that for a list of 5000 objects, the score reached for the Fork/Join transformation is 9228 operation per second, while the sequential process stops at 6199 operations per second. A nice gain of performance I think.

For bigger lists, the Fork/Join approach shows better performances. For a list of 50000 objects, the multithreaded transformer grants a huge gain of performance (the 240%).

If you're in need of extreme optimization on your codebase, I think that this is a valuable solution.

Cheers




Nessun commento:

Posta un commento