Banking Industry: Innovate or Perish

With the advent of digitization, the Banking industry is undergoing a massive change — a disruption in the way traditional banks do business. With technological shifts and changes in regulatory environment, the emerging FinTech players are giving incumbents a good run for their money. New players, business ideas, platforms, and models are rising because of the constantly evolving technological landscape.

There are new innovative technology entrants who are reaching out to customers effectively while most of the banks are still struggling to stay relevant in the business. As customers are leaning toward a seamless digital banking experience, the traditional banking practices are gradually fading into oblivion.

Companies like LendingClub, Square, Uber, Mint, Alibaba, etc., are disrupting the traditional model of banking. LendingClub, for example, provides peer-to-peer loans at lower interest rates, Uber provides auto financing to drivers and Alibaba provides the largest online payment platform.

Digital technologies are reshaping the industry in several ways.

First, financial transactions are moving from traditional channels to digital channels (mobile and web), changing the way customers interact with banks and FinTech companies.

Second, banks are getting unbundled. Lending platforms and digital payments are phenomenally growing, removing reliance on banks for making payments or borrowing money.

Third, the speed of financial transactions has astronomically increased. Transactions such as remittances that used to take days can now be completed within few seconds using Blockchain technology.

And last, the Robo Advisors that use complex computer algorithms are going to replace human financial advisors to provide unbiased financial advice at much lower cost.

Banks must reinvent, why?

Retail banks are facing difficult times. They may soon disappear as technology players are catching up fast and challenging their very existence. Tech Startups are building advance business models that are difficult for banks to imitate and follow at such rapid pace.

It is becoming imperative for incumbent banks reinvent themselves digitally, otherwise, they may have a hard time building relationship with the next generation of consumers, risk losing their market share and becoming obsolete.

This tipping point, on the other hand, provides a huge opportunity for growth to banks, provided they relentlessly innovate to offer better online experiences to the customers with product offerings tailored to their needs.

Digital transformation initiatives should be taken at banks to move quickly like startups do. First, banks need to adopt agile mindset to build digital solutions faster for their customers. By promoting lean learning culture, they can test to refine the customer experience, have faster product iteration, resulting in better product-market fit.

Second, they need to extend their existing technology capabilities and go all digital. They may be using technology for work-flows, internal operations, but most of them still send/receive documentation by paper or fax for account opening, loan and credit card applications.

An alternate approach to this — customers can use smart devices, from opening an account and paying bills to investing in their preferred financial instrument without even stepping inside a physical branch.

FinTechs and Banks Need to Collaborate

The primary goal of FinTechs was to solve the banking sector problems by focusing on providing a seamless digital customer experience, streamlining operational processes, and offering innovative products and services in lending, payments, wealth management, and remittance to customers. The collaboration between Banks and FinTechs can bring immense value to both parties.

FinTechs are technology focused. They are not constrained by any bulky legacy banking system. They can add agility to ideas/technology implementations at banks and can help increase their product and service offerings.

Whereas, banks are risk-averse and usually move slowly due to the highly-regulated environment. However, they can provide FinTechs their industry expertise – operations, legal, risk and access to their Banking Systems Infrastructure.

Though banks have been reluctant, some of them have been offering APIs for FinTechs to provide a wide range of products and services to their customers. Moreover, regulations such as Payment Services Directives (PSD2) are compelling banks to open-up to third parties and bring more transparency in the system.

Overall, banks are bringing the barriers down and allowing entry for FinTechs and providing them access to the huge customer base, which otherwise would take them longer time to build.

The Future of the Industry

The banking industry will continue to remain highly competitive where an increasing number of FinTech entrants will bring creative banking solutions for the customers and earn profits. It is quite possible that some of the traditional banks won’t be able to adapt to this changing industry. Unfortunately, this might either lead to collapse of a few banks or force them to radically change their business models.

The collaboration between Banks and FinTechs will be the key to shaping the future of the financial services ecosystem. Industry players who relentlessly innovate to provide frictionless digital customer experience while providing personalized products and services faster will be able to sustain longer in this constantly evolving landscape.

About Xebia

Xebia, a niche agile software development and digital consulting firm has teamed up with highly qualified partners to implement digital banking solutions for banks and financial organizations globally. We have a Center of Excellence (COE) dedicated to providing state-of-the-art Omni-Channel Digital Banking Solutions. Our mission is to help financial clients succeed in their digital transformation so that they can not only survive, but also thrive in today’s highly competitive landscape.


[1] Broeders, Henk, and Somesh Khann “Strategic choices for banks in the digital age” Mckinsey & Company, Mckinsey, 2015.
[2] Chishti, Susanne, and Janos Barberis. The FINTECH Book. John Wiley & Sons, 2016.
[3] Rubini, Agustin. Fintech in a Flash: Financial Technology Made Easy. Simtac Ltd, 2017.

Reusing Rails Scopes

In this post, I would like to discuss about one of the least explored methods of ActiveRecord
i.e. merge. I came across this method after working with Rails for about 2 years.
It mainly focuses on merging of scopes in associations. Confused ?? Let’s head straight to an example.

Let’s say I’ve orders, line items and products as follows :

class Order < ApplicationRecord

  has_many :line_items
  has_many :products, through: :line_items

class LineItem < ApplicationRecord

  belongs_to :order
  belongs_to :product

class Product < ApplicationRecord
  scope :popular, -> { where("products.published = ? and products.bought_count > ?", true, 200) }

Now here I would like to find out all the orders with popular products. Normally we would approach the problem as follows:

  Order.joins(:products).where("products.published = ? and products.bought_count > ?", true, 200)


class Order < ApplicationRecord

  has_many :line_items
  has_many :products, through: :line_items

  scope :with_popular_products, -> { joins(:products).where("products.published = ? and products.bought_count > ?", true, 200) }

The resulting query is

  SELECT "orders".*
  FROM "orders"
  INNER JOIN "line_items" ON "line_items"."order_id" = "orders"."id"
  INNER JOIN "products" ON "products"."id" = "line_items"."product_id"
  WHERE (products.published = 't' and products.bought_count > 200)

Here, the issues are

  1. Our code does not abide by the DRY principle.
  2. If the business logic for a popular product changes, we would have to take care we change it in every place being used.
  3. The popular product logic should have nothing to do with our Order model. It should stay encapsulated within the product.

How does merge come to the rescue ?

Let’s see for ourselves how merge can help us get rid of the above issues. We can rewrite the above scope as

scope :with_popular_products, -> { joins(:products).merge(Product.popular) }

results in

  SELECT "orders".*
  FROM "orders"
  INNER JOIN "line_items" ON "line_items"."order_id" = "orders"."id"
  INNER JOIN "products" ON "products"."id" = "line_items"."product_id"
  WHERE (products.published = 't' and products.bought_count > 200)

It fires the same query as above but code is much more cleaner and avoids replication.

Here we can see that the product logic remains within the product and we can reuse it whenever and wherever required.

There are some more ways of using merge

  1. Performing a join with multiple where conditions across tablesLet’s suppose we need to find out the delivered orders whose products are published.
    There can be two ways to do it

    Order.joins(:products).where(status: :delivered, products: { published: true})


    Order.where(status: :delivered).joins(:products).merge(Product.where(published: true))

    Both will result in the same query as follows

    SELECT "orders".*
    FROM "orders"
    INNER JOIN "line_items" ON "line_items"."order_id" = "orders"."id"
    INNER JOIN "products" ON "products"."id" = "line_items"."product_id"
    WHERE ("orders"."status" = 'delivered' and products.published = 't')

    But what if my Product model has a different table name say deals

    class Product < ApplicationRecord
      scope :popular, -> { where("products.published = ? and products.bought_count > ?", true, 200) }
      def self.table_name

    I’ll have to change my where query as per the table name and will always have to keep in mind such table name mappings

    Order.joins(:products).where(status: :delivered, deals: { published: true})

    But do you know what will happen with the query using merge ?
    Voila! No change needed!

  2. Merging two resultsImagine you’ve a website of various technical course videos and these videos can be accessed as per the user’s accessibility.
    For example, A guest user can view say just first video of each course, a user who has an account can view some free courses, a user with subscription can view the paid courses too.

    Suppose we’re using CanCan to manage the abilities of the user. Now to find the videos accessible by a user we can use


    where current_ability tells me the access rights of the user (guest/free/subscription).

    Now if I want to find the videos accessible by the current user which are also published

    accessible_videos = Video.accessibe_by(current_ability)
    Video.where(published: true).merge(accessible_videos)

    It returns the intersection of all published videos with the ones accessible by current user.

Please drop in your suggestions/feedback in the comments below to help me improve.

Performance tuning the camel parameters in backbase CXP application

Backbase is an Omni-Channel Digital Banking platform empowering financial institutions to accelerate their digital transformation and effectively compete in a digital-first world. It unifies functionality from of traditional core systems and new FinTech capabilities into a seamless digital customer experience. Thereby, drastically improving any the customer channel.
In any banking application, we interact with core banking for everything via Middleware ESB. In a Backbase CXP application, we make all calls to Middleware via camel. A typical Backbase CXP application’s architecture and system’s interaction is like shown below.

backbase cxp interaction

In a recent Backbase CXP project which that went live, we began experiencing slowness in the application when the number of concurrent users increased to (200+) and it became difficult to use the iOS and Android apps that are were consuming the Backbase CXP backend. We generated the thread dumps at the time the system was hanging and analyzed them using the tool samurai and jvisualvm.Lots of these threads were in WAITING mode. We analyzed the thread dumps a bit more and found that many of these threads were waiting with stacktrace like below.

"http-nio-8080-exec-499" #679 daemon prio=5 os_prio=0 tid=0x00007fb1f4204000 nid=0xb685 in Object.wait() [0x00007fb148308000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(
- locked <0x0000000e853c1c30> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(
at org.apache.commons.httpclient.HttpClient.executeMethod(
at org.apache.commons.httpclient.HttpClient.executeMethod(

As we can see above in the above thread dump snippet, that the thread is waiting when it is trying to get the connection from the MultiThreadedHttpConnectionManager. This We had identified as the problem which was causing so many threads to wait and resulted in slowness. In our codebase, we created a common camel route to connect to Middleware for all the calls. So this is the common http endpoint which the camel is invoking every time. We further looked into our codebase and camel source code and discovered that the camel-core jar is using the apache commons-httpclient.jar to make http connections to the Middleware using the class MultiThreadedHttpConnectionManager. We found that Backbase uses its own default Multithreaded connection manager defined in backbase-ptc.xml. The property ptc.http.maxConnectionsPerHost is used to control the number of connections per host. This is part of the jar ptc-core.jar.

<bean id="ptc_httpConnectionManager" class="org.apache.commons.httpclient.MultiThreadedHttpConnectionManager">
<property name="maxConnectionsPerHost" value="$ptc{ptc.http.maxConnectionsPerHost}"/>
<property name="maxTotalConnections" value="$ptc{ptc.http.maxTotalConnections}"/>

These default values of the max connections in this connection manager are very little to handle 200 concurrent users.

## Maximum number of concurrent requests for one remote resource.

## Maximum total number of concurrent requests.


The approach we took to solve this problem was to make the Backbase CXP’s CamelContext to use a different MultiThreadedHttpConnectionManager that has higher optimal values. For making the change in the default camel context, we had to add the default backbase-integration.xml to portalserver/src/main/resources/META-INF/spring/backbase-integration.xml and then edit the file to attach new MultiThreadedHttpConnectionManager to the camel context using the following code.

<bean id="http" class="org.apache.camel.component.http.HttpComponent">
<property name="camelContext" ref="bb-integration-context"/>
<property name="httpConnectionManager" ref="myHttpConnectionManager"/>

<bean id=”myHttpConnectionManager” class=”org.apache.commons.httpclient.MultiThreadedHttpConnectionManager”>
<property name=”params” ref=”myHttpConnectionManagerParams”/>

<bean id=”myHttpConnectionManagerParams” class=”org.apache.commons.httpclient.params.HttpConnectionManagerParams”>
<property name=”defaultMaxConnectionsPerHost” value=”1000“/>
<property name=”maxTotalConnections” value=”1000“/>

So with the above code, we have defined myHttpConnectionManager to handle higher load and add it to the default camel context bb-integration-context. With this change on the camel context picked the new connection manager which could handle higher load of requests. This has solved the bottleneck for the http connections made to call the ESB and the application was performing well. This was the approach that we had followed.

Another simpler approach could have been to just change the default values in file


We could have done this, but the ptc module was going to be removed in the coming Backbase versions, so we stuck with to our initial approach.


In any Backbase CXP application, the Camel connection’s bottleneck will be an inevitable sure problem that will to happen with the default values ( and esp. if CXP is on one node only) as all the http requests will be sent to the Middleware and then if the concurrent users reaches like 200+, then the application will be slower.
This blog has addressed how to solve this performance issue which happens because of waiting in MultiThreadedHttpConnectionManager.

Selenium with C# and xUnit

In this blog I will explain how you can start with automating your functional test using C# and xUnit.

A little introduction about the tools and technologies we are going to use:
  • C# is an elegant and type-safe object-oriented language that enables developers to build a variety of secure and robust applications that run on the .NET Framework. Source
  • Selenium is a portable software testing framework for web applications. Selenium provides a record/playback tool for authoring tests without learning a test scripting language (Selenium IDE). It provides supports to C#, Java, Groovy, Perl, PHP, Python and Ruby. The tests can then be run against most modern web browsers. Selenium deploys on Windows, Linux, and Macintosh platforms. Source
  • xUnit is a free, open source, community-focused unit testing tool for the .NET Framework. Written by the original inventor of NUnit v2, xUnit is the latest technology for unit testing C#, F#, VB and other .NET languages. Source

I am not going in the details as there are plenty of information available on Internet about these, we will focus on creating a actual xUnit test framework using C# and Selenium.


  • Visual Studio (for this blog we are using VS 2015)

Let’s Start

  • Open Visual Studio

  • Click on New Project, it will open a window, you need to select a class library project, enter name for solution say “Functional” and for project say “Demo” then click OK.

  • Now you will see screen as follows:

  • Right click on the Project “Demo” and select “`Manage NuGet Packages…“`

  • It will take you to following screen:

  • Search for following packages and install those:
    • Selenium.WebDriver
    • Selenium.Chrome.WebDriver (chrome driver exe)
    • xUnit.runner.visualstudio (to discover xUnit tests)


Just an information, xUnit have it’s own way of working, unlike NUnit or MSUnit there is no [SetUp] or [TestInitialize], here you need to achieve this using parameterless constructor.

You can get all other information in following article:

  • Now you are ready to automate your first test, change class name to Tests(or whatever you want) and start writing your first test as below: you can find code here

  • Now just build your code by right click the project Demo Or by pressing  Ctrl + Shift + B and you will be able to see your test in “Test Explorer”.

  • To run you test you need to right click on the test itself or click on Run All in the Explorer, check image below:


  • It will execute test and if Assert get passed it will show your test case as passed.

This blog is just to give an idea about how you can start Selenium with C# and xUnit, now you can start building a complete framework of your choice. If you need in-depth information or have any feedback, please mention in comments.

Why do we react?

Functional programming and reactive programming have been more of theoretical concepts for frontend developers in the past because they seemed like overkill tools especially for something as simple as web-pages in the era when the frontend was a dumb static representation of the server state.

But now things have changed. Redux creator Dan Abramov rightly compares asynchronicity and mutation with coke and mentos. While being very common names in our day to day lives, if these come together, the mixture can become explosive enough to get out of control very soon.

Today we create multi-platform, high-performance, rapidly evolving interfaces to cater fairly complex applications.

User Interfaces being event-driven in nature have had a problem of not scaling well. Changes/events in one part of a page can have an effect on some other page or part of page. It feels like a fire under control to start with, but when these effects become causes for some other changes in the application, the end of these cascading CAUSE and EFFECT cycles quickly becomes unpredictable. Also, when architected with little decoupling and abstraction between different layers of responsibility, sophisticated design pattern can not make space in building frontend applications. Reasons like these combined quickly give rise to `edge cases` and then `dirty-hacks` on code that cannot be accompanied with automated tests.

ReactJS(and the eco-system around it) is built on the principles of functional reactive programming where the application is always in a predictable state and the entire view is a consistent mapping of that state object. That allows us to make controlled changes to the data to cause a visual change in the application, rather than changing the actual visible page itself (which gets efficiently re-rendered to reflect the state data change). Its not just easy, but also intuitive to add automated tests for these applications, without which a lot of programmers today would call even fresh code legacy. Emerging design patterns like flux (put together by Facebook) make it possible to control the data flow and the side-effects because of its change.

That said, ReactJS in neither a complete frontend solution, nor the only one available today. It fits together as the View (V) with flux and MV* (popular in the near past) paradigms. Its popularity and success have opened doors to its emerging alternates which can even be more performant under certain circumstances (eg: inferno). While react and react-like libraries make the most preferred fits for the V (view) part of modern frontend applications, rest of the pieces like M (model), C (controller), presenter, pipeline, state store, etc are carefully chosen depending upon the requirement of the project.

With the kind of decoupling built into ReactJS, it can be used not only in web applications, but also in mobile/ desktop applications with performance as good as native alternates if not better. This brings a great advantage of being able to efficiently code views for all devices and platforms in a common language(JavaScript) by the same engineers. Facebook and Netflix are good examples of apps that run on Reacts across all platforms of consumption.


Why Akka ?

When we have to write a concurrent program, our focus shifts from the real problem and most of our time is spent on ‘How to make it concurrent’. The challenge is to make the problem domain less coupled with the code we are writing to make it concurrent. One solution to this is introducing Akka into your system.

Akka is a toolkit and runtime for building highly concurrent, distributed, and fault-tolerant applications on the JVM, according to the official Akka website.

I will talk about few points worth noticing while considering Akka as a toolkit for your concurrent applications.

Creating a thread is not just creating an object:

When we create a normal Java object, it’s about allocating a new heap memory for this particular instance. But creating a thread comes with its own responsibilities. With every instance of a thread, we need a separate memory space for a thread stack. Along with this, a Java thread has a one to one correspondence with OS-level thread. When a new thread is created, OS comes into the picture and handles the life cycle of the newly created thread. It then needs a program counter, a register set and a stack. It will also be registered with a thread scheduler. In essence, relative to the normal Java objects, thread creation is slower. Also, due to the added liabilities, there is a limitation on the maximum number of threads per process (a few 1,000 threads). When we deal with an actor system, we are only creating Java objects (actors) but not the threads. The actor will use the threads available in the thread pools as and when needed. As the actors are normal objects, millions of actor instances can be created in the memory at once.

Communication by passing messages is one of the greatest forms of abstraction:

Writing a concurrent program is not just about thread or locks, it’s about managing access to state and, in particular, the shared mutable state. It’s also about controlling the concurrent data access. In Akka, states are managed by the actor itself and can only be done by passing messages. One actor can’t control a state defined by another actor. If one actor wishes to get the current state of another actor or wants to update the existing state, it will have to send a message. Another actor will then take an action based on the message and respond with the new state in the form of a message itself. The state modification is abstracted from the outside world. The state of actors are controlled by actors themselves and they can process messages one at a time. So there is no need of adding complexity to handle synchronization explicitly. Messages are passed asynchronously so there is no blocking at all unless specifically stated by using the ask (?) option. This approach makes it easier to reason about your program. Higher-level complexity of concurrency is handled by the library itself and what you write is what you really needed to do.

Location Transparency:

When we need our system to scale out, we need additional resources remotely. We also need our system to be modified in order to adopt remote handling, which has its own complexities. The Akka system says ‘Don’t share mutable state at all’. Actors are responsible for making changes in their own state, thus making the components loosely coupled. Loosely coupled components are easy to handle and process. Not sharing state implies that, for computations, you don’t even require shared resources. Isn’t it easy to scale your system by utilizing multiple cores doing independent units of work! Distributed system is much easy to configure in Akka. Remote actor feels like the actor exists in the local system. The only difference will be with the network latency, which is going to be present anyway, as it is on the other side of the network. The rest is the same. You only have to pass messages in the usual way. Depending upon where the actor is located, it will send messages to that path. A path could be local (akka: //ActorSystem/user/…) or remote (akka.tcp://ActorSystem@host:port/user/…). Creating a replica set is easy as well. When one of the remote systems goes down, the state of an actor will persist. A new actor system can be created initializing actors in their persisting state.

Delegation of tasks:

When we have a big problem to solve, it is easier to split it into sub-problems. Solve them individually and then compose the result. Akka works on the same philosophy. One actor does not do its work completely but spans multiple children. It segregates responsibilities. Failures are part of the system and sometimes they cannot be predicted. The only thing that can be done is to recover from those failures. Let’s say you have a server. It consists of multiple components, i.e. database, mailing, logging etc. All components interact with each other. In a highly coupled system, it is really difficult to make decisions on the behalf of a single component. If one of the components crashes, you cannot handle that failure in isolation, as other components depend on it. You might have to stop the entire server or restart it. But what if components are capable of self-healing or are smart enough to decide what to do on certain failures! Creating such a system is difficult, but becomes easier with Akka’s fault-tolerant mechanism. The parent-child hierarchy of actors makes it easy to take decisions for the children. The supervisor can choose various strategies such as restarting all of its children if one fails or only restarting the child that fails. Akka provides developers enough flexibility to choose the options available depending on their needs.

To summarize, Akka lets the developers think in terms of the problem at hand and focus only on what needs to be built. Handling concurrency is taken off the developer’s hands and that taken care of by Akka itself. Developers also don’t need to worry about the remote systems as Akka unifies them while writing the system. With its flexible fault management, it helps create complex systems in a simpler way.

Crystal – Lets Call it Ruby Plus Plus


In the world of thousands of languages, added one more to the list is Crystal, which is General Purpose Object Oriented Programming language. It is a compiled language, and compiles to an ultra optimized native code using Low Level Virtual Machine or LLVM as the backend.

But what’s so special about it you ask?

Well, if you have heard about Ruby, and how every person who has used it, loves the language. Crystal looks almost like Ruby, but it goes one step further and fixes its shortcomings, some of which are:

  •  Concurrency
  •  Speed

Continue reading

Enterprise Integration Pattern With Spring

Enterprise Integration Pattern with Spring

Recently in one of my project I got a requirement to poll a directory and it's sub directories on a constant rate and process the files residing in it to drive some business information out of it. To implement the same we used enterprise integration pattern implementation of spring because of two reasons, firstly – we are already using spring as our backend framework and secondly – it enforce separation of concerns between business logic and integration logic in an intuitive way with well-defined boundaries to promote reusability and portability.

What is Spring Integration?

Spring Integration is an enterprise integration pattern implementation of spring which supports integration with external systems via declarative adapters and these adapters provide a higher-level of abstraction over Spring's support for remoting, messaging, and scheduling. It does not need a container or separate process space and can be invoked in existing program as it is just a JAR which can be dropped with WAR or standalone systems.

As I mentioned it works using adapters, we created InboundChannelAdapter as a spring bean which starts at time of application boot up and constantly polls a directory specified noticing the Scanner and Filter specified as follows:

    @InboundChannelAdapter(value = "fileIn", autoStartup = "true", poller = @Poller(fixedDelay = "500"))
    public MessageSource<File> fileMessageSource() throws Exception {
        FileReadingMessageSource fileReadingMessageSource = new FileReadingMessageSource();
        fileReadingMessageSource.setDirectory(new File(pollingDir));
        return fileReadingMessageSource;

Scanner specified above is the strategy for scanning directories and we used WatchServiceDirectoryScanner implementation along with the composite filter as our requirement is to scan the directory and it's sub directories as well for a file ending with predefined pattern.

    public DirectoryScanner dirScanner() throws Exception {
        WatchServiceDirectoryScanner watchServiceDirectoryScanner = new WatchServiceDirectoryScanner(
        return watchServiceDirectoryScanner;

In this post we will be polling for a file pattern ending with .csv in a /Users/ArpitAggarwal/directory/ directory, so we implemented SimplePatternFileListFilter provided by framework as below:

    public CompositeFileListFilter<File> compositeFilter() throws Exception {
        return new CompositeFileListFilter<>(getFileListFilterList("*.csv"));
    private List<FileListFilter<File>> getFileListFilterList(
            final String pattern) {
        List<FileListFilter<File>> fileListFilterList = new ArrayList<>();
        fileListFilterList.add(new SimplePatternFileListFilter(pattern));
        return fileListFilterList;

By default framework keep in-memory track of files read from directory which doesn't suffice our need as we want to process the file only once even on server restart, which make us use FileSystemPersistentAcceptOnceFileListFilter implementation of framework which requires directory location to be specified to save information of files already processed on disk in form of properties file named as follows:

    public FileSystemPersistentAcceptOnceFileListFilter persistentFilter() throws Exception {
        FileSystemPersistentAcceptOnceFileListFilter fileSystemPersistentAcceptOnceFileListFilter = new FileSystemPersistentAcceptOnceFileListFilter(
                metadataStore(), "");
        return fileSystemPersistentAcceptOnceFileListFilter;

    public PropertiesPersistingMetadataStore metadataStore() throws Exception{
        PropertiesPersistingMetadataStore propertiesPersistingMetadataStore = new PropertiesPersistingMetadataStore();
        return propertiesPersistingMetadataStore;

Integrating FileSystemPersistentAcceptOnceFileListFilter as part of composite filter results in changing the FileListFilterList bean definition as follows:

private List<FileListFilter<File>> getFileListFilterList(
            final String pattern) {
        List<FileListFilter<File>> fileListFilterList = new ArrayList<>();
        fileListFilterList.add(new SimplePatternFileListFilter(pattern));
        return fileListFilterList;

That's all about basic configuration to look up a directory and it's sub directories for files with ending with .csv.

Next we need the action to be taken once file is read and to that framework provides @ServiceActivator to be specified with inputChannel over a method definition with file as an argument, as follows:

public class FileInServiceActivator {

    @ServiceActivator(inputChannel = "fileIn")
    public void run(File file) {
        String fileName = file.getAbsolutePath();
        System.out.println("File to be processed " + fileName);

The complete source code is hosted on github.

Big Data Testing

Big Data Testing in Hadoop Ecosystem

This blog is for people who want to understand what to test in the Big Data ecosystem or what are the scenarios to cover in Big Data Testing. We will cover the following topics:-

What is Big Data?

Big Data is the new buzzword in the industry primarily due to large amount of data generated daily. Big Data is used to describe data which is large in size and grows exponentially with time. Big Data is based on 4V’s Volume (amount of data), Velocity (Speed of data in and out), Variety (Range of data type and sources) and Veracity (uncertainty of data). As data increases it becomes difficult to process, handle and manage the data. While traditional Computing infrastructure cannot work efficiently to handle Big Data, New Computing technologies have been created to handle, manage huge amount of data and processing it quicker than the traditional system and technologies.

As Enterprises started to move towards Big Data it becomes important to understand the system and technologies used in order to get the best out of it. Enterprises have a new learning curve while moving towards Big Data. Learning the technologies is just a starting block whereas designing, testing and implementing are the big challenges to consider while moving to a whole new technology.

Why require Big Data Testing?

With introduction of Big Data it becomes very much important to test the big data system with usage of appropriate data correctly. If not tested properly it would affect the business significantly thus automation becomes a key part of Big Data Testing to test the application and it’s functionality. Big Data Testing if done incorrectly will make it very difficult to understand the error, how it occurred and the probable solution with mitigation steps could take a long time thus resulting in incorrect/missing data and correcting it is again a huge challenge in such a way that current flowing data is not affected. As data is very important it is recommended to have a proper mechanism so that data is not lost/corrupted and proper mechanism should be used to handle failovers.

We will be primarily discussing about Big Data Testing in Hadoop. Big Data primarily uses Hadoop for processing and handling large amount of data. Hadoop is a framework which provides cluster of computing resources for processing huge amount of data. In Hadoop extending cluster is easy with addition of nodes as required which should be carefully planned during design /requirement stage.

Big Data Architecture

Let us have a look at the high level Big Data architecture.

Big Data Architecture

In the above diagram, Data Storage block contains Data Ingestion layer and Data Processing layer which are used to store processed data. Ingested data is stored in HDFS which acts as input for data processing. The above diagram also shows the data pipeline from data ingestion to data visualization.

Explanation of Big Data Components

Data ingestion layer is responsible for ingesting data into Hadoop. It is a preprocessing stage and the entry point from which data comes. It is used for batch, file or event ingestion. This layer is very critical because if data is corrupted or missing then data cannot be processed leading to complete loss of data. Handling failures /failover is very important to manage. In this layer storage formats play a crucial role for compression of data which would lead to reduction in I/O.

Data Processing layer is responsible for processing of data ingested, aggregation of data as per business requirements. This layer uses business rules for processing and aggregating the data. Hadoop is used for processing data which uses Map Reduce operations. It is very important to create proper alert mechanisms in order to catch the failure and helping in resolving it as soon as possible.

Data Storage layer is responsible to store the processed data from Hadoop. As data generated is huge it becomes important to design and use this layer in order to store all the data. This needs to be designed very carefully to prevent disk corruption or other failures leading to loss of data. This layer is also referred to data warehouse for storing infrequently accessed data, archived data or old data.

Data Visualization layer is responsible for visualizing the data received from ingestion, processed as per business rules and storing the data. It is used for understanding the data and gathering insights from data. Stored data can be in any format (Excel file, Json file, Text file, Access file etc.) which is used for visualization of data. Also it is not necessary that only data stored in HDFS can be used for data visualization.

Big Data Testing Scenarios

Let us examine the scenarios for which Big Data Testing can be used in the Big Data components:-

Data Ingestion :-

This step is considered as pre-Hadoop stage where data is generated from multiple sources and data flows into HDFS. In this step the testers verifies that data is extracted properly and data is loaded into HDFS.

  • Ensure proper data from multiple data sources is ingested i.e. all required data is ingested as per their defined schema and data not matching schema should not be ingested. Data which has not matched with schema should be stored for stats reporting purpose. Also ensure there is no data corruption.
  • Comparison of source data with data ingested to simply validate that correct data is pushed.
  • Verify that correct data files are generated and loaded into HDFS correctly into desired location.

Data Processing :-

This step is used for validating Map-Reduce jobs. Map-Reduce is a concept used for condensing large amount of data into aggregated data. The data ingested is processed using execution of Map-Reduce jobs which provides desired results. In this step the tester verifies that ingested data is processed using Map-Reduce jobs and validate whether business logic is implemented correctly.

  • Ensure Map Reduce Jobs run properly without any exceptions.
  • Ensure key-value pairs are correctly generated post MR Jobs.
  • Validate business rules are implemented on data.
  • Validate data aggregation is implemented on data and data is consolidated post reduce operations.
  • Validate that data is processed correctly post Map-Reduce Jobs by comparing output files with input files.

Note: – For validation at data ingestion or data processing layers, we should use a small set of sample data (in KB’s or MB). By using a small sample data we can easily verify that correct data is ingested by comparing source data with output data at ingestion layer. It becomes easier to verify that MR jobs are run without any error, business rules are correctly implemented on ingested data and validate data aggregation is correctly done by comparing output file with input file.

Initially for testing at data ingestion or data processing layers if we use large data (in GB’s), it becomes very difficult to validate or verify each input record with output record and validating whether business rules are implemented correctly becomes difficult.

Data Storage :-

This step is used for storing output data in HDFS or any other storage system (such as Data Warehouse). In this step the tester verifies that output data is correctly generated and loaded into storage system.

  • Validate data is aggregated post Map-Reduce Jobs.
  • Verify that correct data is loaded into storage system & discard any intermediate data which is present.
  • Verify that there is no data corruption by comparing output data with HDFS (or any storage system) data.

The other type of testing scenarios a Big Data Tester can do is:-

  • Check whether proper alert mechanisms are implemented such as Mail on alert, sending metrics on Cloud watch etc.
  • Check Exceptions or errors are displayed properly with appropriate exception message so that solving an error becomes easy.
  • Performance testing to test the different parameters to process a random chunk of large data and monitor parameters such as time taken to complete Map-Reduce Jobs, memory utilization, disk utilization and other metrics as required.
  • Integration testing for testing complete workflow directly from data ingestion to data storage/visualization.
  • Architecture testing for testing that Hadoop is highly available all the time & Failover services are properly implemented to ensure data is processed even in case of failure of nodes.

Note: – For testing it is very important to generate data for testing covering various test scenarios (positive and negative). Positive test scenarios cover scenarios which are directly related to the functionality. Negative test scenarios cover scenarios which do not have direct relation with the desired functionality.

List of few tools used in Big Data

Data Ingestion – Kafka, Zookeeper, Sqoop, Flume, Storm, Amazon Kinesis.

Data Processing – Hadoop (Map-Reduce), Cascading, Oozie, Hive, Pig.

Data Storage – HDFS (Hadoop Distributed File System), Amazon S3, HBase.

At the end of this blog you understand the various scenarios which can be tested in Big Data domain. You know what is Big Data and why do we require Big Data Testing. It made you aware of the Big Data architecture with brief explanation of its components. Lastly few tools were mentioned which are used within the Big Data System.

In the next blog we will look at a use-case for a practical scenario as an example for Big Data Testing. It will cover the problem statement followed by testing via traditional method and importance of creating/using automation script to automate testing in Big Data domain.

Kafka Perfromance Blog

           Big Data Testing : Apache Kafka Performance Benchmarking

In this blog we will start from the basic tools/scripts of Apache Kafka and discuss how performance test and benchmarking can be done by performing some load tests for default configuration.


Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system.

Let’s go through it’s messaging terminology first:

  • Kafka maintains feeds of messages in categories called topics.
  • We'll call processes that publish messages to a Kafka topic producers.
  • We'll call processes that subscribe to topics and process the feed of published messages consumers.
  • Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:

For further information about Apache Kafka, please refer to link below:

Kafka Documentation

So, while doing performance testing for Kafka there can be two aspects which we need to take in consideration:
1. Performance at Producer End
2. Performance at Consumer End

We need to perform this test for both, Producer and Consumer so that we can make sure how many messages Producer can produce and Consumer can consume in a given time. For a large number of messages we can ensure data loss as well.

Main intent of this test is to find out the following stats:
1. Throughput(messages/sec) on size of data
2. Throughput(messages/sec) on number of messages
3. Total data
4. Total messages

Let’s go ahead with download and setup kafka, starting zookeeper, cluster, producer and consumer.

  • To download kafka refer this link
  • Once it is downloaded, untar it then switch to the directory
    tar -xzf kafka_2.9.1-
    cd kafka_2.9.1-
  • As Kafka uses Zookeeper, so first you need to start it, follow the steps below:
    bin/ config/
  • Now start the Kafka server:
    bin/ config/
  • Once server started we need to create a topic now, say “test”
    bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
  • To check if the topic created successfully, use the list command:
    bin/ --list --zookeeper localhost:2181***
  • Now let’s start the Producer and Consumer as mentioned below:
bin/ --broker-list localhost:9092 --topic test
bin/ --zookeeper localhost:2181 --topic test
  • Send some message now by type it on Producer console, once you press enter same message should be consumer on the consumer console.

Once the messages generated by Producer are consumed on Consumer, that’s show you setup Kafka correctly.

Now let’s take the performance stats, to do this follow the steps mentioned below:
1. Launch a new terminal window
2. Set the directory to Kafka/bin
3. Here you can find multiple shell scripts, we will be using following to take performance stats:

If you want to check help about both the shell scripts(perf tools) just type
./ --help
./ --help

for Producer and Consumer respectively.

Performance at Producer End

Type following command on console and hit enter key.
./ --broker-list localhost:9092 --topic test --messages 100

Let’s understand these command line options one by one,
– First parameter is “broker-list”, in this we need to mention broker info that is the list of broker\s host and port for bootstrap, this is required parameter.
– Second parameter is “topic”, this one is also required parameter and shows message category as we discussed earlier.
– Third one shows how many messages you want to produce and send to take the stats, we set it to 100 for our first scenario.

Once test completed some stats will be printed on console, something like;

| start.time | end.time | compression | message.size | batch.size | | MB.sec | | nMsg.sec |
| ———-| ——– | ———– | ———— | ———- | ——————— | —— | ———————– | ——–
| 2016-02-03 21:38:28:094 | 2016-02-03 21:38:28:449 | 0 | 100 | 200 | 0.01 | 0.0269 | 100 | 281.6901 |

  1. start.time, end.time will show when was test started and completed.
  2. If Compression is ‘0’ as above then it shows message compression was off(Default).
  3. message.size shows the size of each message.
  4. batch.size indicates how many messages will be sent in one batch, by default it is set to 200.
  5. shows total data send to cluster in MB.
  6. MB.sec indicates how much data transferred in MB per sec(Throughput on size).
  7. will show the count of total message which were sent during this test.
  8. And last nMsg.sec shows how many messages sent in a sec(Throughput on count of messages).

There are some more parameters which you can use while doing this performance test, like;

–csv-reporter-enabled : If set, the CSV metrics reporter will be enabled

–initial-message-id : The is used for generating test data, If set, messages will be tagged with an ID and sent by producer starting from this ID sequentially. Message content will be String type and in the form of 'Message:000…1:xxx…', using this parameter you will be able to see messages consuming on the consumer.

–message-size : It indicates the size of each message, it can be useful when you want to load test Kafka with some large messages.

–vary-message-size : If set, message size will vary up to the given maximum.

There are some other options as well which can be use as per need during the Producer performance test.

For this blog, I took some performance numbers based on number of messages and performance was shows by graph inline.

Performance at Consumer End

Now let’s look how can we take performance stats at consumer end, type following command and hit enter key.
./ --topic test --zookeeper localhost:2181

Let’s understand it's command line options,

First parameter was “topic”, this one is also required parameter and shows message category.
Second parameter is “zookeeper”, this one is also required parameter and shows the connection string for the zookeeper connection in the form host:port.

Once test completed some stats will be printed on console, something like;

| start.time | end.time | fetch.size | | MB.sec | | nMsg.sec |
| ———- | ——– | ———- | ——————- | —— | ——————– | ——– |
| 2016-02-04 11:29:41:806 | 2016-02-04 11:29:46:854 | 1048576 | 0.0954 | 1.9869 | 1001 | 20854.1667|

  1. start.time, end.time will show when was test started and completed.
  2. fetch.size** shows the amount of data to fetch in a single request.
  3.**** shows the size of all messages consumed.
  4. ***MB.sec* indicates how much data transferred in MB per sec(Throughput on size).
  5. will show the count of total message which were consumed during this test.
  6. And last nMsg.sec shows how many messages consumed in a sec(Throughput on count of messages).

Performance test for Consumer is also based on number of messages and result was shows by graph inline.

By using the stats we can decide the batch size, message size and number of maximum messages which can be produced/consumed for a given configuration or in other words we can benchmark numbers for Kafka.

All the above analysis is done using the default settings of kafka, there can be multiple scenarios where we can test and take the performance stats for Kafka Producer and Consumer, some of those cases can be :

  1. Change number of topics
  2. Change async batch size
  3. Change message size
  4. Change number of partitions
  5. Network Latency
  6. Change number of Brokers
  7. Change number of Producer/Consumer etc.

Above mentioned changes can be done in the properties files available in folder :

To understand the config files you can also refer to the link provided in the beginning of the blog.

This blog is just to give an initial idea about Apache Kafka Performance testing and benchmarking, In further blog/s we will be discussing about some complex Kafka performance aspects.