Creating a Finite State Machine Distributed Workflow with Service Fabric Reliable Actors

The accompanying source code can be found at github.com/jsturtevant/azure-service-fabric-actor-fsm.

I recently had to manage a long running workflow inside Azure Service Fabric solution. I remembered reading about using a Finite State Machine (FSM) with Actors in the book Programming Microsoft Azure Service Fabric by Haishi Bai. When I went looking for examples of how I might use a FSM with the Service Fabric Actor Programming model I wasn’t able to find one, so I built an example of using a FSM with Service Fabric Actors.

Although, Finite State Machines might sound scary they are not to hard to implement or us, once you learn some basic terminology. FSM’s are great for many types of scenarios but keeping tack of long running workflows with a predetermined sequence of actions is a great use case.

In Haishi’s book, he walks through the example of how it might be used in an ordering and shipping process to track items as they move through the ordering, manufacturing, shipping and invoicing product stages. In my case I needed to manage creating and configuring multiple resources within my Service Fabric cluster (applications and services) and Azure (IP addresses and Load balancers, etc). Each individual step required the previous step to be complete and could take anywhere from a few seconds to 10 minutes.

In such a distributed workflow, where there are many different services implementing each of the individual steps you it can be challenge to track and manage all of the different steps. A single Actor for each instance of the workflow simplifies the process and simplifies the scaling of the workflow service. For each instance of a new workflow a new actor is created and the state for the workflow is managed by that actor.

An example of a Actor managing multiple states and services might look like:

Finite State Machine Workflow using Reliable Actors and multiple distributed services.

A few of the benifits of using a Reliable Actor with a FSM in this case are:

  • FSM will centralize the logic for the workflow into one location
  • the workflow service will scale well as each actor has independent state and logic for a given workflow instance
  • can use timed Reminders on the actor to trigger workflow timeouts and notify other services something is wrong.
  • monitoring status of the workflow is simplified
  • canceling a workflow is simplified (if in correct state)

A notable trade off of using Reliable Actors are they are inheritly single threaded. If your workflow requires a high number of reads and writes then you might find that the Actors will not be a good choice do to throughput performance. In both of the use cases noted above this is not an issue so the models maps nicely to the Reliable Actors.

Create the Finite State Machine (FSM) with Service Fabric Actor

Instead of implementing a FSM myself I used the very popular Stateless FSM library. I chose this library because of its popularity and it comes with the ability to externalize the state of the FSM. In this case I wanted to persist the state using a Service Fabric Stateful Reliable Actor.

You can install the library via Nuget Install-Package Stateless and checkout some of the examples to get an understanding of to use the library.

I adapted the BugTracker example to create the Service Fabric code samples below.

For the remainder of the article we will walk through some of the important parts of the solution.

Working with external State inside the Actor

Stateless uses a simple enum to track the state the FSM is in at any given time. We can use Reliable Actors StateManager to save this value durably when ever the State Machine changes state.

Using external state in Stateless is fairly easy:

//Private member Variable
enum State { Open, Assigned, Deferred, Resolved, Closed }

//initializing Stateless StateMachine with external state.
machine = new StateMachine<State, Trigger>(() => this.state, s => this.state = s);

Saving the State durably (replicated across multiple nodes for failover) using Reliable Actors is also straight forward. Within any method of the Actor we can pass a key and value:

await this.StateManager.SetStateAsync("state", machine.State);

Now let’s see how we can use this to create a FSM using Service Fabric Reliable Actors.

Initializing the State of the Actor

Reliable Actors are virtual, which means they always “exist” and the lifetime of an actor is not tied to their in-memory representation. An individual actor can also be “Garbage Collected” (deactivated) from the system at if not used for a period of time, although the state will be persisted. This means two things when an working with the FSM in the actor framework:

  • The first time an actor is activated we need to set the initial FSM state and configure the FSM
  • When a new request comes in for a Actor that has been deactivated we will need to load the state and re-configure the FSM

The code to handle both first time activation and re-activation happens in the OnActivateAsync method and will look something like:

protected override async Task OnActivateAsync()
{
    // Handle Activation for first time and re-activation after being garbage collected
    ActorEventSource.Current.ActorMessage(this, $"Actor activated: {this.GetActorId()}");

    var savedState = await this.StateManager.TryGetStateAsync<State>("state");
    if (savedState.HasValue)
    {
        // has started processing
        this.state = savedState.Value;
    }
    else
    {
        // first load ever - initialize
        this.state = State.Open;
        await this.StateManager.SetStateAsync<State>("state", this.state);
    }

    ActorEventSource.Current.ActorMessage(this, $"Actor state at activation: {this.GetActorId()}, {this.state}");

    //Configure the Stateless StateMachine with the state.
    machine = new StateMachine<State, Trigger>(() => this.state, s => this.state = s);
    assignTrigger = machine.SetTriggerParameters<string>(Trigger.Assign);
    machine.Configure(State.Open)
        .Permit(Trigger.Assign, State.Assigned);
    machine.Configure(State.Assigned)
        .SubstateOf(State.Open)
        .OnEntryFromAsync(assignTrigger, assignee => OnAssigned(assignee))
        .PermitReentry(Trigger.Assign)
        .Permit(Trigger.Close, State.Closed)
        .Permit(Trigger.Defer, State.Deferred)
        .OnExitAsync(() => OnDeassigned());
    machine.Configure(State.Deferred)
        .OnEntryAsync(() => this.StateManager.SetStateAsync<string>("assignee", null))
        .Permit(Trigger.Assign, State.Assigned);
}

You will notice that the state machine gets saved in a local member variable. From here on out we will interact with our state machine in memory and persist the state object using the StateManager when ever the FSM Transitions from one state to another.

Triggering a new state from external system

In the case of managing a distributed workflow, the external services that are doing the work of at each step of the workflow will need a way to notify the workflow actor that there work has either completed or failed. We can create methods on the actor that will enable the clients to interact with the actor and workflow. If any service requests a transition of state outside of the workflow, an error will be returned to the client notifying them that the action is not available at this time.

The first step is to modify the interface that Actor implements:

public interface IFSM : IActor
{
    Task Close(CancellationToken cancellationToken);
    Task Assign(string assignee, CancellationToken cancellationToken);
    Task<bool> CanAssign(CancellationToken cancellationToken);
    Task Defer(CancellationToken cancellationToken);
}

[StatePersistence(StatePersistence.Persisted)]
internal class FSM : Actor, IFSM
{
  /// implement members and methods like OnActivateAsync here
}

The next step is to implement the methods and save the FSM state as the workflow transitions. Since the FSM is in memory when an actor is active we can work with the FSM’s actual state:

public async Task Assign(string assignee, CancellationToken cancellationToken)
{
    //trigger transition to next workflow state.
    await machine.FireAsync(assignTrigger, assignee);

    //check that local state is same as machine state then save.
    Debug.Assert(state == machine.State);
    await this.StateManager.SetStateAsync("state", machine.State);
}

This creates a transition in the FSM (assignTrigger) and fires an event which could then call out to another service to kick of the next step of the workflow. Now that the next step in the workflow has been activated we can save the state of the machine. Calling the Assign method from an external service is simple:

ActorId actorId = new ActorId(Guid.NewGuid().ToString());
IFSM actor = ActorProxy.Create<IFSM>(actorId, new Uri("fabric:/ActorFSM/FSMActorService"));
var token = new CancellationToken();
await actor.Assign("Joe", token);

In the assignTrigger of the FSM we can also save the the actual assignee durably using the Actors StateManager. This means that we can not only query the actor to get the current step of the workflow but also who is assigned. The event wired to FireAsync looks like:

async Task OnAssigned(string assignee)
{
    var _assignee = await this.StateManager.TryGetStateAsync<string>("assignee");
    if (_assignee.HasValue && assignee != _assignee.Value)
      await SendEmailToAssignee("Don't forget to help the new employee!");

    await this.StateManager.SetStateAsync("assignee", assignee);
    await SendEmailToAssignee("You own it.");
}

Getting current State of a workflow

A system out side our workflow might want to know what state a given workflow instance is in. This is again is made simple through the Actor. What is interesting to note is that since the actor is single threaded, you will always get the latest state of the Workflow.

Implement the current status as a method on the Actor:

public async Task<BugStatus> GetStatus()
{
  var statusHistory = await this.StateManager.TryGetStateAsync<StatusHistory>("statusHistory");
  var assignee = await this.StateManager.TryGetStateAsync<string>("assignee");

  var status = new BugStatus(this.machine.State);
  status.History = statusHistory.HasValue ? statusHistory.Value : new StatusHistory(machine.State);
  status.Assignee = assignee.HasValue ? assignee.Value : string.Empty;

  return status;
}

Getting the current status:

var actor1Status = await actor.GetStatus();
Console.WriteLine($"Actor 1 with {actorId} has state {actor1Status.CurrentStatus}");

Knowing the history of all the states a workflow has gone through will almost certainly be a requirement. Updating the workflow history as it move throught the FSM is straight forward using the machines machine.OnTransitionedAsync(LogTransitionAsync) event and wiring up a function to log it:

private async Task LogTransitionAsync(StateMachine<State, Trigger>.Transition arg)
{
    var conditionalValue = await this.StateManager.TryGetStateAsync<StatusHistory>("statusHistory");

    StatusHistory history;
    if (conditionalValue.HasValue)
    {
        history = StatusHistory.AddNewStatus(arg.Destination, conditionalValue.Value);
    }
    else
    {
        history = new StatusHistory(arg.Destination);
    }

    await this.StateManager.SetStateAsync<StatusHistory>("statusHistory", history);
}

You can see that the GetStatus method previously implemented already returns that value.

Testing state persistance

As was discussed earlier, the Reliable Actor can be Garbage collected from the system if not used with in a given time period. I wanted to test this to make sure the Activation code worked properly with the FSM. The actor framework gives you some configuration settings that you can modify to tweak your actor’s behavior. I used these settings to aggressively garbage collect the Actors to test the state.

Note: Only modify these Actor settings is you have data to backup making a change to the default settings. In other words you should have sufficient monitoring set up that points to your actors being garbage collected to early or to late for your scenario.

During registration of the Actor with Service Fabric in the program.cs you can adjust the ActorGarbageCollectionSettings. The first number is scan interval (how often it checks the system) and the second number is the idle time out (how long an actor must be idle for):

internal static class Program
{
  private static void Main()
  {
      try
      {
          ActorRuntime.RegisterActorAsync<FSM>(
              (context, actorType) => new ActorService(context, actorType, settings: new ActorServiceSettings()
              {
                  ActorGarbageCollectionSettings = new ActorGarbageCollectionSettings(10, 5)
              })).GetAwaiter().GetResult();

          Thread.Sleep(Timeout.Infinite);
      }
      catch (Exception e)
      {
          ActorEventSource.Current.ActorHostInitializationFailed(e.ToString());
          throw;
      }
  }
}

Then during my test I waited period of time before working with the actors and used the Service Fabric management API’s to make sure the Actors were deactivated:

ActorId actorId2 = new ActorId(Guid.NewGuid().ToString());
IFSM actor2 = ActorProxy.Create<IFSM>(actorId2, new Uri("fabric:/ActorFSM/FSMActorService"));
await actor2.Assign("Sue", token);

// Wait for actors to get garbage collected
await Task.Delay(TimeSpan.FromSeconds(20));

//Query management API's to verify
var actorService = ActorServiceProxy.Create(new Uri("fabric:/ActorFSM/FSMActorService"), actorId);
ContinuationToken continuationToken = null;
IEnumerable<ActorInformation> inactiveActors;
do
{
    var queryResult = await actorService.GetActorsAsync(continuationToken, token);
    inactiveActors = queryResult.Items.Where(x => !x.IsActive);
    continuationToken = queryResult.ContinuationToken;
} while (continuationToken != null);

foreach (var actorInformation in inactiveActors)
{
    Console.WriteLine($"\t {actorInformation.ActorId}");
}

Conclusion

Creating a distributed workflow with multiple steps and variable times to complete can be a difficult problem. Using Service Fabric’s Reliable Actors with a FSM machine library like Stateless not only simplifies the code that needs to be written but also ensures the scalablity of the solution because of the partitioned nature of Reliable Actors. If you don’t need lots of concurrent reads and writes then the Reliable Actors could be a great solution.

You can find the complete sample code for this solution at https://github.com/jsturtevant/azure-service-fabric-actor-fsm.

Training links for Microsoft and Azure

I often get asked how to get trained up on a given Microsoft technology. Like anything, there is a lot of info out there, so here is a consolidated list where I would get started. You can go to each resource and search for the technology you are interested in. I always start at the free tier then work up from there if I feel I need/want more.

Microsoft Virtual Academy (free)

Good for full overview of tech with both hands on and presentation style.

https://mva.microsoft.com/

Channel9 (free)

Good for learning more about specific part of tech. Often more specific subtopics of a given tech.

https://channel9.msdn.com/

3rd party video content (pay)

More high quality video training to supplement Channel9 and MVA

LinkedIn Learning: https://learning.linkedin.com Lynda.com: https://www.lynda.com Pluralsight: https://www.pluralsight.com/

Professional programs (pay)

Very structured Profession role based training that gives you a credential at the end.

Microsoft Academy: https://academy.microsoft.com/en-us/professional-program/

Certification Programs (pay for cert)

If your looking for Certifications the Microsoft Certification program is a good place. You can get good idea of what is covered on the exam (and there for things you should generally know/learn) from the exam prep pages.

https://www.microsoft.com/en-us/learning/certification-overview.aspx

Online/in-person Training with Microsoft Partners (pay)

If you are looking for in-person or online training with instructors you can check out the following resources:

https://www.microsoft.com/en-us/learning/training.aspx https://www.microsoft.com/en-us/learning/partners.aspx https://www.microsoft.com/en-us/learning/on-demand-course-partners.aspx

Setting Global Shortcuts for Bash for Windows

I have started using Bash for Windows as part of my workflow. For a long time, I have been using the standard command prompt with Git Bash tools hooked up to my path (using chocolately to install the tools in one command choco install git -params '"/GitAndUnixToolsOnPath"'). This was nice because I could use the shortcut Windows Key + X then C to open a command prompt and get access to Linux like tools such as ls, git and grep.

Now that I am using Bash for Windows I wanted to have a shortcut set up as well. I found it was really easy to set up using the Shortcut key setting on the shortcut properties window. This process could be used for any application that has a shortcut setup. The basic steps are:

  1. Find the shortcut by Windows Key the type bash
  2. Right click on the shortcut and select open file location

find shortcut for Bash for Windows

  1. Right click on the shortcut in the folder and select Properties
  2. Select Shortcut key and type your keyboard shortcut

setup global shortcuts for Bash for Windows

Deploying a Service Fabric Guest Executable with Configuration File

In Service Fabric, when your guest executable relies on a configuration file, the file must be put in the same folder as the executable. An example of an application that that needs configuration is nginx or using Traefik with a static configuration file.

The folder structure would look like:

|-- ApplicationPackageRoot
    |-- GuestService
        |-- Code
            |-- guest.exe
            |-- configuration.yml
        |-- Config
            |-- Settings.xml
        |-- Data
        |-- ServiceManifest.xml
    |-- ApplicationManifest.xml

You may need to pass the location of the configuration file via Arguments tag in the ServiceManifest.xml. Additionally you need to set the WorkingFolder tag to CodeBase so it is able to see the folder when it starts. The valid options for the WorkingFolder parameter are specified in the docs. An example configuration of the CodePackage section of the ServiceManifest.xml file is:

<CodePackage Name="Code" Version="1.0.0">
  <EntryPoint>
    <ExeHost>
      <Program>traefik_windows-amd64.exe</Program>
      <Arguments>-c traefik.toml</Arguments>
      <WorkingFolder>CodeBase</WorkingFolder>
      <ConsoleRedirection FileRetentionCount="5" FileMaxSizeInKb="2048"/> 
    </ExeHost>
  </EntryPoint>
</CodePackage>

Not the ConsoleRedirection tag also enables redirection of stdout and stderr which is helpful for debugging.

Using the Visual Studio Team Services Agent Docker Images

There are two ways to create Visual Studio Team Services (VSTS) agents: Hosted and Private. Creating your own private agent for VSTS has some advantages such as being able to install the specific software you need for your builds. Another advantage that becomes important, particularly when you start to build docker images, is the ability to do incremental builds. Incremental builds lets you keep the source files, and in the case of the docker the images, on the machine between builds.

If you choose the VSTS Hosted Linux machine, each time you kick off a new build you will have to re-download the docker images your to the agent because the agent used is destroyed after the build is complete. Whereas, if you use your own private agent the machine is not destroyed, so the docker image layers will be cached and builds can be very fast and you will minimize your network traffic as well.

Note: This article ended up a bit long; If you are simply looking for how to run VSTS agents using docker jump to the end where there is one command to run.

Installing VSTS agent

VSTS provides you with an agent that is meant to get you started building your own agent quickly. You can download it from your agents page in VSTS. After it is installed and you can customize the software available on the machine. VSTS has done a good job at making this pretty straightforward but if you read the documentation you can see there are quite a few steps you have to take to get it configured properly.

Since we using docker, why not use the docker container as our agent? It turns out that VSTS provides a docker image just for this purpose. And it is only one command to configure and run an agent. You can find the source and the documentation on the Microsoft VSTS Docker hub account.

There are many different agent images to choose from depending on your scenario (TFS, standard tools, or docker) but the one we are going to use is the image with Docker installed on it.

Using the Docker VSTS Agent

All the way at the bottom of the page there is a description of how to get started with the agent images that support docker. One thing to note that it uses the host instance of Docker to provide the Docker support:

# note this doesn't work if you are building source with docker 
# containers from inside the agent
docker run \
  -e VSTS_ACCOUNT=<name> \
  -e VSTS_TOKEN=<pat> \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -it microsoft/vsts-agent:ubuntu-16.04-docker-1.11.2

This works for some things but if you using docker containers to build your source (and you should be considering it… it is even going to be supported soon through multi-stage docker builds), you may end up with some errors like:

2017-04-14T01:38:00.2586250Z �[36munit-tests_1     |�[0m MSBUILD : error MSB1009: Project file does not exist.

or

2017-04-14T01:38:03.0135350Z Service 'build-image' failed to build: lstat obj/Docker/publish/: no such file or directory

You get these errors because the source code is downloaded into the VSTS agent docker container. When you run a docker container from inside the VSTS agent, the new container runs in the context of the host machine because the VSTS agent uses docker on the host machine which doesn’t have your files (that’s what the line -v /var/run/docker.sock:/var/run/docker.sock does). If this hurts your brain, you in good company.

Agent that supports using Containers to build source

Finally, we come to the part you are most interested in. How do we actually set up one of these agents and avoid the above errors?

There is an environment variable called VSTS_WORK that specifies where the work should be done by the agent. We can change the location of the directory and volume mount it so that when the docker container runs on the host it will have access to the files.

To create an agent that is capable of using docker in this way:

docker run -e VSTS_ACCOUNT=<youraccountname>  \
  -e VSTS_TOKEN=<your-account-Private-access-token> \
  -e VSTS_WORK=/var/vsts -v /var/run/docker.sock:/var/run/docker.sock \
  -v /var/vsts:/var/vsts -d \ microsoft/vsts-agent:ubuntu-16.04-docker-17.03.0-ce-standard

The important command here is -e VSTS_WORK=/var/vsts which tells the agent to do all the work in the /var/vsts folder. Then volume mounting the folder with -v /var/vsts:/var/vsts enables you to run docker containers inside the VSTS agent and still see all the files.

Note: you should change the image name above to the one most recent or the version you need

Conclusion

Using the VSTS Docker Agents enables you to create and register agents with minimal configuration and ease. If you are using Docker containers inside the VSTS Docker Agent to build source files you will need to add the VSTS_WORK environment variable to get it to build properly.