How to Build GraphQL APIs for Text Analytics in Python

David Mráz@davidm_ai

Development

Introduction

GraphQL is a query language and a server-side runtime environment for building APIs. It isn't tied to any specific database or programming language, meaning you can build your GraphQL server in Node.js, C#, Scala, Python and more. The ecosystem in JavaScript is the most evolved GraphQL-wise and it is usually better for client-facing APIs and frontend applications. You can take a look at our blog to learn more about building software with React, GraphQL, and Node.js. However, Node.js is not the best fit for data science, so in this article, we will take a look at building microservices with a GraphQL API. This will help us better serve our machine learning models and improve functionality for other computations like text analytics.

Running Flask server

Flask is a minimalist framework for building Python servers. At Atheros, we use Flask to expose our GraphQL API to serve our machine learning models. Requests from client applications are then forwarded by the GraphQL gateway. Overall, microservice architecture allows us to use the best technology for the right job, and and it also allows us to use advanced patterns like schema federation. In this article, we will start small with the implementation of the so-called Levenshtein distance. We will use the well-known NLTK library and expose the Levenshtein distance functionality with the GraphQL API. In this article, we assume that you are familiar with basic GraphQL concepts like building GraphQL mutations.

Let's start by cloning our example repository with the following:

git clone git@github.com:atherosai/python-graphql.git

In our projects, we use Pipenv for managing Python dependencies. If you are located in the project folder we can create our virtual environment with this:

pipenv shell

and install dependencies from Pipfile:

pipenv install

We usually define a couple of script aliases in our Pipfile to ease our development work-flow.

[scripts]
dev = "env FLASK_APP=app.py env FLASK_ENV=development flask run"
prod = "env FLASK_APP=app.py env FLASK_ENV=production gunicorn --bind 0.0.0.0:5000 -w 8 wsgi:app --timeout 10000"
test = "python -m pytest"
test-watch = "ptw --runner 'python -m pytest -s'"

It allows us to run our dev environment easily with a command alias as follows:

pipenv run dev

The Flask server should be then exposed by default at port 5000. You can immediately move on to GraphQL Playground, which serves as the IDE for the live documentation and query execution for GraphQL servers. GraphQL Playground uses the so-called GraphQL introspection for fetching information about our GraphQL types. The following code initialises our Flask server:

from flask import Flask, render_template, request
from flask_graphql import GraphQLView
from server.schema.Schema import Schema

app = Flask(__name__, static_url_path='', template_folder='./public', static_folder='./public')


@app.route('/')
def hello_world():
    return app.send_static_file('index.html')

@app.errorhandler(404)
@app.route("/404")
def page_not_found(error):
    return app.send_static_file('404.html')

@app.errorhandler(500)
@app.route("/500")
def requests_error(error):
    return app.send_static_file('500.html')

#point GraphQL playground to /graphql endpoint

app.add_url_rule(
    '/graphql',
    view_func=GraphQLView.as_view(
    'graphql',
    schema=Schema,
    #pass request to context to perform resolver validation
    get_context=lambda: {'request': request}
))

@app.route('/graphiql')
def playground_render():
    return app.send_static_file('playground.html')

if __name__ == "__main__":
    app.run(host='0.0.0.0')

It is a good practice to use the WSGI server when running a production environment. Therefore, we have also set-up a script alias for gunicorn with:

pipenv run prod

Levenshtein distance (edit distance)

The Levenshtein distance, also known as edit distance, is a string metric. It is defined as the minimum number of single-character edits needed to change a one character sequence $a$ to another one $b$ . If we denote length of such sequences $\left|a\right|$ and $\left|b\right|$ respectively, we get the following:

$lev_{a,b}\left(|a|, |b| \right)$ ,

where

$\begin{aligned}\displaystyle \qquad \operatorname {lev} _{a,b}(i,j)={\begin{cases}\max(i,j)\\\min {\begin{cases}\operatorname {lev} _{a,b}(i-1,j)+1\\\operatorname {lev} _{a,b}(i,j-1)+1\\\operatorname {lev} _{a,b}(i-1,j-1)+1_{(a_{i}\neq b_{j})}\end{cases}}\end{cases}}\end{aligned}$

$1_(a_{i}\neq b_{j})$ is the so-called indicator function, which is equal to 0, when $a_i = b_j$ and equal to 1 otherwise. $lev_{a,b}(i,j)$ is the distance between the first $i$ characters of $a$ and the first $j$ character of $b$ . For more on the theoretical background, feel free to check out the wiki.

In practice, let's say that someone misspelled "machine learning" and wrote "machinlt lerning". We would need to make the following edits: Edit - Edit type - Word state 0 - machinlt lerning 1 - Substitution - machinet lerning 2 - Deletion - machine lerning 3 - Insertion - machine learning

For these two strings we get a Levenshtein distance equal to 3. The Levenshtein distance has many applications, such as spell checkers, correction systems for optical character recognition, or similarity calculations.

Building a GraphQL server with graphene in Python

We will build the following schema in our article:

input LevenshteinDistanceInput {
  s1: String!
  s2: String!
  substitutionCost: Int = 1
  transpositions: Boolean = true
}

type LevenshteinDistancePayload {
  levenshteinDistance: Float
}

type Mutation {
  levenshteinDistance(
    input: LevenshteinDistanceInput!
  ): LevenshteinDistancePayload
}

type Query {
  healthcheck: Boolean!
}

Each GraphQL schema is required to have at least one query. We usually define our first query in order to healthcheck our microservice. The query can be called like this:

query {
  healthcheck
}

However, the main functional of our schema is to enable us to calculate the Levenshtein distance. We will use variables to pass dynamic parameters in the following GraphQL document:

mutation levenshteinDistance($input: LevenshteinDistanceInput!) {
  levenshteinDistance(input: $input) {
    levenshteinDistance
  }
}

We have defined our schema so far in SDL format. In the Python ecosystem, however, we do not have libraries like graphql-tools, so we need to define our schema with a code-first approach. The schema is defined as follows using the Graphene library:


import graphene
import os
from server.schema.levenshtein_calc import LevenshteinDistanceMutation

class Query(graphene.ObjectType):
    healthcheck = graphene.Boolean(required=True)

    def resolve_healthcheck(self, info):
        return True


class Mutation(graphene.ObjectType):
    levenshtein_distance = LevenshteinDistanceMutation.Field()

Schema = graphene.Schema(query=Query, mutation=Mutation)

We have followed the best practices for overall schema and mutations. Our input object type is written in Graphene as follows:


import graphene

class LevenshteinDistanceInput(graphene.InputObjectType):
    s1 = graphene.String(required=True)
    s2 = graphene.String(required=True)
    substitution_cost = graphene.Int(default_value=1)
    transpositions = graphene.Boolean(default_value=True)

This code essentially defines dynamic arguments for our mutation. Those are then passed to the function responsible for calculating the Levenshtein distance.

Each time, we execute our mutation in GraphQL playground:

mutation levenshteinDistance($input: LevenshteinDistanceInput!) {
  levenshteinDistance(input: $input) {
    levenshteinDistance
  }
}

with the following variables

{
  "input": {
    "s1": "test1",
    "s2": "test2"
  }
}

we obtain the Levenshtein distance between our two input strings. For our simple example of strings test1 and test2, we get 1. We can leverage the well-known NLTK library for natural language processing (NLP). The following code is executed from our resolver:


import nltk

def calculate_levenshtein_distance(input):
    return nltk.edit_distance(s1=input["s1"], s2=input["s2"], substitution_cost=input["substitution_cost"], transpositions=input["transpositions"])

It is also straightforward to implement the Levenshtein distance by ourselves using, for example, an iterative matrix, but I would suggest to not reinvent the wheel and use the default NLTK functions.

Conclusion

GraphQL is a great technology for building APIs and it is very useful for exposing the output from machine learning models and other calculations. The Python ecosystem is more suitable for data science and libraries such as graphene help us build our GraphQL schema for machine learning microservices with ease.