A New Solution to Vehicle Routing Problem

Ever wonder how Amazon delivers thousands of packages everyday? Or how Doordash brings your order to your doorstep?

When we have a list of locations to go to, how do these giant companies minimize cost and time to complete their deliveries at multiple locations?

The magic lies within theoretical complexity, these type of questions falls under the umbrella of Vehicle Routing Problem(VRP). Simply put, it asks: “What is the optimal set of routes for a group of vehicles to drive through in order to reach a given set of customers?” The seemingly simple solution is that it requires > 1 route to complete all deliveries.

Here we have a more realistic depiction of VSP on a map. Large companies such as Amazon and Uber, use VSP to deliver their products and services. Other applications of VRP include minimizing total carbon emissions and allocating charging demands and energy consumption of electric vehicles.

Source: https://www.caliper.com/maptitude/solutions/alternative-to-omnitracs-planner.htm

VRP is a combinatorial optimization problem. It is known to be a computationally difficult problem. VRP is NP-Hard. The vehicle routing problem was first researched by Georgia Dantzig and John Ramser in 1959, in which the algorithm was initially applied to petrol deliveries. Many exact(algorithms that always find the optimal solution to a given optimization problem) and heuristic(solve a problem in a faster and more efficient fashion than traditional methods by sacrificing optimality, accuracy, precision, or completeness for speed) algorithms have been proposed solutions that aim to provide optimal solutions. However, there is no perfect solution that meets the needs of every scenario — providing a fast and reliable solution remains a challenging task until these days.

Some of the well-known VRP solutions include:

  • Clarke and Wright savings heuristic (CW): applies to problems with flexible number of vehicles and works equally well for both directed and undirected problems.
  • Sweep heuristic (SW): applies to planar instances of the VRP consisting of two parts: (1). Split: viable clusters are initially formed rotating a ray centered at the depot, and (2). Travelling Salesman Problem(TSP): A vehicle routing is created for each cluster by solving a TSP.
  • Google’s Optimization tools (OR-Tools): one of the best open-source VRP solvers, leveraging first solution strategy with various heuristic and it may use Local Search(optional).

So far, the Clarke and Wright savings algorithm is the most widely used approach to solve vehicle routing problems. However, a newly founded solution by Nazari, M., Oroojlooy, A., Snyder, L. V., & Takáč, M, uses reinforcement learning and neural networks to solve VRP. Despite its first appearance in the computation world in 2018, this research has already been cited by almost 250 scholars. It trains a single model that finds near-optimal solutions for problem instances sampled from a given distribution, only by observing the reward signals and following feasibility rules. The model represents a parameterized stochastic policy and applies a policy gradient algorithm to optimize its parameters. What’s interesting is that, on capacitated VRP, this approach outperforms classical heuristics and Google’s Optimization-Tools on medium-sized instances in solution quality with comparable computation time.

Below is a list of key concepts to help us understand the complexity of this model:

  • Reinforcement learning: a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences — getting rewards/penalties. When it comes to designing, its number one rule is to never give the answer to the program. A good example is illustrated in this Pacman game.
Source: https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292
  • A Recurrent Neural Network(RNN): a type of neural network that contains loops, allowing information to be stored within the network.
Source: https://towardsdatascience.com/introducing-recurrent-neural-networks-f359653d7020
  • Policy: a strategy that an agent uses in pursuit of goals.
  • Stochastic policy: outputs a probability distribution over actions. It allows our agent to explore the state space without always taking the same action. Thus, it handles the exploration/exploitation trade-off without hard coding it. For instance, from the mouse and cheese diagram below, it means that instead of being sure of taking action “a” (for instance left), there is a probability we’ll take a different one (in this case 30% that we take south). The stochastic policy is used when the environment is uncertain. This is of course the use of randomization in decision making.
Source: https://medium.com/free-code-camp/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f


Now, let’s move on to the model.

The method also uses Y sub t to denote the decoded sequence up to time t, i.e., Y sub t = {y sub 0, · · ·, y sub t}. It tries to find a stochastic policy π that generates the sequence Y in a way that minimizes a loss objective while satisfying the problem constraints. The optimal policy π∗ will generate the ideal solution with probability 1. The model’s goal is to make π as close to π∗ as possible. It uses the probability chain rule to decompose the probability of generating sequence Y, i.e., P(Y |X sub 0), as follows:

is a recursive update of the problem representation with the state transition function f. Each component in the right-hand side of (1) is computed by the attention mechanism,


g: an affine function that outputs an input-sized vector

h sub t: the state of the RNN decoder that summarizes the information of previously decoded steps y sub 0, · · ·, y sub t.

The proposed attention mechanism employed in this model is illustrated in the figure below.

Figure 1: The embedding layer maps the inputs to a high-dimensional vector space. On the right, an RNN decoder stores the information of the decoded sequence. Then, the RNN hidden state and embedded input produce a probability distribution over the next input using the attention mechanism.

As seen in figure 1. This model is composed of two main components. The first is a set of embeddings that maps the inputs into a D-dimensional vector space. There may exist multiple embeddings corresponding to different elements of the input, but they are shared among the inputs. The second component of this model is a decoder that points to an input at every decoding step. The framework uses RNN to model the decoder network. Notice that both the static and dynamic elements are fed as the inputs to the decoder network.

An attention mechanism is based on the concept of directing the model to pay greater attention to certain factors when processing the data. Figure 1 illustrates the attention mechanism employed in this framework. At decoder step (i), a context-based attention mechanism is used to extract the relevant information from the inputs using a variable-length alignment vector a sub t. In other words, a sub t specifies how much every input data point might be relevant in the next decoding step t.

Let x¯^ 𝔦 sub t = (¯s^ 𝔦, ¯d^𝔦 sub t ) be the embedded input 𝔦, and h sub t ∈ R^D be the memory state of the RNN cell at decoding step t.

The alignment vector a sub t is then computed as

Here “;” means the concatenation of two vectors. By combining the context vector c sub t, the model computes conditional probabilities as

with the embedded inputs, and then normalizing the values with the softmax function, as follows:

Note: in (4)–(6), v sub a, v sub c, W sub a, and W sub c are trainable variables.

The embeddings and the attention mechanism are invariant to the input order. Hence, the order in which the inputs are fed into the network does not matter. The proposed model is applicable to tasks with no obvious input sequence, such as sorting. Making this model a better approach compared to the models that use pointer networks for combinatorial optimization problems.


To train the network, the study uses well-known policy gradient approaches. The model parameterizes the stochastic policy π with parameters θ. To constantly improve the policy, policy gradient methods use an estimate of the gradient of the expected return with respect to the policy parameters. In principle, the policy gradient algorithm contains two networks: (i) an actor network that predicts a probability distribution over the next action at any given decision step, and (ii) a critic network that estimates the reward for any problem instance from a given state.


In this experiment, the study employs two different decoders: (i) greedy in which the node (either customer or depot) with the highest probability is selected as the next destination at every decoding step, and (ii) beam search (BS), which keeps track of the most probable paths and then chooses the one with the minimum tour length. The results indicate that by applying the beam search algorithm, the quality of the solutions can be improved with only a slight increase in computation time. For faster training and generating feasible solutions, a masking scheme was used which sets the log-probabilities of infeasible solutions to −∞ or forces a solution if a particular condition is satisfied.

The experiment uses the following masking procedures: (i) nodes with zero demand are not allowed to be visited, (ii) all customer nodes will be masked if the vehicle’s remaining load is exactly 0, and(iii) the customers whose demands are greater than the current vehicle load are masked. Under this masking scheme, the vehicle must satisfy all of a customer’s demands when visiting it. However, if the situation being modeled does allow split deliveries, one vehicle can relax (iii). The relaxed masking allows split deliveries, so the solution can allocate the demands of a given customer into multiple routes. In fact, this property is an appealing behavior that is present in numerous real-world applications but is often neglected in classical VRP algorithms.


Figure 2: Log of ratio of solution time to the number of customer nodes using different algorithms.

The method is labeled with the “RL” prefix. The ratio stays almost the same for RL with different decoders. Conversely, the log of the run time for the Clarke-Wright and Sweep heuristics increases linearly with the number of nodes. This observation is one motivation for applying the RL framework to more general combinatorial problems since it suggests that this method scales well. For larger problems, this framework is faster than randomized heuristics.

The proposed framework is extendable to problems with multiple depots by constructing the corresponding state transition function and masking procedure. Additional side constraints can also be included: soft constraints can be applied by penalizing the rewards, or hard constraints such as time windows can be implemented through a masking scheme. However, designing such a scheme may even be harder than solving the optimization problem. Another interesting extension of this model is for VRPs with multiple vehicles. In the simplest case, one must only design a shared masking scheme to prevent the independent vehicles from pointing to the same customer nodes. Hence, it might be interesting to incorporating competition or collaboration among the vehicles that relate to multi-agent RL.

In sum, this method is quite appealing because it only requires a verifier to find feasible solutions and a reward signal indicating how well the policy performs. Once the trained model is available, it can be used many times, without needing to re-train for the new problems. Unlike many classical heuristics, this method scales well with increasing problem size and has a superior performance with competitive solution-time. However, this framework still needs further improvements.

The main takeaway of this article is that VRP is difficult to solve — it is NP-hard. However, VRP can be solved by reinforcement learning. Although this framework has not been widely researched and used, it appears to have high potential and is applicable to other combinatorial optimization problems, such as bin-packing, job-shop, and flow-shop.


As the sole author of this article, I (Sethatevy Bong) am responsible for all parts of this project, including background research, study analysis, and concluding segments. In addition, I also want to give credit to the author of the original paper (listed below), Nazari et. Al, and Bello et al. for developing neural combinatorial optimization frameworks as explained by this article.


https://www.biz.uiowa.edu/faculty/xzhou/schedule/Spring_18_files/DRL_for_VRP.pdf [ Original Paper]

https://arxiv.org/abs/1611.09940 [Similar framework, but using Pointer Network]


https://www.hindawi.com/journals/mpe/2017/5098183/ [reference for EV example]

https://neo.lcc.uma.es/vrp/solution-methods/ [VRP solutions]


https://www.semanticscholar.org/paper/Complexity-of-vehicle-routing-and-scheduling-Lenstra-Kan/8d9b83d213b172582fcdb930ebe3d33ada3c5a8e[reference for EV example]