17.04.2015 Views

DARPA ULTRALOG Final Report - Industrial and Manufacturing ...

DARPA ULTRALOG Final Report - Industrial and Manufacturing ...

DARPA ULTRALOG Final Report - Industrial and Manufacturing ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Ultra*Log<br />

PSU/IAI <strong>Final</strong> <strong>Report</strong> for Ultra*Log<br />

Document Revision Number: 2.0<br />

Date: 09/01/2005<br />

Prepared for:<br />

Defense Advanced Research Projects Agency<br />

Information Systems Office<br />

3701 North Fairfax Drive<br />

Arlington, VA 22203-1714<br />

Prepared by:<br />

The Pennsylvania State University<br />

Intelligent Automation, Inc.<br />

Contact Persons:<br />

Soundar Kumara (PSU)<br />

skumara@psu.edu (814) 863-2359<br />

Vikram Manikonda (IAI)<br />

vikram@i-a-i.com (301) 294-5045<br />

Wilbur Peng (IAI)<br />

wpeng@i-a-i.com (301) 294-50455<br />

i


Document History<br />

Revision<br />

Date<br />

Revised By Comments Date Reviewed<br />

with Team<br />

08/24/05 Soundar Kumara Initial creation<br />

* Document will be approved as part of CCB process.<br />

Approved *<br />

Software Release History<br />

Date<br />

Comments<br />

LEGEND (example; legend/copyright statement optional)<br />

Use, duplication, or disclosure by the Government is as set forth in the Rights in technical data<br />

noncommercial items clause DFAR 252.227-7013 <strong>and</strong> Rights in noncommercial computer software <strong>and</strong><br />

noncommercial computer software documentation clause DFAR 252.227-7014.<br />

© Copyright 2001 <br />

ii


Contents<br />

Contents .............................................................................................................. iii<br />

Executive Summary..............................................................................................v<br />

1 Introduction .....................................................................................................7<br />

2 Design <strong>and</strong> survivability of distributed multi-agent systems ............................7<br />

2.1 Designing a Network Infrastructure for Survivability of Multi-Agent<br />

Systems .................................................................................................7<br />

2.2 Survivability of Multi-agent based Supply Networks: A Topological<br />

perspective.............................................................................................7<br />

2.3 Survivability of a distributed multi-agent application – A performance<br />

control perspective .................................................................................8<br />

2.4 Survivability through Implementation Alternatives in Large-scale<br />

Information Networks with Finite Load ...................................................8<br />

3 Monitoring, situation identification <strong>and</strong> pattern extraction ...............................8<br />

3.1 Situation Identification Using Dynamic Parameters in Complex Agent-<br />

Based Planning Systems .......................................................................8<br />

3.2 Estimating global stress environment through local behavior in a<br />

multiagent-based planning system.........................................................8<br />

3.3 Using Predictors to Improve the Robustness of Multi-Agent Systems:<br />

Design <strong>and</strong> Implementation in Cougaar .................................................8<br />

3.4 Survivability of Complex System – Support Vector Machine Based<br />

Approach................................................................................................8<br />

4 Control ............................................................................................................8<br />

4.1 A Framework For Performance Control of Distributed Autonomous<br />

Agents....................................................................................................8<br />

4.2 An Autonomous Performance Control Framework for Distributed Multi-<br />

Agent Systems: A Queueing Theory Based Approach...........................9<br />

4.3 Adaptive control for large-scale information networks through alternative<br />

algorithms to support survivability ..........................................................9<br />

4.4 Self-organizing resource allocation for minimizing completion time in<br />

large-scale distributed information networks ..........................................9<br />

4.5 Efficient method of quantifying minimal completion time for componentbased<br />

service networks: Network topology <strong>and</strong> resource allocation ......9<br />

4.6 Market-based model predictive control for large-scale information<br />

networks: Completion time <strong>and</strong> value of solution ...................................9<br />

4.7 Coordinating control decisions of software agents for adaptation to<br />

dynamic environments. ..........................................................................9<br />

iii


5 CPE society modeling <strong>and</strong> performance analysis...........................................9<br />

5.1 Underst<strong>and</strong>ing agent societies using distributed monitoring <strong>and</strong> profiling<br />

...............................................................................................................9<br />

5.2 Reliable MAS Performance Prediction Using Queueing Models............9<br />

6 Characterization <strong>and</strong> analysis of supply chains from complex systems<br />

perspective .........................................................................................................10<br />

6.1 Supply Chain Network: A Complex Adaptive Systems Perspective .....10<br />

6.2 Decision Making in Logistics: A Chaos Theory Based Approach.........10<br />

iv


Executive Summary<br />

Ultra*Log is a Defense Advanced Research Projects Agency (<strong>DARPA</strong>) sponsored<br />

research project focused on creating a distributed agent-based architecture that is<br />

inherently survivable <strong>and</strong> capable of operating effectively in very chaotic environments.<br />

The project is pursuing the development of technologies to enhance the security,<br />

robustness, <strong>and</strong> scalability of large-scale, distributed agent-based systems operating in<br />

chaotic wartime environments. Ultra*Log's goal is to operate with up to 45% information<br />

infrastructure loss in a very chaotic environment with not more than 20% capabilities<br />

degradation <strong>and</strong> not more than 30% performance degradation for a period representing<br />

180 days of sustained military operations in a major regional contingency.<br />

In order to achieve the above goals, we are concentrating on the complexity studies for<br />

analysis, estimation <strong>and</strong> control. The efforts are geared towards realizing a robust theory<br />

for analyzing <strong>and</strong> controlling the complexity in distributed multi-agent systems. This<br />

would in turn help to define the theoretical <strong>and</strong> application grounds for adaptivity in<br />

distributed systems. The application area is military logistics where the studies<br />

concentrate on sensing to logistics in a network centric warfare environment. The<br />

research under Ultra*Log is expected to lay the foundation for the next generation<br />

logistics.<br />

In this document we discuss significant accomplishments of PSU/IAI as a part of<br />

Ultra*Log project. We present the results in the form of papers we published/submitted to<br />

different refereed conferences <strong>and</strong> journals. During the course of the Ultra*Log project,<br />

we have proposed three different research <strong>and</strong> development areas, namely:<br />

1. Research in design <strong>and</strong> survivability of distributed multi-agent systems<br />

2. Research in Monitoring, Situation Identification <strong>and</strong> Pattern Extraction<br />

3. Research in characterization, analysis <strong>and</strong> control of complex adaptive systems.<br />

We discuss the details of several results <strong>and</strong> findings related to the above research areas<br />

in this document.<br />

The total period of the project is four <strong>and</strong> half years from the start of the project. The<br />

team members include The Pennsylvania State University (PSU), <strong>and</strong> Intelligent<br />

Automation Incorporated, Rockville, MD., who is a sub-contractor to PSU. The regular<br />

team members from PSU include graduate students (Y. Hong, S. Lee, H. P.<br />

Thadakamalla, A. Surana, V. Narayanan, H. Gupta, N. Gnanasamb<strong>and</strong>am, K. Tang, X.<br />

Ding, E. Pinto <strong>and</strong> U. N. Raghavan). These students worked under the direct supervision<br />

of Professor Kumara. Most of them are Ph.D., students. In addition, Professors G.<br />

Natarajan <strong>and</strong> C.R. Rao * participated in the project. From IAI, V. Manikonda, W. Peng<br />

<strong>and</strong> H. Gupta were the participants.<br />

* (Winner of National medal of Science for Mathematics from President Bush in 2002)<br />

v


1 Introduction<br />

The main focus of the proposed research is breakthrough technology development based on<br />

theory of chaos, knowledge mining, queueing theory <strong>and</strong> market based control theoretic<br />

principles combined with chaos to improve scalability, robustness <strong>and</strong> survivability of Cougaar<br />

Architecture. In specific, our focus was on adaptive logistics. Such a development will introduce<br />

new-operation capabilities in Cougaar in terms of:<br />

• Dynamic fault isolation <strong>and</strong> recovery,<br />

• Dynamic adaptation to environment, <strong>and</strong><br />

• Variable fidelity to adaptive processes.<br />

We have published/submitted many papers in different refereed conferences <strong>and</strong> journals that<br />

address these principles. We have classified these papers into five different sections as given<br />

below:<br />

• Design <strong>and</strong> survivability of distributed multi-agent systems<br />

• Monitoring, situation identification <strong>and</strong> pattern extraction<br />

• Control<br />

• CPE society modeling <strong>and</strong> performance analysis<br />

• Characterization <strong>and</strong> analysis of supply chains from complex systems perspective<br />

Please note that the papers are hyper-linked.<br />

2 Design <strong>and</strong> survivability of distributed multi-agent systems<br />

It is extremely important to design multi-agent systems architecture that is survivable even<br />

in war-time or critical situations. Survivability can be improved both from functionality <strong>and</strong><br />

topological perspective. We published the following papers, where we discuss different ways of<br />

improving survivability for distributed multi-agent systems.<br />

2.1 Surana, A., Gautam, N., Kumara, S. R. T., <strong>and</strong> Greaves, M., “Designing a<br />

Network Infrastructure for Survivability of Multi-Agent Systems”, IASTED<br />

Conference on Parallel <strong>and</strong> Distributed Computing <strong>and</strong> Networks, 2005.<br />

2.2 Thadakamalla, H. P., Raghavan, U. N., Kumara, S. R. T. <strong>and</strong> Albert, R.,<br />

“Survivability of Multi-agent based Supply Networks: A Topological<br />

perspective,” IEEE Intelligent Systems, Vol. 19, No. 5, 2004.<br />

7


2.3 Gnanasamb<strong>and</strong>am, N., Lee, S., Kumara, S. R. T., Gautam, N., Peng, W.,<br />

Manikonda, V., Brinn, M. <strong>and</strong> Greaves, M., “Survivability of a distributed multiagent<br />

application – A performance control perspective”, IEEE Symposium on<br />

Multi-agent Security <strong>and</strong> Survivability (MAS&S 2005), Philadelphia, 2005.<br />

2.4 Lee, S., <strong>and</strong> Kumara, S. R. T., “Survivability through Implementation<br />

Alternatives in Large-scale Information Networks with Finite Load,” Proceedings<br />

of Open Cougaar Conference, July 2004.<br />

3 Monitoring, situation identification <strong>and</strong> pattern extraction<br />

An essential task for control is sensing. We have developed different tools to monitor <strong>and</strong><br />

sense distributed systems. With the help of these tools, we devised many situation identification<br />

<strong>and</strong> pattern extraction algorithms based on theory of chaos, knowledge mining <strong>and</strong> Kalman filter<br />

principles. The following are the papers published in this field of research.<br />

3.1 Lee, S., Gautam, N., Kumara, S. R. T., Surana, A., Gupta, H., Hong, Y.,<br />

Narayanan, V., <strong>and</strong> Thadakamalla, H. P., “Situation Identification Using Dynamic<br />

Parameters in Complex Agent-Based Planning Systems,” Intelligent Engineering<br />

Systems Through Artificial Neural Networks, v 12, 2002.<br />

3.2 Lee, S., <strong>and</strong> Kumara, S. R. T., “Estimating global stress environment through<br />

local behavior in a multiagent-based planning system,” IEEE Conference on<br />

Automation Science <strong>and</strong> Engineering (CASE 05), Edmonton, Canada, August<br />

2005.<br />

3.3 Gupta, H., Hong, Y., Thadakamalla, H. P., Manikonda, V., Kumara, S. R. T. <strong>and</strong><br />

Peng, W., “Using Predictors to Improve the Robustness of Multi-Agent Systems:<br />

Design <strong>and</strong> Implementation in Cougaar”, Proceedings of Open Cougaar<br />

Conference, July 2004.<br />

3.4 Hong, Y., Gautam, N., Kumara, S. R. T., Surana, A., Gupta, H., Lee, S.,<br />

Narayanan, V., <strong>and</strong> Thadakamalla, H. P., “Survivability of Complex System –<br />

Support Vector Machine Based Approach,” Intelligent Engineering Systems<br />

Through Artificial Neural Networks, v 12, 2002.<br />

4 Control<br />

The heart of adaptivity is control. In our work we therefore, build different control<br />

frameworks <strong>and</strong> methods for distributed systems. The following are the papers published in this<br />

research area.<br />

4.1 Gnanasamb<strong>and</strong>am, N., Lee, S., Kumara, S. R. T. <strong>and</strong> Gautam, N., “A Framework<br />

For Performance Control of Distributed Autonomous Agents,” <strong>Industrial</strong><br />

Engineering Research Conference (IERC), Atlanta, August 2005.


4.2 Gnanasamb<strong>and</strong>am, N., Lee, S., Gautam, N., Kumara, S. R. T., Peng, W.,<br />

Manikonda, V., Brinn, M. <strong>and</strong> Greaves, M., “An Autonomous Performance<br />

Control Framework for Distributed Multi-Agent Systems: A Queueing Theory<br />

Based Approach,” Autonomous Agents <strong>and</strong> Multi-Agent Systems (AAMAS),<br />

Utrecht, Netherl<strong>and</strong>s, July 2005.<br />

4.3 Lee, S. <strong>and</strong> Kumara, S. R. T. “Adaptive control for large-scale information<br />

networks through alternative algorithms to support survivability”, submitted to<br />

IEEE Transactions on Automatic Control.<br />

4.4 Lee, S., Kumara, S. R. T. <strong>and</strong> Gautam, N., “Self-organizing resource allocation<br />

for minimizing completion time in large-scale distributed information networks”,<br />

submitted to IEEE Transactions on Systems, Man, <strong>and</strong> Cybernetics.<br />

4.5 Lee, S., Kumara, S. R. T. <strong>and</strong> Gautam, N., “Efficient method of quantifying<br />

minimal completion time for component-based service networks: Network<br />

topology <strong>and</strong> resource allocation”, submitted to IEEE Transactions on<br />

Computers.<br />

4.6 Lee, S., Kumara, S. R. T. <strong>and</strong> Gautam, N., “Market-based model predictive<br />

control for large-scale information networks: Completion time <strong>and</strong> value of<br />

solution”, submitted to IEEE Transactions on Parallel <strong>and</strong> Distributed Systems.<br />

4.7 Hong, Y. <strong>and</strong> Kumara, S. R. T., “Coordinating control decisions of software<br />

agents for adaptation to dynamic environments” 37 th CIRP International Seminar<br />

on <strong>Manufacturing</strong> Systems (ISMS-2004), Budapest, Hungary, May 2004.<br />

5 CPE society modeling <strong>and</strong> performance analysis<br />

We have built a demo society, “CPEDemo”, for identifying the key aspects within a<br />

continuous planning <strong>and</strong> execution scenario. This will help us to identify <strong>and</strong> demonstrate key<br />

concepts in the argument for <strong>and</strong> concept of “design for survivability”. The following are the<br />

papers published related to the CPE society.<br />

5.1 Peng, W., Manikonda, V. <strong>and</strong> Kumara S. R. T., “Underst<strong>and</strong>ing agent societies<br />

using distributed monitoring <strong>and</strong> profiling” Proceedings of Open Cougaar<br />

Conference, July 2004.<br />

5.2 Gnanasamb<strong>and</strong>am, N., Lee, S., Gautam, N., Kumara, S. R. T., Peng, W.,<br />

Manikonda, V., Brinn, M. <strong>and</strong> Greaves, M., “Reliable MAS Performance<br />

Prediction Using Queueing Models,” IEEE Symposium on Multi-agent Security<br />

<strong>and</strong> Survivability (MAS&S 2004), Philadelphia, PA, 2004


6 Characterization <strong>and</strong> analysis of supply chains from complex<br />

systems perspective<br />

With the advent of information technology, supply chains have acquired complexity that is<br />

almost equivalent to biological systems. In the following papers, we argue why supply chains<br />

should be treated as complex systems <strong>and</strong> propose how various concepts, tools <strong>and</strong> techniques<br />

used in complex adaptive systems literature can be exploited to characterize <strong>and</strong> analyze supply<br />

chain networks.<br />

6.1 Surana, A., Kumara, S. R. T., Greaves, M. <strong>and</strong> Raghavan, U. N. “Supply Chain<br />

Network: A Complex Adaptive Systems Perspective”, International Journal of<br />

Production Research (to be published), 2005.<br />

6.2 Kumara, S. R. T., Ranjan, P., Surana, A. <strong>and</strong> Narayanan, V., “Decision Making in<br />

Logistics: A Chaos Theory Based Approach”, CIRP Annals, p.381, 2003.


DESIGNING A NETWORK INFRASTRUCTURE FOR SURVIVABILITY OF<br />

MULTI-AGENT SYSTEMS<br />

A. Surana<br />

MIT,<br />

Cambridge, MA 02139<br />

email: surana@mit.edu<br />

N. Gautam, S.R.T. Kumara<br />

Penn. State University,<br />

University Park, PA 16802<br />

email: {ngautam, skumara}@psu.edu<br />

M. Greaves<br />

<strong>DARPA</strong>, 3701 Fairfax Drive<br />

Arlington, VA 22203-1714<br />

email: mgreaves@darpa.mil<br />

ABSTRACT<br />

In this paper we consider a society of agents whose interactions<br />

are known. Our objective is to solve a strategic<br />

network infrastructure design problem to determine: (i)<br />

number of nodes (usually computers or servers) <strong>and</strong> their<br />

processing speeds, (ii) set of links between nodes <strong>and</strong> their<br />

b<strong>and</strong>widths, <strong>and</strong> (iii) assignment of agents to nodes. From<br />

a performance st<strong>and</strong>point, on one h<strong>and</strong> all the agents can<br />

reside in a single node thereby stressing the processor, on<br />

the other h<strong>and</strong> the agents can be distributed so that there<br />

is a maximum of one agent per node thereby increasing<br />

communication cost. From a robustness st<strong>and</strong>point since<br />

links <strong>and</strong> arcs can fail (possibly due to attacks) we would<br />

like to build a network that is least disruptive to the multiagent<br />

system functionality. Although we do not explicitly<br />

consider tactical issues such as moving agents to different<br />

nodes upon failure, we would like to design an infrastructure<br />

that facilitates such agent migrations. We formulate<br />

<strong>and</strong> solve a mathematical program for the network infrastructure<br />

design problem by minimizing a cost function subject<br />

to satisfying quality of service (QoS) as well as robustness<br />

requirements. We test our methodology on Cougaar<br />

multi-agent societies.<br />

KEY WORDS<br />

network design, QoS, robustness, optimization.<br />

1 Introduction<br />

As the number of applications requiring distributed multiagent<br />

systems (MAS) is continuously growing, it becomes<br />

extremely important to build a network infrastructure that<br />

can guarantee a survivable MAS architecture. By survivable<br />

we mean a system that is robust, secure as well as able<br />

to provide excellent quality of service (QoS) even when<br />

stressed. For example Brinn <strong>and</strong> Greaves [6] state that the<br />

Cougaar MAS in Ultra*Log [14], would be considered survivable<br />

if it would maintain at least x% of system capabilities<br />

<strong>and</strong> y% of system performance in the face of z%<br />

infrastructure loss <strong>and</strong> wartime loads (where x, y <strong>and</strong> z are<br />

provided by the system users).<br />

In order to build such a survivable system, there are<br />

several decisions that need to be made at different time<br />

granularities. These can be broken down as strategic (once<br />

a year or just one time), tactical (once a week to once a day<br />

depending on how often the configuration changes), <strong>and</strong><br />

operational (typically milliseconds to seconds, depending<br />

on the granularity of information exchange) decisions. The<br />

strategic decisions typically involve designing the network<br />

infrastructure (in terms of both number <strong>and</strong> capacity) for<br />

the MAS such as computers, servers, cables, etc. The tactical<br />

decisions include where to migrate the agents if a node<br />

fails or is cut off from the other nodes. Operational decisions<br />

include adaptive control methods for deciding which<br />

agent should process a task, the fidelity with which to process<br />

a task, etc.<br />

There has been a lot of research related to (a) software<br />

technology, such as, agent architecture, communication,<br />

migration, adaptation, learning, etc., <strong>and</strong> (b) networking,<br />

such as, QoS provisioning, fault-tolerance, dependability,<br />

robustness, etc. However there is very little research that<br />

combines the two <strong>and</strong> studies them from a systems engineering<br />

viewpoint. In this research we address that shortcoming.<br />

We focus on the strategic problem stated in the<br />

previous paragraph of designing a network infrastructure<br />

in terms of hardware to support a given society of agents<br />

<strong>and</strong> their interactions. This is with the underst<strong>and</strong>ing that<br />

in order to solve the tactical <strong>and</strong> operational problems effectively,<br />

the strategic problem must favor a network design<br />

that would ease tactical <strong>and</strong> operational decisions.<br />

We now present some of the related research with the<br />

underst<strong>and</strong>ing that due to space restrictions it is difficult to<br />

cite all relevant articles in the literature. Andreoli et al [2]<br />

consider a distributed software network infrastructure for<br />

agents performing search tasks (such as search engines).<br />

Optimization issues for MAS at software level such as<br />

load balancing using non-linear programming techniques is<br />

studied in Aiello et al [1] for a given hardware topology. In<br />

Hofmann et al [9], a mobile intelligent agent system is built<br />

under conditions of low b<strong>and</strong>width to show that it could<br />

improve the efficiency of military tactical operations <strong>and</strong><br />

the mobile agents would outperform static agents. Multiagent<br />

hybrid systems that combine computational hardware<br />

<strong>and</strong> a large scale software residing on it, with an application<br />

to air-traffic management is studied in Tomlin et al<br />

[13]. In Kephart et al [11], one of the emerging research<br />

areas, namely considering a distributed information system<br />

as analogous to biological ecosystem <strong>and</strong> social systems, is<br />

presented in order to study their survivability. Cancho <strong>and</strong><br />

Sole [7] consider a complex network <strong>and</strong> show that by op-


timizing simultaneously the link density <strong>and</strong> path distance<br />

in a graph (with a fixed set of nodes), leads to a scale-free<br />

topology which is robust to r<strong>and</strong>om attacks.<br />

The remainder of this paper is organized as follows.<br />

In Section 2 we describe the strategic problem in detail.<br />

Then in Section 3 we formulate the problem as a mathematical<br />

program. We discuss various methods to solve<br />

the mathematical program in Section 4. Then we describe<br />

numerical examples <strong>and</strong> results in Section 5. <strong>Final</strong>ly we<br />

present our concluding remarks <strong>and</strong> directions for future<br />

work in Section 6.<br />

2 Problem Description<br />

We now present details of the strategic problem of designing<br />

a network infrastructure for a MAS. Distributed information<br />

systems (DIS) can be viewed as a reconfigurable<br />

network with (i) computational infrastructure forming the<br />

backbone, <strong>and</strong> (ii) agents residing on it <strong>and</strong> moving around,<br />

consuming resources <strong>and</strong> providing services under uncertain<br />

<strong>and</strong> often hostile conditions. Each agent, has access<br />

to different information <strong>and</strong> makes its own local decisions,<br />

but must work together with other agents for the achievement<br />

of a common, system-wide goal. In this research we<br />

consider a MAS <strong>and</strong> an underlying network infrastructure<br />

that can be modeled as a DIS.<br />

One of the key inputs that go into the network design<br />

problem is the agent interaction pattern. A typical<br />

agent interaction tree is depicted in Figure 1. In the figure,<br />

the agents are nodes <strong>and</strong> if there arcs connecting two<br />

nodes, then the corresponding agents interact. The agents<br />

also specify the b<strong>and</strong>width requirement for their interaction.<br />

Besides the b<strong>and</strong>width (<strong>and</strong> interaction graph), another<br />

input to the design problem is resource requirements<br />

such as CPU <strong>and</strong> memory from the host computer or server.<br />

Although the figure suggests a hierarchical network, the<br />

model does not require that. In addition, some or all of<br />

the agents can be identical (in terms of what they can do).<br />

Figure 1. Example of an agent logical network<br />

Given the inputs mentioned above <strong>and</strong> other inputs<br />

based on survivability requirements, the output of the design<br />

problem is a physical network of nodes <strong>and</strong> arcs,<br />

Figure 2. Agent logical network residing in a physical network<br />

where nodes signify processors such as computers <strong>and</strong><br />

servers, <strong>and</strong> arcs signify links (it is not required that there<br />

be a single link between 2 nodes, however, we use the capacity<br />

of the bottleneck link <strong>and</strong> pretend as though it is a<br />

single link). The two extremes of design are if we place all<br />

agents in one node <strong>and</strong> if we place one agent in each node.<br />

In case of all agents in one node, the bottleneck would be<br />

resources, i.e. whether the CPU <strong>and</strong> memory requirements<br />

of all agents can be met. However if the agents are such<br />

that there is one agent per node, there would be a lot of<br />

time spent in communication between them. We assume<br />

that if two agents are on a single node, the available b<strong>and</strong>width<br />

for their interaction is infinite. In Figure 2, we take<br />

the logical network of agents (described in Figure 1) <strong>and</strong><br />

build 4 nodes <strong>and</strong> three arcs to house the agents in a physical<br />

infrastructure.<br />

3 Robust Design: Problem Formulation<br />

In this section we formulate a robust design problem for<br />

DIS. As we have discussed previously in Section 2, a DIS<br />

consists of two critical components: computational hardware<br />

(processors <strong>and</strong> communication links) <strong>and</strong> a MAS residing<br />

on it. With this viewpoint we have the following<br />

robust design problem: given the software agents (MAS)<br />

with their interaction pattern <strong>and</strong> computational resource<br />

requirements, we want to decide (i) How much processing<br />

power to start with i.e., how many processors <strong>and</strong> the capacity<br />

of each processor (to be selected from a given a set of<br />

processors)? (ii) How to lay out the physical network structure,<br />

i.e., how to connect the processors <strong>and</strong> what should be<br />

the capacity of each link (to be selected from a given a set<br />

of b<strong>and</strong>widths)? (iii) How to distribute agents on this network?<br />

The above decisions are to be made so that we can<br />

meet the survivability requirements <strong>and</strong> at the same time<br />

minimize the information infrastructure cost. We translate<br />

the survivability requirements into following “specifications”<br />

for the design problem: (a) Sufficient computational<br />

resources to start with <strong>and</strong> its balanced utilization;


(b) Small average path length <strong>and</strong> diameter, measuring the<br />

connectivity; (c) Resilience to complete node <strong>and</strong> link failures.<br />

Given the above specifications, it is clear that a robust<br />

design for DIS would be one with maximum possible computational<br />

resources <strong>and</strong> a fully connected backbone network.<br />

However, this would incur a very large infrastructure<br />

cost. This leads to the problem of optimally designing the<br />

backbone network <strong>and</strong> distributing agents on it such that it<br />

is fairy robust <strong>and</strong> at the same time cost effective. In order<br />

to systematically pose this trade-off as a mathematical<br />

programming model, we first give a formal description of<br />

various entities involved in the model.<br />

3.1 Agent Society, Nodes <strong>and</strong> Links<br />

The MAS or the agent society is described by the computational<br />

resource each agent consumes <strong>and</strong> their interaction<br />

pattern. Let N A be the total number of agents in the society,<br />

indexed as {1, 2, · · · , N A }. Let for an i th agent P a i<br />

denote the computational resource (CPU <strong>and</strong> Memory) it<br />

consumes <strong>and</strong> let Ba ij denote the b<strong>and</strong>width it uses, if it<br />

interacts with agent j.<br />

A node represents a computer with a given processing<br />

power (power can refer to CPU, Memory, etc.). Each agent<br />

in the given society has to be assigned to a node. As a result<br />

each node can be assigned one or more agents. For the<br />

agents which reside on the same node, the communication<br />

requirement is automatically satisfied. Let N max be the<br />

total number of nodes numbered {1, 2, · · · N max }, that is<br />

initially chosen to distribute the agents on. Let N i denote<br />

the decision variable such that,<br />

{ 1, if node i is selected from Nmax nodes<br />

N i =<br />

0, otherwise,<br />

for 1 ≤ i ≤ N max . Let P = {P 1 , P 2 , · · · , P Np } denote<br />

the set of available processing power for nodes with an associated<br />

cost set C(P ) = {C p1 , C p2 , · · · , C pNp } <strong>and</strong> P n ij<br />

a decision variable such that<br />

{<br />

1, if i<br />

P n ij =<br />

th node uses processor with a power P j<br />

0, otherwise,<br />

for 1 ≤ i ≤ N max <strong>and</strong> 1 ≤ j ≤ N p . Furthermore, let<br />

A d = [A ij ] denote a matrix of the distribution of agents on<br />

the nodes, where A ij is<br />

{ 1, if agent i resides on node j<br />

A ij =<br />

0, otherwise,<br />

for 1 ≤ i ≤ N A <strong>and</strong> 1 ≤ j ≤ N max . It is assumed<br />

that there are no multiple links <strong>and</strong> no self-loops when we<br />

connect the nodes with communication links. Let X ij be<br />

the decision variable, such that<br />

{ 1, if there is link from node i to j <strong>and</strong> i ≠ j<br />

X ij =<br />

0, otherwise,<br />

for 1 ≤ i, j ≤ N max . The matrix, X = [X ij ], is symmetric<br />

as the links connecting the nodes form the communication<br />

pathways <strong>and</strong> hence are undirected.<br />

Consider the set V = {N i |N i ≠ 0, 1 ≤ i ≤ N max }<br />

of occupied nodes <strong>and</strong> the corresponding index set I =<br />

{i|N i ≠ 0, 1 ≤ i ≤ N max }. Let E = {X ij |X ij ≠<br />

0 <strong>and</strong> i, j ∈ I, 1 ≤ i, j ≤ N max } . We shall denote<br />

by G = (V, E), the graph with V as the set of vertices <strong>and</strong><br />

E as a set of undirected edges. Let l avg (G) be the average<br />

path length <strong>and</strong> D(G) be the diameter of G. Note that if G<br />

consists of disconnected components, then l avg (G) −→ ∞<br />

<strong>and</strong> D(G) −→ ∞. Furthermore, we are only allowed to<br />

choose the capacity of links from an available set of b<strong>and</strong>widths<br />

B = {B 1 , B 2 , · · · , B Nb } with an associated cost<br />

set C(B) = {C b1 , C b2 , · · · , C bNb }. Let Br ijl be the decision<br />

variable which is 1 if link X ij uses a capacity B l <strong>and</strong><br />

0 otherwise, for, 1 ≤ i, j ≤ N max <strong>and</strong> 1 ≤ l ≤ N b .<br />

3.2 Problem Statement<br />

With the notations given in the previous section, let D =<br />

{N max , N i , P n ij , A ij , X ij , Br ijl } denote the set of decision<br />

variables (which are all binary). We can state the network<br />

design problem as follows:<br />

Objective: Let C denote the infrastructure cost, then<br />

we desire to<br />

min C =<br />

N∑<br />

max<br />

i=1<br />

N p<br />

∑<br />

C pj P n ij +<br />

j=1<br />

N∑<br />

max<br />

i=1<br />

subject to the following constraints:<br />

1. Resource Choice Constraints<br />

∑N b<br />

l=1<br />

N p<br />

N∑<br />

max<br />

j>i<br />

∑N b<br />

l=1<br />

C bl Br ijl ,<br />

(1)<br />

∑<br />

P n ij = N i , 1 ≤ i ≤ N max (2)<br />

j=1<br />

Br ijl = X ij , 1 ≤ i ≤ N max <strong>and</strong> i < j ≤ N max<br />

(3)<br />

The above constraints (2) <strong>and</strong> (3), restricts the choice<br />

of one type of processor <strong>and</strong> one type of b<strong>and</strong>width<br />

capacity for a node <strong>and</strong> a link respectively.<br />

2. Agent Distribution Constraints<br />

N∑<br />

max<br />

j=1<br />

A ij = 1, 1 ≤ i ≤ N A (4)<br />

∑N A<br />

∑<br />

A ij P a i + ∆ 1 (j) ≤ P n jl P l , 1 ≤ j ≤ N max<br />

i=1<br />

N p<br />

l=1<br />

∑N A<br />

∑N A<br />

A li A kj (Ba lk + Ba kl )<br />

l=1 k=1<br />

(5)<br />

∑N b<br />

+∆ 2 (i, j) ≤ Br ijt B t (6)<br />

t=1<br />

for 1 ≤ i ≤ N max , i < j ≤ N max ,


where ∆ 1 (j) ≥ 0 <strong>and</strong> ∆ 2 (i, j) ≥ 0 are given constants,<br />

which can vary with the node <strong>and</strong> the link, respectively.<br />

The constraints (4), force that each agent is assigned<br />

to only one node. On the other h<strong>and</strong> the constraints<br />

(5) guarantee that the agents are assigned to only those<br />

nodes which have been selected <strong>and</strong> the processing capacity<br />

chosen for that node meets the computational<br />

requirements in terms of CPU for all the agents assigned<br />

to that node. Note that this constraint also leads<br />

to a well balanced utilization of CPU by the agents, to<br />

begin with. Similarly the constraints (6), are for the<br />

meeting the communication requirements in terms of<br />

b<strong>and</strong>width of the links between nodes. Also each of<br />

the constraint (6), forces that if two agents which communicate<br />

with each other reside on separate nodes,<br />

then a direct communication link exists between those<br />

nodes. The constants ∆ 1 (j) <strong>and</strong> ∆ 2 (i, j), provide<br />

for additional or redundant CPU <strong>and</strong> b<strong>and</strong>width in<br />

the network. This redundancy takes into consideration<br />

the additional computational resources that may<br />

be required due to factors like: variability in computational<br />

resource requirements by agents, complete or<br />

partial loss resources at nodes <strong>and</strong> links <strong>and</strong> migration<br />

of agents between nodes. It should be noted that the<br />

effect of these constants can be absorbed in the processing<br />

P a i <strong>and</strong> b<strong>and</strong>width Ba ij requirements of the<br />

agents <strong>and</strong> hereafter we would assume that this has<br />

been done.<br />

3. Connectivity Constraints<br />

X ij ≤ N i 1 ≤ i ≤ N max <strong>and</strong> 1 ≤ j ≤ N max<br />

(7)<br />

l avg (G) ≤ l max (8)<br />

D(G) ≤ D max , (9)<br />

where l max is the maximum allowable average path<br />

length <strong>and</strong> D max is the maximum allowable diameter<br />

of the network considered. The constraints (7) enforce<br />

that link exists only between nodes which have been<br />

selected. On the other h<strong>and</strong>, the constraints (8) <strong>and</strong><br />

(9) are related to network performance <strong>and</strong> also guarantee<br />

that G is connected. In general, the constraints<br />

(8) <strong>and</strong> (9), cannot be expressed explicitly in terms of<br />

decision variables as equations, <strong>and</strong> have to be verified<br />

algorithmically.<br />

4 Solution Methodology<br />

The problem discussed in the previous section is similar<br />

in many respects, to the problems that often arise in the<br />

design of telecommunication networks [3], [4], [5]. For<br />

example, in [5], the authors consider the problem of “survivable<br />

network design” (SND), which seeks to design a<br />

minimum cost network with a given set of nodes <strong>and</strong> a set<br />

of possible edges between them, such that the connectivity<br />

requirement (which is specified as the minimum number<br />

of edge-disjoint paths needed between different nodes) is<br />

satisfied. The major distinction of our model from such<br />

formulations is that we consider infrastructural design <strong>and</strong><br />

the distribution of agents on this network simultaneously in<br />

the strategic design phase. Note that:<br />

• The maximum number of nodes needed satisfies,<br />

N max ≤ N A , otherwise the optimization problem has<br />

no feasible solution, as the constraints (5) cannot be<br />

satisfied. Hence, we can always take N max = N A .<br />

• Our problem, is a generalization of the “bin-packing”<br />

problem [12]. Following distinctions of our optimization<br />

problem from the “bin-packing” problem can be<br />

noted. There are two types of bins: the processors <strong>and</strong><br />

the network links <strong>and</strong> the agents are the objects to be<br />

chosen. The capacity for both type of bins are variable<br />

<strong>and</strong> can be selected from a given set, rather than being<br />

fixed. There is constraint between filling two types of<br />

bins i.e., as we fill the processors with agents, the bin<br />

which is the link connecting the processors also gets<br />

filled based on the agent distribution. Also, there are<br />

additional constraints related to the diameter <strong>and</strong> average<br />

path length (7)-(9), that should be satisfied.<br />

• Consider a special case of our optimization problem<br />

where Agents do not interact with each other i.e<br />

Ba ij = 0 (1 ≤ i, j ≤ N A ); There is only one processor<br />

with a capacity P <strong>and</strong> unit cost; There are no<br />

constraints on l avg (G) <strong>and</strong> D(G), i.e., D max −→ ∞<br />

<strong>and</strong> l max −→ ∞. Under these conditions the problem<br />

reduces to the usual bin-packing problem, as follows:<br />

subject to<br />

min C =<br />

N∑<br />

max<br />

i=1<br />

N i , (10)<br />

∑N A<br />

A ij = 1, 1 ≤ i ≤ N A , (11)<br />

j=1<br />

∑N A<br />

A ij P a i ≤ N j P, 1 ≤ j ≤ N A . (12)<br />

i=1<br />

The bin-packing problem is known to be NP-hard in the<br />

strong sense [12] . Since our problem is a generalization<br />

of the bin-packing problem it is also NP-hard in the strong<br />

sense. Given this we either need to develop heuristics or<br />

use evolutionary algorithms to obtain solutions.<br />

We have used Genetic Algorithms (GA) with the following<br />

important features:<br />

• Rather than using a binary encoding, we used a<br />

scheme of integer coding of the decision variables.<br />

• The initial pool of population is generated r<strong>and</strong>omly,<br />

with one feasible solution. The feasible solution can


e obtained as follows. Start with N A nodes, assign<br />

each agent to a separate node <strong>and</strong> choose a lowest possible<br />

processing capacity from the available set C(P )<br />

such that the processing requirements for each agent<br />

is satisfied. Connect the nodes which have agents that<br />

interact with each other <strong>and</strong> assign to that link a capacity<br />

with the lowest possible b<strong>and</strong>width from the<br />

available set C(B) such that the communication requirements<br />

are satisfied.<br />

• We have used NSGA2 [8, 10], as the GA solver. It<br />

has capability to automatically h<strong>and</strong>le constraints. It<br />

uses a mean-centric crossover <strong>and</strong> uniform bounded<br />

mutation operators for real coded strings.<br />

5 GA Application Examples <strong>and</strong> Results<br />

Figure 4. GA Result: DIS Layout for MSC, C = 130<br />

the nature of branching in the hierarchical structure<br />

<strong>and</strong> the variation in processing <strong>and</strong> b<strong>and</strong>width requirements<br />

for agents. In Figure 3, each node in the<br />

tree is labeled by an agent number A i <strong>and</strong> its processing<br />

requirement P a i , whereas each link between<br />

the agents A i <strong>and</strong> A j (if they interact), is labeled by<br />

(Ba ij + Ba ji ), their communication requirement.<br />

Figure 3. A military supply chain as an agent society<br />

3. The restriction on the maximum allowable diameter<br />

D max <strong>and</strong> the average path length l max for the DIS<br />

are listed in Table 2.<br />

5.1 Inputs<br />

Following are inputs based on notation in Section 3.2:<br />

1. The cost structure for processing power <strong>and</strong> b<strong>and</strong>width<br />

used in examples is shown in the Table 1.<br />

2. The agent societies in the examples considered were<br />

generated r<strong>and</strong>omly. The processing P a i <strong>and</strong> the<br />

b<strong>and</strong>width Ba ij requirements for the agents were<br />

sampled from uniform distribution. However, the<br />

structure of the agent societies in all the cases was restricted<br />

to be hierarchical. This is motivated by the<br />

fact that most of the organizations, like in Comm<strong>and</strong><br />

<strong>and</strong> Control or in society have hierarchical structure.<br />

Note that the optimization problem <strong>and</strong> formulation<br />

we have considered is general enough to be applied to<br />

an agent society with any underlying structure. The<br />

agent societies, differ in number of agents (Table 2),<br />

Table 1. Processing Cost <strong>and</strong> B<strong>and</strong>width Cost<br />

i 1 2 3<br />

P i 5 7 9<br />

C pi 1 2 3<br />

5.2 Output: GA Results<br />

i 1 2<br />

B i 3 6<br />

Cb i 1 2<br />

The costs C obtained by running GA have been tabulated<br />

below (Table 2), while the DIS layout is shown in Figure<br />

4, next to the corresponding agent society. In the figure,<br />

each node in the graph is labeled by the processing power<br />

P i followed by the agents which are assigned to that node,<br />

while each link is labeled by the b<strong>and</strong>width B i chosen for<br />

it. It should be noted that many agents can be assigned to a<br />

same node. For example, (Figure 4), agents A12 <strong>and</strong> A13<br />

are both assigned to a the node labeled P 9 : A12A13.


Table 2. GA Results<br />

Agent N A D max , GA Ratio<br />

Society No. l max C<br />

C<br />

N A<br />

1 4 2, 2 15 3.75<br />

2 7 ” 27 3.86<br />

3 10 5, 4 30 3.00<br />

4 12 ” 33 2.75<br />

5 15 ” 50 3.33<br />

6 19 ” 61 3.21<br />

7 24 ” 87 3.63<br />

8 33 ” 108 3.27<br />

9 40 ” 164 4.1<br />

10 50 ” 188 3.76<br />

The Table 2, shows that the optimal cost per agent<br />

ratio ( C<br />

N A<br />

) is fairly constant, with a small variation. This<br />

may be a result of the cost structure assumed <strong>and</strong> the particular<br />

instances of the agent societies considered. This observation,<br />

however, can have following implication: given<br />

a very large agent society like with N A = 5, 000 agents, we<br />

can decompose it into smaller agent societies, solve the optimization<br />

problem for each of the sub-societies <strong>and</strong> then<br />

combine them to solve the overall problem. Due to the<br />

constancy of the ratio<br />

C<br />

N A<br />

, this heuristic should lead to solutions<br />

which are fairly close to optimal.<br />

As a final example we consider one of the realistic<br />

agent societies which has been developed in the Ultra*Log<br />

Program [14], [6]. The society is shown below in Figure 3<br />

<strong>and</strong> represents a typical military supply chain (MSC). The<br />

structure of society is an exact replica of the true society,<br />

however the processing <strong>and</strong> b<strong>and</strong> width requirements for<br />

the agents have been assigned r<strong>and</strong>omly. The result obtained<br />

by GAs has been shown in the Figure 4.<br />

6 Conclusion <strong>and</strong> Future Work<br />

In this paper we have systematically studied the issue of<br />

survivability of DIS. Based on these we formulated a robust<br />

design problem for DIS. We showed that this problem<br />

is NP hard in strong sense <strong>and</strong> used GAs to obtain solutions<br />

for a number of example agent societies. We also considered<br />

a realistic agent society representing a military supply<br />

chain. We showed that our robust design problem formulation<br />

results in a fairly survivable DIS.<br />

Survivability of DIS is an emerging area <strong>and</strong> future<br />

research is possible in a number of varied directions.<br />

Refining the robust design problem we have posed <strong>and</strong><br />

developing heuristic solution methodologies for it. Further<br />

research is required to build mechanisms for survivability<br />

against other types of attacks, such as security <strong>and</strong> DOS<br />

attacks. Most of the above stated problems are nothing<br />

new for biological systems which have routinely solved<br />

them for literally millions of years. Can we draw inspiration<br />

from the structures discovered in biology to solve<br />

problems of distributed systems? We believe that the quest<br />

for “open-ended” survivability for DIS can be achieved<br />

only by exporting biological mechanisms into software<br />

systems.<br />

Acknowledgements<br />

The authors acknowledge <strong>DARPA</strong> (Grant#:<br />

MDA972-1-1-0038 under Ultra*Log Program) <strong>and</strong><br />

NSF (Grant#:ANI-0219747 under ITR program) for their<br />

generous support for this research. Special thanks to the<br />

anonymous reviewers for their comments <strong>and</strong> suggestions.<br />

References<br />

[1] W. Aiello, B. Awerbuch, B. M. Maggs, <strong>and</strong> S. Rao. Approximate<br />

load balancing on dynamic <strong>and</strong> asynchronous<br />

networks. In ACM Symposium on Theory of Computing,<br />

pages 632–641, 1993.<br />

[2] J. Andreoli, U. Borghoff, R. Pareschi, S. Bistarelli, U. Montatnari,<br />

<strong>and</strong> F. Rossi. Constraints <strong>and</strong> agents for a decentralized<br />

network infrastructure. In AAAI Workshop, Menlo<br />

Park, California, USA: AAAI Press., pages 39–44, 1997.<br />

[3] A. Balakrishnan <strong>and</strong> K. Altinkemer. Using hop-constrined<br />

model to generate alternative= communication network design.<br />

ORSA Journal on Computing, 4(2), 1992.<br />

[4] A. Balakrishnan, T. l. Magnanti, <strong>and</strong> P. Mirch<strong>and</strong>ani. A<br />

dual-based algorithm for multi-level network design. Management<br />

Science, 40(5):567–581, 1994.<br />

[5] A. Balakrishnan, T. l. Magnanti, <strong>and</strong> P. Mirch<strong>and</strong>ani.<br />

Connectivity-splitting models for survivable network design.<br />

Submitted, 2003.<br />

[6] M. Brinn <strong>and</strong> M. Greaves. Leveraging agent properties<br />

to assure survivability of distributed multi-agent<br />

systems. In https://docs.ultralog.net/dscgi/ds.py/Get/File-<br />

4088/AA03-SurvivabilityOfDMAS.pdf, 2002.<br />

[7] R. F. Cancho <strong>and</strong> R. V. Sole. Optimization in<br />

complex networks. In http://arxiv.org/PS cache/condmat/pdf/0111/0111222.pdf,<br />

2001.<br />

[8] K. Deb, A. Pratap, S. Agarwal, <strong>and</strong> T. Meyarivan. A<br />

fast <strong>and</strong> elitist multi-objective genetic algorithm-nsga-ii. In<br />

http://www.iitk.ac.in/kangal/pub.htm, 2000.<br />

[9] M. O. Hofmann, A. McGovern, <strong>and</strong> K. R. Whitebread. Mobile<br />

agents on the digital battlefield. In Proceedings of<br />

the 2nd International Conference on Autonomous Agents<br />

(Agents’98), pages 219–225, 1998.<br />

[10] KanGAL. Kanpur genetic algorithm laboratory. In<br />

http://www.iitk.ac.in/kangal/pub.htm.<br />

[11] J. O. Kephart, T. Hogg, <strong>and</strong> B. A. Huberman. Collective<br />

behavior of predictive agents. Physics D, 42:48–65, 1990.<br />

[12] A. Martello <strong>and</strong> P. Toth. Knapsack Problems Algorithms<br />

<strong>and</strong> Computer Implementations. John Wiley <strong>and</strong> Sons,<br />

1990.<br />

[13] C. Tomlin, G. Pappas, <strong>and</strong> S. Sastry. Conflict resolution for<br />

air traffic management: A case study in multi-agent hybrid<br />

systems. IEEE Trans. on Automat. Ctrl, 43(4), 1998.<br />

[14] <strong>ULTRALOG</strong>. A darpa program on logistics infromation<br />

system= survivability. In http://www.ultralog.net/.


D e p e n d a b l e A g e n t S y s t e m s<br />

Survivability of<br />

Multiagent-Based<br />

Supply Networks: A<br />

Topological Perspective<br />

Hari Prasad Thadakamalla, Usha N<strong>and</strong>ini Raghavan, Soundar Kumara, <strong>and</strong><br />

Réka Albert, Pennsylvania State University<br />

Supply chains involve complex webs of interactions among suppliers, manufacturers,<br />

distributors, third-party logistics providers, retailers, <strong>and</strong> customers.<br />

You can improve a<br />

multiagent-based<br />

supply network’s<br />

survivability by<br />

concentrating on the<br />

topology <strong>and</strong> its<br />

interplay with<br />

functionalities.<br />

Although fairly simple business processes govern these individual entities, real-time<br />

capabilities <strong>and</strong> global Internet connectivity make today’s supply chains complex.<br />

Fluctuating dem<strong>and</strong> patterns, increasing customer<br />

expectations, <strong>and</strong> competitive markets also add to<br />

their complexity.<br />

Supply networks are usually modeled as multiagent<br />

systems (MASs). 1 Because supply chain management<br />

must effectively coordinate among many<br />

different entities, a multiagent modeling framework<br />

based on explicit communication between these entities<br />

is a natural choice. 1 Furthermore, we can represent<br />

these multiagent systems as a complex network<br />

with entities as nodes <strong>and</strong> the interactions between<br />

them as edges. Here we explore the survivability (<strong>and</strong><br />

hence dependability) of these MASs from the view<br />

of these complex supply networks.<br />

Today’s supply networks aren’t dependable—or<br />

survivable—in chaotic environments. For example,<br />

Figure 1 shows how mediocre a typical supply network’s<br />

reaction to a node or edge failure is compared<br />

to a network with built-in redundancy.<br />

Survivability is a critical factor in supply network<br />

design. Specifically, supply networks in dynamic<br />

environments, such as military supply chains during<br />

wartime, must be designed more for survivability<br />

than for cost effectiveness. The more survivable a<br />

network is, the more dependable it will be.<br />

We present a methodology for building survivable<br />

large-scale supply network topologies that can<br />

extend to other large-scale MASs. Building survivable<br />

topologies alone doesn’t, however, make an<br />

MAS dependable. To create survivable—<strong>and</strong> hence<br />

dependable—multiagent systems, we must also consider<br />

the interplay between network topology <strong>and</strong><br />

node functionalities.<br />

A topological perspective<br />

To date, the survivability literature has emphasized<br />

network functionalities rather than topology. To be<br />

survivable, a supply network must adapt to a dynamic<br />

environment, withst<strong>and</strong> failures, <strong>and</strong> be flexible<br />

<strong>and</strong> highly responsive. These characteristics<br />

depend on not only node functionality but also the<br />

topology in which nodes operate.<br />

The components of survivability<br />

From a topological perspective, the following<br />

properties encompass survivability, <strong>and</strong> we denote<br />

them as survivability components.<br />

The first is robustness. A robust network can sustain<br />

the loss of some of its structure or functionalities <strong>and</strong><br />

maintain connectedness under node failures, whether<br />

the failure is r<strong>and</strong>om or is a targeted attack. We measure<br />

robustness as the size of the network’s largest<br />

24 1541-1672/04/$20.00 © 2004 IEEE IEEE INTELLIGENT SYSTEMS<br />

Published by the IEEE Computer Society


connected component, in which a path exists<br />

between any pair of nodes in that component.<br />

The second is responsiveness. A responsive<br />

network provides timely services <strong>and</strong><br />

effective navigation. Low characteristic path<br />

length (the average of the shortest path<br />

lengths from each node to every other node)<br />

leads to better responsiveness, which determines<br />

how quickly commodities or information<br />

proliferate throughout the network.<br />

The third is flexibility. This property depends<br />

on the presence of alternate paths.<br />

Good clustering properties ensure alternate<br />

paths to facilitate dynamic rerouting. The<br />

clustering coefficient, defined as the ratio<br />

between the number of edges among a node’s<br />

first neighbors <strong>and</strong> the total possible number<br />

of edges between them, characterizes the<br />

local order in a node’s neighborhood.<br />

The fourth is adaptivity. An adaptive network<br />

can rewire itself efficiently—that is,<br />

restructure or reorganize its topology on the<br />

basis of environmental shifts—to continue<br />

providing efficient performance. For example,<br />

if a supplier can’t reliably meet a customer’s<br />

dem<strong>and</strong>s, the customer should be<br />

able to choose another supplier.<br />

A typical supply chain with a tree-like or<br />

hierarchical structure lacks these four properties—the<br />

clustering coefficient is nearly<br />

zero, <strong>and</strong> the characteristic path length scales<br />

linearly with the number of nodes (or agents)<br />

N. In designing complex agent networks<br />

with built-in survivability, conventional optimization<br />

tools won’t work because of the<br />

problem’s extremely large scale. When networks<br />

were smaller, we could underst<strong>and</strong><br />

their overall behavior by concentrating on<br />

the individual components’ properties. But<br />

as networks exp<strong>and</strong>, this becomes impossible,<br />

so we shift focus to the statistical properties<br />

of the collective behavior.<br />

Using topologies<br />

Studying complex networks such as protein<br />

interaction networks, regulatory networks,<br />

social networks of acquaintances,<br />

<strong>and</strong> information networks such as the Web<br />

is illuminating the principles that make these<br />

networks extremely resilient to their respective<br />

chaotic environments. The core principles<br />

extracted from this exploration will<br />

prove valuable in building robust models for<br />

survivable complex agent networks.<br />

Complex-network theory currently offers<br />

r<strong>and</strong>om-graph, small-world, <strong>and</strong> scale-free network<br />

topologies as likely c<strong>and</strong>idates for survivable<br />

networks (see the sidebar “Complex<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

(a)<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

(b)<br />

Battalion<br />

FSB<br />

FSB<br />

Battalion<br />

Battalion<br />

FSB<br />

FSB<br />

Battalion<br />

Battalion<br />

Node<br />

failure<br />

Battalion<br />

Node<br />

failure<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Networks” for more on this topic). Evaluating<br />

these for survivability (see Figure 2), we find<br />

that no one topology consistently outperforms<br />

the others. For example, while small-world networks<br />

have better clustering properties, scalefree<br />

networks are significantly more robust to<br />

r<strong>and</strong>om attacks. So, we can’t directly use these<br />

Battalion<br />

Battalion<br />

MSB<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

MSB<br />

Battalion<br />

Battalion<br />

Battalion<br />

FSB<br />

FSB<br />

Battalion<br />

Battalion<br />

FSB<br />

FSB<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Battalion<br />

Figure 1. How redundancy affects survivability. (a) A part of the multiagent system<br />

for military logistics modeled using the UltraLog (www.ultralog.net) program. This<br />

example models each entity, such as main support battalion, forward support battalion,<br />

<strong>and</strong> battalion, as a software agent. (We’ve changed the agents’ names for security<br />

reasons.) In the current scenario, MSBs send the supplies to the FSBs, who in turn<br />

forward these to battalions. (b) A modified military supply chain with some redundancy<br />

built into it. This network performs much better in the event of node failures <strong>and</strong> hence<br />

is more dependable than the first network.<br />

topologies to build supply networks. We can,<br />

however, use their evolution principles to build<br />

supply chain networks that perform well in all<br />

respects of the survivability components.<br />

Researchers have studied complex networks<br />

in part to find ways to design evolutionary<br />

algorithms for modeling networks<br />

SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 25


Complex Networks<br />

Social scientists, among the first to study complex networks<br />

extensively, focused on acquaintance networks, where nodes<br />

represent people <strong>and</strong> edges represent the acquaintances between<br />

them. Social psychologist Stanley Milgram posited the<br />

“six degrees of separation” theory that in the US, a person’s<br />

social network has an average acquaintance path length of six. 1<br />

This turns out to be a particular instance of the small-world<br />

property found in many real-world networks, which, despite<br />

their large size, have a relatively short path between any two<br />

nodes.<br />

An early effort to model complex networks introduced r<strong>and</strong>om<br />

graphs for modeling networks with no obvious pattern or<br />

structure. 2 A r<strong>and</strong>om graph consists of N nodes, <strong>and</strong> two nodes<br />

are connected with a connection probability p. R<strong>and</strong>om graphs<br />

are statistically homogeneous because most nodes have a degree<br />

(that is, the number of edges incident on the node) close<br />

to the graph’s average degree, <strong>and</strong> significantly small <strong>and</strong> large<br />

node degrees are exponentially rare.<br />

However, studying the topologies of diverse large-scale networks<br />

found in nature reveals a more complex <strong>and</strong> unpredictable<br />

dynamic structure. Two measures quantifying network<br />

topology found to differ significantly in real networks are the<br />

degree distribution (the fraction of nodes with degree k) <strong>and</strong><br />

the clustering coefficient. Later modeling efforts focused on<br />

trying to reproduce these properties. 3,4 Duncan Watts <strong>and</strong><br />

Steven Strogatz introduced the concept of small-world networks<br />

to explain the high degree of transitivity (order) in complex<br />

networks. 5 The Watts-Strogatz model starts from a regular<br />

1D ring lattice on L nodes, where each node is joined to its<br />

first K neighbors. Then, with probability p, each edge is rewired<br />

with one end remaining the same <strong>and</strong> the other end<br />

chosen uniformly at r<strong>and</strong>om, without allowing multiple edges<br />

(more than one edge joining a pair of vertices) or loops (edges<br />

joining a node to itself). The resulting network is a regular lattice<br />

when p = 0 <strong>and</strong> a r<strong>and</strong>om graph when p = 1, because all<br />

edges are rewired. This network class displays a high clustering<br />

coefficient for most values of p, but as p → 1, it behaves like a<br />

r<strong>and</strong>om graph.<br />

Albert-László Barabási <strong>and</strong> Réka Albert later proposed an<br />

evolutionary model based on growth <strong>and</strong> preferential attachment<br />

leading to a network class, scale-free networks, with<br />

power law distribution. 6 Many real-world networks’ degree<br />

distribution follows a power law, fundamentally different<br />

from the peaked distribution observed in r<strong>and</strong>om graphs <strong>and</strong><br />

small-world networks. Barabási <strong>and</strong> Albert argued that a<br />

static r<strong>and</strong>om graph of the Watts-Strogatz model fails to capture<br />

two important features of large-scale networks: their<br />

constant growth <strong>and</strong> the inherent selectivity in edge creation.<br />

Complex networks such as the Web, collaboration networks,<br />

or even biological networks are growing continuously with<br />

the creation of new Web pages, the birth of new individuals,<br />

<strong>and</strong> gene duplication <strong>and</strong> evolution. Moreover, unlike r<strong>and</strong>om<br />

networks where each node has the same chance of<br />

acquiring a new edge, new nodes entering the scale-free network<br />

don’t connect uniformly to existing nodes but attach<br />

preferentially to higher-degree nodes. This reasoning led<br />

Barabási <strong>and</strong> Albert to define two mechanisms:<br />

• Growth: Start with a small number of nodes—say, m 0 —<strong>and</strong><br />

assume that every time a node enters the system, m edges<br />

are pointing from it, where m < m 0 .<br />

• Preferential attachment: Every time a new node enters the<br />

system, each edge of the newly connected node preferentially<br />

attaches to a node i with degree k i with the probability<br />

k<br />

Π i = i<br />

∑ j<br />

k j<br />

Research has shown that the second mechanism leads to a<br />

network with power-law degree distribution P(k) = k –γ with<br />

exponent γ = 3. Barabási <strong>and</strong> Albert dubbed these networks<br />

“scale free” because they lack a characteristic degree <strong>and</strong> have<br />

a broad tail of degree distribution. Following the proposal of<br />

the first scale-free model, researchers have introduced many<br />

more refined models, leading to a well-developed theory of<br />

evolving networks. 7<br />

Protein-to-protein interactions in metabolic <strong>and</strong> regulatory<br />

networks <strong>and</strong> other biological networks also show a striking<br />

ability to survive under extreme conditions. Most of these<br />

networks’ underlying properties resemble the three most<br />

familiar networks found in the literature (see Figure 1 in the<br />

article).<br />

Complex networks are also vulnerable to node or edge<br />

losses, which disrupt the paths between nodes or increase<br />

their length <strong>and</strong> make communication between them harder.<br />

In severe cases, an initially connected network breaks down<br />

into isolated components that can no longer communicate.<br />

Numerical <strong>and</strong> analytical studies of complex networks indicate<br />

that a network’s structure plays a major role in its response to<br />

node removal. For example, scale-free networks are more<br />

robust than r<strong>and</strong>om or small-world networks with respect to<br />

r<strong>and</strong>om node loss. 8 Large scale-free networks will tolerate the<br />

loss of many nodes yet maintain communication between<br />

those remaining. However, they’re sensitive to removal of the<br />

most-connected nodes (by a targeted attack on critical nodes,<br />

for example), breaking down into isolated pieces after losing<br />

just a small percentage of these nodes.<br />

References<br />

1. S. Milgram, “The Small World Problem,” Psychology Today, vol. 2,<br />

May 1967, pp. 60–67.<br />

2. P. Erdös <strong>and</strong> A. Renyi, “On R<strong>and</strong>om Graphs I,” Publicationes Mathematicae,<br />

vol. 6, 1959, pp. 290–297.<br />

3. S.N. Dorogovtsev <strong>and</strong> J.F.F. Mendes, “Evolution of Networks,”<br />

Advances in Physics, vol. 51, no. 4, 2002, pp. 1079–1187.<br />

4. M.E.J. Newman, “The Structure <strong>and</strong> Function of Complex Networks,”<br />

SIAM Rev., vol. 45, no. 2, 2003, pp. 167–256.<br />

5. D.J. Watts <strong>and</strong> S.H. Strogatz, “Collective Dynamics of ‘Small-World’<br />

Networks,” Nature, vol. 393, June 1998, pp. 440–442.<br />

6. A.-L. Barabási <strong>and</strong> R. Albert, “Emergence of Scaling in R<strong>and</strong>om<br />

Networks,” Science, vol. 286, Oct. 1999, pp. 509–512.<br />

7. R. Albert <strong>and</strong> A.-L. Barabási, “Statistical Mechanics of Complex Networks,”<br />

Reviews of Modern Physics, Jan. 2002, pp. 47–97.<br />

8. R. Albert, H. Jeong, <strong>and</strong> A.-L Barabási, “Error <strong>and</strong> Attack Tolerance<br />

of Complex Networks,” Nature, July 2000, pp. 378–382.<br />

26 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


R<strong>and</strong>om<br />

Small-world<br />

Scale-free<br />

with distinct properties found in nature. A<br />

network’s evolutionary mechanism is designed<br />

such that the network’s inherent properties<br />

emerge owing to the mechanism. For<br />

example, small-world networks were designed<br />

to explain the high clustering coefficient<br />

found in many real-world networks,<br />

while the “rich get richer” phenomenon used<br />

in the Barabási-Albert model explains the<br />

scale-free distribution. 2<br />

Similarly, we seek to design supply networks<br />

with inherent survivability components<br />

(see Figure 3), obtaining these components by<br />

coining appropriate growth mechanisms. Of<br />

course, having all the aforementioned properties<br />

in a network might not be practically feasible—we’d<br />

likely have to negotiate trade-offs<br />

depending on the domain. Also, domain specificities<br />

might make it inefficient to incorporate<br />

all properties. For instance, in a supply<br />

network, we might not be able to rewire the<br />

edges as easily as we can in an information<br />

network, so we would concentrate more on<br />

obtaining other properties such as low characteristic<br />

path length, robustness to failures<br />

<strong>and</strong> attacks, <strong>and</strong> high clustering coefficients.<br />

So, the construction of these networks is<br />

domain specific.<br />

Establishing edges between network nodes<br />

is also domain specific. For instance, in a supply<br />

network, a retailer would likely prefer to<br />

have contact with other geographically convenient<br />

nodes (distributors, warehouses, <strong>and</strong><br />

other retailers). At the same time, nodes in a<br />

file-sharing network would prefer to attach to<br />

other nodes known to locate or hold many<br />

shared files (that is, nodes of high degree).<br />

Obtaining the survivability<br />

components<br />

While evolving the network on the basis<br />

of domain constraints, we need to incorporate<br />

four traits into the growth model for<br />

obtaining good survivability components.<br />

The first is low characteristic path length.<br />

During network construction, establish a few<br />

long-range connections between nodes that<br />

require many steps to reach one from<br />

another.<br />

The second is good clustering. When two<br />

nodes A <strong>and</strong> B are connected, new edges<br />

from A should prefer to attach to neighbors<br />

of B, <strong>and</strong> vice versa.<br />

The third is robustness to r<strong>and</strong>om <strong>and</strong> targeted<br />

failure. Preferential attachment—where<br />

new nodes entering the network don’t connect<br />

uniformly to existing nodes but attach preferentially<br />

to higher-degree nodes (see the side-<br />

Degree<br />

distribution<br />

Characteristic<br />

path length<br />

Clustering<br />

coefficient<br />

Robustness<br />

to failures<br />

P(k)<br />

Poisson<br />

<br />

k<br />

Scales as<br />

log(N)<br />

p (the connection<br />

probability)<br />

Similar responses<br />

to both r<strong>and</strong>om<br />

<strong>and</strong> targeted<br />

attacks<br />

Peaked<br />

1.0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0.0<br />

2 4 6 8 10 12<br />

k<br />

Scales linearly with N<br />

for small p. And for higher<br />

p scales as log(N)<br />

High, but as p → 1<br />

behaves like<br />

a r<strong>and</strong>om graph<br />

Similar response as<br />

r<strong>and</strong>om networks.<br />

This is because it has<br />

a degree distribution<br />

similar to r<strong>and</strong>om<br />

networks.<br />

P(k)<br />

1<br />

0.1<br />

0.01<br />

0.001<br />

0.0001<br />

Power law<br />

1 10 100 1,000<br />

k<br />

Scales as<br />

log(N)/log(logN))<br />

((m–1)/2)*(log(N)/N)<br />

where m is the number of<br />

edges with which a<br />

node enters<br />

Highly resilient to r<strong>and</strong>om<br />

failures while being very<br />

sensitive to targeted<br />

attacks<br />

Figure 2. Comparing the survivability components of r<strong>and</strong>om, small-world, <strong>and</strong><br />

scale-free networks.<br />

Manufacturer<br />

Warehouse<br />

Warehouse<br />

Warehouse<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Manufacturer<br />

Manufacturer<br />

Warehouse<br />

Warehouse<br />

Warehouse<br />

Figure 3. The transition from supply chain to a survivable supply network.<br />

Failed node<br />

Failed edge<br />

Alternate path<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

Retailer<br />

SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 27


D e p e n d a b l e A g e n t S y s t e m s<br />

Preferential attachment R<strong>and</strong>om attachment Proposed attachment rules<br />

Figure 4. Snapshots of the modeled networks during their growth, where the nodes number 70. MSBs are green, FSBs are red, <strong>and</strong><br />

battalions are blue.<br />

bar for more details)—leads to scale-free networks<br />

with very few critical <strong>and</strong> many not-socritical<br />

nodes. Here we measure a node’s criticality<br />

in terms of the number of edges incident<br />

on it. So, these networks are robust to r<strong>and</strong>om<br />

failures (the probability that a critical node fails<br />

is very small) but not to targeted attacks (attacking<br />

the very few critical nodes would devastate<br />

the network). Also, it’s not practically feasible<br />

to have all nodes play an equal role in the system—that<br />

is, be equally critical. Thus, the network<br />

should have a good balance of critical,<br />

not-so-critical, <strong>and</strong> noncritical nodes.<br />

The fourth is efficient rewiring. Rewiring<br />

edges in a network might or might not be feasible,<br />

depending on the domain. But where<br />

it is feasible, it should preserve the other<br />

three traits.<br />

Although complete graphs come equipped<br />

with good survivability components, they<br />

clearly aren’t cost effective. Allowing every<br />

agent in an agent system to communicate<br />

with every other agent uses system b<strong>and</strong>width<br />

inefficiently <strong>and</strong> could completely bog<br />

down the system. So the amount of redundancy<br />

results from a trade-off between cost<br />

<strong>and</strong> survivability.<br />

An illustration<br />

Suppose we want to build a topology for a<br />

military supply chain that must be survivable<br />

in wartime. First, we broadly classify the network<br />

nodes into three types:<br />

• Battalions prefer to attach to a highly connected<br />

node so that the supplies from different<br />

parts of the network will be transported<br />

to them in fewer steps. Battalions<br />

also require quick responses, so they prefer<br />

the subsequent links to attach to nodes at<br />

convenient shorter distances (in our model<br />

we considered a fixed distance of two).<br />

• A forward support battalion prefers to<br />

attach to highly connected nodes so that<br />

its supplies proliferate faster in the network.<br />

The supply range from an FSB goes<br />

up to a particular distance (at most three<br />

in our model).<br />

• A main support battalion also prefers to<br />

attach to a highly connected node to<br />

enable its supplies to proliferate faster in<br />

the network. We assume an unrestricted<br />

supply reach from an MSB, thus facilitating<br />

some long-range connections.<br />

In a conventional logistics network, the<br />

MSBs supply commodities (such as ammunitions,<br />

food, <strong>and</strong> fuel) to the FSBs, who in<br />

turn forward them to the battalions. Our<br />

approach doesn’t restrict node functionalities<br />

as such—for example, we assume that<br />

even a battalion can supply commodities to<br />

other battalions if necessary.<br />

In (number of nodes of degree > k)<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

Model 1<br />

Model 2<br />

Model 3<br />

Characteristic path length<br />

5.6<br />

5.5<br />

5.4<br />

5.3<br />

5.2<br />

5.1<br />

5.0<br />

4.9<br />

4.8<br />

(a)<br />

0<br />

0 1 2 3 4 5<br />

In (degree k)<br />

(b)<br />

4.7<br />

6.5 7.0 7.5<br />

8.0 8.5 9.0<br />

Ln (number of nodes)<br />

Figure 5. How our proposed network performed: (a) the log-log of the degree distribution for all the three networks;<br />

(b) the characteristic path length of the proposed network against the log of the number of nodes.<br />

28 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


Growth mechanisms<br />

Start with a small number of nodes—say,<br />

m 0 —<strong>and</strong> assume that every time a node<br />

enters the system, m edges are pointing from<br />

it, where m < m 0 . Battalions, FSBs, <strong>and</strong><br />

MSBs enter the system in a certain ratio<br />

l:m:n where l > m > n:<br />

• A battalion has one edge pointing from it<br />

<strong>and</strong> a second edge added with a probability<br />

p.<br />

• An FSB has three edges pointing from it.<br />

• An MSB has five edges pointing from it.<br />

The attachment rules applied depend on<br />

which node type enters the system:<br />

• For a battalion, the first edge attaches to a<br />

node i of degree k i with the probability<br />

ki<br />

Π i =<br />

∑ k<br />

.<br />

j<br />

j<br />

The second edge, which exists with a<br />

probability p, attaches to a r<strong>and</strong>omly chosen<br />

node at a distance of two.<br />

• For an FSB, the first edge attaches to a<br />

node i of degree k i with the probability<br />

ki<br />

Π i =<br />

∑ k<br />

.<br />

j<br />

j<br />

The subsequent edges attach to a r<strong>and</strong>omly<br />

chosen node at a distance of at most three.<br />

Table 1. Simulation results.<br />

Model 1 (r<strong>and</strong>om) Model 2 (preferential) Model 3 (proposed)<br />

Clustering coefficient 0.0038–0.0039 0.013–0.019 0.35–0.39<br />

Characteristic path length 5.26–5.36 4.09–4.25 4.69–4.79<br />

• For an MSB, each edge attaches preferentially<br />

to a node i with degree k i with the<br />

probability<br />

ki<br />

Π i =<br />

∑ k<br />

.<br />

j<br />

j<br />

Simulation <strong>and</strong> analysis<br />

Using this method, we built a network of<br />

1,000 nodes with l, m, <strong>and</strong> n being 25, 4, <strong>and</strong><br />

1 (we obtained these values from the current<br />

configuration of the military logistics network<br />

used in the UltraLog program) <strong>and</strong><br />

p = 1/2. We compared this network’s survivability<br />

with that of two other networks built<br />

using similar mechanisms except that one<br />

used purely preferential attachment rules<br />

(similar to scale-free networks) <strong>and</strong> the other<br />

used purely r<strong>and</strong>om attachment rules (similar<br />

to r<strong>and</strong>om networks) (see Figure 4). All<br />

three networks had an equal number of edges<br />

<strong>and</strong> nodes to ensure fair comparison.<br />

We refer to the networks built from r<strong>and</strong>om,<br />

preferential, <strong>and</strong> proposed attachment<br />

rules as Models 1, 2, <strong>and</strong> 3, respectively. As<br />

we noted earlier, a typical military supply<br />

chain (see Figure 1a) with a tree-like or hierarchical<br />

structure has deficient survivability<br />

components, making it vulnerable to both<br />

r<strong>and</strong>om <strong>and</strong> targeted attacks. Models 1, 2,<br />

<strong>and</strong> 3 outperform the typical supply network<br />

in all survivability components.<br />

Figure 5a shows the three models’ degree<br />

distribution. As expected, the preferentialattachment<br />

network has a heavier tail than<br />

the other two networks. We measured survivability<br />

components for all three networks.<br />

The clustering coefficient for Model 3 was<br />

the highest (see Table 1). The Model 3 attachment<br />

rules, especially those for battalions <strong>and</strong><br />

FSBs, contribute implicitly to the clustering<br />

coefficient, unlike the attachment rules in the<br />

other models.<br />

The proposed network model’s characteristic<br />

path length measured between 4.69 <strong>and</strong> 4.79<br />

despite the network’s large size (1,000 nodes).<br />

This value puts it between the preferential <strong>and</strong><br />

r<strong>and</strong>om attachment models. Also, as Figure 5b<br />

shows, the characteristic path length increases<br />

in the order of log(N) as N increases. Model 3<br />

clearly displays small-world behavior.<br />

To measure network robustness, we removed<br />

a set of nodes from the network <strong>and</strong><br />

evaluated its resilience to disruptions. We<br />

considered two attacks types: r<strong>and</strong>om <strong>and</strong> targeted.<br />

To simulate r<strong>and</strong>om attacks, we removed<br />

a set of r<strong>and</strong>omly chosen nodes; for<br />

targeted attacks, we removed a set of nodes<br />

selected strictly in order of decreasing node<br />

degree. To determine robustness, we measured<br />

how the size of each network’s largest<br />

connected component, characteristic path<br />

length, <strong>and</strong> maximum distance within the<br />

largest connected component changed as a<br />

function of the number of nodes removed. We<br />

expect that in a robust network the size of the<br />

largest connected component is a considerable<br />

fraction of N (usually O(N)), <strong>and</strong> the distances<br />

between nodes in the largest connected<br />

component don’t increase considerably.<br />

For r<strong>and</strong>om failures, Figure 6 shows that<br />

Model 3’s robustness nearly matches that of<br />

the preferential-attachment network (note that<br />

scale-free networks are highly resilient to ran-<br />

Size of the largest connected component<br />

1,000<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

(a)<br />

Model 1<br />

Model 2<br />

Model 3<br />

0 20 40 60 80<br />

Percentage of nodes removed<br />

Average length in the largest<br />

connected component<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

Maximum distance in the largest<br />

connected component<br />

2<br />

1<br />

0<br />

0 20 40 60 80<br />

(b) Percentage of nodes removed (c)<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

0 20 40 60 80<br />

Percentage of nodes removed<br />

Figure 6. Responses of the three networks to r<strong>and</strong>om attacks, plotted as (a) the size of the largest connected component,<br />

(b) characteristic path length, <strong>and</strong> (c) maximum distance in the largest connected component against the percentage of nodes<br />

removed from each network.<br />

SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 29


D e p e n d a b l e A g e n t S y s t e m s<br />

1,200<br />

18<br />

45<br />

Model 1<br />

16<br />

40<br />

1,000<br />

Model 2<br />

Model 3<br />

14<br />

35<br />

800<br />

12<br />

30<br />

10<br />

25<br />

600<br />

8<br />

20<br />

400<br />

6<br />

15<br />

4<br />

10<br />

200<br />

2<br />

5<br />

0<br />

0<br />

0<br />

0 10 20 30 40 50 60 0 20 40 60 0 10 20 30 40 50 60<br />

(a) Percentage of nodes removed (b) Percentage of nodes removed (c) Percentage of nodes removed<br />

Size of the largest connected component<br />

Average length in the largest<br />

connected component<br />

Figure 7. The three networks’ responses to targeted attacks, plotted as (a) the size of the largest connected component,<br />

(b) characteristic path length, <strong>and</strong> (c) maximum distance in the largest connected component against the percentage of nodes<br />

removed from each network.<br />

Maximum distance in the largest<br />

connected component<br />

dom failures). Also, the decrease in the largest<br />

connected component’s size is linear with<br />

respect to the number of nodes removed, which<br />

corresponds to the slowest possible decrease.<br />

So, we can safely conclude that these networks<br />

T h e A u t h o r s<br />

are robust to r<strong>and</strong>om failures—most of the<br />

nodes in the network have a degree less than<br />

four, <strong>and</strong> removing smaller-degree nodes<br />

impacts the networks much less than removing<br />

high-degree nodes (called hubs).<br />

Hari Prasad Thadakamalla is a PhD student in the Department of <strong>Industrial</strong><br />

<strong>and</strong> <strong>Manufacturing</strong> Engineering at Pennsylvania State University, University<br />

Park. His research interests include supply networks, search in complex networks,<br />

stochastic systems, <strong>and</strong> control of multiagent systems. He obtained<br />

his MS in industrial engineering from Penn State. Contact him at<br />

hpt102@psu.edu.<br />

Usha N<strong>and</strong>ini Raghavan is a PhD student in industrial <strong>and</strong> manufacturing<br />

engineering at Pennsylvania State University, University Park. Her research<br />

interests include supply chain management, graph theory, complex adaptive<br />

systems, <strong>and</strong> complex networks. She obtained her MSc in mathematics from<br />

the Indian Institute of Technology, Madras. Contact her at uxr102@psu.edu.<br />

Soundar Kumara is a Distinguished Professor of industrial <strong>and</strong> manufacturing<br />

engineering. He holds joint appointments with the Department of Computer<br />

Science <strong>and</strong> Engineering <strong>and</strong> School of Information Sciences <strong>and</strong> Technology<br />

at Pennsylvania State University. His research interests include<br />

complexity in logistics <strong>and</strong> manufacturing, software agents, neural networks,<br />

<strong>and</strong> chaos theory as applied to manufacturing process monitoring <strong>and</strong> diagnosis.<br />

He’s an elected active member of the International Institute of Production<br />

Research. Contact him at skumara@psu.edu.<br />

Réka Albert is an assistant professor of physics at Pennsylvania State University<br />

<strong>and</strong> is affiliated with the Huck Institutes of the Life Sciences. Her<br />

main research interest is modeling the organization <strong>and</strong> dynamics of complex<br />

networks. She received her PhD in physics from the University of Notre<br />

Dame. She is a member of the American Physical Society <strong>and</strong> the Society for<br />

Mathematical Biology. Contact her at ralbert@phys.psu.edu.<br />

These networks’ responses to targeted<br />

attacks are inferior compared to their resilience<br />

to r<strong>and</strong>om attacks (see Figure 7). The<br />

size of the largest component decreases much<br />

faster for the proposed network than for the<br />

other two networks, but the proposed network<br />

performs better on the other two robustness<br />

measures. That is, the distances in the connected<br />

component are considerably smaller<br />

when more than 10 percent of nodes are<br />

removed.<br />

We can improve robustness to targeted<br />

attacks by introducing constraints in the<br />

attachment rules. Here we assume that node<br />

type constrains its degree—that is, network<br />

MSBs, FSBs, <strong>and</strong> battalions can’t have more<br />

than m 1 , m 2 , <strong>and</strong> m 3 edges, respectively, incident<br />

on them. This is a reasonable assumption<br />

because in military logistics (or any orga-<br />

Sixe of the largest connected component<br />

1,000<br />

900<br />

800<br />

700<br />

Model<br />

m 1 = 4, m 2 = 10, m 3 = 25<br />

m 1 = 4, m 2 = 8, m 3 = 12<br />

m 1 = 3, m 2 = 6, m 3 = 10<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

0 10 20 30 40 50 60<br />

Percentage of nodes removed<br />

Figure 8. The proposed network’s<br />

responses to targeted attacks for<br />

different values of m 1 , m 2 , <strong>and</strong> m 3 .<br />

30 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


Table 2. The proposed network’s characteristic path<br />

length for different m 1 , m 2 , <strong>and</strong> m 3 values.<br />

Values of m 1 , m 2 , <strong>and</strong> m 3<br />

Characteristic path length<br />

m 1 = ∞, m 2 = ∞, m 3 = ∞ 4.4<br />

m 1 = 4, m 2 = 10, m 3 = 25 6.2<br />

m 1 = 4, m 2 = 8, m 3 = 12 7.1<br />

m 1 = 3, m 2 = 6, m 3 = 10 8.0<br />

nization’s logistics management, for that matter),<br />

the suppliers might not be able to cater to<br />

more than a certain number of battalions or<br />

other suppliers. Initial experiments (see Figure<br />

8) show that a network with these constraints<br />

displayed improved robustness to targeted<br />

attacks while not deviating much from<br />

the clustering coefficient. However, as we<br />

restrict how many links a node can receive,<br />

the network’s characteristic path length<br />

increases (see Table 2). Clearly a trade-off<br />

exists between robustness to targeted attacks<br />

<strong>and</strong> the average characteristic path length.<br />

The fourth measure of survivability, network<br />

adaptivity, relates more to<br />

node functionality than to<br />

topology. Node functionality<br />

should facilitate the ability to<br />

rewire. For example, if a supplier<br />

can’t fulfill a customer’s<br />

dem<strong>and</strong>s, the customer seeks<br />

an alternate supplier—that is,<br />

the edge connected to the supplier<br />

is rewired to be incident on another supplier.<br />

Our model rewires according to its<br />

attachment rules. We conjecture that in such<br />

a case, other survivability components (clustering<br />

coefficient, characteristic path length,<br />

<strong>and</strong> robustness) will be intact. But to make a<br />

stronger argument we need more analysis in<br />

this direction.<br />

The growth mechanism we describe is<br />

more like an illustration because<br />

real-world data aren’t available, but we can<br />

always modify it to incorporate domain<br />

constraints. For example, we’ve assumed<br />

that a new node can attach preferentially to<br />

any node in the network, which might not<br />

be a realistic assumption. If specific geographical<br />

constraints are known, we can<br />

modify our mechanism to make the new<br />

node entering the system attach preferentially<br />

only within a set of nodes that satisfy<br />

the constraints.<br />

Acknowledgments<br />

We thank the anonymous reviewers for their<br />

helpful comments. We acknowledge <strong>DARPA</strong> for<br />

funding this work under grant MDA972-01-1-<br />

0038 as part of the UltraLog program.<br />

References<br />

1. J.M. Swaminathan, S.F. Smith, <strong>and</strong> N.M.<br />

Sadeh, “Modeling Supply Chain Dynamics:<br />

A Multiagent Approach,” Decision Sciences,<br />

vol. 29, no. 3, 1998, pp. 607–632.<br />

2. A.-L. Barabási <strong>and</strong> R. Albert, “Emergence of<br />

Scaling in R<strong>and</strong>om Networks,” Science, vol.<br />

286, Oct. 1999, pp. 509–512.<br />

Look to the Future<br />

IEEE Internet Computing reports<br />

emerging tools, technologies,<br />

<strong>and</strong> applications implemented through<br />

the Internet to support a worldwide<br />

computing environment.<br />

In 2004-2005, we’ll look at<br />

• Homel<strong>and</strong> Security<br />

• Internet Access to Scientific Data<br />

• Recovery-Oriented<br />

Approaches to Dependability<br />

• Information Discovery:<br />

Needles <strong>and</strong> Haystacks<br />

• Internet Media<br />

... <strong>and</strong> more!<br />

www.computer.org/internet/<br />

SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 31


Survivability of a Distributed Multi-Agent Application - A Performance Control<br />

Perspective<br />

Nathan Gnanasamb<strong>and</strong>am, Seokcheon Lee, Soundar R.T. Kumara, Natarajan Gautam<br />

Pennsylvania State University<br />

University Park, PA 16802<br />

{gsnathan, stonesky, skumara, ngautam}@psu.edu<br />

Wilbur Peng, Vikram Manikonda<br />

Intelligent Automation Inc.<br />

Rockville, MD 20855<br />

{wpeng, vikram}@i-a-i.com<br />

Marshall Brinn<br />

BBN Technologies<br />

Cambridge, MA 02138<br />

mbrinn@bbn.com<br />

Mark Greaves<br />

<strong>DARPA</strong> IXO<br />

Arlington, VA 22203<br />

mgreaves@darpa.mil<br />

Abstract<br />

Distributed Multi-Agent Systems (DMAS) such as supply<br />

chains functioning in highly dynamic environments need to<br />

achieve maximum overall utility during operation. The utility<br />

from maintaining performance is an important component<br />

of their survivability. This utility is often met by identifying<br />

trade-offs between quality of service <strong>and</strong> performance.<br />

To adaptively choose the operational settings for better utility,<br />

we propose an autonomous <strong>and</strong> scalable queueing theory<br />

based methodology to control the performance of a hierarchical<br />

network of distributed agents. By formulating<br />

the MAS as an open queueing network with multiple classes<br />

of traffic we evaluate the performance <strong>and</strong> subsequently the<br />

utility, from which we identify the control alternative for a<br />

localized, multi-tier zone. When the problem scales, another<br />

larger queueing network could be composed using<br />

zones as bu0ilding-blocks. This method advocates the systematic<br />

specification of the DMAS’s attributes to aid realtime<br />

translation of the DMAS into a queueing network. We<br />

prototype our framework in Cougaar <strong>and</strong> verify our results.<br />

1. Introduction<br />

Distributed multi-agent systems (DMAS), through adaptivity,<br />

have enormous potential to act as the “brains” behind<br />

numerous emerging applications such as computational<br />

grids, e-commerce hubs, supply chains <strong>and</strong> sensor<br />

networks [13]. The fundamental hallmark of all these applications<br />

is dynamic <strong>and</strong> stressful environmental conditions,<br />

of one type or the other, in which the MAS as a whole must<br />

survive albeit it suffers temporary or permanent damage.<br />

While the survival notion necessitates adaptivity to diverse<br />

conditions along the dimensions of performance, security<br />

<strong>and</strong> robustness, delivering the correct proportion of these<br />

quantities can be quite a challenge. From a performance<br />

st<strong>and</strong>point, a survivable system can deliver excellent Quality<br />

of Service (QoS) even when stressed. A DMAS could be<br />

considered survivable if it can maintain at least x% of system<br />

capabilities <strong>and</strong> y% of system performance in the face<br />

of z% of infrastructure loss <strong>and</strong> wartime loads (x, y, z are<br />

user-defined) [7].<br />

We address a piece of the survivability problem by building<br />

an autonomous performance control framework for the<br />

DMAS. It is desirable that the adaptation framework be<br />

generic <strong>and</strong> scalable especially when building large-scale<br />

DMAS such as UltraLog [2]. For this, one can utilize a<br />

methodology similar to Jung <strong>and</strong> Tambe [19], composing<br />

the bigger society of smaller building blocks (i.e. agent<br />

communities). Although Jung <strong>and</strong> Tambe [19] successfully<br />

employ strategies for co-operativeness <strong>and</strong> distributed<br />

POMDP to analyze performance, an increase in the number<br />

of variables in each agent can quickly render POMDP ineffective<br />

even in reasonable sized agent communities due to<br />

the state-space explosion problem. In [27], Rana <strong>and</strong> Stout<br />

identify data-flows in the agent network <strong>and</strong> model scala-


¡ ¡ ¢ £ ¤ ¥ ¦ £ § ¨ © ¥ <br />

¨ ¥ ¦ ¤ ¦ © ¥ § ¥ ¥ <br />

£ ¤ ¥ ¢ © ¥ <br />

© £ ¨ ¥ ¤ § ¢ ¥ <br />

© £ ¨ £¦ £ ¨ ¦ ¥ ¢ ¥ <br />

Figure 1. Operational Layers forming the MAS<br />

bility with Petri nets, but their focus is on identifying synchronization<br />

points, deadlocks <strong>and</strong> dependency constraints<br />

with coarse support for performance metrics relating to delays<br />

<strong>and</strong> processing times for the flows. Tesauro et al. [34]<br />

propose a real-time MAS-based approach for data centers<br />

that is self-optimizing based on application-specific utility.<br />

While [19, 27] motivate the need to estimate performance of<br />

large DMAS using a building block approach, [34] justifies<br />

the need to use domain specific utility whose basis should<br />

be the network’s service-level attributes such as delays, utilization<br />

<strong>and</strong> response times.<br />

We believe that by using queueing theory we can analyze<br />

data-flows within the agent community with greater granularity<br />

in terms of processing delays <strong>and</strong> network latencies<br />

<strong>and</strong> also capitalize on using a building block approach by<br />

restricting the model to the community. Queueing theory<br />

has been widely used in networks <strong>and</strong> operating systems<br />

[5]. However, the authors have not seen the application<br />

of queueing to MAS modeling <strong>and</strong> analysis. Since, agents<br />

lend themselves to being conveniently represented as a network<br />

of queues, we concentrate on engineering a queueing<br />

theory based adaptation (control) framework to enhance the<br />

application-level performance.<br />

Inherently, the DMAS can be visualized as a multilayered<br />

system as is depicted in Figure 1 . The top-most<br />

layer is where the application resides, usually conforming<br />

to some organization such as mesh, tree etc. The infrastructure<br />

layer not only abstracts away many of the complexities<br />

of the underlying resources (such as CPU, b<strong>and</strong>width),<br />

but more importantly provides services (such as Message<br />

Transport) <strong>and</strong> aiding agent-agent services (such as naming,<br />

directory etc.). The bottom most layer is where the<br />

actual computational resources, memory <strong>and</strong> b<strong>and</strong>width reside.<br />

Most studies in the literature do not make this distinction<br />

<strong>and</strong> as such control is not executed in a layered<br />

fashion. Some studies such as [35, 17], consider controlling<br />

attributes in the physical or infrastructural layers so<br />

that some properties (eg. robustness) could result <strong>and</strong>/or<br />

the facilities provided by these layers are taken advantage<br />

of. Often, this requires rewiring the physical layer, availability<br />

of a infrastructure level service or the ability of the<br />

application of share information with underlying layers in<br />

a timely fashion for control purposes. In this initial work,<br />

we consider control only due to application-level trade-offs<br />

such as quality of service versus performance <strong>and</strong> assume<br />

that infrastructure level services (such as load-balancing,<br />

priority scheduling) or physical level capabilities (such as<br />

rewiring) are not possible. While we intend to extend the<br />

approach to multi-layered control, it must be noted that it<br />

is not always possible for the application (or the application<br />

manager) to have access to all the underlying layers due<br />

to security reasons. In autonomic control of data centers,<br />

the application manager may have complete authority over<br />

parameters in the physical layer (servers, buffers, network),<br />

the infrastructure (middle-ware) <strong>and</strong> the applications. However,<br />

in DMAS scenarios, especially when dealing with mobile<br />

agents (as an application), trust between the layers is<br />

often partial forcing them to negotiate parameters through<br />

authorized channels. Hence, each layer must be capable of<br />

adapting with minimum cross-layer dependencies.<br />

Our contribution in this work is to combine queueing<br />

analysis <strong>and</strong> application-level control to engineer a generic<br />

framework that is capable of self-optimizing its domainspecific<br />

utility. Secondly, we provide a methodology for<br />

engineering a self-optimizing DMAS to assure applicationlevel<br />

survivability. While we see utility improvements by<br />

by adopting application-level adaptivity, we underst<strong>and</strong> that<br />

further improvement may be gained by utilizing the adaptive<br />

capabilities of the underlying layers.<br />

Before we consider the details of our framework, we<br />

classify the performance control approaches in literature in<br />

Section 2. We present the details for our Cougaar based<br />

test-bed system in Section 3. The architectural details of<br />

our framework is provided in Section 4. We provide an empirical<br />

evaluation in Section 5 <strong>and</strong> finally conclude with discussions<br />

<strong>and</strong> future work in Section 6.<br />

2. Background <strong>and</strong> Motivation<br />

2.1 Approaches in Literature<br />

Because of the diversity of literature on control frameworks<br />

<strong>and</strong> performance evaluation, we examined a representative<br />

subset primarily on the basis of control objective,<br />

(component) interdependence <strong>and</strong> autonomy, generality,<br />

composability, real-time capability (off-line/on-line<br />

control) <strong>and</strong> layering in control architecture.


In some AI based approaches such as [32, 10], behavioral<br />

or rule based controllers are employed to make the<br />

system exhibit particular behavior based upon logical reasoning<br />

or learning. While performance is not the objective,<br />

layered learning is an interesting capability that may<br />

be helpful in a large scale MAS. Learning may be from a<br />

statistical sense as well where the parameters of a transfer<br />

function are learnt from empirical data to subsequently<br />

enforce feedback control [8]. Another architectural framework<br />

called MONAD [37], utilizes a hierarchical <strong>and</strong> distributed<br />

behavior-based control module, with immense flexibility<br />

through scripting for role <strong>and</strong> resource allocation,<br />

<strong>and</strong> co-ordination. While many these approaches favor<br />

the “sense-plan-act” or “sense <strong>and</strong> respond” paradigm <strong>and</strong><br />

some partially support flexibility through scripting, some<br />

important unanswered questions are what happens when<br />

system size changes, can all axioms <strong>and</strong> behaviors be learnt<br />

a-priori <strong>and</strong> what is the performance impact of size (i.e.<br />

scalability)?<br />

Control theoretic approaches in software performance<br />

optimization are becoming important [22, 29], with software<br />

becoming increasingly more complex, multi-layered<br />

<strong>and</strong> having real-time requirements. However, because of<br />

the dynamic system boundaries, size, varying measures of<br />

performance <strong>and</strong> non-linearity in DMAS it is very complex<br />

to design a strict control theoretic control process [21].<br />

Some approaches such as [21, 34] take the heuristic path,<br />

with occasional analogs to control theory, with an emphasis<br />

on application or domain-specific utility. Kokar et al.<br />

[22] refer to this utility as benefit function <strong>and</strong> elaborate on<br />

various analogs between software systems <strong>and</strong> traditional<br />

control systems. From the perspective of autonomic control<br />

of computer systems, Bennani <strong>and</strong> Menasce [4] study<br />

the robustness of self-management techniques for servers<br />

under highly variable workloads. Although queueing theory<br />

has been used in this work, any notion of components<br />

being distributed or agent-based seems to be absent.<br />

Furthermore, exponential smoothing or regression based<br />

load-forecasting may not be sufficient to address situations<br />

caused by wartime dynamics, catastrophic failure <strong>and</strong> distributed<br />

computing. Nevertheless, in our approach we have<br />

a notion of controlling a distributed application’s utility using<br />

queueing theory.<br />

Numerous market-based control mechanisms are available<br />

in literature such as [24, 9, 12, 6]. In market-based<br />

control systems, agents emulate buyers <strong>and</strong> sellers in a<br />

market acting only with locally available information yet<br />

helping us realize global behaviour for the community of<br />

agents. While these methods are very effective <strong>and</strong> offer<br />

desirable properties such as decentralization, autonomy <strong>and</strong><br />

control hierarchy, they have been used for resource allocation<br />

[24, 9] <strong>and</strong> resource control [6]. The Challenger [9]<br />

system seeks to minimize mean flow time (job completion<br />

time - job origination time), the task is allocated to an agent<br />

providing least processing time. Load balancing is another<br />

application as applied by Ferguson et al. [12]. Resource allocation<br />

<strong>and</strong> load-balancing can be thought of as infrastructure<br />

level services, that agent frameworks such as Cougaar<br />

[1] provide <strong>and</strong> hence in our work we focus on applicationlevel<br />

performance <strong>and</strong> the associated utility to the DMAS.<br />

Using finite state machines, hybrid automata <strong>and</strong> their<br />

variants have been the foci of many research paths in agent<br />

control as in [11, 23]. The idea here is to utilize the states<br />

of the multi-agent system to represent, validate, evaluate,<br />

<strong>and</strong> choose plans that lead the system towards the goal. Often,<br />

the drawback here is that when the number of agents<br />

increase, the state-space approaches tend to become intractable.<br />

Heuristics have widely been used in controlling multiagent<br />

systems primarily in the following sense: searching<br />

<strong>and</strong> evaluating options based on domain knowledge <strong>and</strong><br />

picking a course of action (maybe a compound action composed<br />

of a schedule of individual actions) eventually. The<br />

main idea in recent heuristics based control as exemplified<br />

by [36, 26, 31] is that schedules of actions are chosen based<br />

upon requirements such as costs, feasibilities for real-time<br />

contexts, complexity, quality etc. Opportunistic planning is<br />

an interesting idea as mentioned in Soto et al. [31] refers<br />

to the best-effort planning (maximum quality) considering<br />

available resources. These meta-heuristics offer very effective,<br />

special-purpose solutions to control agent behavior,<br />

however to be more flexible, we separate the performance<br />

evaluation <strong>and</strong> the domain-specific application utility computation.<br />

Given that we have a model for performance estimation<br />

(whose parameters <strong>and</strong> state-space are known), dynamic<br />

programming (DP) <strong>and</strong> its adaptive version - reinforcement<br />

learning (RL), <strong>and</strong> model predictive control (MPC) have<br />

been used to find the control policy [3, 33, 20, 28, 25].<br />

Since the complexity of finding the optimal policy grows<br />

exponentially with the state space [3] <strong>and</strong> convergence has<br />

to be ensured in RL [33, 20], we take an MPC-like approach<br />

in our work for finding quick solutions in real-time. We discuss<br />

this further in Section 4.<br />

2.2 Related Work<br />

In large scale MAS applications, performance estimation<br />

<strong>and</strong> modeling itself can be a formidable task as illustrated<br />

by [16] in the UltraLog [2] context. UltraLog [2],<br />

built on Cougaar [1], uses for heuristic control a host of<br />

architectural features such as operating modes, conditions,<br />

<strong>and</strong> plays <strong>and</strong> play-books as described in [21]. Helsinger<br />

et al. [15] incorporate the aforementioned features into<br />

their closed-loop heuristic framework that balances the different<br />

dimensions of system survivability through targeted


© © ¤ ¢ ¨ ¤<br />

£ ¦ ¨ © §<br />

¢ £ ¤ ¥ ¦ ¡<br />

£ ¦ ¨ © §<br />

defense mechanisms, trade-offs <strong>and</strong> layered control actions.<br />

The importance of high-level, system specifications (interchangeably<br />

called TechSpecs, specification database, component<br />

database) has been emphasized in many places such<br />

as [18, 21, 14]. These specifications contain componentwise,<br />

static input/output behavior, operating requirements<br />

<strong>and</strong> control actions of agents along with domain measures<br />

of performance <strong>and</strong> computation methodologies [14].<br />

Also, queueing network based methodologies for offline<br />

<strong>and</strong> design-time performance evaluation have been applied<br />

<strong>and</strong> validated in [14, 30]. Building on these ideas, we build<br />

a real-time framework with queueing based performance<br />

prediction capabilities.<br />

¤ ¨ <br />

£ ¦ ¨ © §<br />

(a) MAS building<br />

block: Community<br />

¡ ¢ £ ¤ ¥ ¦ § § ¨ £ © ¤ <br />

¡ ¢ £ ¤ ¥ ¦ § § ¨ £ © ¤ <br />

¡ ¢ £ ¤ ¥ ¦ § § ¨ £ © ¤ <br />

(b) Agent society formed by composing<br />

communities<br />

2.3 Problem Statement<br />

Being the top-most layer as an application, the survivability<br />

of a DMAS depends on its ability to leverage its<br />

knowledge of the domain, the system’s overall utility <strong>and</strong><br />

available control-knobs. The utility of the application is the<br />

combined benefit along several conflicting (eg. completeness<br />

<strong>and</strong> timeliness [7, 2]) <strong>and</strong>/or independent (eg. confidentiality<br />

<strong>and</strong> correctness [7, 2]) dimensions, which the application<br />

tries to maximize in a best-effort sense through<br />

trade-offs. Underst<strong>and</strong>ably, in a distributed multi-agent<br />

setting, mechanisms to measure, monitor <strong>and</strong> control this<br />

multi-criteria utility function become hard <strong>and</strong> inefficient,<br />

especially under conditions of scale-up. Given that the application<br />

does not change its high-level goals, task-structure<br />

or functionality in real-time, it is beneficial to have a framework<br />

that assists in the choice of operational modes (eg.<br />

plan quality) that maximize the utility from performance.<br />

Hence, the research objective of this work is to design <strong>and</strong><br />

develop a generic, real-time, self-controlling framework for<br />

DMAS, that utilizes a queueing network model for performance<br />

evaluation <strong>and</strong> a learned utility model to select an<br />

appropriate control alternative.<br />

2.4 Solution Methodology<br />

This research concentrates on adjusting the applicationlevel<br />

parameters or operating modes (opmodes for short)<br />

within the distributed agents to make an autonomous choice<br />

of operational parameters for agents in a reasonable-sized<br />

domain (called an agent community). The choice of opmodes<br />

is based on the perceived application-level utility of<br />

the combined system (i.e. the whole community) that current<br />

environmental conditions allow. We assume that the<br />

application’s utility depends on the choice of opmodes at<br />

the agents constituting the community because the opmodes<br />

directly affect the performance. A queueing network model<br />

is utilized to predict the impact of DMAS control settings<br />

<strong>and</strong> environmental conditions on steady-state performance<br />

Figure 2. MAS Community <strong>and</strong> Society<br />

(in terms of end-to-end delays in flows), which in turn is<br />

used to estimate the application-level utility. After evaluating<br />

<strong>and</strong> ranking several alternatives from among the feasible<br />

set of operational settings on the basis of utility, the best<br />

choice is picked.<br />

3. Overview of Application (CPE) Scenario<br />

The Continuous Planning <strong>and</strong> Execution (CPE) Society<br />

is a comm<strong>and</strong> <strong>and</strong> control (C2) MAS built on Cougaar<br />

(<strong>DARPA</strong> Agent Framework [1]) that serves as the test-bed<br />

for performance control. Designed as a building block for<br />

larger scale MAS, the primary CPE prototype consists of<br />

three tiers (Brigade, Battalion, Company) as shown in Figure<br />

2a. While the discussion is mainly with respect to the<br />

structure of CPE, the system can be grown by combining<br />

many CPE communities to form large agent societies as<br />

shown in Figure 2b.<br />

CPE embodies a complete military logistics scenario<br />

with agents emulating roles such as suppliers, consumers<br />

<strong>and</strong> controllers all functioning in a dynamic <strong>and</strong> hostile (destructive)<br />

external environment. Embedded in the hierarchical<br />

structure of CPE are both comm<strong>and</strong> <strong>and</strong> control, <strong>and</strong><br />

superior-subordinate relationships. The subordinates compile<br />

sensor updates <strong>and</strong> furnish them to superiors. This<br />

enables the superiors to perform the designated function<br />

of creating plans (for maneuvering <strong>and</strong> supply) as well as<br />

control directives for downstream subordinates. Upon receipt<br />

of plans, the subordinates execute them. The supply<br />

agents replenish consumed resources periodically. This<br />

high level system definition is to be executed continuously<br />

by the application with maximum achievable performance<br />

in the presence of stresses that include temporary <strong>and</strong> catastrophic<br />

failure. Stresses associated with wartime situations<br />

cause the resource allocation (CPU, memory, b<strong>and</strong>width)<br />

<strong>and</strong> offered load (due to increased planning requirements)


¢ £ ¤ ¤ £ ¤ ¡<br />

¤ ¨© § ¢ ¤ ¡ ¢ © ¡ ¢ £ £ ¢<br />

<br />

<<br />

5<br />

1<br />

¢ £ ¤ ¤ £ ¤ ¡<br />

¦ ¦ § ¨© ¡ ¨ ¤ £ ¢<br />

¥<br />

, ><br />

, = ; : =<br />

<br />

<br />

: ; ,<br />

<br />

! " " # ! <br />

: =<br />

8 )<br />

) ; < ; ) ; :<br />

$ % & ' % (<br />

- . / . - 0 ,<br />

/ % 2 / - 3 4 1<br />

3 6 - 4 3 / - 1<br />

A / 2 - 3 4 @<br />

: <br />

? ; $<br />

) ><br />

8 3 9 / + 2 . / 9 4<br />

: > ; < ><br />

,<br />

+ *<br />

2 2 . 6 7 5<br />

3 9 / 8<br />

)<br />

+ *<br />

Figure 3. Traffic flow within CPE<br />

to fluctuate immensely.<br />

As part of the application-level adaptivity features, a set<br />

of opmodes are built into the system. Opmodes allow individual<br />

tasks (such as plans, updates, control) to be executed<br />

at different qualities or to be processed at different rates. We<br />

assume that TechSpecs for the CPE application (similar to<br />

[14]) are available to be utilized by the control framework.<br />

Although, functionally CPE <strong>and</strong> UltraLog are unique,<br />

the same flavor of activities are reflected in both. Both of<br />

them share the same Cougaar infrastructure; execute planning<br />

in dynamic, distributed settings with similar QoS requirements;<br />

<strong>and</strong> are both one application with physically<br />

distributed components interconnected by task flows (as<br />

shown in Figure 3 in the case of CPE), wherein the individual<br />

utilities of the components contribute to the global<br />

survivability.<br />

4. Architecture of the Performance Control<br />

Framework<br />

The distributed performance control framework that<br />

accomplishes application-level survivability while operating<br />

amidst infrastructure/physical layer <strong>and</strong> environmental<br />

stresses is represented in Figure 4. This representation consists<br />

of activities, modules, knowledge repositories <strong>and</strong> information<br />

flow through a distributed collection of agents.<br />

The features for adaptivity are solely at the application level<br />

without considering infrastructure or physical level adaptivity<br />

such as dynamically allocating processor share or adjusting<br />

the buffer sizes.<br />

Figure 4. Architecture Overview<br />

Architecture Overview<br />

When the application is stressed by an amount S by the<br />

underlying layers (due to under-allocation of resources)<br />

<strong>and</strong> the environment (due to increased workloads during<br />

wartime conditions), the DMAS Controller has to examine<br />

all its performance-related variables from set X <strong>and</strong> the<br />

current overall performance P in order to adapt. The variables<br />

that need to be maintained are specified in the Tech-<br />

Specs <strong>and</strong> may include delays, time-stamps, utilizations <strong>and</strong><br />

their statistics. They are collected in a distribution fashion<br />

through measurement points (MP ) which are “soft” storage<br />

containers residing inside the agents <strong>and</strong> contain information<br />

on what, when <strong>and</strong> how they should be measured.<br />

The DMAS Controller knows the set of flows F that traverse<br />

the network <strong>and</strong> the set of packet types T from the<br />

TechSpecs. With (F, T, X, C), where C is a suggestion<br />

based on prior effectiveness from the DMAS Controller, the<br />

Model Builder can select a suitable queueing model template<br />

Q. The Control Set Evaluator knows the current opmode<br />

set O as well as the set of possible opmodes, OS<br />

from TechSpecs. To evaluate the performance due to a c<strong>and</strong>idate<br />

opmode set O ′ , the Control Set Evaluator uses the<br />

Queueing Model with a scaled set of operating conditions<br />

X ′ . Once the performance P ′ is estimated by the Queueing<br />

Model it can be cached in the performance database P DB<br />

<strong>and</strong> then sent to the Utility Calculator. The Utility Calculator<br />

computes the domain utility (U ′ ) due to (O ′ , P ′ )<br />

<strong>and</strong> caches it in the utility database, UDB. Subsequently,<br />

the optimal operating mode O ∗ is identified <strong>and</strong> sent to the


DMAS Controller. The functional units of the architecture<br />

are distributed but for each community that forms part of<br />

a MAS society, O ∗ will be calculated by a single agent.<br />

We now examine the functionality <strong>and</strong> role offered by each<br />

component of the framework in greater detail.<br />

4.1 Self-Monitoring Capability<br />

Any system that wants to control itself should possess<br />

a clear specification of the scope of the variables it has to<br />

monitor. The TechSpecs is a distributed structure that supports<br />

this purpose by housing meta-data about all variables,<br />

X, that have to be monitored in different portions of the<br />

community (refer [14]). The data/statistics collected in a<br />

distributed way, is then aggregated to assist in control alternatives<br />

by the top-level controller that each community will<br />

possess.<br />

The attributes that need to be tracked are formulated<br />

in the form of measurement points (MP ). For example,<br />

one simple measurement could be specified as<br />

{what = delay, when = every packet, how =<br />

timestamp at receiving emd − timestamp at sending end }<br />

which is subsequenly stored in an MP . Each agent can<br />

look up its own TechSpecs <strong>and</strong> from time-to-time forward<br />

a measurement to its superior. The superior can analyze<br />

this information (eg. calculate statistics such as mean or<br />

variance) <strong>and</strong>/or add to this information <strong>and</strong> forward it<br />

again. We have measurement points for time-periods, timestamps,<br />

operating-modes, control <strong>and</strong> generic vector-based<br />

measurements. These measurement points can be chained<br />

for tracking information for a flow such that information is<br />

tagged-on at every point the flow traverses. For the sake of<br />

reliability, the information that is contained in these agents<br />

is replicated at several points, so that in the absence of packets<br />

reaching on time or not reaching at all, previously stored<br />

packets <strong>and</strong> their corresponding information can be utilized<br />

for control purposes.<br />

4.2 Self-Modeling Capability<br />

One of the key features of this framework is that it<br />

has the capability to choose a type of performance model<br />

for analysing the current system configuration from several<br />

queueing model templates provided. The type of model that<br />

is utilized is based on the accuracy, the computation time<br />

<strong>and</strong> the history of effectiveness of the model. For example,<br />

a simulation based queueing model may be very accurate<br />

but cannot evaluate enough alternatives in limited time, in<br />

which case an analytical model (such as BCMP [5], QNA<br />

[38]) is preferred.<br />

The inputs to the model builder are the flows that traverse<br />

the network (F ), the types of packets (T ) <strong>and</strong> the current<br />

configuration of the network. If at a given time, we know<br />

that there are n agents interconnected in a hierarchical fashion<br />

then the role of this unit is to represent that information<br />

in the required template format (Q). The current number<br />

of agents is known to the controller by tracking the measurement<br />

points. For example, if there is no response from<br />

an agent for a sufficient period of time, then for the purpose<br />

of modeling, the controller may assume the agent to<br />

be non-existent. In this way dynamic configurations can<br />

be h<strong>and</strong>led. On the other h<strong>and</strong>, TechSpecs do m<strong>and</strong>ate<br />

connections according to superior-subordinate relationships<br />

thereby maintaining the flow structure at all times. Once the<br />

modeling is complete, the MAS has to capability to analyze<br />

its current performance using the selected type of model.<br />

The MAS does have the flexibility, to choose another model<br />

template for a different iteration.<br />

4.3 Self-Evaluating Capability<br />

The evaluation capability, the first step in control, allows<br />

the MAS to examine its own performance under a given<br />

set of plausible conditions. This prediction of performance<br />

is used for the elimination of control alternatives that may<br />

lead to instabilities. Our notion of performance evaluation<br />

is similar to [34]. While Tesauro et al. [34] compute the<br />

resource level utility functions (based on the application<br />

manager’s knowledge of the system performance model)<br />

that can be combined to obtain a globally optimal allocation<br />

of resources, we predict the performance of the MAS<br />

as a function of its operating modes in real-time (within<br />

Queueing Model) <strong>and</strong> then use it to calculate its global utility<br />

(some more differences are pointed out in Section 4.4).<br />

By introducing a level of indirection, we may get some<br />

desirable properties because we separate an application’s<br />

domain-specific utility computation from performance prediction<br />

(or analysis). This theoretically enables us to predict<br />

the performance of any application whose TechSpecs are<br />

clearly defined <strong>and</strong> then compute the application-specific<br />

utility. In both cases, control alternatives are picked based<br />

on best-utility. We discuss the notion of control alternatives<br />

in Section 4.4. Also, our performance metrics (<strong>and</strong> hence<br />

utility) are based on service level attributes such as endto-end<br />

delay <strong>and</strong> latency, which is a desirable attribute of<br />

autonomic systems [34].<br />

When plan, update <strong>and</strong> control tasks (as mentioned in<br />

Section 3) flow in this heterogeneous network of agents<br />

in predefined routes (called flows), the processing <strong>and</strong> wait<br />

times of tasks at various points in the network are not alike.<br />

This is because the configuration (number of agents allocated<br />

on a node), resource availability (load due to other<br />

contending software) <strong>and</strong> environmental conditions at each<br />

agent is different. In addition, the tasks themselves can be<br />

of varying qualities or fidelities that affects the time taken<br />

to process that task. Under these conditions, performance is


Symbol<br />

Table 1. Notation<br />

Description<br />

N Total # of nodes in the community<br />

λ ij Average arrival rate of class j at node i<br />

1/µ ijk Average processing time of class j at<br />

node i at quality k<br />

M total number of classes<br />

T i Routing probability matrix for class i<br />

W ijk Steady state waiting time for class j at<br />

node i at quality k<br />

Q ij Set of qualities at which a class j task<br />

can be processed at node i<br />

estimated on the basis of the end-to-end delay involved in a<br />

“sense-plan-respond” cycle.<br />

The primary performance prediction tool that we use are<br />

called Queueing Network Models (QNM) [5]. The QNM<br />

is the representation of the agent community in the queueing<br />

domain. As the first step of performance estimation, the<br />

agent community needs to be translated into a queueing network<br />

model. Table 1 provides the notations used is this section.<br />

Inputs <strong>and</strong> outputs at a node are regarded as tasks. The<br />

rate at which tasks of class j are received at node i is captured<br />

by the arrival rate (λ ij ). Actions by agents consume<br />

time, so they get abstracted as processing rates (µ ij ). Further,<br />

each task can be processed at a quality k ∈ Q ij , that<br />

causes the processing rates to be represented as µ ijk . Statistics<br />

of processing times are maintained at each agent in PDB<br />

to arrive at a linear regression model between quality k <strong>and</strong><br />

µ ijk . Flows get associated with classes of traffic denoted<br />

by the index j. If a connection exists between two nodes,<br />

this is converted to a transition probability p ij , where i is<br />

the source <strong>and</strong> j is the target node. Typically, we consider<br />

flows originating from the environment, getting processed<br />

<strong>and</strong> exiting the network making the agent network an open<br />

queueing network [5]. Since we may typically have multiple<br />

flows through a single node, we consider multi-class<br />

queueing networks where the flows are associated with a<br />

class. Performance metrics such as delays for the “senseplan-respond”<br />

cycle is captured in terms of average waiting<br />

times, W ijk . As mentioned earlier, TechSpecs is a convenient<br />

place where information such as flows <strong>and</strong> Q ij can be<br />

embedded.<br />

The choice of QNM depends on the number of classes,<br />

arrival distribution <strong>and</strong> processing discipline as well as<br />

a suggestion C by the DMAS controller that makes this<br />

choice based upon history of prior effectiveness. Some analytical<br />

approaches to estimate performance can be found<br />

in [5, 38]. In the context of agent networks, Jackson <strong>and</strong><br />

BCMP queueing networks to estimate the performance in<br />

[14]. By extending this work we support several templates<br />

of queueing models (such as BCMP [5], Whitt’s QNA [38],<br />

Jackson [5], M/G/1, a simulation) that can be utilized for<br />

performance prediction.<br />

4.4 Self-Controlling Capability<br />

In contrast to [34], we deal with optimization of the domain<br />

utility of a single MAS that is distributed, rather than<br />

allocating resources in an optimal fashion to multiple applications<br />

that have a good idea of their utility function<br />

(through policies). As mentioned before opmodes allow<br />

for trading-off quality of service (task quality <strong>and</strong> response<br />

time) <strong>and</strong> performance. We are assuming there is a maximum<br />

ceiling R on the amount of resources, <strong>and</strong> the available<br />

resources fluctuate depending on stresses S = S e +S a ,<br />

where S e are the stresses from the environment (i.e. multiple<br />

contending applications, changes in the infrastructural<br />

or physical layers) <strong>and</strong> S a are the application stresses (i.e.<br />

increased tasks). The DMAS controller receives from<br />

the measurement points (MP ) a measurement of the actual<br />

performance P <strong>and</strong> a vector of other statistics (relating<br />

to X). Also at the top-level the overall utility (U) is<br />

U(P, S) = ∑ w n x n is known where x n is the actual utility<br />

component <strong>and</strong> w n is the associated weight specified by<br />

the user or another superior agent. We cannot change S, but<br />

we can adjust P to get better utility. Since P depends on<br />

O, which is a vector of opmodes collected from the community,<br />

we can use the QNM to find O ∗ <strong>and</strong> hence P ∗ that<br />

maximizes U(P, S) for a given S from within the set OS.<br />

In words, we find the vector of opmodes (O ∗ ) that maximizes<br />

domain utility at current S <strong>and</strong> opmodes O. This<br />

computation is performed in the Utility Calculator module<br />

using a learned utility model based on UDB.<br />

In addition to differences pointed out thus far, here are<br />

some more differences between this work <strong>and</strong> [34]:<br />

• Tesauro et al. [34] assume that the application manager<br />

in Unity has a model of system performance, which<br />

we do not assume. Although they allude to a modeler<br />

module, they do not explain the details of their performance<br />

model. We use a queueing network model<br />

that is constructed in real-time to estimate the performance<br />

for any set of opmodes O ′ by taking the current<br />

opmodes O <strong>and</strong> scaling them appropriately based on<br />

observed histories (X) to X ′ in the Control Set Evaluator.<br />

• Because of the interactions involved <strong>and</strong> complexity<br />

of performance modeling [19, 27], it may be timeconsuming<br />

to utilize statistical inferencing <strong>and</strong> learning<br />

mechanisms in real-time. This is why we use an<br />

analytical queueing network model to estimate performance<br />

quickly.


1000<br />

500<br />

0<br />

-500<br />

-1000<br />

0.2 0.4 0.6 0.8<br />

Default Policy<br />

Controlled<br />

Stress (S)<br />

Figure 5. Results Overview<br />

• Another difference is that in [34], they assume operating<br />

system support for being able to tune parameters<br />

such as buffer sizes <strong>and</strong> operating system settings<br />

which may not be true in many MAS-based situations<br />

because of mobility, security <strong>and</strong> real-time constraints.<br />

Besides, in addition to the estimation of performance,<br />

the queueing model may have the capability to eliminate<br />

instabilities from a queueing sense, which is not<br />

apparent in the other approach.<br />

• But most importantly, their work reflects a two level hierarchy<br />

where the resource manager mediates several<br />

application environments to obtain maximum utility to<br />

the data center. But our work is from the perspective<br />

of a single, self-optimizing application that is trying to<br />

be survivable by maximizing its own utility.<br />

Inspite of these differences, it is interesting to see that the<br />

self-controlling capability can be achieved, with or without<br />

explicit layering, in real-world applications.<br />

5. Empirical Evaluation on CPE Test-bed<br />

The aforementioned framework was implemented within<br />

CPE which we use as a test-bed for our experimentation.<br />

The main goal of this experimentation was to examine if<br />

application-level adaptivity led to any utility gains in the<br />

long run. The superior agents in CPE continuously plan<br />

maneuvers for their subordinates which get executed by the<br />

lower rung nodes. We subjected the entire distributed community<br />

to r<strong>and</strong>om stresses by simulating enemy clusters of<br />

varying sizes <strong>and</strong> arrival rates. These stresses translated into<br />

the need to perform the distributed “sense-plan-respond”<br />

more frequently causing increased load <strong>and</strong> traffic in the<br />

network of agents. The stresses were created by a worldagent<br />

whose main purpose was to simulate warlike dynamics<br />

within our test-bed.<br />

The CPE prototype consists of 14 agents spread across<br />

a physical layer of 6 CPUs. We utilized the prototype<br />

CPE framework to run 36 experiments at two stress levels<br />

(S = 0.25 <strong>and</strong> S = 0.75). There were three layers of<br />

hierarchy as shown in Figure 2a with a three-way branching<br />

at each level <strong>and</strong> one supply node. The community’s<br />

utility function was based on the achievement of real goals<br />

in military engagements such as terminating or damaging<br />

the enemy <strong>and</strong> reducing the penalty involved in consuming<br />

resources such as fuel or sustaining damage. To keep<br />

our queueing models simple, we assumed that the external<br />

arrival was Poisson while the service times were generally<br />

distributed. In order to cater to general arrival rates, the<br />

framework contains a QNA-based <strong>and</strong> a simulation-based<br />

model. Using this assumption a BCMP or M/G/1 queueing<br />

model could be selected by the framework for real-time<br />

performance estimation. The baseline for comparison was<br />

the do nothing policy (default) where we let the Cougaar infrastructure<br />

manage conditions of high load. Although our<br />

framework did better than any set of opmodes as shown in<br />

Figure 5 for the two stress modes, we show instantaneous<br />

<strong>and</strong> cumulative utility for two opmodes (Default A, B) in<br />

particular in Figure 6. We noticed that in the long run the<br />

framework enhanced the utility of the application as compared<br />

to the default policy.<br />

At both stress levels, the controlled scenario performed<br />

better that the default as shown in Figure 6. We did observe<br />

oscillations in the instantaneous utility <strong>and</strong> we attribute this<br />

to the impreciseness of the prediction of stresses. Stresses<br />

vary relatively fast in the order of seconds while the control<br />

granularity was of the order of minutes. Since this is a<br />

military engagement situation following no stress patterns,<br />

it is hard to cope with in the higher stress case. In contrast<br />

to MAS applications dealing with data centers where load<br />

can be attributed to time-of-day <strong>and</strong> other seasonal effects,<br />

it is not possible to get accurate load predictions for MAS<br />

applications simulating wartime loads. We think that this<br />

could be the reason why our utility falls in the latter case.<br />

In subsequent work, we intend to enhance Cougaar capability<br />

to be supportive of the application-layer by forcing it to<br />

guarantee some end-to-end delay requirements.<br />

6. Conclusions <strong>and</strong> Future Work<br />

In this paper, we were able to successfully control a realtime<br />

DMAS to achieve overall better utility in the long run,<br />

thus making the application survivable. Utility improvements<br />

were made through application-level trade-offs between<br />

quality of service <strong>and</strong> performance. We utilized a<br />

queueing network based framework for performance analysis<br />

<strong>and</strong> subsequently used a learned utility model for computing<br />

the overall benefit to the DMAS (i.e. community).<br />

While Tesauro et al. [34] employ a resource arbiter to maximize<br />

the combined utility of several application environments<br />

in a data center scenario, we focus on using queueing


-10<br />

20<br />

15<br />

10<br />

5<br />

0<br />

-5<br />

0 200 400 600 800 1000 1200<br />

time (sec.)<br />

Controlled Default A Default B<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

0 200 400 600 800 1000 1200<br />

time (sec.)<br />

Controlled Default A Default B<br />

(a) Instantaneous Utility (stress 0.25)<br />

(b) Cumulative Utility (stress 0.25)<br />

20<br />

15<br />

10<br />

5<br />

0<br />

0 200 400 600 800 1000 1200<br />

-5<br />

-10<br />

time (sec.)<br />

Controlled Default A Default B<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

0 200 400 600 800 1000 1200<br />

time (sec.)<br />

Controlled Default A Default B<br />

(c) Instantaneous Utility (stress 0.75)<br />

(d) Cumulative Utility (stress 0.75)<br />

Figure 6. Sample Results<br />

theory to maximize the utility from performance of a single<br />

distributed application given that is has been allocated<br />

some resources. We think that the approaches are complementary,<br />

with this study providing empirical evidence to<br />

support the observation by Jennings <strong>and</strong> Wooldridge in [18]<br />

that agents can be used to optimize distributed application<br />

environments, including themselves, through flexible highlevel<br />

(i.e. application-level) interactions.<br />

Furthermore, this work has resulted in a general architectural<br />

lesson. We believe that any distributed application<br />

would have flows of traffic <strong>and</strong> would require service<br />

level attributes such as response times, utilization or delays<br />

of components to be optimized. The paradigm that we<br />

have chosen can capture such quantities <strong>and</strong> help evaluate<br />

choices that may lead to better application utility. This concept<br />

of breaking the application into flows <strong>and</strong> allowing a<br />

real-time model-based predictor to steer the system into regions<br />

of higher utility is pretty generic in nature.<br />

While we are continuing the empirical evaluation, we<br />

keep the building blocks small to ensure scalability <strong>and</strong><br />

to reduce interactions. We utilize TechSpecs to distribute<br />

knowledge <strong>and</strong> meta-data thus reemphazing the separation<br />

principle. Subsequently, we hope to broaden the layered<br />

control approach to encompass infrastructure-level control<br />

within the framework. Another avenue for improvement is<br />

to design self-protecting mechanisms so that the security<br />

aspect of the framework is reinforced.<br />

Acknowledgements<br />

This work was performed under the <strong>DARPA</strong> UltraLog<br />

Grant#: MDA 972-01-1-0038. The authors wish to acknowledge<br />

<strong>DARPA</strong> for their generous support.<br />

References<br />

[1] Cougaar open source site. http://www.cougaar.org.<br />

<strong>DARPA</strong>.<br />

[2] Ultralog program site. http://dtsn.darpa.mil/ixo/.<br />

<strong>DARPA</strong>.<br />

[3] A. G. Barto, S. J. Bradtke, <strong>and</strong> S. Singh. Learning to<br />

act using real-time dynamic programming. Artificial<br />

Intelligence, 72:81–138, 1995.<br />

[4] M. N. Bennani <strong>and</strong> D. A. Menasce. Assessing the<br />

robustness of self-managing computer systems under<br />

highly variable workloads. International Conference<br />

on Autonomic Computing, 2004.<br />

[5] G. Bolch, S. Greiner, H. de Meer, <strong>and</strong> K. S.Trivedi.<br />

Queueing Networks <strong>and</strong> Markov Chains: Modeling<br />

<strong>and</strong> Performance Evaluation with Computer Science<br />

Applications. John Wiley <strong>and</strong> Sons, Inc., 1998.<br />

[6] J. Bredin, D. Kotz, <strong>and</strong> D. Rus. Market-based resource<br />

control for mobile agents. Autonomous Agents, 1998.<br />

[7] M. Brinn <strong>and</strong> M. Greaves. Leveraging agent properties<br />

to assure survivability of distributed multi-agent sys-


tems. Proceedings of the Second Joint Conference on<br />

Autonomous Agents <strong>and</strong> Multi-Agent Systems, 2003.<br />

[8] T. Chao, F. Shan, <strong>and</strong> S. X. Yang. Modeling <strong>and</strong> design<br />

monitor using layered control architecture. Autonomous<br />

Agents <strong>and</strong> Multi-Agent Systems, 2002.<br />

[9] A. Chavaz, A. Moukas, <strong>and</strong> P. Maes. Challenger:<br />

A multi-agent systems for distributed resource allocation.<br />

Agents, 1997.<br />

[10] L. Chen, K. Bechkoum, <strong>and</strong> G. Clapworthy. A logical<br />

approach to high-level agent control. Agents, 2001.<br />

[11] A. E. Fallah-Seghrouchni, I. Degirmenciyan-Cartault,<br />

<strong>and</strong> F. Marc. Modelling, control <strong>and</strong> validation of<br />

multi-agent plans in dynamic context. Autonomous<br />

Agents <strong>and</strong> Multi-Agent Systems, 2004.<br />

[12] D. Ferguson, Y. Yemini, <strong>and</strong> C. Nikolaou. Microeconomic<br />

algorithms for load balancing in distributed<br />

computer systems. Proceedings of the International<br />

Conference on Distributed Systems, 1988.<br />

[13] I. Foster, N. R. Jennings, <strong>and</strong> C. Kesselman. Brain<br />

meets brawn: Why grid <strong>and</strong> agents need each other.<br />

Autonomous Agents <strong>and</strong> Multi-Agent Systems, 2004.<br />

[14] N. Gnanasamb<strong>and</strong>am, S. Lee, N. Gautam, S. R. T.<br />

Kumara, W. Peng, V. Manikonda, M. Brinn, <strong>and</strong><br />

M. Greaves. Reliable mas performance prediction using<br />

queueing models. IEEE Multi-agent Security <strong>and</strong><br />

Survivabilty Symposium, 2004.<br />

[15] A. Helsinger, K. Kleinmann, <strong>and</strong> M. Brinn. A framework<br />

to control emergent survivability of multi agent<br />

systems. Autonomous Agents <strong>and</strong> Multi-Agent Systems,<br />

2004.<br />

[16] A. Helsinger, R. Lazarus, W. Wright, <strong>and</strong> J. Zinky.<br />

Tools <strong>and</strong> techniques for performance measurement<br />

of large distributed multi-agent systems. Autonomous<br />

Agents <strong>and</strong> Multi-Agent Systems, 2003.<br />

[17] Y. Hong <strong>and</strong> S. R. T. Kumara. Coordinating control<br />

decisions of software agents for adaptation to dynamic<br />

environments. Working Paper, Dept. of IME, Pennsylvania<br />

State University, University Park, PA, 2004.<br />

[18] N. R. Jennings <strong>and</strong> M. Wooldridge. H<strong>and</strong>book of<br />

Agent Technology, chapter Agent-Oriented Software<br />

Engineering. AAAI/MIT Press, 2000.<br />

[19] H. Jung <strong>and</strong> M. Tambe. Performance models for large<br />

scale multi-agent systems: Using distributed pomdp<br />

building blocks. Proceedings of the Second Joint Conference<br />

on Autonomous Agents <strong>and</strong> Multi-Agent Systems,<br />

July 2003.<br />

[20] L. P. Kaelbling, M. L. Littman, <strong>and</strong> A. Moore. Reinforcement<br />

learning: A survey. Journal of Artificial<br />

Intelligence Research, 4:237–285, 1996.<br />

[21] K. Kleinmann, R. Lazarus, <strong>and</strong> R. Tomlinson. An infrastructure<br />

for adaptive control of multi-agent systems.<br />

IEEE Conference on Knowledge-Intensive<br />

Multi-Agent Systems, 2003.<br />

[22] M. M. Kokar, K. Baclawski, <strong>and</strong> Y. A. Eracar. Control<br />

theory-based foundations of self-controlling software.<br />

IEEE Intelligent Systems, pages 37–45, May/June<br />

1999.<br />

[23] K. C. Lee, W. H. Mansfield, <strong>and</strong> A. P. Sheth. A framework<br />

for controlling cooperative agents. IEEE Computer,<br />

1993.<br />

[24] T. W. Malone, R. Fikes, K.R.Grant, <strong>and</strong> M.T.Howard.<br />

Enterprise: A Market-like Task Scheduler for Distributed<br />

Computing Environments. Elsevier, Holl<strong>and</strong>,<br />

1988.<br />

[25] M. Morari <strong>and</strong> J. H. Lee. Model predictive control:<br />

past, present <strong>and</strong> future. Computers <strong>and</strong> Chemical Engineering,<br />

23(4):667–682, 1999.<br />

[26] A. Raja, V. Lesser, <strong>and</strong> T. Wagner. Toward robust<br />

agent control in open environments. Agents, 2000.<br />

[27] O. F. Rana <strong>and</strong> K. Stout. What is scalabilty in multiagent<br />

systems? Proceedings of the Fourth International<br />

Conference on Autonomous Agents, 2000.<br />

[28] J. B. Rawlings. Tutorial overview of model predictive<br />

control. IEEE Control Systems, 20(3):38–52, 2000.<br />

[29] R. Sanz <strong>and</strong> K.-E. Arzen. Trends in software <strong>and</strong> control.<br />

IEEE Control Systems Magazine, June 2003.<br />

[30] F. Sheikh, J. Rolia, P. Garg, S. Frolund, <strong>and</strong> A. Shepard.<br />

Performance evaluation of a large scale distributed<br />

application design. World Congress on Systems<br />

Simulation, 1997.<br />

[31] I. Soto, M. Garijo, C. A. Iglesias, <strong>and</strong> M. Ramos.<br />

An agent architecture to fulfill real-time requirement.<br />

Agents, 2000.<br />

[32] P. Stone <strong>and</strong> M. Veloso. Using decision tree confidence<br />

factors for multi-agent control. Autonomous<br />

Agents, 1998.<br />

[33] R. S. Sutton, A. G. Barto, <strong>and</strong> R. J. Williams. Reinforcement<br />

learning is direct adaptive optimal control.<br />

IEEE Control Systems, 12(2):19–22, 1992.<br />

[34] G. Tesauro, D. M. Chess, W. E. Walsh, R. Das,<br />

I. Whalley, J. O. Kephart, <strong>and</strong> S. R. White. A multiagent<br />

systems approach to autonomic computing. Autonomous<br />

Agents <strong>and</strong> Multi-Agent Systems, 2004.<br />

[35] H. P. Thadakamalla, U. N. Raghavan, S. R. T. Kumara,<br />

<strong>and</strong> R. Albert. Survivability of multi-agent supply<br />

networks: A topological perspective. IEEE Intelligent<br />

Systems: Dependable Agent Systems, 19(5):24–<br />

31, September/October 2004.<br />

[36] R. Vincent, B. Horling, V. Lesser, <strong>and</strong> T. Wagner. Implementing<br />

soft real-time agent control. Agents, 2001.<br />

[37] T. Vu, J. Go, G. Kaminka, M. Velosa, <strong>and</strong> B. Browning.<br />

Monad: A flexible architecture for multi-agent<br />

control. Autonomous Agents <strong>and</strong> Multi-Agent Systems,<br />

2003.<br />

[38] W. Whitt. The queueing network analyzer. The Bell<br />

System Technical Journal, 62(9):2779–2815, 1983.


Proceedings of the 1st Open Cougaar Conference 1<br />

Survivability through Implementation Alternatives<br />

in Large-scale Information Networks with Finite Load<br />

Seokcheon Lee <strong>and</strong> Soundar Kumara<br />

Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />

The Pennsylvania State University<br />

University Park, PA 16802<br />

{stonesky, skumara}@psu.edu<br />

Abstract<br />

We study a large-scale information network, which is<br />

composed of distributed software components linked with<br />

each other through a task flow structure. The service<br />

provided by the network is to produce a global solution to<br />

a given problem, which is an aggregate solution of<br />

partial solutions from processing tasks. Quality of Service<br />

of this network is determined by the value of the global<br />

solution <strong>and</strong> time for generating the global solution.<br />

Survivability of the network is the capability to provide<br />

high Quality of Service by utilizing implementation<br />

alternatives as control actions, in the presence of<br />

accidental failures <strong>and</strong> malicious attacks. In this paper<br />

we develop an adaptive control mechanism to support<br />

survivability. We stress two desirable properties in<br />

designing the mechanism: scalability <strong>and</strong> predictability.<br />

To address adaptivity we model the stress environment<br />

indirectly by quantifying resource availability of the<br />

system. We build a mathematical programming model<br />

with the resource availability incorporated, which<br />

predicts Quality of Service as a function of control<br />

actions. By periodically solving the programming model<br />

<strong>and</strong> taking optimal control actions with recent resource<br />

availability, the system can be adaptive to the changing<br />

stress environment predictably. But, as the programming<br />

model becomes large-scale <strong>and</strong> complex, we agentify the<br />

components of the network from a control point of view<br />

so that the system can solve the large-scale programming<br />

model in a decentralized mode. We provide an auctionbased<br />

market as a decentralized coordination<br />

mechanism.<br />

1. Introduction<br />

Critical infrastructures become increasingly dependent<br />

on networked systems in many domains for automation or<br />

organizational integration. Though such infrastructure can<br />

improve the efficiency <strong>and</strong> effectiveness, these systems<br />

can be easily exposed to various adverse events such as<br />

accidental failures <strong>and</strong> malicious attacks [1]. Two metrics,<br />

namely survivability <strong>and</strong> scalability, can be used to<br />

determine the efficiency <strong>and</strong> effectiveness of these<br />

systems. Survivability is defined as “the capability of a<br />

system to fulfill its mission, in a timely manner, in the<br />

presence of attacks, failure, or accidents” [2]. One<br />

promising way to achieve survivability is through<br />

adaptivity: changing the system behavior to achieve the<br />

system goal in response to the changing environment [3].<br />

One important consideration of an adaptation is<br />

predictability. Unpredictable adaptation can sometimes<br />

result in worse performance than without adaptation [4].<br />

Scalability is defined as: “the ability of a solution to some<br />

problem to work when the size of the problem increases”<br />

(From Dictionary of Computing at<br />

http://wombat.doc.ic.ac.uk). As the size of networked<br />

systems grows scalability becomes a critical issue when<br />

developing practical software systems [5].<br />

As software systems grow larger <strong>and</strong> more complex,<br />

component technology became one of the topmost topics<br />

in the computing community [6][7]. A component is a<br />

reusable program element, with which developers can<br />

build the systems needed by simply wiring all the<br />

components together. To support flexible usage of the<br />

components in various forms, the components must be<br />

independent, self-contained, <strong>and</strong> highly specialized. In<br />

component-based software systems, components interact<br />

with each other through a task flow structure with each<br />

component specialized for specific tasks.<br />

We study a large-scale information network, which is<br />

composed of distributed software components linked with<br />

each other through a task flow structure. A problem given<br />

to the network is decomposed into a set of tasks for some<br />

of software components <strong>and</strong> those tasks are propagated<br />

through the task flow structure. The service provided by<br />

the network is to produce a global solution to the given<br />

problem, which is an aggregate solution of partial<br />

solutions from processing tasks. Each component can<br />

process a task using one of available implementation<br />

alternatives, which trade off processing time <strong>and</strong> value of<br />

partial solution. Quality of Service (QoS) of this network<br />

is determined by the value of the global solution <strong>and</strong> time<br />

for generating the global solution. Survivability of the<br />

network is the capability to provide high QoS in the<br />

presence of accidental failures <strong>and</strong> malicious attacks. A


Proceedings of the 1st Open Cougaar Conference 2<br />

promising approach to deal with the large-scale systems is<br />

multiagent systems (MAS), we agentify the components<br />

in purely control point of view. In MAS, agents address<br />

the scalability issue by computing solutions locally <strong>and</strong><br />

then using this information in a social way. In this paper<br />

we develop a multiagent-based adaptive control<br />

mechanism with scalability <strong>and</strong> predictability to support<br />

survivability of large-scale networks.<br />

Specifically, in Section 2, we discuss problem domain<br />

<strong>and</strong> in Section 3 formally define the problem in detail.<br />

We review previous control approaches in Section 4. We<br />

design an adaptive control mechanism in Section 5 <strong>and</strong><br />

show empirical results in Section 6. <strong>Final</strong>ly, we discuss<br />

implications <strong>and</strong> possible extensions of our work in<br />

Section 7.<br />

2. Problem domain<br />

The networks we study in this paper represent<br />

distributed <strong>and</strong> component-based architectures. As an<br />

instance, Cougaar (Cognitive Agent Architecture:<br />

http://www.cougaar.org) developed by <strong>DARPA</strong> (Defense<br />

Advanced Research Project Agency), follows such an<br />

architecture for building large-scale multiagent systems.<br />

Recently, there have been efforts to combine the<br />

technologies of agents <strong>and</strong> components to improve the<br />

way of building large-scale software systems [8][9][10].<br />

While component technology focuses on reusability,<br />

agent technology focuses on processing complex tasks as<br />

a community. Cougaar is in line with this trend. In<br />

Cougaar a software system is comprises of agents <strong>and</strong> an<br />

agent of components (called plugins). The task flow<br />

structure in those systems is that of components as a<br />

combination of intra-agent <strong>and</strong> inter-agent task flows. As<br />

the agents in Cougaar can be distributed both from<br />

geographical <strong>and</strong> information content sense, the networks<br />

implemented in Cougaar have distributed <strong>and</strong> componentbased<br />

architecture.<br />

UltraLog (http://www.ultralog.net) networks are<br />

military supply chain planning systems implemented in<br />

Cougaar. Agents in those networks represent<br />

organizations in military supply chains. The objective of<br />

an UltraLog network is to provide appropriate logistics<br />

plan to a military operational plan. The system produces a<br />

logistics plan by decomposing the operational plan into<br />

logistics tasks <strong>and</strong> processing them through a task flow<br />

structure. The system makes initial planning for a given<br />

operation <strong>and</strong> continuous replanning in the execution<br />

mode to cope with logistics plan deviations or operational<br />

plan changes. As the scale of operation increases there<br />

can be thous<strong>and</strong>s of agents working together to generate a<br />

logistics plan.<br />

Initial planning or replanning generates a logistics plan<br />

as a global solution, which is an aggregate of individual<br />

schedules built by plugins through their task flow<br />

structure. Each plugin can implement one of its available<br />

implementation alternatives which trade off processing<br />

time <strong>and</strong> quality of the schedule. Quality of service is<br />

determined by two metrics, quality of logistics plan <strong>and</strong><br />

plan completion time. These two metrics directly affect<br />

the performance of the operation.<br />

Planning <strong>and</strong> replanning of UltraLog networks are the<br />

instances of the current research problem. An UltraLog<br />

network cannot work in isolation from outside world<br />

because they utilize external databases <strong>and</strong> users should<br />

be able to access the system. This inevitable connection to<br />

the outside makes the system exposed to malicious<br />

attacks in addition to accidental failure. Now, the<br />

question is how can we make this system survivable to<br />

generate high quality logistics plans in a timely manner in<br />

the presence of accidental failures <strong>and</strong> malicious attacks?<br />

3. Problem specification<br />

In this Section we formally define the problem by<br />

detailing the network model. We concentrate on<br />

computational CPU resources assuming that the system is<br />

computation-bounded.<br />

3.1. Network model<br />

We define four elements of the network to clarify its<br />

mechanics: network configuration, implementation<br />

alternatives, quality of service, <strong>and</strong> stress environment.<br />

Network configuration<br />

A network is composed of a set of agents A with each<br />

agent located in its own machine. Task flow structure of<br />

the network, which defines precedence relationship<br />

between agents, is an acyclic directed graph with each<br />

link assigned a positive real number. A link number l ij<br />

(i≠j) indicates the number of tasks generated for successor<br />

agent j when agent i processes a task in its queue. Once<br />

accumulated tasks for a successor agent becomes over<br />

one, the corresponding integer number of tasks are sent to<br />

the successor agent. By using real numbers we can<br />

represent wide range of task flow structure including noninteger<br />

aggregation <strong>and</strong> expansion.<br />

A problem given to a network is decomposed in terms<br />

of root tasks for some agents. And, those tasks are<br />

propagated through task flow structure.<br />

Implementation alternatives<br />

An agent can have multiple implementation<br />

alternatives to process a task. Different alternatives trade<br />

off CPU time <strong>and</strong> solution value with more CPU time


Proceedings of the 1st Open Cougaar Conference 3<br />

resulting in higher solution value. As we can find optimal<br />

mixed alternatives, an agent has a monotonically<br />

increasing convex function, say value function, with CPU<br />

time as a function of value. We call the value in the<br />

function as value mode that the agent can select as its<br />

decision variable. A value function is defined with three<br />

components as:<br />

〈 f i ( vi<br />

), vi(min) , vi(max)<br />

This function says that an agent i’s expected CPU time<br />

to process a task is f i (v i ) with a value mode v i <strong>and</strong> v i(min) ≤<br />

v i ≤ v i(max) .<br />

Quality of service<br />

A problem given to the network is decomposed to root<br />

tasks for some agents <strong>and</strong> those tasks are propagated<br />

through task flow structure. The service provided by the<br />

network is to produce a global solution to the given<br />

problem, which is an aggregate solution of the partial<br />

solutions from processing tasks. QoS of the network is<br />

determined by the value of global solution <strong>and</strong> the cost of<br />

completion time for generating global solution. The value<br />

of global solution is the summation of partial solution<br />

values. And, the cost of completion time is determined by<br />

a cost function CCT(T), which is a monotonically<br />

increasing function with completion time T. Consider that<br />

v i d denotes the value mode used to process d th task <strong>and</strong> e i<br />

the number of tasks processed to completion by agent i.<br />

Then, QoS can be calculated as:<br />

Stress environment<br />

QoS<br />

ei<br />

= ∑∑<br />

i∈ A d = 1<br />

d<br />

i<br />

〉<br />

v − CCT(<br />

T)<br />

Survivability stresses, such as accidental failures <strong>and</strong><br />

malicious attacks, affect the system by consuming<br />

resources directly or indirectly through activating defense<br />

mechanisms as remedies against them. For example,<br />

“denial of service” attack consumes resources directly<br />

while relevant defense mechanism also consumes<br />

resource in terms of resistance, recognition, <strong>and</strong> recovery<br />

[1]. We consider both of survivability stresses <strong>and</strong><br />

remedies as stress environment from the viewpoint of the<br />

agents in the network.<br />

The stress environment space is a high-dimensional<br />

<strong>and</strong> also evolving one [11][12]. But, as we concentrate on<br />

computational CPU resources a stress environment can be<br />

regarded as a set of threads residing in the machines of<br />

the network <strong>and</strong> sharing resources with the agents. The<br />

threads, say stressors, can have some priorities or weights<br />

for resource allocation under admission or can be stealing<br />

resources without admission.<br />

3.2. Problem definition<br />

In this paper we develop an adaptive control<br />

mechanism with scalability <strong>and</strong> predictability to support<br />

the survivability of large-scale networks. The system<br />

needs to adapt to the changing stress environment to<br />

provide high QoS utilizing implementation alternatives<br />

(v) as:<br />

arg max<br />

v<br />

QoS<br />

We discuss several characteristics of the problem that<br />

will be helpful in underst<strong>and</strong>ing the problem <strong>and</strong><br />

developing appropriate control mechanism:<br />

• Large-scale network: The network can be large-scale<br />

as the number of agents <strong>and</strong> nodes increase with the<br />

scale of the given problem to the network.<br />

• Finite time horizon: The time horizon for a network<br />

to generate a global solution is finite.<br />

• Indecomposable QoS: QoS is not decomposable to<br />

individual elements’ performance because one of the<br />

two conflicting QoS elements is the completion time<br />

that is common throughout the network.<br />

• Complex dynamics: Agents interact with each other<br />

through task flow <strong>and</strong> with stressors through sharing<br />

resources. As those interactions are in parallel to<br />

control actions the dynamics of the system is<br />

intrinsically complex especially in large-scale<br />

networks.<br />

• Non-availability of statistics: Statistics such as arrival<br />

rates or service rates are not fixed or given. But, they<br />

are changing as the system evolves. In addition, the<br />

stress environment changes.<br />

4. Control approaches in dynamic systems<br />

In general in dynamic systems, centralized <strong>and</strong><br />

decentralized control approaches are used.<br />

4.1. Centralized approaches<br />

There are three centralized control approaches,<br />

dynamic programming (DP), reinforcement learning<br />

(RL), <strong>and</strong> model predictive control (MPC). Dynamic<br />

programming (DP) solves optimality equation to produce<br />

reactive strategies in terms of optimal closed-loop control<br />

policy, which is a rule specifying optimal action as a<br />

function of state <strong>and</strong> time [13]. It assumes that the<br />

structure of dynamic model is fixed <strong>and</strong> the model<br />

parameters are known in advance. DP gives absolutely<br />

optimal policy but the complexity in solving optimality<br />

equation grows exponentially with the dimension of the<br />

state space. RL is an adaptive version of DP to develop a


Proceedings of the 1st Open Cougaar Conference 4<br />

policy in real-time when the model parameters are<br />

unknown [14][15]. This method takes longer time to<br />

converge than DP at the cost of exploration in addition to<br />

exploitation.<br />

In MPC, for each current state, an optimal open-loop<br />

control policy is designed for finite-time horizon by<br />

solving a static mathematical programming model based<br />

on an explicit process model [13][16][17][18][19]. The<br />

design process is repeated for the next observed state<br />

feedback forming a closed-loop policy reactive to each<br />

current system state. Though MPC does not give<br />

absolutely optimal policy in stochastic environment, it is<br />

easy to adapt to new contexts by explicitly h<strong>and</strong>ling<br />

objective function or constraints. But, it requires efforts to<br />

develop process models <strong>and</strong> has scalability problem.<br />

4.2. Decentralized approaches<br />

There are three decentralized control approaches,<br />

market-based approaches, insect-behavioral approaches,<br />

<strong>and</strong>, learning-based approaches. Market-based control<br />

works through the interaction of local agents in the same<br />

way as economic markets [20]. Agents trade with one<br />

another using a relatively simple mechanism, yet<br />

desirable global objectives can often be realized. These<br />

approaches are implemented in distributed processor<br />

allocation problems. Insect-behavioral approaches are<br />

inspired by effective <strong>and</strong> adaptive behavior of social<br />

insect colonies such as ants, bees, wasps, <strong>and</strong> termites<br />

[21]. An important <strong>and</strong> interesting behavior of ant<br />

colonies is their foraging behavior, in particular how ants<br />

can find the shortest paths between food sources <strong>and</strong> their<br />

nest. Algorithms based on the foraging behavior are<br />

implemented in routing problems in communication<br />

networks <strong>and</strong> shop floor. Similar to ant algorithms, wasp<br />

algorithms are proposed inspired by wasps’ task<br />

allocation behavior. Algorithms based on the task<br />

allocation behavior are implemented in routing problems<br />

in shop floor. Reinforcement learning can be used without<br />

prior knowledge of the system model. By making agents<br />

to learn through their experience the method can be used<br />

in decentralized mode. These approaches are<br />

implemented in routing problems in communication<br />

networks [22].<br />

5. Control mechanism<br />

DP <strong>and</strong> RL have inefficiencies in terms of scalability<br />

<strong>and</strong> agility which are important considerations in our<br />

problem. In addition, the dynamic model in our problem<br />

is not fixed <strong>and</strong> partially known due to unpredictable<br />

stress environment. Decentralized approaches are scalable<br />

<strong>and</strong> robust, but they lack agility <strong>and</strong> optimality. We<br />

choose MPC-style approach considering its benefits with<br />

respect to complexity, optimality, <strong>and</strong> agility. However,<br />

we need to overcome scalability problem.<br />

5.1. Overall control procedure<br />

As we discussed we develop an adaptive control<br />

mechanism to provide high QoS to the changing stress<br />

environment while ensuring scalability <strong>and</strong> predictability.<br />

To address adaptivity we model the stress environment<br />

indirectly by quantifying resource availability of the<br />

system through sensors. We build a mathematical<br />

programming model with the resource availability<br />

incorporated, which predicts QoS as a function of control<br />

actions. By periodically solving the programming model<br />

<strong>and</strong> taking optimal control actions with recent resource<br />

availability, the system can be adaptive to the changing<br />

stress environment predictably. But, as the programming<br />

model can be large-scale, we provide a decentralized<br />

coordination mechanism to solve the large-scale<br />

programming model in a decentralized mode.<br />

5.2. Sensors<br />

We facilitate two different types of sensors, Load<br />

sensor <strong>and</strong> Resource sensor, which are located in each<br />

agent <strong>and</strong> measure statistics which form the coefficients<br />

in the mathematical programming model.<br />

Load sensor<br />

A load sensor measures future load L i of agent i,<br />

which is the number of tasks to be processed in the future.<br />

Initially, each agent identifies their future loads by<br />

combining its own root tasks <strong>and</strong> incoming tasks from its<br />

predecessor agents in the future. After identifying initial<br />

future loads, agents update them by counting down as<br />

they process tasks.<br />

Resource sensor<br />

A resource sensor measures resource availability,<br />

which is defined as the available fraction of resource<br />

when an agent requests the resource. In a given time<br />

window we define two measurements to calculate this<br />

statistic, request time <strong>and</strong> execution time. Request time is<br />

the time duration that an agent requests resource, which is<br />

the duration for which queue length (including one in<br />

service) is more than zero. Execution time is the time<br />

duration for which an agent actually utilizes resource. An<br />

agent i’s resource availability in between two subsequent<br />

control points (k-1, k) is calculated as:<br />

RA<br />

( k−1,<br />

k)<br />

i<br />

execution time in ( k −1,<br />

k)<br />

=<br />

.<br />

request time in ( k −1,<br />

k)


Proceedings of the 1st Open Cougaar Conference 5<br />

5.3. Mathematical programming model<br />

Agents can estimate their resource availability in the<br />

future using observed resource availability in the past. An<br />

agent i estimates its resource availability in the future RA i<br />

f<br />

using observed resource availability in the last control<br />

period. Service time to process a task can be directly<br />

predicted as a function of value mode by incorporating<br />

the estimation as:<br />

i<br />

i<br />

f<br />

i<br />

f ( v ) / RA .<br />

Based on this we build a mathematical programming<br />

model. Consider completion time as T <strong>and</strong> current time as<br />

t. An agent’s optimal mode is a pure mode common to all<br />

the tasks because of the convexity of value function.<br />

When agents use pure modes such that their total service<br />

times are less than or equal to T-t, each agent can<br />

complete its tasks approximately by T because in worst<br />

case tasks will arrive at a constant rate, L i /(T-t). In other<br />

words, the completion time is dominantly determined by<br />

bottleneck agents with maximal total service times for<br />

their future loads, that is:<br />

f<br />

i<br />

T − t ≈ Max [ L * f ( v ) / RA ] .<br />

i∈A<br />

i<br />

So, given completion time T each agent can select a<br />

maximal mode so that total service time is less than or<br />

equal to T. That is, it is optimal for each agent to select a<br />

mode maximizing:<br />

subject to<br />

i<br />

L i * v i<br />

f<br />

i<br />

L * f ( v ) / RA ≤ T − t .<br />

i<br />

i<br />

i<br />

Through the optimality condition we can formulate the<br />

control problem through a mathematical programming<br />

model that maximizes QoS by trading off the value of<br />

solution <strong>and</strong> the cost of completion time as:<br />

Select v i ’s <strong>and</strong> T satisfying:<br />

Max<br />

s.<br />

t.<br />

∑<br />

i∈A<br />

L * f ( v ) / RA<br />

v<br />

i<br />

L * v − CCT ( T )<br />

i<br />

i<br />

i(min)<br />

i<br />

i<br />

≤ v ≤ v<br />

i<br />

f<br />

i<br />

i(max)<br />

≤ T − t<br />

i<br />

for all i ∈ A<br />

for all i ∈ A<br />

security. As we discussed earlier our effort is to support<br />

survivability. If information is revealed to others directly<br />

it is not survivable in the viewpoint of information<br />

security. So, decentralization will also help survivability<br />

with respect to information security.<br />

One branch of distributed control approaches is that of<br />

decentralizing structured mathematical programming<br />

models. In this branch there are two popular methods,<br />

decomposition methods <strong>and</strong> auction/bidding algorithms.<br />

We decentralize the mathematical programming model<br />

through a non-iterative auction mechanism, so called<br />

multiple-unit auction with variable supply [23]. In this<br />

auction a seller may be able <strong>and</strong> willing to adjust the<br />

supply as a function of bidding. In the programming<br />

model we have built, all the agents are coupled with each<br />

other. But, it has a typical structure, where the objective<br />

function <strong>and</strong> constraints are separable if one variable T is<br />

fixed. This characteristic makes it possible to convert the<br />

model into an auction. The completion time T is an<br />

unbounded resource <strong>and</strong> the supply can be adjusted as a<br />

function of bidding.<br />

In the designed auction for the programming model,<br />

agents bid for T <strong>and</strong> the seller decides T* based on the<br />

bids by maximizing its utility considering the cost. But,<br />

the seller supplies so that minimum requirements of the<br />

agents are fulfilled. After the seller broadcasts T*, agents<br />

select their optimal value modes by maximizing their<br />

utility.<br />

<br />

Agents’ bids<br />

〈 b<br />

i( T ), Ti<br />

(min)〉<br />

b ( T ) = L * f<br />

i<br />

i<br />

= L * v<br />

i<br />

−1<br />

i<br />

f<br />

( T − t)*<br />

RAi<br />

(<br />

)<br />

L<br />

i(max)<br />

f<br />

Ti (min)<br />

= Li<br />

* fi(<br />

vi<br />

(min))/<br />

RAi<br />

+ t<br />

<br />

Max<br />

s.<br />

t.<br />

<br />

Seller’s decision problem<br />

∑<br />

i<br />

i∈A<br />

b ( T)<br />

− CCT(<br />

T)<br />

T ≥ Max(<br />

T<br />

i∈A<br />

i(min)<br />

Agents’ decision<br />

)<br />

i<br />

if T ≤ L * f ( v<br />

else<br />

i<br />

i<br />

i(max)<br />

)/ RA<br />

f<br />

i<br />

+ t<br />

5.4. Decentralization<br />

The next question is how to decentralize the<br />

mathematical programming model for scalability <strong>and</strong><br />

robustness. In addition to these properties<br />

decentralization will give a byproduct, information<br />

v<br />

f<br />

* 1<br />

i<br />

= −<br />

i<br />

= v<br />

( T<br />

(<br />

i(max)<br />

*<br />

f<br />

− t) * RAi<br />

)<br />

L<br />

i<br />

if T<br />

else<br />

*<br />

≤ L * f ( v<br />

i<br />

i<br />

i(max)<br />

) / RA<br />

f<br />

i<br />

+ t


Proceedings of the 1st Open Cougaar Conference 6<br />

The auction mechanism described as a decentralized<br />

coordination incorporates a centralized seller. As a<br />

centralized auction can still exhibit problems in terms of<br />

scalability <strong>and</strong> robustness we introduce a hierarchical<br />

auction mechanism. Suppose that T a * is optimal of agent<br />

group a <strong>and</strong> T b * is optimal of agent group b. If a ⊂ b, then<br />

we can say that:<br />

* *<br />

a T b<br />

T ≤ .<br />

Through this property we can convert the auction<br />

mechanism into a hierarchical one, in which there are<br />

multiple auction markets that are structured<br />

hierarchically. Each auction solves its decision problem<br />

based on the bids from the agents or subordinate auctions<br />

<strong>and</strong> makes a bid to its superior auction with T larger than<br />

<strong>and</strong> equal to its optimal completion time. This<br />

hierarchical structure makes improvements with respect<br />

to scalability <strong>and</strong> robustness compared to central auction<br />

mechanism. Scalability improves because bids <strong>and</strong><br />

decisions are distributed to multiple auctions in the<br />

hierarchical framework. And, robustness improves<br />

because there is no single point of failure.<br />

6. Empirical result<br />

We ran several experiments to validate the proposed<br />

control approach through discrete-event simulation.<br />

6.1. Experimental design<br />

The network is composed of fifteen agents with a<br />

convergent structure as in figure 1, in which each link is<br />

assigned 1 (l ij ). In this network each agent in the lowest<br />

position has 200 root tasks. Each of the agents has the<br />

same linear value function <strong>and</strong> the cost of completion<br />

time is linear as described in the figure.<br />

A 2<br />

A 4 A 5<br />

A 1<br />

<br />

CCT(T) = 4T<br />

Figure 1. Experimental network configuration<br />

To observe adaptive behavior we assign weight w i to<br />

an agent <strong>and</strong> w′ i to a stressor residing in a same machine<br />

with the agent for proportional resource share between<br />

them. A stressor, which has infinite work (continuously<br />

A 3<br />

A 8 A 9 A 10 A 11<br />

A 6 A 7<br />

A 12 A 13 A 14<br />

A 15<br />

200 200 200 200 200 200 200 200<br />

requiring resources), can impose different levels of stress<br />

on the agent directly by changing w′ i . When it is zero there<br />

is no stress, <strong>and</strong> as it increases the stress level increases.<br />

We implement our stress environment simply by using<br />

Weighted Round-Robin scheduling, in which each thread<br />

gets a number of quanta in proportion to its weight.<br />

We set up four different experimental conditions as in<br />

table 1. In stressed conditions we stress agent A 4 in the<br />

middle of run. And, the distribution of CPU time in value<br />

function can be deterministic or exponential. While using<br />

stochastic value function we ran 5 experiments.<br />

Table 1. Experimental conditions<br />

Condition Stress Value function<br />

Con1 W/o Stress Deterministic<br />

Con2 W/o Stress Exponential<br />

Con3 W/Stress Deterministic<br />

Con4 W/Stress Exponential<br />

* Control period: 100<br />

* w i : 0.1, w′ 4 : 1 in (500, 1000)<br />

We use three different control modes for each<br />

experimental condition. Table 2 shows the control modes<br />

we used for experimentation. AC represents the adaptive<br />

control mechanism we have developed.<br />

Table 2. Control modes for experimentation<br />

Control mode<br />

Description<br />

FL<br />

Fixed lowest value mode<br />

FH<br />

Fixed highest value mode<br />

AC<br />

Adaptive control<br />

6.2. Results<br />

Experimental results are summarized in table 3. The<br />

proposed adaptive control showed significant advantages<br />

compared to non-adaptive cases in all different<br />

conditions.<br />

Table 3. Experimental results<br />

FL FH AC<br />

T V QoS T V QoS T V QoS<br />

Con1 1656 13558 6934 6313 30643 5391 1663 22898 16245<br />

Con2 1652 13547 6942 6302 30643 5435 1723 22982 16089<br />

Con3 1656 13558 6934 6313 30643 5391 1966 23401 15539<br />

Con4 1652 13547 6942 6371 30643 5159 2024 23495 15401<br />

* T: Completion time<br />

* V: Value of solution<br />

The adaptive behaviors under proposed control<br />

mechanism are shown in figures 2 <strong>and</strong> 3 for deterministic<br />

case, <strong>and</strong> figures 4 <strong>and</strong> 5 for stochastic case. These<br />

represent time series of decision variables at each control<br />

point. Under stress the system changes its behavior


Proceedings of the 1st Open Cougaar Conference 7<br />

adaptively to the new environment. And, once the stress<br />

is removed the system adapts again.<br />

4000<br />

3500<br />

mode<br />

6.0<br />

5.0<br />

4.0<br />

3.0<br />

A 8 A 2<br />

A 1<br />

A 4<br />

optimal T<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

2.0<br />

0<br />

0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />

time<br />

mode<br />

optimal T<br />

1.0<br />

0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />

time<br />

Figure 2. Adaptive value mode under Con3<br />

4000<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />

time<br />

Figure 3. Adaptive optimal T under Con3<br />

A 8<br />

6.0<br />

A 2<br />

5.0<br />

4.0<br />

A 1<br />

3.0<br />

A 4<br />

2.0<br />

1.0<br />

0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />

time<br />

Figure 4. Adaptive value mode under Con4<br />

Figure 5. Adaptive optimal T under Con4<br />

7. Summary <strong>and</strong> conclusions<br />

A typical information network emerges as a result of<br />

automation or organizational integration. These networks<br />

are large-scale with distributed <strong>and</strong> component-based<br />

architecture. As such networks can be easily exposed to<br />

various adverse events such as accidental failures <strong>and</strong><br />

malicious attacks, there is a need to study survivability of<br />

the networks.<br />

In this paper we studied the emerging networks to<br />

support survivability by utilizing implementation<br />

alternatives. By adopting MPC-style approach<br />

considering its benefits with respect to complexity,<br />

optimality, <strong>and</strong> agility, we developed an adaptive control<br />

mechanism with scalability <strong>and</strong> predictability. To address<br />

adaptivity we modeled the stress environment indirectly<br />

by quantifying resource availability of the system. We<br />

built a mathematical programming model with the<br />

resource availability incorporated, which predicts QoS as<br />

a function of control actions. By periodically solving the<br />

programming model <strong>and</strong> taking optimal control actions<br />

with recent resource availability, the system could be<br />

adaptive to the changing stress environment predictably.<br />

But, as the programming model can be large-scale <strong>and</strong><br />

complex, we agentified the components of the network<br />

from control point of view so that the system can solve<br />

the large-scale programming model in a decentralized<br />

mode. We provided a hierarchical auction mechanism as<br />

a coordination mechanism. We showed the effectiveness<br />

of our approach regarding to QoS <strong>and</strong> adaptivity in<br />

different experimental conditions.<br />

Our approach can be extended for the network<br />

configurations where there are multiple agents in a<br />

machine sharing resources together. In this case we have<br />

a good opportunity to improve the system performance by<br />

appropriately allocating resources to the agents.<br />

To implement the proposed control mechanism in<br />

information networks such as UltraLog network, we need<br />

to devise several things which are discussed in


Proceedings of the 1st Open Cougaar Conference 8<br />

developing the mechanism. Each component should have<br />

value function <strong>and</strong> sensors. And, to coordinate the<br />

components through hierarchical auction market, sellers<br />

need to be built with appropriate optimization algorithms.<br />

To provide necessary information to auction market<br />

components <strong>and</strong> sellers should be able to make bids. As<br />

the system makes periodic decisions a seller at the top of<br />

hierarchy may send market opening messages to market<br />

participants periodically.<br />

Acknowledgements<br />

Support for this research was provided by <strong>DARPA</strong><br />

(Grant#: MDA972-01-1-0038) under the UltraLog<br />

program. We thank Dr. Mark Greaves (<strong>DARPA</strong>),<br />

Marshall Brinn, Beth DePass, <strong>and</strong> Aaron Helsinger (all<br />

from BBN) for their suggestions in this work.<br />

References<br />

[1] S. Jha, J. M. Wing, “Survivability analysis of networked<br />

systems”, 23rd international conference on Software<br />

engineering, pp. 307-317, 2001<br />

[2] R. Ellison, D. Fisher, H. Lipson, T. Longstaff, <strong>and</strong> N. Mead,<br />

“Survivable network systems: An emerging discipline”,<br />

Technical <strong>Report</strong> CMU/SEI-97-153, Software Engineering<br />

Institute, Carnegie Mellon University, 1997<br />

[3] J. E. Eggleston, S. Jamin, T. P. Kelly, J. K. MacKie-Mason,<br />

W. E. Walsh, <strong>and</strong> M. P. Wellman, “Survivability through<br />

Market-Based Adaptivity: The MARX Project”, <strong>DARPA</strong><br />

Information Survivability Conference <strong>and</strong> Exposition, 2000<br />

[4] S. Bowers, L. Delcambre, D. Maier, C. Cowan, P. Wagle, D.<br />

McNamee, A. L. Meur, <strong>and</strong> H. Hinton, “Applying Adaptation<br />

Spaces to Support Quality of Service <strong>and</strong> Survivability”,<br />

<strong>DARPA</strong> Information Survivability Conference <strong>and</strong> Exposition,<br />

2000<br />

[5] O. F. Rana <strong>and</strong> K. Stout, “What is scalability in multi-agent<br />

systems?”, Fourth International Conference on Autonomous<br />

Agents, 2000<br />

[6] B. Meyer, “On to components”, IEEE Computer, vol. 32, no.<br />

1, pp. 139-140, 1999<br />

[7] P. Clements, “From Subroutine to Subsystems: Component-<br />

Based Software Development”, In Alan W. Brown, editor,<br />

Component Based Software Engineering, IEEE Computer<br />

Society Press, pp. 3-6, 1996<br />

[8] F. M. T. Brazier, C. M. Jonker, <strong>and</strong> J. Treur, “Principles of<br />

Component-Based Design of Intelligent Agents”, Data <strong>and</strong><br />

Knowledge Engineering, vol. 41, no. 1, pp. 1-28, 2002<br />

[9] H. J. Goradia <strong>and</strong> J. M. Vidal, “Building blocks for agent<br />

design”, Fourth International Workshop on Agent-Oriented<br />

Software Engineering, pp. 17-30, 2003<br />

[10] R. Krutisch, P. Meier, <strong>and</strong> M. Wirsing, “The<br />

AgentComponent approach, combining agents <strong>and</strong><br />

components”, Net.objectDays, 2003<br />

[11] A. P. Moore, R. J. Ellison, <strong>and</strong> R. C. Linger, “Attack<br />

Modeling for Information Security <strong>and</strong> Survivability”,<br />

Technical Note CMU/SEI-2001-TN-001, Software Engineering<br />

Institute, Carnegie Mellon University, 2001<br />

[12] F. Moberg, “Security Analysis of an Information System<br />

Using an Attack Tree-based Methodology”, Master’s Thesis,<br />

Automation Engineering Program, Chalmers University of<br />

Technology, 2000<br />

[13] G. Barto, S. J. Bradtke, <strong>and</strong> S. P. Singh, “Learning to act<br />

using real-time dynamic programming”, Artificial Intelligence,<br />

vol. 72, pp. 81-138, 1995<br />

[14] R. S. Sutton, A. G. Barto, <strong>and</strong> R. J. Williams,<br />

“Reinforcement learning is direct adaptive optimal control”,<br />

IEEE Control Systems, vol. 12, no. 2, pp. 19-22, 1992<br />

[15] L. P. Kaelbling, M. L. Littman, <strong>and</strong> A. W. Moore,<br />

“Reinforcement learning: A survey”, Journal of Artificial<br />

Intelligence Research, vol. 4, pp. 237-285, 1996<br />

[16] J. B. Rawlings, “Tutorial overview of model predictive<br />

control”, IEEE Control Systems, vol. 20, no. 3, pp. 38-52, 2000<br />

[17] M. Morari <strong>and</strong> J. H. Lee, “Model predictive control: past,<br />

present <strong>and</strong> future”, Computers <strong>and</strong> Chemical Engineering, vol.<br />

23, no. 4, pp. 667-682, 1999<br />

[18] M. Nikolaou, “Model predictive Controllers: A Critical<br />

Synthesis of Theory <strong>and</strong> <strong>Industrial</strong> Needs”, Advances in<br />

Chemical Engineering Series, Academic Press, 2001<br />

[19] S. J. Qin <strong>and</strong> T. A. Badgwell, “A survey of industrial model<br />

predictive technology”, Control Engineering Practice, vol. 11,<br />

pp. 733-764, 2003<br />

[20] S. Clearwater, Market-Based Control: A Paradigm for<br />

Distributed Resource Allocation, World Scientific Publishing,<br />

1996<br />

[21] E. Bonabeau, M. Dorigo, <strong>and</strong> G. Theraulaz, Swarm<br />

Intelligence: From Natural to Artificial Systems, Oxford<br />

University Press, 1999<br />

[22] S. Kumar, “Confidence based dual reinforcement Q-<br />

routing: An on-line adaptive network routing algorithm”,<br />

Technical <strong>Report</strong> AI98-267, Department of Computer Sciences,<br />

The University of Texas at Austin, 1998<br />

[23] Y. Lengwiler, “The multiple unit auction with variable<br />

supply”, Economic Theory, vol. 14, pp. 373-392, 1999


Proceedings of the 1st Open Cougaar Conference 9


1<br />

SITUATION IDENTIFICATION USING DYNAMIC PARAMETERS IN<br />

COMPLEX AGENT-BASED PLANNING SYSTEMS<br />

SEOKCHEON LEE, N. GAUTAM, S. KUMARA, Y. HONG, H. GUPTA,<br />

A. SURANA, V. NARAYANAN, H. THADAKAMALLA, M. BRINN, M.<br />

GREAVES<br />

Department of <strong>Industrial</strong> Engineering<br />

The Pennsylvania State University<br />

University Park, PA 16802<br />

ABSTRACT<br />

Survivability of multi-agent systems is a critical problem. Real-life systems are<br />

constantly subject to environmental stresses. These include scalability,<br />

robustness <strong>and</strong> security stresses. It is important that a multi-agent system adapts<br />

itself to varying stresses <strong>and</strong> still operates within acceptable performance<br />

regions. Such an adaptivity comprises of identifying the state of the agents,<br />

relating them to stress situations, <strong>and</strong> then invoking control rules (policies). In<br />

this paper, we study a supply chain planning implemented in COUGAAR<br />

(Cognitive Agent Architecture) developed by <strong>DARPA</strong> (Defense Advanced<br />

Research Project Agency), <strong>and</strong> develop a methodology to identify behavior<br />

parameters, <strong>and</strong> relate those parameters to stress situations. Experimentally we<br />

verify the proposed method.<br />

1. INTRODUCTION<br />

Survivability of multi-agent systems is a critical problem. Real-life systems are<br />

inherently distributed <strong>and</strong> are constantly subject to environmental <strong>and</strong> internal stresses.<br />

These include scalability, robustness <strong>and</strong> security stresses. It is important that a multiagent<br />

system adapts itself to varying stresses <strong>and</strong> still operates within an acceptable<br />

performance region. Such an adaptivity comprises of identifying the state of the agents,<br />

relating them to stress situation, <strong>and</strong> then invoking control rules (policies). One of the<br />

fundamental problems is agent state (behavior) identification.<br />

In this paper, we study a supply chain planning society called Small Supply Chain<br />

(SSC) implemented in COUGAAR (Cognitive Agent Architecture) developed by<br />

<strong>DARPA</strong> (Defense Advanced Research Project Agency), <strong>and</strong> develop a methodology for<br />

behavior parameter identification, <strong>and</strong> relating it to stress situations. The two important<br />

steps in our methodology are: 1. Identify the most discriminable behavior parameter set<br />

for situation identification, 2. Apply it to situation identification. To identify the most<br />

discriminable behavior parameter set we collect the time series data from one of the<br />

agents in SSC (TAO) <strong>and</strong> compute 38 statistical <strong>and</strong> deterministic parameters to represent<br />

the collected time series. In essence, these 38 parameters are the features of agent state. In<br />

our earlier work (Ranjan et al., 2002) we prove that SSC shows chaotic behavior from an<br />

inventory fluctuation point of view <strong>and</strong> computed chaos indicators (which we call as<br />

deterministic parameters without loss of generality). Though we compute 38 different<br />

parameters, next question we address is whether all these are really useful <strong>and</strong> necessary<br />

for identifying several stress situations. So, we develop a discriminability index <strong>and</strong><br />

identify the most discriminable behavior parameter set based on this index as a


2<br />

representative parameter set for identifying several stress situations. Using those<br />

parameters we develop a nearest neighbor classification based method to identify stress<br />

situations.<br />

2. SSC (SMALL SUPPLY CHAIN) SOCIETY<br />

SSC is a COUGAAR society for supply chain planning composed of 26 agents.<br />

Each agent generates logistics plan depending on its relative position in the supply chain.<br />

TAO is an important agent of the SSC <strong>and</strong> we have selected it to test our schema. Figure<br />

1 shows the detailed view. In TAO GenerateProjection Tasks are exp<strong>and</strong>ed to Supply<br />

Tasks, which are for internal consumption. Each Supply Task is exp<strong>and</strong>ed to Withdrawal<br />

Task, which is allocated to inventory asset. Supply Tasks are also transferred from other<br />

agents. They are exp<strong>and</strong>ed to Withdrawal Tasks, which are allocated to inventory asset.<br />

MaintainInventory Tasks, which are for the maintenance of inventory assets in TAO, are<br />

exp<strong>and</strong>ed to Supply Tasks. Each Supply Task is allocated to other agents.<br />

MaintainInventory<br />

ProjectSupply<br />

Supply<br />

TAO<br />

GenerateProjection<br />

ProjectSupply<br />

Supply<br />

ProjectWithdrawl<br />

Withdrawl<br />

Inventory Asset<br />

Figure 1. TAO in SSC<br />

3. STRESSES AND BEHAVIOR<br />

For the sake of analysis we have parameterized the stress situations <strong>and</strong> system<br />

behavior.<br />

3.1 Stress<br />

Stress refers to survivability stress <strong>and</strong> includes scalability, security, <strong>and</strong> robustness<br />

stresses. Scalability is defined as the ability of a solution to a problem to work when the<br />

size of the problem increases. And, survivability (regarding security <strong>and</strong> robustness) is<br />

defined as as the capability of a system to fulfill its mission, in a timely manner, in the<br />

presence of attacks, failures, or accidents (Ellison et al., 1997). There can be diverse<br />

stress situations, but in this paper we consider stress situations formed by two scalability<br />

stress types given below:<br />

• Problem Complexity: Problem complexity is determined by the complexity of<br />

the planning task. This includes many aspects <strong>and</strong> we have chosen one of the stress types,<br />

called OpTempo of each agent. OpTempo defines operation tempo.<br />

• Query Frequency: Each agent provides query service for its planning<br />

information to human operators. We have chosen query frequency (# of query request per<br />

second) to each agent as one of stress types.<br />

Although SSC society is composed of 26 agents there are only 8 agents that are<br />

directly affected by OpTempo. We define stress levels: Low/Medium/High. So, the size<br />

of our stress situation space becomes 3 34 .


3<br />

3.2 Behavior<br />

In SSC society an agent’s behavior can be described by its Task groups’ behaviors.<br />

Behaviors can be represented by time series. We define four different time series (Task<br />

arrival, Time to solution sorted by generation sequence, Time to solution sorted by<br />

completion sequence, <strong>and</strong> Queue length). A time series may be characterized using<br />

deterministic <strong>and</strong> statistical parameters as shown in Table 1.<br />

Deterministic characterization makes it possible to h<strong>and</strong>le non-stationary, nonperiodic,<br />

irregular time series, including chaotic deterministic time series. In this study<br />

we use five different deterministic behavior parameters. In a deterministic dynamical<br />

system since the dynamics of a system are unknown, we cannot reconstruct the original<br />

attractor that gave rise to the observed time series. Instead, we seek the embedding space<br />

where we can reconstruct an attractor from the scalar data that preserves the invariant<br />

characteristics of the original unknown attractor using delay coordinates proposed by<br />

Packard et al. (1980) <strong>and</strong> justified by Taken (1981). Average mutual information has<br />

been suggested to choose time delay coordinates by Fraser <strong>and</strong> Swinney (1986). And,<br />

Schuster (1989) proposed nearest neighbor algorithm to base the choice of the embedding<br />

dimension. Local dimension has been used to define the number of dynamical variables<br />

that are active in the embedding dimension (1998). The most popular measure of an<br />

attractor’s dimension is the correlation dimension, first defined by Grassberger <strong>and</strong><br />

Procaccia (1983). And, a method to measure the largest Lyapunov exponent, sensitivity<br />

to initial condition as a measure of chaotic dynamics, is proposed by Wolf et al. (1985).<br />

We have systematically studied the use of the methods from the literature <strong>and</strong> computed<br />

38 different behavioral parameters to characterize the four time series we have<br />

considered. These 38 parameters are shown in Table 1.<br />

Statistical<br />

Parameters<br />

Deterministic<br />

Parameters<br />

Table 1. Behavioral parameters<br />

Time Series<br />

Task Arrival<br />

Time to Solution Time to Solution<br />

(Generation) (Completion)<br />

# of events<br />

# of events<br />

Average<br />

Average<br />

Minimum<br />

Minimum<br />

Maximum<br />

Maximum<br />

Radius<br />

Radius<br />

Variance<br />

Variance<br />

ami<br />

e_dim<br />

l_dim<br />

c_dim<br />

l_exp<br />

ami<br />

e_dim<br />

l_dim<br />

c_dim<br />

l_exp<br />

ami<br />

e_dim<br />

l_dim<br />

c_dim<br />

l_exp<br />

ami: average mutual information, e_dim: embedding dimension, l_dim: local dimension,<br />

c_dim: correlation dimension, l_exp: lyapunov exponent<br />

Queue Length<br />

# of events<br />

Average<br />

Minimum<br />

Maximum<br />

Radius<br />

Variance<br />

ami<br />

e_dim<br />

l_dim<br />

c_dim<br />

l_exp<br />

4. EXPERIMENTATION AND RESULTS<br />

We ran several simulations of SSC to identify the most discriminable behavior<br />

parameter set.


4<br />

4.1 Experimental configuration<br />

SSC<br />

TAO<br />

Behavior<br />

Stressor<br />

Stress<br />

Situation<br />

Database<br />

Parameter<br />

Generation<br />

Parameter<br />

Table<br />

Online Experimentation<br />

Figure 2. Experimental configuration<br />

Offline Analysis<br />

In this experimentation we store event data from TAO <strong>and</strong> the parameters of stress<br />

situation from stressor into an online database, <strong>and</strong> then from the database we construct<br />

the parameter table with stress parameters <strong>and</strong> behavior parameters as in the Fig. 2. The<br />

experimental matrix is shown in Table 2.<br />

Table 2. Experimental matrix<br />

TestID OpTempo Query Repetition<br />

PRE001 Low to all agents Low to all agents 10<br />

PRE002 High to all agents Low to all agents 10<br />

PRE003 Medium to all agents Low to all agents 10<br />

PRE004 Medium to all agents High to all agents 10<br />

4.2 Results<br />

Reduction of stress space<br />

Figure 3. shows an example of ‘# of events’ parameter in each experiment repeated<br />

10 times in four different stress conditions. We identified the stresses that have no<br />

significant effects on the society’s behavior by comparing the behavior parameters under<br />

different conditions. The result shows:<br />

• No significant difference between Low <strong>and</strong> Medium of OpTempo stress<br />

• No significant effect of query frequency stress<br />

# of Events from Task Arrival<br />

1120<br />

1110<br />

1100<br />

# of events<br />

1090<br />

1080<br />

1070<br />

1060<br />

1050<br />

PRE001 PRE002 PRE003 PRE004<br />

1040<br />

1<br />

3<br />

5<br />

7<br />

9<br />

11<br />

13<br />

15<br />

17<br />

19<br />

21<br />

23<br />

25<br />

27<br />

29<br />

31<br />

33<br />

35<br />

37<br />

39<br />

Experiment<br />

Figure 3. Comparison of a behavior parameter in different stress conditions<br />

This leads to the reduction in the stress space to 2 8 (OpTempo Low/High for 8<br />

agents) from 3 34 .<br />

Discriminability of behavior parameters


5<br />

All the behavior parameters may not be equally good in helping the classification of<br />

stress situations. Therefore, there is a need for a measure of discriminating power of each<br />

of the behavior parameters. We call this as discriminability index (DI). DI can be<br />

represented as the ratio between sensitivity to the stress situations <strong>and</strong> r<strong>and</strong>om variation<br />

defined as:<br />

Discriminability Index (DI) = [∑(µ-µi) 2 /n] / [∑(si 2 )/n] = ∑ (µ-µi) 2 / ∑ (si 2 ) (1)<br />

µ : Average of parameter values<br />

µi : Average of parameter values from ith condition<br />

si : St<strong>and</strong>ard deviation of parameter values from ith condition<br />

n : Number of conditions<br />

We ranked those 38 behavior parameters using the DI. Top 5 are as shown in Table<br />

3. As shown in the table ‘# of events’ from task arrival time series was the most<br />

discriminable behavior parameter. Because this parameter is sensitive to the different<br />

stress situations <strong>and</strong> has small variation in the same stress situations the DI is relatively<br />

larger than those of other parameters.<br />

Table 3. Discriminability index (DI) of behavior parameters<br />

Rank DI Time Series Behavior Parameter<br />

1 2477 Task arrival # of events<br />

2 6 Time to solution Variance<br />

3 5 Time to solution Radius<br />

4 4 Time to solution Average<br />

5 4 Time to solution Maximum<br />

5. SITUATION IDENTIFICATION<br />

Results from preliminary experimentation showed that ‘# of events’ from task<br />

arrival time series (# of tasks) is the most discriminable behavior parameter in our stress<br />

space. So, assuming that the input to an agent affects the output depending on that agent’s<br />

stress situation we can identify OpTempo of an agent by using four features of ‘# of<br />

tasks’ as shown in Fig. 4.<br />

ProjectSupply<br />

Supply<br />

Agent<br />

(OpTempo)<br />

# of ProjectSupply from Outside / # of Supply from Outside<br />

# of ProjectSupply to Outside / # of Supply to Outside<br />

Figure 4. Features for situation identification<br />

ProjectSupply<br />

Supply<br />

OpTempo<br />

We performed an initial design of experiments <strong>and</strong> constructed a database of the<br />

behavior parameters from 100 experiments. Each agent’s OpTempo is r<strong>and</strong>omly chosen<br />

<strong>and</strong> the parameters are computed <strong>and</strong> stored in the database. Given a new experimental<br />

data we select the nearest neighbor from the base database by using the Euclidean<br />

distance between feature vectors. The stress level of the nearest neighbor is used for<br />

stress estimation. We estimated the stress level for 100 new experimental data using this


6<br />

approach. The results of estimation are shown in Table 4. Half of agents identified the<br />

stress successfully although the other half didn’t.<br />

Table 4. Stress estimation result<br />

Stress Correct estimation Stress Correct estimation<br />

OpTempo of agent 1 54% OpTempo of agent 5 100%<br />

OpTempo of agent 2 100% OpTempo of agent 6 94%<br />

OpTempo of agent 3 56% OpTempo of agent 7 53%<br />

OpTempo of agent 4 100% OpTempo of agent 8 46%<br />

6. CONCLUSIONS<br />

In this paper, we developed a methodology for extracting features from time series’<br />

of an agent-based supply chain planning society (behavior parameters) <strong>and</strong> relating it to<br />

stress situations. We identified ‘# of tasks’ as the most discriminable behavior parameter<br />

of our 38 statistical <strong>and</strong> deterministic parameters in our stress space. Using this parameter<br />

we validated the method’s ability to identify stress situation using nearest neighbor<br />

classification. Although our analysis showed deterministic parameters don’t have the<br />

ability to identify stress situations in our stress space it is possible that they can be good<br />

indicators under other stress space such as security <strong>and</strong> robustness stresses.<br />

ACKNOWLEDGEMENTS<br />

Support for this research was provided by <strong>DARPA</strong> (Grant#: MDA 972-01-1-0563) under<br />

the UltraLog program.<br />

REFERENCES<br />

Abarbanel, H. D. I., Gilpin, M. E., Rotenberg, M., 1998, Analysis of Observed Chaotic Data,<br />

Springer.<br />

Ellison, R. J., Fisher, D. A., Linger, R. C., Lipson, H. F., Longstaff, T., Mead, N. R., 1997,<br />

“Survivable Network Systems, An Emerging Discipline”, Technical <strong>Report</strong> CMU/SEI-97-153,<br />

Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA.<br />

Fraser, A. M., <strong>and</strong> Swinney, H., 1986, “Independent coordinates for strange attractors from mutual<br />

information”, Physical Review A, Vol. 33, pp. 1134 – 1140.<br />

Grassberger, P., <strong>and</strong> Procaccia, I., 1983, “Characterization of Strange Attractors”, Physical Review<br />

Letters, Vol. 50, pp. 346.<br />

Grassberger, P., <strong>and</strong> Procaccia, I., 1983, “Characterization of Strange Attractors”, Physica D, Vol. 9,<br />

pp. 189 – 208.<br />

Packard, N. H, Crutchfield, J. P., Farmer, J. D., <strong>and</strong> Shaw, R. S., 1980, “Geometry from a Time<br />

Series”, Physical Review Letters, Vol. 45, pp. 712.<br />

Ranjan, P., Kumara, S., Surana, A., Manikonda, V., Greaves, M., Peng, W., 2002, “Decision Making<br />

in Logistics: A Chaos Theory Based Analysis”, AAAI Spring Symposium, Technical <strong>Report</strong> SS-02-<br />

03, pp. 130-136.<br />

Schuster, H. G., 1989, Deterministic Chaos: An Introduction, Verlagsgesellshaft, Weinheim.<br />

Taken, F., 1981, “Detecting strange attractors in turbulence”, Dynamical Systems <strong>and</strong> Turbulence,<br />

pp. 366 - 381, Springer, Berlin.<br />

Wolf, A., Swift, J. B., Swinney, H. L., <strong>and</strong> Vastano, J., 1985, “Determining Lyapunov Exponents<br />

from a Time Series”, Physica D, Vol. 16, pp. 285 – 317.


Estimating Global Stress Environment by Observing Local<br />

Behavior in Distributed Multiagent Systems<br />

Seokcheon Lee <strong>and</strong> Soundar Kumara<br />

Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering,<br />

The Pennsylvania State University<br />

University Park, PA 16802 USA<br />

Abstract—A multiagent system can be considered survivable<br />

if it adapts itself to varying stresses without considerable<br />

performance degradation. Such an adaptivity comprises of<br />

identifying the behavior of the agents in a society, relating them<br />

to stress situations, <strong>and</strong> then invoking control rules. This<br />

problem is a hard one, especially in distributed multiagent<br />

systems wherein the agent behaviors tend to be nonlinear <strong>and</strong><br />

dynamic. In this paper, we study a supply chain planning<br />

system implemented in COUGAAR (Cognitive Agent<br />

Architecture) <strong>and</strong> develop a methodology for identifying the<br />

behavior of agents through their behavioral parameters, <strong>and</strong><br />

relating those parameters to stress situations. One important<br />

aspect of our approach is that we identify the stress situations<br />

of agents in the society by observing local behavior of one<br />

representative agent. This approach is motivated by the fact<br />

that a local time series can have the information of the<br />

dynamics of the entire system in deterministic dynamical<br />

systems. We validate our approach empirically through<br />

identifying the stress situations using k-nearest neighbor<br />

algorithm based on the behavioral parameters.<br />

S<br />

I. INTRODUCTION<br />

urvivability is defined as “the capability of a system to<br />

fulfill its mission, in a timely manner, in the presence of<br />

attacks, failures, or accidents” [1]. This definition considers<br />

security <strong>and</strong> robustness stresses as components of the stress<br />

environment. With the increasing size of networked systems<br />

scalability becomes a critical issue for a system to fulfill its<br />

mission [2]. We argue that scalability is also an important<br />

component of survivability <strong>and</strong> hence the stress<br />

environment. In this paper we consider only scalability<br />

stress in dealing with survivability.<br />

Survivability of multiagent systems is a critical problem.<br />

As infrastructures become large-scale <strong>and</strong> increasingly<br />

dependent on networked systems for automation or<br />

organizational integration, this capability becomes more <strong>and</strong><br />

more important. Real-life systems are inherently distributed<br />

This work was supported in part by <strong>DARPA</strong> under Grant MDA 972-01-<br />

1-0038.<br />

S. Lee is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong><br />

Engineering, The Pennsylvania State University, University Park, PA 16802<br />

USA (phone: 814-863-4799; fax: 814-863-4745; e-mail:<br />

stonesky@psu.edu).<br />

S. Kumara is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong><br />

Engineering, The Pennsylvania State University, University Park, PA 16802<br />

USA (e-mail: skumara@psu.edu).<br />

<strong>and</strong> are constantly subject to environmental <strong>and</strong> internal<br />

stresses. Hence, it is important that a multiagent system<br />

adapts itself to varying stresses <strong>and</strong> maintains its<br />

performance within the acceptable bounds of performance.<br />

The three important constituents of adaptivity are: agent<br />

behavior identification, mapping the agent behavior to the<br />

environment (stresses) <strong>and</strong> invoking the appropriate control<br />

rules (policies).<br />

In this paper, we study a supply chain planning system,<br />

Small Supply Chain (SSC) society implemented in<br />

COUGAAR (Cognitive Agent Architecture:<br />

http://www.cougaar.org) as an example system. We develop<br />

a methodology to identify the stress situations of the agents<br />

in the society by observing local behavior of one<br />

representative agent (called TAO). This information can<br />

subsequently be used to devise <strong>and</strong> invoke control policies.<br />

The two important steps in our methodology are: 1. Extract<br />

meaningful behavioral parameters for situation<br />

identification, 2. Apply these parameters to situation<br />

identification. We collect time series data from TAO <strong>and</strong><br />

compute 38 statistical <strong>and</strong> deterministic parameters to<br />

represent its behavior. In essence, these 38 parameters are<br />

the features of the behavior of the society as we can assume,<br />

from the theory of deterministic dynamic systems [3], [4],<br />

that the behavior of TAO has the information of the<br />

dynamics of the entire system. All the 38 different<br />

parameters are not equally important <strong>and</strong> independent. We<br />

therefore develop a discriminability index of the parameters<br />

based on which, extract meaningful behavioral parameters.<br />

Using those selected parameters we develop a k-nearest<br />

neighbor classification based method to identify stress<br />

situations of agents in the society.<br />

The organization of the paper is as follows. In section II<br />

we discuss the SSC society. In section III, we parameterize<br />

stress situations <strong>and</strong> behavior. In section IV we analyze the<br />

results from preliminary experimentation to build the<br />

methodology. In section V we implement our approach to<br />

identify the stress situations. <strong>Final</strong>ly, in section VI, we<br />

conclude our work.<br />

II. SSC (SMALL SUPPLY CHAIN) SOCIETY<br />

SSC is a COUGAAR society for military supply chain<br />

planning composed of 26 agents with 17 agents working for


actual planning. COUGAAR has distributed <strong>and</strong><br />

component-based architecture in which agents are<br />

geographically distributed <strong>and</strong> process their specific types of<br />

tasks. The objective of the SSC society is to generate a<br />

logistics plan for a given military operation. Each agent,<br />

representing an organization of military supply chain,<br />

processes tasks received from other agents or generated<br />

internally. Those tasks are allocated to assets after<br />

exp<strong>and</strong>ing or aggregating. The allocations in an agent<br />

trigger generating tasks to its supplier agents to refill the<br />

assets. When tasks from customers are allocated in a<br />

supplier agent, the results are fed back to the customer<br />

agents. Fig. 1 shows the task flow structure of the SSC<br />

society. TAO (Agent 3), which provides direct logistics<br />

support to combat units (Agents 1 <strong>and</strong> 2), is an important<br />

agent of the SSC society with respect to its relationship with<br />

other agents <strong>and</strong> amount of tasks. We have selected it as a<br />

representative agent to test our schema.<br />

1<br />

2<br />

14<br />

III. STRESS AND BEHAVIOR<br />

For analysis purposes we parameterize the stress <strong>and</strong><br />

behavioral space.<br />

A. Stress<br />

TAO<br />

3<br />

15<br />

4<br />

5 6 7<br />

There are diverse survivability stresses with respect to<br />

scalability, security, <strong>and</strong> robustness [5]–[7]. For<br />

implementation purpose, we consider three types of<br />

scalability stresses as follows:<br />

- Network topology: We consider one aspect of scalability<br />

stress as adding or removing an agent(s) to TAO in the<br />

existing topology. Theoretically we can r<strong>and</strong>omly add or<br />

remove any agent. However, we consider agent 1 <strong>and</strong> TAO<br />

together. The three stress levels we impose on TAO are<br />

through removing agent 1, having one agent 1 connected<br />

<strong>and</strong> adding one more agent of agent 1 type.<br />

- Problem Complexity: Problem complexity is determined<br />

by the complexity of the planning tasks. This includes many<br />

aspects <strong>and</strong> we have chosen OpTempo to implement this<br />

stress type, which represents the tempo of military<br />

operations. We define three stress levels of OpTempo for<br />

each of the 16 agents other than agent 1 as Low, Medium<br />

8<br />

9<br />

16 17<br />

Fig. 1. SSC society<br />

10<br />

11<br />

12<br />

13<br />

<strong>and</strong> High.<br />

- User Query: Each agent provides query service for its<br />

planning information to human operators. We have chosen<br />

query frequency to implement this stress type, which is the<br />

number of query requests per second. We define three stress<br />

levels of query frequency for each of the 16 agents other<br />

than agent 1as Low, Medium <strong>and</strong> High.<br />

The size of the stress space is very large. Combining these<br />

three types of stresses the size of stress space becomes 3 33<br />

(3*3 16 *3 16 ).<br />

B. Behavior<br />

In SSC society an agent’s behavior can be abstracted by<br />

observing the agent’s task processing. We define four<br />

different time series related to the agent’s task processing as<br />

follows:<br />

- Task arrival: Task inter-arrival times from other agents<br />

as well as TAO itself<br />

- Time to solution sorted by generation sequence: Time<br />

durations taken to complete a task from its generation,<br />

sorted by generation sequence<br />

- Time to solution sorted by completion sequence: Time<br />

durations taken to complete a task from its generation,<br />

sorted by completion sequence<br />

- Queue length: Number of tasks that are waiting for<br />

processing<br />

A time series can be characterized using deterministic <strong>and</strong><br />

statistical parameters. We have systematically studied the<br />

use of the methods from the literature <strong>and</strong> computed 38<br />

different behavioral parameters to characterize the four time<br />

series we have considered. These 38 parameters, composed<br />

of 18 statistical <strong>and</strong> 20 deterministic parameters (from<br />

dynamical systems theory), are shown in Table I. These<br />

represent the features of agent’s behavior.<br />

Task<br />

Arrival<br />

# of events<br />

Average<br />

Minimum<br />

Maximum<br />

Radius<br />

Variance<br />

AMI<br />

E_Dim<br />

L_Dim<br />

C_Dim<br />

L_Exp<br />

TABLE I<br />

BEHAVIORAL PARAMETERS<br />

Time to Solution<br />

(Generation)<br />

AMI<br />

E_Dim<br />

L_Dim<br />

C_Dim<br />

L_Exp<br />

Time Series<br />

# of events<br />

Average<br />

Minimum<br />

Maximum<br />

Radius<br />

Variance<br />

Time to Solution<br />

(Completion)<br />

AMI<br />

E_Dim<br />

L_Dim<br />

C_Dim<br />

L_Exp<br />

Queue<br />

Length<br />

# of events<br />

Average<br />

Minimum<br />

Maximum<br />

Radius<br />

Variance<br />

AMI<br />

E_Dim<br />

L_Dim<br />

C_Dim<br />

L_Exp<br />

AMI: Average Mutual Information, E_Dim: Embedding Dimension,<br />

L_Dim: Local Dimension, C_Dim: Correlation Dimension, L_Exp:<br />

Lyapunov Exponent<br />

Deterministic characterization makes it possible to h<strong>and</strong>le<br />

non-stationary, non-periodic, irregular time series, including


chaotic deterministic time series. In this paper we use five<br />

different deterministic behavioral parameters. In a<br />

deterministic dynamical system since the dynamics of a<br />

system are unknown, we cannot reconstruct the original<br />

attractor that gives rise to the observed time series. Instead,<br />

we seek the embedding space where we can reconstruct an<br />

attractor from the scalar data that preserves the invariant<br />

characteristics of the original unknown attractor using delay<br />

coordinates [3], [4]. This motivates us to characterize the<br />

system dynamics of the society by observing local behavior.<br />

Average mutual information has been suggested to select<br />

time delay coordinates [8]. Nearest neighbor algorithm to<br />

base the choice of the embedding dimension is proposed in<br />

[9]. Local dimension has been used to define the number of<br />

dynamical variables that are active in the embedding<br />

dimension [10]. The most popular measure of an attractor’s<br />

dimension is the correlation dimension [11], [12]. In [13] a<br />

method to measure the largest Lyapunov exponent,<br />

sensitivity to initial condition as a measure of chaotic<br />

dynamics, is proposed. As these parameters are well<br />

documented in the references we have given, we do not<br />

undertake a detailed explanation.<br />

TABLE II<br />

EXPERIMENTAL MATRIX<br />

TestID OpTempo Query Replication<br />

PRE001 Low to all agents Low to all agents 10<br />

PRE002 High to all agents Low to all agents 10<br />

PRE003 Medium to all agents Low to all agents 10<br />

PRE004 Medium to all agents High to all agents 10<br />

For all experiments the number of agent 1 is one<br />

B. Results<br />

1) Reduction of stress space: We identified the stress<br />

situations that have no significant effects on the system<br />

dynamics by analyzing behavioral parameters. Fig. 3 shows<br />

an example of ‘# of events’ parameter from ‘Task Arrival’<br />

time series in four different stress conditions. By analyzing<br />

all 38 parameters systematically we concluded that:<br />

- There is no significant difference between Low <strong>and</strong><br />

Medium levels of OpTempo stress.<br />

- There is no significant effect of query frequency stress.<br />

This analysis leads to the reduction of the stress space to<br />

3*2 16 (the number of agent 1: 0/1/2, OpTempo for each of<br />

16 agents other than agent 1: Low/High) from 3 33 .<br />

IV. PRELIMINARY EXPERIMENTATION<br />

We ran several experiments to reduce the stress space by<br />

removing ineffective stress situations (stresses which do not<br />

change the existing behavior of a given agent). In addition<br />

we use the experiments to extract meaningful behavioral<br />

parameters from the 38 behavioral parameters we computed.<br />

In the following we undertake a detailed explanation.<br />

A. Experimental configuration<br />

# of events<br />

1120<br />

1110<br />

1100<br />

1090<br />

1080<br />

1070<br />

1060<br />

1050<br />

1040<br />

PRE001 PRE002 PRE003 PRE004<br />

SSC<br />

TAO<br />

1030<br />

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39<br />

Experiment<br />

Stressor<br />

Stress<br />

Situation<br />

Behavior<br />

Online Experimentation<br />

Database<br />

Parameter<br />

Generation<br />

Fig. 2. Experimental configuration<br />

Parameter<br />

Table<br />

Offline Analysis<br />

In this experimentation we store event data from TAO<br />

<strong>and</strong> the stress parameters from stressor (i.e., injector of<br />

stresses) into an online database, <strong>and</strong> then from the database<br />

we construct the parameter table with stress parameters <strong>and</strong><br />

behavioral parameters as shown in Fig. 2. The experimental<br />

matrix in this preliminary experimentation is shown in Table<br />

II. There are four different experimental conditions with<br />

different OpTempo <strong>and</strong> query frequency levels. We replicate<br />

each condition ten times.<br />

Fig. 3. Variance of a behavioral parameter<br />

2) Discriminability of behavioral parameters: All the 38<br />

behavioral parameters may not be equally good in helping<br />

the identification of stress situations. It is important to select<br />

good parameters, especially when we use deterministic<br />

parameters because they are computationally expensive.<br />

Therefore, there is a need for a measure of discriminating<br />

power of the parameters. We developed an index to measure<br />

the discriminating power. We call this as DI<br />

(Discriminability Index). DI is represented as the ratio<br />

between sensitivity to the stress situations <strong>and</strong> r<strong>and</strong>om<br />

variation defined as in (1).<br />

DI =<br />

∑ (<br />

∑<br />

2<br />

µ − µ i ) / n<br />

=<br />

2<br />

s / n<br />

i<br />

∑ (<br />

∑<br />

µ − µ )<br />

s<br />

2<br />

i<br />

µ : Average of parameter values<br />

µ i : Average of parameter values in i th condition<br />

s i : St<strong>and</strong>ard deviation of parameter values in i th condition<br />

n : Number of conditions<br />

i<br />

2<br />

(1)


A DI value greater than one implies that the particular<br />

parameter can help in discriminating between the situations<br />

(more discrimination power). We calculated DI values for<br />

38 behavioral parameters <strong>and</strong> selected those parameters with<br />

DI values larger than one. This resulted in 10 parameters,<br />

comprising of eight statistical <strong>and</strong> two deterministic<br />

parameters, as shown in Table III. ‘# of events’ from ‘Task<br />

Arrival’ <strong>and</strong> ‘Time to Solution’ was the most discriminating<br />

behavioral parameter. Note that ‘# of events’ is the same for<br />

both time series’ as arrived tasks are processed.<br />

TABLE III<br />

DISCRIMINABILITY INDEX OF BEHAVIORAL PARAMETERS<br />

Rank DI Time Series Parameter<br />

1 2477.5 Task Arrival/Time to Solution # of events<br />

2 5.7 Time to Solution(G) Variance<br />

3 5.1 Time to Solution(G) Radius<br />

4 4.4 Time to Solution(G) Average<br />

5 4.2 Time to Solution(G) Maximum<br />

6 2.9 Queue length # of events<br />

7 2.8 Queue length Maximum<br />

8 2.2 Queue length AMI<br />

9 1.2 Queue length Average<br />

10 1.1 Time to Solution(C) L_Expo<br />

(G): Generation, (C): Completion<br />

V. SITUATION IDENTIFICATION<br />

Results from preliminary experimentation shows that 10<br />

of the 38 behavioral parameters have better discriminating<br />

power in the stress space. By using them as features we<br />

identify the stress situations using k-nearest neighbor<br />

classification algorithm.<br />

A. k-nearest neighbor algorithm<br />

k-nearest neighbor algorithm, one of the instance-based<br />

learning methods, is conceptually straightforward. The<br />

summary reported here is based on [14]. In this algorithm<br />

learning is simply storing the training instances, in which<br />

each instance corresponds to a point in the n-dimensional<br />

feature space. Given a new query instance k nearest<br />

neighbors are retrieved from memory <strong>and</strong> used to classify<br />

the new query instance. The nearest neighbors of an instance<br />

are defined in terms of the st<strong>and</strong>ard Euclidean distance. One<br />

problem in this algorithm is the sensitivity to noise axes in<br />

high dimensional problems. One possible solution would be<br />

to normalize each feature. However, normalization does not<br />

resolve this problem because Euclidean distance can become<br />

very noisy for high dimensional problems where only a few<br />

of the features carry classification information. The solution<br />

to this problem is to modify the Euclidean metric by a set of<br />

weights that represents the information content or goodness<br />

of each feature. Therefore, given a set of weights w the<br />

distance between two normalized instances x i <strong>and</strong> x j with m<br />

features can be calculated as in (2).<br />

m<br />

∑<br />

d( x , x ) = w[<br />

k](<br />

x [ k]<br />

− x [ k])<br />

(2)<br />

i<br />

j<br />

k = 1<br />

B. Empirical results<br />

i<br />

We performed 200 experiments with the same<br />

experimental configuration as in Fig. 2 to construct the<br />

database of training instances. Each training instance is<br />

represented with the 10 behavioral parameters. In this<br />

experimentation each agent’s OpTempo (Low/High) <strong>and</strong> the<br />

number of agent 1 (0/1/2) are r<strong>and</strong>omly chosen. Given a<br />

new instance we select 20 nearest neighbors (10% of the<br />

population of training instances) from the database <strong>and</strong><br />

estimate the stress situations of the agents in the society by<br />

using those neighbors. To address the effectiveness of DI we<br />

use 12 different sets of weights to calculate the distance. In<br />

the first 10 sets only one parameter is considered with other<br />

parameters’ weights equal to zero. We weight the<br />

parameters equally in the 11th set <strong>and</strong> proportionally to DI<br />

in the 12th set.<br />

We estimated the stress situations for 100 new instances<br />

using those 12 different sets of weights. To eliminate the<br />

noise from those agents that has no significant effect on the<br />

behavioral parameters, we removed the agents from our<br />

analysis that we cannot identify more than 2/3 by using any<br />

weight set. Through this procedure only 8 agents are<br />

selected. The results of correct estimation from different<br />

weight sets are shown in Fig. 4.<br />

Correctness (%)<br />

75<br />

70<br />

65<br />

60<br />

55<br />

50<br />

1 2 3 4 5 6 7 8 9 10<br />

On the whole, performance of behavioral parameters to<br />

identify the stress situations is quite correlated with DI of<br />

the parameters. As a parameter’s DI ranked high its<br />

estimation accuracy is also high. And, when weighted<br />

equally the performance is in the middle. But, when we<br />

weight proportionally to DI the performance becomes the<br />

highest. This result demonstrates the effectiveness of DI to<br />

get the goodness of the behavioral parameters for situation<br />

identification. Fig. 5 shows the performance for each agent<br />

when we weight proportionally to DI. As agents located<br />

farther from the TAO the performance becomes degraded.<br />

j<br />

DI Rank<br />

2<br />

Weighted w ith DI<br />

Weighted equally<br />

Fig. 4. Correct estimation using different weight sets


100%<br />

1<br />

95%<br />

2<br />

14 15<br />

64%<br />

TAO 4<br />

64%<br />

62%<br />

5 6 7<br />

59%<br />

VI. CONCLUSIONS<br />

In this paper, we developed a methodology for extracting<br />

features by characterizing the time series’ <strong>and</strong> relating it to<br />

stress situations in distributed multiagent systems. One<br />

important aspect of our approach is that we identify the<br />

stress situations of the agents in the society by observing<br />

local behavior of one representative agent. This approach is<br />

motivated by the fact that a local time series can have the<br />

information of the dynamics of entire system in<br />

deterministic dynamic systems. It is important to identify the<br />

situations of other agents when agents are interdependent in<br />

networked systems.<br />

When we have a large society we will be able to predict<br />

the stress levels in some other agents in the society. This<br />

helps in invoking an appropriate control policy. For example<br />

by studying the local behavior of TAO during certain time,<br />

we may be able to estimate that agent 11’s OpTempo is high<br />

with 62% accuracy. This may need us to reduce the amount<br />

of tasks to the agent as high OpTempo requires more<br />

computational resource.<br />

To extract meaningful behavioral parameters we collected<br />

the time series data from a representative agent <strong>and</strong><br />

computed 38 statistical <strong>and</strong> deterministic parameters to<br />

represent its behavior. Discriminability Index defined by us<br />

in this paper as a measure of the discriminating power of the<br />

parameters seems to be a promising direction for agent<br />

behavior estimation. Using those selected parameters we<br />

validated our approach through identifying the stress<br />

situations using k-nearest neighbor algorithm with the index<br />

values as weights. Although our analysis showed that<br />

deterministic parameters don’t have significant ability to<br />

identify stress situations in our stress space, it is possible<br />

that they can be good indicators under other stress space<br />

such as security <strong>and</strong> robustness stresses.<br />

8<br />

9<br />

16 17<br />

52%<br />

10<br />

62%<br />

Fig. 5. Correct estimation with proportional weights to DI<br />

11<br />

12<br />

13<br />

[2] O. F. Rana <strong>and</strong> K. Stout, “What is scalability in multi-agent systems?”<br />

in Proc. 4th Int. Conf. Autonomous Agents, 2000, pp. 56–63.<br />

[3] N. H. Packard, J. P. Crutchfield, J. D. Farmer, <strong>and</strong> R. S. Shaw,<br />

“Geometry from a time series,” Physical Review Letters, vol. 45, pp.<br />

712–716, 1980.<br />

[4] F. Taken, “Detecting strange attractors in turbulence,” in Dynamical<br />

Systems <strong>and</strong> Turbulence, D. R<strong>and</strong> <strong>and</strong> L.-S. Young, Eds. Springer:<br />

Berlin, 1981, pp. 366–381.<br />

[5] A. P. Moore, R. J. Ellison, <strong>and</strong> R. C. Linger, “Attack modeling for<br />

information security <strong>and</strong> survivability,” Software Engineering<br />

Institute, Carnegie Mellon University, Pittsburg, PA, Tech. Note<br />

CMU/SEI-2001-TN-001, 2001.<br />

[6] F. Moberg, “Security analysis of an information system using an<br />

attack tree-based methodology,” M.S. thesis, Automation Engineering<br />

Program, Chalmers University of Technology, Sweden, 2000.<br />

[7] S. Jha <strong>and</strong> J. M. Wing, “Survivability analysis of networked systems,”<br />

in Proc. 23rd Int. Conf. Software engineering, 2001, pp. 307–317.<br />

[8] A. M. Fraser <strong>and</strong> H. Swinney, “Independent coordinates for strange<br />

attractors from mutual information,” Physical Review A, vol. 33, pp.<br />

1134–1140, 1986.<br />

[9] H. G. Schuster, Deterministic Chaos: An Introduction,<br />

Verlagsgesellshaft: Weinheim, 1989.<br />

[10] H. D. I. Abarbanel, M. E. Gilpin, <strong>and</strong> M. Rotenberg, Analysis of<br />

Observed Chaotic Data, Springer: New York, 1998.<br />

[11] P. Grassberger <strong>and</strong> I. Procaccia, “Characterization of strange<br />

attractors,” Physical Review Letters, vol. 50, pp. 346–349, 1983.<br />

[12] P. Grassberger <strong>and</strong> I. Procaccia, “Characterization of strange<br />

attractors,” Physica D, vol. 9, pp. 189–208, 1983.<br />

[13] A. Wolf, J. B. Swift, H. L. Swinney, <strong>and</strong> J. Vastano, “Determining<br />

Lyapunov exponents from a time series,” Physica D, vol. 16, pp. 285–<br />

317, 1985.<br />

[14] T. M. Mitchell, Machine Learning, MaGraw-Hill, pp. 230–236, 1997.<br />

REFERENCES<br />

[1] R. Ellison, D. Fisher, H. Lipson, T. Longstaff, <strong>and</strong> N. Mead,<br />

“Survivable network systems: An emerging discipline,” Software<br />

Engineering Institute, Carnegie Mellon University, Pittsburg, PA,<br />

Tech. Rep. CMU/SEI-97-153, 1997.


Using Predictors to Improve the Robustness of Multi-Agent Systems: Design<br />

<strong>and</strong> Implementation in Cougaar<br />

† Himanshu Gupta, ‡ Yunho Hong, ‡ Hari Prasad Thadakamalla, † Vikram Manikonda, ‡ Soundar Kumara <strong>and</strong> † Wilbur<br />

Peng<br />

† Intelligent Automation Incorporated<br />

7519 St<strong>and</strong>ish Place, Suite 200, Rockville, MD – 20855<br />

{hgupta, vikram, wpeng}@i-a-i.com<br />

‡<br />

<strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />

310 Leonhard Building, The Pennsylvania State University, University Park, PA 16802<br />

{yyh101, hpt102, skumara}@psu.edu<br />

Abstract<br />

In this paper we discuss the use of predictors as a means<br />

to improve the robustness of a multi-agent system in the<br />

event of information attacks that might result in a<br />

communication loss between agents. We focus on an<br />

adaptive logistics application developed under<br />

<strong>DARPA</strong>’s Ultralog program using the Cougaar Agent<br />

infrastructure. The objective of the predictors is to<br />

estimate key “state variables” such as dem<strong>and</strong>,<br />

inventory etc. in the absence of communication, <strong>and</strong><br />

allow logistics planning <strong>and</strong> execution to continue when<br />

communication resources are limited or lost. Prediction<br />

schemes based on a model-based linear state estimation<br />

<strong>and</strong> moving averages are discussed. A generalized<br />

software implementation of the predictors as plugins<br />

within a Cougaar agent, <strong>and</strong> approaches to reconcile<br />

any errors between the “estimated” <strong>and</strong> “actual” states<br />

when communication is restored is also discussed.<br />

Experimental results based on the implementation of the<br />

predictors in a logistics society subject to simulated<br />

communication losses <strong>and</strong> variable changes to the<br />

operational plan are presented.<br />

1 Introduction<br />

Agent-based technology provides a natural solution<br />

for inherently complex, distributed <strong>and</strong> decentralized<br />

systems, where a desired solution emerges as set of<br />

autonomous, interacting entities, execute/optimize their<br />

individual/group behavior in a dynamically changing<br />

environment. Adaptive logistics is one such example. In<br />

this setting, agents represent logistics entities such as the<br />

Units of Action (UA), Forward Support Battalions<br />

(FSB), Brigades, <strong>and</strong> Companies etc. These agents,<br />

distributed across various physical <strong>and</strong> logical<br />

boundaries, collaborate to perform logistics sustainment<br />

operations such as forecasting logistics consumption<br />

trends, identifying potential shortfalls, planning,<br />

executing, monitoring <strong>and</strong> re-planning logistics<br />

operations in a dynamically changing environment.<br />

When deployed in a battlefield environment, the agent<br />

infrastructure is subject to several stresses such as<br />

wartime loads (e.g. CPU stressors due to variable loads),<br />

information attacks (e.g. denial of service,<br />

communication loss, reduced b<strong>and</strong>width) <strong>and</strong> kinetic<br />

attacks (e.g. loss of hardware resources). For successful<br />

deployment, the overall agent infrastructure needs to be<br />

robust <strong>and</strong> resilient to these stresses/attacks.<br />

In this paper we discuss the use of predictors as a<br />

means to achieve robust behavior in the event of<br />

information attacks that might result in a communication<br />

loss between agents. We use an adaptive logistics<br />

application developed under <strong>DARPA</strong>s Ultralog program<br />

using the Cougaar agent infrastructure [9] as a testbed to<br />

motivate, implement <strong>and</strong> test the predictor designs. Two<br />

prediction schemes are discussed. The first is based on a<br />

linear state estimation approach that models agent


interactions as a dynamical system <strong>and</strong> the second is<br />

based on moving averages.<br />

The paper is organized as follows. In Section 2, we<br />

discuss the modeling approach adopted to build the<br />

predictors. In Section 3, software implementation of the<br />

predictors as plugins within a Cougaar agent is<br />

discussed. In Section 4, we discuss experimental results<br />

based on the implementation of the predictors on a<br />

logistics system built on Cougaar. Approaches adopted<br />

to tune the predictors based on historical data is also<br />

discussed. Section 5 discusses conclusions <strong>and</strong> possible<br />

future areas of research <strong>and</strong> development.<br />

2 Predictor Design <strong>and</strong> Algorithms<br />

In this section we discuss the predictor algorithms<br />

that were implemented in Cougaar. Before we discuss<br />

the technical details of the algorithms we present a brief<br />

description of the logistics application domain, as some<br />

aspects of the design <strong>and</strong> implementation are specific to<br />

the application.<br />

2.1 Logistics Scenario<br />

The Cougaar multi-agent society considered in this<br />

effort is the Full society that was developed as a part of<br />

<strong>DARPA</strong>’s Ultralog Program (See [1] for more details)<br />

The Full is a military supply chain logistics society that<br />

consists of many different supply classes. Each agent in<br />

the society represents a military unit performing a<br />

certain logistics operation in the supply chain. For e.g.<br />

the TRANSCOM agent represents the transportation<br />

comm<strong>and</strong> authority for the US military. It issues<br />

directives to its subordinate units regarding the<br />

transportation to be provided to a particular agent for a<br />

particular type of shipment. Figure 1 shows the<br />

organizational structure of the prototype Full society.<br />

Figure 1. Full society hierarchical structure<br />

There are five main supply chain threads in the<br />

prototype military logistics society. They are (i)<br />

Ammunition Supply Chain (ii) Petroleum, Oil <strong>and</strong><br />

Lubricants Supply Chain (BulkPOL <strong>and</strong> PackagedPOL)<br />

(iii) Subsistence Supply Chain (Food, Water) (iv) Repair<br />

Parts Supply Chain; <strong>and</strong> (v) Transportation Supply<br />

Chain<br />

Within each supply chain there exists a customersupplier<br />

relationship between various agents. A<br />

customer makes requests for various items (POL,<br />

ammunition etc) to its supplier <strong>and</strong> the supplier in turn<br />

attempts to meet these dem<strong>and</strong>s based on its current<br />

inventory, or forwards the requests up the supply chain<br />

hierarchy. Thus, depending on its positioning in the<br />

hierarchy, a supplier can also be the customer for<br />

another agent.<br />

Figure 2 shows a part of the supply chain. Here, FSB<br />

is a supplier. ARBN <strong>and</strong> INFBN are customers of FSB<br />

(Note: FSB is also a customer of MSB.) These agents<br />

send dem<strong>and</strong> requests to the FSB, which are managed<br />

by its Inventory Manager. Based on the operation plan,<br />

Optempo <strong>and</strong> current inventory each customer (agent)<br />

requests items from its supplier.<br />

Figure 2. Predictor Implementation in Agent<br />

Network


2.2 Role of Predictors<br />

As mentioned earlier, when deployed in a battlefield<br />

this agent society is subject to several stresses related to<br />

varying wartime loads, kinetic <strong>and</strong> information warfare.<br />

These stresses may result in node failures, denial of<br />

service <strong>and</strong> other network related faults that will result<br />

in the lack of communications between agents. In this<br />

decentralized application the inability of customer/<br />

suppliers to make/meet requests can significantly impact<br />

the performance of the various operational units.<br />

In this setting, predictors can play an important role<br />

in maintaining the supply chain connectivity, while<br />

network related faults are restored. The predictors can<br />

provide the ability to approximate the “expected<br />

behavior” by continuing to make appropriate<br />

dem<strong>and</strong>/supply projections.<br />

We focus on two classes of predictors (i) a customer<br />

predictor that resides at the supplier agent <strong>and</strong> estimates<br />

the customer’s dem<strong>and</strong> when communications are lost<br />

(ii) a supplier predictor inserted at the customer agent<br />

that predicts the allocation results for the tasks generated<br />

by the supplier. As shown in Figure 2, the customer<br />

predictor residing on FSB forecasts each customer’s<br />

(ARBN <strong>and</strong> INFBN) dem<strong>and</strong> when communications are<br />

lost. In a similar fashion the supplier predictor residing<br />

on INFBN agent predicts the supplier’s behavior. These<br />

agents use the predicted values <strong>and</strong> continue execution<br />

of their functionality.<br />

Depending on the accuracy of the predictors,<br />

typically predicted states are not identical to the actual<br />

states. Thus when communications are restored, <strong>and</strong><br />

actual dem<strong>and</strong>s/supply values are available, any errors<br />

between estimated <strong>and</strong> actual values will need to be<br />

resolved. This process, termed as “reconciliation”,<br />

requires any surplus tasks to be rescinded <strong>and</strong> new tasks<br />

added for any shortfalls. The predictors in turn will need<br />

to update their models based on available data.<br />

2.3 Predictor Design<br />

2.3.1 Customer Predictor<br />

The customer predictor is implemented in the form of<br />

two plugins. One plugin is used during the planning<br />

mode where it collects the data about the customersupplier<br />

relationship <strong>and</strong> items involved. It also collects<br />

the Optempo of these items. Another plugin is used<br />

during the execution mode. This Plugin monitors the<br />

dem<strong>and</strong> from the customer <strong>and</strong> predicts the dem<strong>and</strong><br />

when there is a communication loss. Figure 3 shows the<br />

framework of the customer predictor during the<br />

execution mode.<br />

Figure 3. Customer Predictor<br />

2.3.2 Supplier Predictor<br />

This predictor is also built in the form of two plugins.<br />

One plugin resides at the supplier <strong>and</strong> other resides at<br />

the customer. The Plugin at the supplier periodically<br />

sends the snapshots of the inventory levels for each item<br />

in all the supply classes to the Plugin at the customer.<br />

The plugin at the customer uses this information for<br />

predicting the allocation results of the dem<strong>and</strong> task. The<br />

design of the supplier predictor is represented<br />

diagrammatically in Figure 4.<br />

Figure 4. Supplier Predictor<br />

2.4 Predictor Algorithms<br />

Based on the nature of the supply chain dynamics<br />

(uncertainty in dem<strong>and</strong>, model complexity), duration of<br />

communication loss, computational requirements<br />

different approaches for the predictors were investigated.<br />

These ranged from dynamical systems, to classification<br />

theory to traditional forecasting [3,4,6,7,8]. While some<br />

of our research [5] <strong>and</strong> prototypes indicates that we may<br />

get better prediction results by using a non parametric<br />

method such as support vector machine, radial basis<br />

function neural network etc., from an<br />

implementation/computational perspective this becomes<br />

impractical. The primary reason is that as the society<br />

size scales it becomes increasingly difficult to generate<br />

historical patterns for each agent <strong>and</strong> classify its


ehavior states under varying system configurations <strong>and</strong><br />

environment. Given the practical nature of the logistics<br />

application we adopted a more generic computationally<br />

inexpensive approach based on moving averages <strong>and</strong><br />

linear-model based estimation schemes. In this section<br />

the design <strong>and</strong> implementation of the two schemes are<br />

discussed.<br />

2.4.1 Model-based State Estimator<br />

This approach is motivated in part by traditional<br />

approaches to estimation such as a Linear Kalman filter<br />

[2], where the system state is estimated by propagating<br />

the state using a model of the system <strong>and</strong> updating the<br />

state using data-based in actual state measurements.<br />

The implementation adopted here is a fairly simplistic<br />

one but has shown to perform significantly well in a<br />

large number of settings.<br />

Recall, that a Cougaar-based logistics society<br />

operates in two modes – a planning mode <strong>and</strong> an<br />

execution mode. In the planning mode, based on the<br />

anticipated operational plan <strong>and</strong> Optempo a logistics<br />

plan is generated from projected dem<strong>and</strong>s from each of<br />

the agents. In the execution mode this plan is executed,<br />

<strong>and</strong> modifications <strong>and</strong> re-planning are done as needed<br />

based on actual dem<strong>and</strong>s.<br />

The approach used in the model-based state estimator<br />

is to use “plan” time information as the ‘best” estimate<br />

for the state (dem<strong>and</strong>) in the event that actual state<br />

information is not available.<br />

In this approach, plan-time data is used to build a<br />

linear estimator model for the system of the form<br />

x(k+1|k) = x(k) + u(k) (1)<br />

Where x(k) is the dem<strong>and</strong> at time k, <strong>and</strong> u(k) is the<br />

request for the “change in dem<strong>and</strong>” at time k. Since<br />

during the planning mode x(k) is available at each time<br />

step, in this approach u(k) is explicitly computed as<br />

u(k) = x(k+1) – x(k) (2)<br />

for each time step <strong>and</strong> saved. Thus, based on data<br />

available during plan generation a model with state <strong>and</strong><br />

inputs for each time step is built. During the execution<br />

mode, as time evolves the estimator projects the dem<strong>and</strong><br />

for the next time step using (1) <strong>and</strong> corrects its estimate<br />

based on actual dem<strong>and</strong> as follows<br />

xˆ(<br />

k + 1| k + 1) = x(<br />

k + 1| k)<br />

+<br />

(3)<br />

K(<br />

x(<br />

k + 1| k)<br />

− xm(<br />

k + 1| k + 1))<br />

xˆ<br />

Here denoted the estimated state, K is the filter<br />

gain <strong>and</strong> xm is the actual (measured) dem<strong>and</strong>. K is<br />

computed offline based on historical data <strong>and</strong> execution<br />

dem<strong>and</strong> error covariances. In the event of a<br />

communication loss at time j the dem<strong>and</strong> (x cl ) for the<br />

next time step is projected as<br />

x cl ( j + 1) = xˆ(<br />

j | j)<br />

+ u(<br />

j)<br />

(4)<br />

Thus using the above model, the customer/suppliers<br />

continue to execute their functionality using the<br />

estimated states, until communication is restored. At this<br />

point any differences between the estimated dem<strong>and</strong>s<br />

<strong>and</strong> the actual dem<strong>and</strong>s during the communication loss<br />

are reconciled <strong>and</strong> supplies/dem<strong>and</strong> are adjusted<br />

accordingly.<br />

2.4.2 Moving Average Model<br />

In the moving approach the forecasted dem<strong>and</strong> for<br />

the day t, denoted by F t , for the time window i, is given<br />

by<br />

t<br />

∑ − 1<br />

−<br />

Dk<br />

k = t i<br />

Ft<br />

=<br />

i<br />

where, D k is the dem<strong>and</strong> for the k th day. For example,<br />

let the time window be 4. Then forecasted dem<strong>and</strong> for<br />

the day 10 is given by<br />

F<br />

10<br />

=<br />

9<br />

∑<br />

k = 6<br />

4<br />

D<br />

k<br />

( D<br />

=<br />

6<br />

+ D7<br />

+ D8<br />

4<br />

+ D )<br />

We evaluate the effectiveness of the forecasting<br />

method with the following two error criteria which are<br />

chosen depending on the requirements of the system.<br />

The corresponding optimal time window i is calculated<br />

using these error criteria.<br />

Error 1: Difference between average of dem<strong>and</strong>s <strong>and</strong><br />

average of forecasted values<br />

t<br />

∑ Dk<br />

− ∑ Fk<br />

k<br />

k<br />

Et<br />

=<br />

t<br />

t<br />

Where the time window t = 1,2, 3, … , D t denotes the<br />

dem<strong>and</strong> at day t, F t denotes the forecasted value for<br />

day t <strong>and</strong> E t denotes the error at day t.<br />

Error Criteria 2: Difference between daily dem<strong>and</strong> <strong>and</strong><br />

daily forecast value<br />

E t ′ = |D t – F t |<br />

9


3 Implementing Predictors as Cougaar<br />

Plugins<br />

A generalized predictor framework was adopted to<br />

implement predictors in Cougaar to ensure component<br />

reusability with minimal code replication. In the adopted<br />

approach each algorithm does not need a separate<br />

implementation but extends the predictor<br />

implementation interface to make use of the data<br />

collection <strong>and</strong> other functionalities. The predictors were<br />

implemented as plugins providing a lightweight<br />

implementation capability for different agents without<br />

any risk of corrupting or jamming other services<br />

provided by the architecture. The predictors are coupled<br />

with the application domain, i.e., in our case, the<br />

logistics application, <strong>and</strong> hence are not part of the<br />

Cougaar core release but available with the logistics<br />

functionality package.<br />

3.1 The Generalized Predictor Framework<br />

The implementation of predictor framework has the<br />

following features <strong>and</strong> services:<br />

• It has a set of two plugins, one for the planning<br />

mode <strong>and</strong> other for the execution mode. Each<br />

plugin differs in the type of tasks it subscribes to<br />

<strong>and</strong> the task processing logic.<br />

• The plugins automatically identify the customers<br />

for a given agent in which the plugin is inserted.<br />

Thus the predictor does not need to know in<br />

advance what agents represent its customers.<br />

• The plugins automatically identify the supply<br />

classes (for e.g. Ammunition, Subsistence etc.) <strong>and</strong><br />

their respective items for each of its customers that<br />

the supplier predictor serves.<br />

• The plan time plugin subscribes to dem<strong>and</strong><br />

projection tasks <strong>and</strong> generates a dem<strong>and</strong>/day model<br />

for each unique customer-supplyclass-item<br />

relationship. It publishes the data structure or the<br />

model to the blackboard.<br />

• The execution time plugin subscribes to actual<br />

dem<strong>and</strong> tasks <strong>and</strong> generates dem<strong>and</strong>/day quantity<br />

values for each unique customer-supplyclass-item<br />

relationship.<br />

• The execution time plugin subscribes to the model<br />

generated by the plan time plugin to update the<br />

model with actual dem<strong>and</strong> values (this features is<br />

used by the linear state estimator approach). The<br />

plugin does not subscribe to the model when other<br />

approaches such as the Moving Average is used.<br />

• Different algorithm implementations can be hooked<br />

into the execution time plugin to do prediction <strong>and</strong><br />

updates on the model.<br />

• A predictor servlet implementation that can turn the<br />

predictor ON, OFF or SLEEP manually. This<br />

feature is used during testing the event of low CPU<br />

availability, memory usage etc. SLEEP mode refers<br />

to a passive data collection mode with no<br />

communication loss whereas OFF mode refers to<br />

completely shutting down the predictor.<br />

• The execution time plugin can access the<br />

communication loss object to automatically change<br />

the predictor mode to ON/SLEEP.<br />

• The framework is rehydration compatible. This<br />

enables the agent to store the data model <strong>and</strong><br />

current state values to keep functioning normally<br />

when rehydrated.<br />

The above framework is robust <strong>and</strong> generic enough<br />

to be plugged into different agents <strong>and</strong> offers a plug <strong>and</strong><br />

play mechanism to hook up different algorithms for<br />

prediction. It should be noted that in the current<br />

implementation the type of algorithm (model-based,<br />

average-based etc) cannot be chosen at run time <strong>and</strong> is<br />

implemented as a rule in the society definition. Future<br />

work involves algorithm selection as a run-time<br />

capability.<br />

3.2 Reconciliation Mechanism<br />

Once the communications are back up, a<br />

reconciliation mechanism has been developed that<br />

resolves the differences between the actual <strong>and</strong><br />

predicted values to avoid any overages or shortages.<br />

Furthermore, as during run-time actual dem<strong>and</strong>s are<br />

often available for some period into the future, the<br />

mechanism only uses estimated dem<strong>and</strong>s for those days<br />

(after communication loss) where actual dem<strong>and</strong> was<br />

not available. The impact of reconciling the predictions<br />

with the actual dem<strong>and</strong> after communication is restored<br />

is significant since it eliminates/reduces the cascading<br />

effect/bullwhip effect up the supply chain. Also, it<br />

reduces the re-planning of tasks <strong>and</strong> resources which<br />

might have resulted due to shortages or overages.<br />

4 Experimental Results<br />

4.1 Customer Predictor using Model-based<br />

State Estimation<br />

For our implementation <strong>and</strong> analysis, we consider<br />

two agents FSB as a customer <strong>and</strong> MSB as the supplier.<br />

MSB provides FSB support for different supply classes<br />

viz., Ammunition, Fuel, Subsistence <strong>and</strong> Consumable<br />

that in turn have various types <strong>and</strong> items. There are in<br />

total more than 100 items that are supplied by MSB to<br />

FSB. .


Extensive testing <strong>and</strong> validation of the algorithms<br />

was performed on a number of societies with varying<br />

number of agents. The results obtained with the linear<br />

estimator approach seem very encouraging across<br />

different supply classes. Due to space limitations we<br />

would show results for few of these items 1 . Figure 5<br />

shows actual dem<strong>and</strong> <strong>and</strong> predicted dem<strong>and</strong> values for<br />

an Ammunition item. We can see that the predicted<br />

values are very close to the actual values <strong>and</strong> match with<br />

the reorder periods of the actual dem<strong>and</strong>. The<br />

communications were cut for 12 days (day 43 -55)<br />

during which we can still see the predictions, but no<br />

actual dem<strong>and</strong> tasks. In Figure 6 (for Subsistence item)<br />

observe that the predicted dem<strong>and</strong> catches up with the<br />

actual dem<strong>and</strong>. The initial error is due to the initial<br />

model inaccuracies that are reduced as the model is<br />

updated with observed data.<br />

Figure 6. Actual vs. Predicted values for<br />

Subsistence item<br />

Figure 5. Actual vs. Predicted values for Ammo<br />

item<br />

Figure 7 <strong>and</strong> Figure 8 show the planned dem<strong>and</strong><br />

(model), actual dem<strong>and</strong>, predicted dem<strong>and</strong> without<br />

communication loss <strong>and</strong> predicted dem<strong>and</strong> with<br />

communication loss for BulkPOL item <strong>and</strong> Subsistence<br />

item respectively. As actual dem<strong>and</strong> values roll in, the<br />

measurement equations reduce the error causing the<br />

model to closely mimic the actual dem<strong>and</strong> pattern. With<br />

communication loss, the predictor uses the last updated<br />

value to forecast dem<strong>and</strong>.<br />

Figure 7. With & W/o Comm. Loss Prediction<br />

for BulkPOL item<br />

1 Some of the experimental results in this paper are shown for Tiny<br />

<strong>and</strong> Small societies, which are smaller versions of Full society. Note<br />

that since the agents are generic, the dem<strong>and</strong> patterns are similar in all<br />

the societies.<br />

Figure 8. With & W/o Comm. Loss Prediction<br />

for Subsistence item


4.2 Moving Average Model Results<br />

Table 1 shows some typical data collected for the<br />

moving average model based predictor. Each column<br />

shows the dem<strong>and</strong> sent to the supplier (MSB) from the<br />

customer (FSB). On each execution day the customer<br />

sends the dem<strong>and</strong> for about 20 days ahead. Suppose<br />

there is a communication loss on day 51, the predictor<br />

then forecasts the dem<strong>and</strong> on day 52.<br />

The graphs below (Figure 9, Figure 10 <strong>and</strong> Figure<br />

11) show some of the results for the moving average<br />

based predictor. These results are a few representative<br />

examples of several runs performed across a number of<br />

Cougaar societies for various supply classes. The results<br />

show that the forecasted dem<strong>and</strong> is quite close to the<br />

original dem<strong>and</strong> thus validating the methodology of the<br />

predictor.<br />

Table 1. Dem<strong>and</strong> Requests from FSB to<br />

MSB<br />

Figure 9. Comparison of the forecasted<br />

dem<strong>and</strong> with the actual dem<strong>and</strong> for BulkPOL in<br />

Small society<br />

Figure 10. Comparison of the forecasted<br />

dem<strong>and</strong> with the actual dem<strong>and</strong> for BulkPOL in<br />

Tiny society<br />

Figure 11. Comparison of the forecasted<br />

dem<strong>and</strong> with the actual dem<strong>and</strong> for<br />

Ammunition in Tiny society


5 Conclusions <strong>and</strong> Future Research<br />

The generic predictor framework provides core<br />

functionality to Cougaar, making Cougaar a more<br />

survivable agent infrastructure. The predictor plugins<br />

can be invoked by any agent participating in the<br />

logistics supply process. Different algorithms can be<br />

hooked into the framework to use the data <strong>and</strong> other<br />

predictor services, hence eliminating the need to write<br />

predictor plugins from scratch. Initial studies show that<br />

the estimator models work well for items across<br />

different supply classes with the prediction almost<br />

mimicking the actual dem<strong>and</strong> values. However, due to<br />

the low variability <strong>and</strong> uncertainty of the observed<br />

dem<strong>and</strong>, the performance of predictors has not been<br />

extensively tested. Testing with variable dem<strong>and</strong> is<br />

currently in progress. Furthermore as a certain class of<br />

predictors seems to perform better for a particular class<br />

of data, hybrid approaches to intelligently selecting the<br />

predictor algorithms based on data-type <strong>and</strong> dem<strong>and</strong> are<br />

being investigated. One such approach is to use a<br />

SMART predictor (Figure 12) as explained below. Here<br />

a smart predictor would monitor the dem<strong>and</strong> coming<br />

from the customers <strong>and</strong> choose which method should be<br />

used during the communication loss.<br />

We observe that each method (Model based state<br />

estimator <strong>and</strong> Moving-average) gives good forecasts for<br />

certain types of data. Thus by building a SMART<br />

predictor which chooses the type of predictor to be used<br />

depending on the situation would result in better<br />

forecasts.<br />

Brinn <strong>and</strong> Beth DePass for their support, comments <strong>and</strong><br />

insightful discussions. We would also like to thank Lora<br />

Goldston for her support in the development of the<br />

reconciliation code <strong>and</strong> in the testing <strong>and</strong> integration of<br />

the predictor algorithms.<br />

7 References<br />

1. Ultra*Log Adaptive Logistics Defense Team Plan,<br />

Revised version 2.0, 2003.<br />

2. Welch .G, Bishop., G. An introduction to Kalman filter.,<br />

Department of Computer Science, University of North<br />

Carolina at Chapel Hill, Chapel Hill, TR 95-041, March<br />

11 2002.<br />

3. John Moody <strong>and</strong> Christian J. Darken, Fast learning in<br />

networks of locally-tuned processing units, Neural<br />

Computation 1, 281-294, 1989.<br />

4. G. Rätsch, T. Onoda, <strong>and</strong> K.-R. Müller. Soft margins for<br />

AdaBoost. Machine Learning, 42(3):287-320, March<br />

2001.<br />

5. Y.Hong, N.Gautam, S.R.T.Kumara, A.Surana, H.Gupta,<br />

S.Lee, V.Narayanan, H.Thadakamalla, M. Greaves, M.<br />

Brinn, Survivability of Complex System – Support Vector<br />

Machine Based Approach Conf., Artificial Neural<br />

Networks in Engineering (ANNIE) 2002.<br />

6. Osuna, E. E., Freund R. <strong>and</strong> Girosi, F., 1997, Support<br />

Vector Machines: Training <strong>and</strong> Applications, Technical<br />

<strong>Report</strong> AIM-1602, MIT A.I. Lab.<br />

7. Vapnik, V. N., 1998, Statistical Learning Theory, John<br />

wiley & sons, Inc, New York.<br />

8. Burges, C. J. C., 1998, A Tutorial on Support Vector<br />

Machines for Pattern Recognition, Knowledge Discovery<br />

<strong>and</strong> Data Mining, Vol. 2, No. 2, pp. 121-167.<br />

9. Cougaar Website (www.cougaar.org)<br />

Figure 12. Description of SMART Predictor<br />

6 Acknowledgements<br />

This research was performed under the <strong>DARPA</strong><br />

Ultralog effort <strong>and</strong> was supported by <strong>DARPA</strong> grant<br />

MDA972-1-1-0038 <strong>and</strong> Contract 2087-IAI-ARPA-0038.<br />

We would like to thank Dr. Mark Greaves, Marshall


1<br />

SURVIVABILITY OF COMPLEX SYSTEM – SUPPORT VECTOR<br />

MACHINE BASED APPROACH<br />

Y, HONG, N. GAUTAM, S. R. T. KUMARA, A. SURANA, H. GUPTA,<br />

S. LEE, V. NARAYANAN, H. THADAKAMALLA<br />

The Dept. of <strong>Industrial</strong> Engineering, The Pennsylvania State University,<br />

University Park, PA 16802<br />

M.BRINN<br />

BBN Technologies, Cambridge, MA<br />

M. GREAVES<br />

<strong>DARPA</strong> IXO/JLTO, Arlington, VA 22203<br />

ABSTRACT<br />

Logistic systems which are inherently distributed, in general can be classified<br />

as complex systems. Survivability of these systems under varying environment<br />

conditions is of paramount importance. Different environmental conditions in<br />

which the logistic system resides are translated into several stresses. These in<br />

turn will manifest as internal stresses. Logistic systems can be modeled as a<br />

collection of software agents. Each agent’s behavior is a result of the stresses<br />

imposed. Predicting the agents’ collective behavior is of paramount importance<br />

to ensure survivability. Analytical modeling of such systems becomes very<br />

difficult, albeit impossible. In this paper, we study a supply chain in which a<br />

real life scenario is used. We implement the supply chain in Cougaar<br />

(Cognitive Agent Architecture developed by <strong>DARPA</strong>) <strong>and</strong> develop a predictor,<br />

based on Support Vector Machine. We report our methodology <strong>and</strong> results with<br />

real-life experiments <strong>and</strong> stress scenarios.<br />

INTRODUCTION<br />

Logistic systems can be classified as complex systems (Choi et al., 2001,<br />

Baranger, http://necsi.org/projects/baranger/cce.pdf). Logistic systems have<br />

many components such as suppliers <strong>and</strong> distributors at several stages. These<br />

components are distributed geographically but interdependent. At each<br />

component some form of nonlinear decision making process goes on. Typically<br />

the system would respond in a stable manner to external disturbances. But due to<br />

information delay, inherent feedback structure <strong>and</strong> nonlinear components<br />

unstable phenomena can arise which may ultimately manifest as chaotic<br />

behavior. Efficient resource allocation <strong>and</strong> collective oscillations (of say<br />

inventory levels) are some examples of emergent behavior shown by supply<br />

chains. They have structure at many scales, each component itself represents a<br />

simple supply chain. The components compete due to resource limitation but<br />

collobarate/cooperate to maximize their gains which is another characteristic<br />

feature of a complex system.<br />

The survivability of logistic systems under varying environmental<br />

conditions is of paramount importance. Survivability is going to be itself an<br />

emergent property of a logistic system <strong>and</strong> it represents the ability of the system


2<br />

to function critically even under adverse conditions. We refer to these adverse<br />

conditions as stresses. In order to improve the survivability, agents should detect<br />

stresses <strong>and</strong> take appropriate actions so that they can adapt to stress conditions.<br />

Due to lack of analytical tools for predicting emergent behavior of a<br />

complex system from its component behavior, simulation is primary tool of<br />

designing <strong>and</strong> optimizing them. In this paper, we would like to show how an<br />

agent learns to detect stresses as the first step towards improving survivability.<br />

We implemented the supply chain in COUGAAR (Cognitive Agent Architecture<br />

developed by <strong>DARPA</strong>) as a simulation model. Through an extensive design of<br />

experiments we subjected the supply chain to various stress conditions <strong>and</strong> made<br />

the agents learn to predict them using Support Vector Machines.<br />

THE SMALL SUPPLY CHAIN (SSC)<br />

We built a multi-agent system for a small supply chain using Cougaar<br />

version 8.6.0 (http://www.cougaar.org). Cougaar is an open source multi-agent<br />

architecture <strong>and</strong> is appropriate for modelling large-scale distributed supply chain<br />

management applications. We call our supply chain system ‘the Small Supply<br />

Chain (SSC)’.<br />

Each agent in SSC represents an individual organization such as a retailer<br />

<strong>and</strong> a supplier in the supply chain. Figure 1 represents dem<strong>and</strong> flows in this<br />

small supply chain.<br />

Supplier 1<br />

Factory 1<br />

Warehouse 1<br />

Supplier 2<br />

Warehouse 2<br />

Retailer 1<br />

Retailer 2<br />

Distribution Center 1 Whole Saler 1<br />

Retailer 3<br />

Distribution Center 2<br />

Factory 2<br />

Supplier 3<br />

Figure 1. Dem<strong>and</strong> flows in the Small Supply Chain (SSC)<br />

STRESS TYPES AND LEVELS<br />

After some preliminary experiments <strong>and</strong> observations we used the<br />

following stress conditions to show our approach.<br />

Stress 1. Changing OPTEMPO. The SSC works according to a Logistics<br />

Plan. The plan for each agent is prespecified. Every activity of each agent has an<br />

OPTEMPO value which represents the level of the activity. Changing<br />

OPTEMPOs can result in a different plan. OPTEMPO can have one of the three<br />

values, ‘low’, ‘medium’ <strong>and</strong> ‘high’.<br />

Stress 2. Adding <strong>and</strong> Dropping agents. Dropping agents can represent<br />

situations such as communication loss due to physical accidents or cyber attacks.<br />

When a retailer agent is dropped, its supporting agent will not receive tasks from<br />

the dropped agent <strong>and</strong> its retailer agents will not receive responses from the


3<br />

dropped agent. These changes will affect planning significantly. By adding new<br />

retailer agents, we can evaluate how sensitive the SSC is to scalability. The<br />

addition of new retailer agent increases a load to the other supplier agents.<br />

PREDICTORS<br />

In Cougaar, every agent has its own blackboard. During logistics planning,<br />

the intermediate planning results are continuously accumulated on that<br />

blackboard. Therefore, by observing the blackboard we can recognize the<br />

planning state. Our idea is to detect stresses by observing the blackboard. Each<br />

agent should have the ability to detect the stresses coming from outside so that it<br />

can make a decision autonomously to h<strong>and</strong>le the stresses.<br />

In this work, for each agent we build a separate supervised learning model.<br />

Many types of task classes are instantiated on the blackboard. The collection of<br />

the number of tasks of each type represents the state of the agent. A task is a<br />

java class of Cougaar which represents a logistic requirement or activity. Tasks<br />

are generated successively along the supply chain starting from the tasks of the<br />

retailers. The learning model takes the state of the agent as input feature <strong>and</strong><br />

predicts the corresponding stress type <strong>and</strong> level. The pattern recognition model -<br />

predictor - is built using the Support Vector Machine.<br />

In order to prepare training <strong>and</strong> test data, the blackboard of each agent is<br />

monitored <strong>and</strong> data is stored into a database during experiments by a monotoring<br />

facility which consists of a specialized Plugin <strong>and</strong> a separate server machine.<br />

The Plugin is a java class provided by Cougaar. The pattern recognition model is<br />

trained by the data from the database off-line.<br />

SUPPORT VECTOR MACHINES (SVM)<br />

A Support Vector Machine is a pattern recognition method. It has been<br />

popular since the mid-90s because of its theoretical clearness <strong>and</strong> good<br />

performance. Many pattern recognition applications have been reported since<br />

this theory was developed by Vapnik (Műller, et al., 2001), which also<br />

exemplify its superiority over similar such techniques. Moghaddam <strong>and</strong> Yang<br />

(2002) applied SVM to the appearance-based gender classification <strong>and</strong> showed<br />

that it is superior to other classifiers-nearest neighbor, radial basis function<br />

classifier. Liang <strong>and</strong> Lin (2001) showed SVM has better performance than<br />

conventional neural network in detection of delayed gastric emptying. For an<br />

exhaustive review, we recommend the reader to (Burges, 1998), (Chapelle et al.,<br />

1999) <strong>and</strong> (Műller et al., 2001).<br />

The main idea of SVM is to separate the classes with a surface that<br />

maximizes the margin between them. This is an approximate implementation of<br />

the Structural Risk Minimization induction principle (Osuna, et al., 1997). To<br />

construct a classifier for a given data set, a SVM solves a quadratic<br />

programming problem with each variable corresponding to a data point. When<br />

the size of the data set is large, it requires special techniques such as<br />

decompositions to h<strong>and</strong>le the large number of variables. Basically, the SVM is a<br />

linear classifier. Thus in order to h<strong>and</strong>le a dataset which is not separable by a<br />

linear function, inner-product kernel functions are used. The role of the innerproduct<br />

kernel functions is to convert an inner product of low dimensional data


4<br />

points into a corresponding inner product in high dimensional space without<br />

actual mapping. The principle of the mapping is based on Mercer’s theorem<br />

(Vapnik 1998). By doing so, the SVM overcomes non linear-separable cases.<br />

The selection of kernel functions depends on the problem. We should<br />

choose an appropriate function by performing experiments. Other control<br />

parameters for the SVM are the extra cost for errors (represented as C), the loss<br />

sensitivity constant (ε insensitivity ) <strong>and</strong> the maximum number of iterations. The extra<br />

cost for errors, C, is a cost assigned to training errors in non linear-separable<br />

cases. A larger C corresponds to assigning a higher penalty to errors (Burges<br />

1998). The loss sensitivity constant (ε insensitivity ) represents the allowable error<br />

range for the prediction values.<br />

SVMs are basically developed as binary classifiers. Currently a lot of<br />

research is being done in the area of multi-class SVM. We use BSVM 2.0 which<br />

is the multi-class SVM program suggested by Hsu <strong>and</strong> Lin (2002).<br />

EXPERIMENT CONDITIONS<br />

We simulated the SSC under various stress conditions. 12 Stress conditions<br />

are made through the combination of the following factors; The number of<br />

retailer 1(zero, one, two), OPTEMPO of retailer 2 (LOW, HIGH), OPTEMPO<br />

of retailer 3 (LOW, HIGH).<br />

219 data sets were used for training <strong>and</strong> 94 data sets were used for testing<br />

the prediction power from a total of 313 data sets. The conditions <strong>and</strong> number of<br />

experiments are shown in the table 1.<br />

Condition Retailer 1 Retailer 2 Retailer 3 Training Test Total<br />

1 Zero LOW LOW 25 11 36<br />

2 Zero HIGH LOW 19 8 27<br />

3 Zero LOW HIGH 18 8 26<br />

4 Zero HIGH HIGH 17 7 24<br />

5 One LOW LOW 29 13 42<br />

6 One HIGH LOW 17 7 24<br />

7 One LOW HIGH 18 7 25<br />

8 One HIGH HIGH 18 8 26<br />

9 Two LOW LOW 21 9 30<br />

10 Two HIGH LOW 12 5 17<br />

11 Two LOW HIGH 13 6 19<br />

12 Two HIGH HIGH 12 5 17<br />

Gr<strong>and</strong> Total - - - 219 94 313<br />

Table 1. The stress condition <strong>and</strong> number of experiments<br />

TRAINING<br />

Through preliminary studies apart from the experiments tabulated above we<br />

found that all the stress conditions do not affect all the agents. Thus, we<br />

prepared different classification definitions for the training set depending on the<br />

agent (See the table 2).<br />

As the classification definitions are different for different agents, the input<br />

features are also different. The tasks used as input features in each agent are<br />

shown in table 3.


5<br />

The option ‘multi-class bound-constrained support vector classification’ in<br />

BSVM 2.0 is selected. For others, we use default options of BSVM 2.0 such as<br />

the radial basis function with gamma = 1/(the number of input features).<br />

Agent name<br />

Retailer 2,<br />

Warehouse 2<br />

Retailer 1<br />

Factory 2,<br />

Supplier 2<br />

Warehouse 1,<br />

Factory 1<br />

Retailer 3,<br />

Distribution Center 2<br />

Distribution Center 1,<br />

Supplier 1<br />

Wholesaler 1<br />

Class definition<br />

Class 1: condition 1,3,5,7,9,11 (Retailer 2 LOW)<br />

Class 2: condition 2,4,6,8,10,12 (Retailer 2 HIGH)<br />

Class 1: condition 1,2,3,4 (zero Retailer 1)<br />

Class 2: condition 5,6,7,8 (one Retailer 1)<br />

Class 3: condition 9,10,11,12 (two Retailer 1)<br />

Class 1: condition 1,3 (Retailer 2 LOW at zero Retailer 1)<br />

Class 2: condition 2,4 (Retailer 2 HIGH at zero Retailer 1)<br />

Class 3: condition 5,6,7,8,9,10,11,12 (All other cases)<br />

12 Classes, Regard each condition as one class<br />

Class 1: condition 1,2,5,6,9,10 (LOW Retailer 3)<br />

Class 2: condition 3,4,7,8,11,12 (HIGH Retailer 3)<br />

Class 1: 1,2 (LOW Retailer 3 at zero Retailer 1)<br />

Class 2: 3,4 (HIGH Retailer 3 at zero Retailer 1)<br />

Class 3: 5,6 (LOW Retailer 3 at one Retailer 1)<br />

Class 4: 7,8 (HIGH Retailer 3 at one Retailer 1)<br />

Class 5: 9,10 (LOW Retailer 3 at two Retailer 1)<br />

Class 6: 11,12 (HIGH Retailer 3 at two Retailer 1)<br />

Class 1: condition 9,10,11,12 (Two Retailer 1)<br />

Class 2: condition 1,2,3,4,5,6,7,8 (other conditions)<br />

Table 2. The class definition by agents<br />

Agent Features Agent Features<br />

Retailer 2 PS, PW Warehouse 1 PS, PW, OS<br />

Distribution Center 1 W, OPS, OS Factory 1 TP, W, OPS, OS<br />

Retailer 3 PS, PW Supplier 1 TR, TP, OTP<br />

Distribution Center 2 PS, PW Retailer 1 PS, PW<br />

Factory 2 TP Warehouse 2 TP, W<br />

Supplier 2 OTP Wholesaler 1 S<br />

* PS = ProjectSupply, PW = ProjectWithdraw, W =Withdraw, TP = Transport,<br />

TR = Transit, S = Supply, OPS = ProjectSupply coming from outside,<br />

OS = Supply coming from outside, OTP = Transport coming from outside<br />

Table 3. The input features by agents<br />

Agent Success rate Agent Success rate<br />

Retailer 2 100% Warehouse 1 100%<br />

Retailer 1 100% Distribution Center 1 100%<br />

Retailer 3 100% Factory 1 22.34%<br />

Distribution Center 2 100% Supplier 1 40.43%<br />

Factory 2 100% Warehouse 2 84.04%<br />

Supplier 2 100% Wholesaler 1 86.17%<br />

Table 4. The success rate to classify the stress condition at each agent<br />

RESULTS<br />

The table 4 contains the test results. Overall performance is good. In<br />

addition, we can see the agent near the retailers in the supply chain can detect


6<br />

stresses well. The Warehouse 1 agent can detect exactly all the stress types even<br />

though it is far from retailers (see Fig 1.).<br />

CONCLUSIONS<br />

We have shown an effective application of pattern recognition model for<br />

detecting stresses by observing the internal state of each agent. Each agent has a<br />

SVM since the influence of the same stress on different agents can be different.<br />

Some agents near the retailers can detect stresses very well. However, it is hard<br />

to detect the influence of the stress on agents which are far from retailers.<br />

Overall performance of predictor of each agent is good. Constructing the<br />

capability for stress detection is the first step towards improving the<br />

survivability of a multi-agent system. This result is important because we can<br />

pursue further research on how we can dampen the effect of stresses based on<br />

the result of this study. Based on current detected state each agent can change<br />

their behaviors - ordering or planning - to adapt to stress conditions without<br />

serious performance degradation of overall supply chain. In addition, our<br />

approach is generally useful because it is very hard to model a complex system<br />

analytically.<br />

ACKNOWLEDGEMENTS<br />

Support for this research was provided by <strong>DARPA</strong> (Grant#: MDA 972-01-<br />

1-0563) under the UltraLog program.<br />

REFERENCES<br />

Baranger, Michel, “Chaos, Complexity, <strong>and</strong> Entropy – A physics talk for non-physicists,” MIT-<br />

CTP-3112, http://necsi.org/projects/baranger/cce.pdf.<br />

Burges, C. J. C., 1998, “A Tutorial on Support Vector Machines for Pattern Recognition,”<br />

Knowledge Discovery <strong>and</strong> Data Mining, Vol. 2, No. 2, pp. 121-167.<br />

Chapelle, O., Haffner, P. <strong>and</strong> Vapnik, V. N., 1999, “Support Vector Machines for Histogram-Based<br />

Image Classification,” IEEE Transactions on Neural Networks, Vol. 10, No. 5, pp. 1055-1064.<br />

Choi, T., Dooley, K. <strong>and</strong> Rungtusanatham, M, 2000, " Supply Networks <strong>and</strong> Complex Adaptive<br />

Systems: control versus emergence,” Journal of Operations Management, Vol. 19, pp 351-366.<br />

Dooley, K., 2002, “Simulation Research Methods,” Companion to Organizations, Joel Baum (ed.),<br />

London: Blackwell, pp. 829-848.<br />

Hsu, Chih-Wei <strong>and</strong> Lin, Chih-Jen, 2002, “A Comparison of Methods for Multiclass Support Vector<br />

Machines,” IEEE Transactions on Neural Networks, Vol. 13, No. 2, pp. 415-425.<br />

Liang, H. <strong>and</strong> Lin, Z., 2001, “Detection of Delayed Gastric Emptying from Electrogastrograms with<br />

Support Vector Machine,” IEEE Transactions on Biomedical Engineering, Vol. 13, No. 2, pp.<br />

415-425.<br />

Moghaddam, B. <strong>and</strong> Yang, M., 2002, “Learning Gender with Support Faces,” IEEE Transactions on<br />

Pattern Analysis <strong>and</strong> Machine Intelligence, Vol. 24, No. 5, pp. 707-711.<br />

Műller, K., Mika, S., Rätch, G., Tsuda, K. <strong>and</strong> Schőlkopf B., 2001, “A Introduction to Kernel-Based<br />

Learning Algorithms,” IEEE Transactions on Neural Networks, Vol. 12, No. 2, pp. 181-202.<br />

Osuna, E. E., Freund R. <strong>and</strong> Girosi, F., 1997, Support Vector Machines: Training <strong>and</strong> Applications,<br />

Technical <strong>Report</strong> AIM-1602, MIT A.I. Lab.<br />

Vapnik, V. N., 1998, Statistical Learning Theory, John wiley & sons, Inc, New York.


A Framework for Performance Control of<br />

Distributed Autonomous Agents<br />

Nathan Gnanasamb<strong>and</strong>am, Seokcheon Lee, Soundar R.T. Kumara <strong>and</strong> Natarajan Gautam<br />

310 Leonhard Building, The Pennsylvania State University, University Park, PA, 16802, USA<br />

Abstract<br />

We propose an autonomous <strong>and</strong> scalable queueing theory-based methodology to control the performance of a hierarchical network<br />

of distributed agents. Multi-agent systems (MAS) such as supply chains functioning in highly dynamic environments<br />

need to achieve maximum overall utility during operation. Hence, the objective of the control framework is to identify the<br />

trade-offs between quality <strong>and</strong> performance <strong>and</strong> adaptively choose the operational settings to posture the MAS for better utility.<br />

By formulating the MAS as an open queueing network with multiple classes of traffic we evaluate the performance <strong>and</strong><br />

subsequently the utility, from which we identify the control alternative for a localized, multi-tier zone.<br />

Keywords: Queueing Network, Multi-Agent Systems, Performance control.<br />

1 Introduction<br />

With the growing view of agent-oriented software sytems [1] <strong>and</strong> the increased deployment of distributed multi-agent systems<br />

(DMAS) for numerous emerging applications such as computational grids, e-commerce hubs, supply chains <strong>and</strong> sensor networks,<br />

we are faced with large-scale distributed agents whose performance needs to be estimated <strong>and</strong> controlled. Often times,<br />

these DMAS operate in dynamic <strong>and</strong> stressful environmental conditions, of one type or the other, in which the MAS as whole<br />

must survive. While survival notion necessitates adaptivity to diverse conditions along the dimensions of performance, security<br />

<strong>and</strong> robustness, delivering the correct proportion of these quantities can be quite a challenge. In this paper, we address a piece<br />

of this problem by building an autonomous performance control framework for MAS.<br />

While building large multi-agent societies (such as UltraLog [2]), it is desirable that the associated adaptation framework<br />

be generic <strong>and</strong> scalable. One way to do this is to utilize a methodology similar to Jung <strong>and</strong> Tambe [3], where the bigger<br />

society is composed of smaller building blocks, in this case, corresponding to communities of agents. Although, strategies for<br />

co-operativeness <strong>and</strong> distributed POMDP to have been utilized to analyze performance in [3], an increase in the number of<br />

variables in each agent can quickly render POMDP ineffective even in reasonable sized agent communities due to the statespace<br />

explosion problem. In [4], Rana <strong>and</strong> Stout identify data-flows in the agent network <strong>and</strong> model scalability with Petri<br />

nets, but their focus is on identifying synchronization points, deadlocks <strong>and</strong> dependency constraints with coarse support for<br />

performance metrics relating to delays <strong>and</strong> processing times for the flows. In a recent architecture for autonomic computing,<br />

Tesauro et al. [5] build a real-time MAS-based framework that is self-optimizing based on application-specific utility. While<br />

[3, 4] motivate the need to estimate performance of large DMAS using a building block approach, [5] justifies the need to use<br />

domain specific utility whose basis should be the network’s service-level attributes such as delays, utilization <strong>and</strong> response<br />

times.<br />

We believe that by using queueing theory we can analyze data-flows within the agent community with greater granularity in<br />

terms of processing delays <strong>and</strong> network latencies <strong>and</strong> also capitalize on using a building block approach by restricting the model<br />

to the agent community. Queueing theory has been widely used in networks <strong>and</strong> operating systems [6]. However, the authors<br />

have not seen the application of queueing to MAS modeling <strong>and</strong> analysis. Since, agents lend themselves to being conveniently<br />

represented as a network of queues, we concentrate on engineering a queueing theory based adaptation (control) framework to<br />

enhance the application-level performance.<br />

Inherently, the DMAS can be visualized as a multi-layered system as is depicted in Figure 1a. The top-most layer is where<br />

the application resides, usually conforming to some organization such as mesh, tree etc. The infrastructure layer not only<br />

abstracts away many of the complexities of the underlying resources (such as CPU, b<strong>and</strong>width), but more importantly provides<br />

services (such as Message Transport) <strong>and</strong> aiding agent-agent services (such as naming, directory etc.). The bottom most layer<br />

is where the actual computational resources, memory <strong>and</strong> b<strong>and</strong>width reside. Most studies in the literature do not make this<br />

distinction <strong>and</strong> as such control is not executed in a layered fashion. Some studies such as [7, 8], consider controlling attributes<br />

in the physical or infrastructural layers so that some properties (eg. survivability) could result <strong>and</strong>/or the facilities provided by<br />

these layers are taken advantage of. Often, this requires rewiring the physical layer, availability of a infrastructure level service<br />

or the ability of the application to share information with underlying layers in a timely fashion for control purposes. In this<br />

work, we consider control only due to application-level trade-offs such as quality of service versus performance <strong>and</strong> assume that<br />

infrastructure level services (such as load-balancing, priority scheduling) or physical level capabilities (such as rewiring) are not<br />

possible. This does not exclude the possibility that in future we can combine all approaches to achieve a multi-layered control.


© £ ¨ ¥ ¤ § ¢ ¥ <br />

© £ ¨ £¦ £ ¨ ¦ ¥ ¢ ¥ <br />

¢ £ ¤ ¤ £ ¤ ¡<br />

¤ ¨© § ¢ ¤ ¡ ¢ © ¡ ¢ £ £ ¢<br />

<br />

<<br />

)<br />

+ *<br />

5<br />

1<br />

<br />

! " " # ! <br />

3 6 - 4 3 / - 1<br />

A / 2 - 3 4 @<br />

2 2 . 6 7 5<br />

3 9 / 8<br />

¢ £ ¤ ¤ £ ¤ ¡<br />

¦ ¦ § ¨© ¡ ¨ ¤ £ ¢<br />

¥<br />

, ><br />

, = ; : =<br />

- . / . - 0 ,<br />

/ % 2 / - 3 4 1<br />

,<br />

+ *<br />

Our contribution in this work is to combine queueing analysis <strong>and</strong> application-level control to engineer a generic framework<br />

that is capable of self-optimizing its domain-specific utility.<br />

¡ ¡ ¢ £ ¤ ¥ ¦ £ § ¨ © ¥ <br />

<br />

<br />

: ; ,<br />

: =<br />

8 )<br />

) ; < ; ) ; :<br />

¨ ¥ ¦ ¤ ¦ © ¥ § ¥ ¥ <br />

$ % & ' % (<br />

: <br />

? ; $<br />

8 3 9 / + 2 . / 9 4<br />

) ><br />

£ ¤ ¥ ¢ © ¥ <br />

: > ; < ><br />

(a) Operational Layers<br />

(b) Framework Architecture<br />

Figure 1: MAS framework<br />

1.1 Problem Statement<br />

Typically, the top-most layer in the computing infrastructure (here the DMAS-based application) possesses maximum transparency<br />

to system’s overall utility, control-knobs <strong>and</strong> domain knowledge. The utility of the application is the combined benefit<br />

along several conflicting (eg. completeness <strong>and</strong> timeliness [9, 2]) <strong>and</strong>/or independent (eg. confidentiality <strong>and</strong> correctness [9, 2])<br />

dimensions, which the application tries to maximize in a best-effort sense through trade-offs. Underst<strong>and</strong>ably, in a distributed<br />

multi-agent setting, mechanisms to measure, monitor <strong>and</strong> control this multi-criteria utility function become hard <strong>and</strong> inefficient,<br />

especially under conditions of scale-up. Given that the application does not change its high-level goals, task-structure or<br />

functionality in real-time, it is beneficial to have a framework that assists in the choice of operational modes (or opmodes) in a<br />

distributed way. Hence, the research objective of this work is to design <strong>and</strong> develop a generic, real-time framework for DMAS,<br />

that utilizes a queueing network model for performance evaluation <strong>and</strong> a learned utility model to select an appropriate control<br />

alternative.<br />

1.2 Solution Methodology<br />

The focus of this research is to adjust the application-level parameters or opmodes within the distributed agents to make an<br />

autonomous choice of operational parameters for agents in a reasonable-sized domain (called an agent community). The choice<br />

of opmodes is based on the perceived application-level utility of the combined system (i.e. the whole community) that current<br />

environmental conditions allow. We assume that the application’s utility depends of the choice of opmodes at the agents<br />

constituting the community because the opmodes directly affect the performance. A queueing network model is utilized to<br />

predict the impact of DMAS control settings <strong>and</strong> environmental conditions on steady-state performance (in terms of end-to-end<br />

delays in tasks), which in turn is used to estimate the application-level utility. After evaluating <strong>and</strong> ranking several alternatives<br />

from among the feasible set of operational settings on the basis of utility, the best choice is picked.<br />

2 Architecture of the Performance Control Framework<br />

We implement the performance control framework for the Continuous Planning <strong>and</strong> Execution (CPE) Society which is a comm<strong>and</strong><br />

<strong>and</strong> control MAS built on Cougaar (<strong>DARPA</strong> Agent Platform [10]). While we describe the functionality of the components<br />

of the framework (Figure 1b) in this section, we highlight the autonomic capabilities that are built into the system.<br />

2.1 Overview of Application (CPE) Scenario<br />

In our set-up, the primary building block consists of three tiers in the application layer. CPE embodies a complete military<br />

logistics scenario with agents emulating roles such as suppliers, consumers <strong>and</strong> controllers all functioning in a dynamic <strong>and</strong><br />

hostile (destructive) external environment. Embedded in the hierarchical structure of CPE are both comm<strong>and</strong> <strong>and</strong> control,<br />

<strong>and</strong> superior-subordinate relationships. The subordinates compile sensor updates <strong>and</strong> furnish them to superiors. This enables<br />

the superiors to perform the designated function of creating plans (for maneuvering <strong>and</strong> supply) as well as control directives


for downstream subordinates. Upon receipt of plans, the subordinates execute them. The supply agents replenish consumed<br />

resources periodically. This high level system definition is the functionality of CPE that it seeks to perform repeatedly with<br />

maximum utility while residing in the application layer.As part of the application-level adaptivity features, a set of opmodes<br />

are built into the system. Opmodes allow individual tasks (such as plans, updates, control) to be executed at different qualities<br />

or to be processed at different rates. We assume that TechSpecs for the CPE scenario are available to be utilized by the control<br />

framework. The framework that accomplishes the aforementioned goal of CPE in a distributed fashion while performing at a<br />

maximum possible level of utility is represented in Figure 1b.<br />

2.2 Self-Monitoring Capability<br />

Any system that wants to control itself should possess a clear specification of the scope of the variables it has to monitor. The<br />

TechSpecs is a distributed structure that supports this purpose by housing all variables, X, that have to be monitored in different<br />

portions of the community (or sub-system). The data/statistics collected in a distributed way, is then aggregated to assist in<br />

control alternatives by the top-level controller that each community will possess.<br />

The attributes that need to be tracked are formulated in the form of measurement points (MP ). The measurement points are<br />

“soft” storage containers residing inside the agents <strong>and</strong> contain information on what, where <strong>and</strong> how frequently they should<br />

be measured. Each agent can look up its own TechSpecs <strong>and</strong> from time-to-time forward that to its superior. The superior can<br />

analyse this information (eg. calculate statistics such as delay, delay-jitter) <strong>and</strong>/or add to this information <strong>and</strong> forward it again.<br />

We have measurement points for time-periods, time-stamps, operating-modes, control <strong>and</strong> generic vector-based measurements.<br />

These measurement points can be chained for tracking information for a flow such that information is tagged-on at every point<br />

the flow traverses. For the sake of reliability, the information that is contained in these agents is replicated at several points, so<br />

that when packets do not reach on time or not reach at all, previously stored packets can be utilized for control purposes.<br />

2.3 Self-Modeling Capability<br />

One of the key features of this framework is that it has the capability to choose a type of model for representing itself for the<br />

purpose of performance evaluation. The system is equipped with several queueing model templates that it can utilize to analyze<br />

the system with. The type of model that is utilized at any given moment is based on accuracy, computation time <strong>and</strong> history of<br />

effectiveness. For example, a simulation based queueing model may be very accurate but cannot complete evaluating enough<br />

alternatives in limited time, in which case an analytical model (such as BCMP, QNA [11]) is preferred.<br />

The inputs to the model builder are the flows that traverse the network (F ), the types of packets (T ) <strong>and</strong> the current configuration<br />

of the network. If at a given time, we know that there are n agents interconnected in a hierarchical fashion then the role of<br />

this unit is to represent that information in the required template format (Q). The current number of agents is known to the<br />

controller by tracking the measurement points. For example, if there is no response from an agent for a sufficient period of time,<br />

then for the purpose of modeling, the controller may assume the agent to be non-existent. In this way dynamic configurations<br />

can be h<strong>and</strong>led. On the other h<strong>and</strong>, TechSpecs do m<strong>and</strong>ate connections according to superior-subordinate relationships thereby<br />

maintaining the flow structure at all times. Once the modeling is complete, the MAS has to capability to analyze its current<br />

performance using the selected type of model. The MAS does have the flexibility, to choose another model template for a<br />

different iteration.<br />

2.4 Self-Evaluating Capability<br />

The evaluation capability, the first step in control, allows the MAS to examine its own performance under a given set of plausible<br />

conditions. This prediction of performance is used for the elimination of control alternatives that may lead to instabilities. Our<br />

notion of performance evaluation is similar to [5]. While Tesauro et al. [5] compute the resource level utility functions (based<br />

on the application manager’s knowledge of system performance) that can be combined to obtain a globally optimal allocation of<br />

resources, we predict the performance of the MAS as a function of its operating modes in real-time (within Queueing Model) <strong>and</strong><br />

then use it to calculate its global utility. By introducing a level of indirection, we may get some desirable properties (explained<br />

in Section 4.2) because we separate an application’s domain-specific utility computation from performance prediction (or<br />

analysis). This theoretically enables us to predict the performance of any application whose TechSpecs are clearly defined <strong>and</strong><br />

then compute the application-specific utility. In both cases, control alternatives are picked based on best-utility. We discuss<br />

the notion of control alternatives in Section 2.5. Also, our performance metrics (<strong>and</strong> hence utility) are based on service level<br />

attributes such as end-to-end delay <strong>and</strong> latency, which is a desirable attribute of autonomic systems [5].<br />

When plan, update <strong>and</strong> control tasks (as mentioned in Section 2.1) flow in this heterogeneous network of agents in predefined<br />

routes (called flows), the processing <strong>and</strong> wait times of tasks at various points in the network are not alike. This is because<br />

the configuration (number of agents allocated on a node), resource availability (load due to other contending software) <strong>and</strong><br />

environmental conditions at each agent is different. In addition, the tasks themselves can be of varying qualities or fidelities<br />

that affects the time taken to process that task. Under these conditions, performance is estimated on the basis of the end-to-end<br />

delay involved in a “sense-plan-respond” cycle.


Table 1: Notation<br />

Symbol<br />

Description<br />

N<br />

Total # of nodes in the community<br />

λ ij<br />

Average arrival rate of class j at node i<br />

1/µ ijk Average processing time of class j at node i at quality k<br />

M<br />

Total number of classes<br />

T i<br />

Routing probability matrix for class i<br />

W ijk<br />

Steady state waiting time for class j at node i at quality k<br />

Set of qualities at which a class j task can be processed at node i<br />

Q ij<br />

The primary performance prediction tool that we use are called Queueing Network Models (QNM) [6]. The QNM is the representation<br />

of the agent community in the queueing domain. As the first step of performance estimation, the agent community<br />

needs to be translated into a queueing network model. Table 1 provides the notations used is this section. Inputs <strong>and</strong> outputs<br />

at a node are regarded as tasks. The rate at which tasks of class j are received at node i is captured by the arrival rate (λ ij ).<br />

Actions by agents consume time, so they get abstracted as processing rates (µ ij ). Further, each task can be processed at a<br />

quality k ∈ Q ij , that causes the processing rates to be represented as µ ijk . Statistics of processing times are maintained at each<br />

agent in the Performance Database (PDB) to arrive at a linear regression model between quality k <strong>and</strong> µ ijk . Flows get associated<br />

with classes of traffic denoted by the index j. If a connection exists between two nodes, this is converted to a transition<br />

probability p ij , where i is the source <strong>and</strong> j is the target node. Typically, we consider flows originating from the environment,<br />

getting processed <strong>and</strong> exiting the network making the agent network an open queueing network [6]. Since we may typically<br />

have multiple flows through a single node, we consider multi-class queueing networks where the flows are associated with a<br />

class. Performance metrics such as delays for the “sense-plan-respond” cycle is captured in terms of average waiting times,<br />

W ijk . As mentioned earlier, TechSpecs is a convenient place where information such as flows <strong>and</strong> Q ij can be embedded.<br />

The choice of QNM depends on the number of classes, arrival distribution <strong>and</strong> processing discipline as well as a suggestion<br />

C by the DMAS controller that makes this choice based upon history of prior effectiveness. Some analytical approaches to<br />

estimate performance can be found in [6, 11]. In the context of agent networks, Jackson <strong>and</strong> BCMP queueing networks have<br />

been applied to estimate the performance in [12]. By extending this work we provide several templates of queueing models<br />

(such as BCMP, Whitt’s QNA [11], Jackson, M/G/1, a simulation) that can be utilized for performance prediction.<br />

2.5 Self-Controlling Capability<br />

In contrast to [5], we deal with optimization of the domain utility of a MAS that is distributed, rather than allocating resources in<br />

an optimal fashion to multiple applications that have a good idea of their utility function (through policies). As mentioned before<br />

opmodes allow for trading-off quality of service (task quality <strong>and</strong> response time) <strong>and</strong> performance. We are assuming there is a<br />

maximum ceiling R on the amount of resources, <strong>and</strong> the available resources fluctuate depending on stresses S = S e +S a , where<br />

S e are the stresses from the environment (i.e. multiple contending applications, changes in the infrastructural or physical layers)<br />

<strong>and</strong> S a are the application stresses (i.e. increased tasks). The DMAS controller receives from MP (measurement points) a<br />

measurement of the actual performance P <strong>and</strong> a vector of other statistics (X) about task processing times. Also at the top-level<br />

the overall utility (U) is U(P, S) = ∑ w n x n is known where x n is the actual utility component <strong>and</strong> w n is the associated<br />

weight. We cannot change S, but we can adjust P to get better utility. Since P depends on O, which is a vector of opmodes<br />

collected from the community, we can use the QNM to find O ∗ <strong>and</strong> hence P ∗ that maximizes U(P, S) for a given S. In words,<br />

we find the vector of opmodes (O ∗ ) that maximizes domain utility at current S <strong>and</strong> update O. This computation is performed in<br />

the Utility Calculator module using a utility model that is learned <strong>and</strong> stored in the Utility Database (UDB). This formulation<br />

although independently found matches the self-optimization notion in [5]. But some differences exist as follows. Tesauro et al.<br />

[5] assume that the system’s knowledge includes a performance model, which we do not assume. We use a queueing network<br />

model to estimate the performance in real-time for any set of opmodes O ′ by taking the current set of opmodes O <strong>and</strong> scaling<br />

them appropriately based on observed histories (X) to X ′ in the Control Set Evaluator. Also, we deal with a single MAS with<br />

an overall utility function for the entire distributed functionality (within the community). Because of the interactions involved<br />

<strong>and</strong> complexity of performance modeling[3, 4], it may be time-consuming to utilize inferencing <strong>and</strong> learning mechanisms in<br />

real-time. This is why we use an analytical queueing network to get the performance estimate quickly. Another difference is<br />

that in [5], they assume operating system support which may not be true in many MAS-based situations because of mobility,<br />

security <strong>and</strong> real-time constraints. Furthermore, in addition to the estimation of performance, the queueing model may have<br />

the capability to eliminate instabilities from a queueing sense, which is not apparent in the other approach. But inspite of these<br />

differences, it is interesting to see that self-controlling capability can be achieved, with or without explicit layering, in a couple<br />

of real-world applications.


1000<br />

500<br />

0<br />

0.2 0.4 0.6 0.8<br />

-500<br />

-1000<br />

Default Policy<br />

Controlled<br />

Stress (S)<br />

Figure 2: Results overview<br />

20<br />

15<br />

10<br />

5<br />

0<br />

0 200 400 600 800 1000 1200<br />

-5<br />

-10<br />

time (sec.)<br />

Controlled Default A Default B<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

0 200 400 600 800 1000 1200<br />

time (sec.)<br />

Controlled Default A Default B<br />

(a) Instantaneous Utility (stress 0.25)<br />

(b) Cumulative Utility (stress 0.25)<br />

20<br />

15<br />

10<br />

5<br />

0<br />

0 200 400 600 800 1000 1200<br />

-5<br />

-10<br />

time (sec.)<br />

Controlled Default A Default B<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

0 200 400 600 800 1000 1200<br />

time (sec.)<br />

Controlled Default A Default B<br />

(c) Instantaneous Utility (stress 0.75)<br />

(d) Cumulative Utility (stress 0.75)<br />

Figure 3: Sample results<br />

3 Empirical Evaluation on CPE Test-bed<br />

We utilized the prototype CPE framework to run 36 experiments at two stress levels (S = 0.25 <strong>and</strong> 0.75). The scenario<br />

consisted of 14 agents, besides a world agent that created r<strong>and</strong>om scenarios in military logistics for the agents to react to. There<br />

were three layers of hierarchy with a three-way branching at each level <strong>and</strong> one supply node. The community’s utility function<br />

was based on the achievement of real goals in military engagements such as terminating or damaging the enemy <strong>and</strong> reducing<br />

the penalty involved in consuming resources such as fuel or sustaining damage. We also assumed for the model selection<br />

process that the external arrival was Poisson <strong>and</strong> the service times were exponentially distributed. In order to cater to general<br />

arrival rates, our framework contains a QNA- <strong>and</strong> simulation-based model. Using this assumption a BCMP or M/G/1 queueing<br />

model could be utilized. We used the Cougaar based default control without additional support from our framework as the<br />

baseline (denoted as Default A <strong>and</strong> Default B) <strong>and</strong> found that controlling the agent community using our framework (denoted<br />

as controlled) was beneficial in the long run. The overview of the results is provided in Figure 2.<br />

At both stress levels, the controlled scenario performed better that the default as shown in Figure 3. We did observe oscillations<br />

in the instantaneous utility <strong>and</strong> we attribute this to the impreciseness of the prediction of stresses. Stresses vary relatively fast<br />

in the order of seconds while the control granularity was of the order of minutes. Since this is a military engagement situation<br />

following no pre-determined stress patterns, it is hard to cope with in the higher stress case. We think that this could be the<br />

reason why our utility falls in the latter case.


4 Conclusions <strong>and</strong> Future Work<br />

4.1 Conclusions<br />

In this paper, we were able to successfully control a real-time MAS to achieve overall better utility in the long run using<br />

application-level trade-offs between quality of service <strong>and</strong> performance. We utilized a queueing network based framework for<br />

performance analysis <strong>and</strong> subsequently used a learned utility model for computing the overall benefit to the MAS (i.e. community).<br />

While Tesauro et al. [5] have found a similar construction to improve utility in multiple applications, we concentrated<br />

on optimizing the utility of a single distributed application using queueing theory. We think that the approaches are complementary,<br />

with this study providing empirical evidence to support the observation in [1] that agents can be used to optimize<br />

distributed application environments, including themselves, through flexible high-level (i.e. application-level) interactions.<br />

4.2 Discussion <strong>and</strong> Future Work<br />

We believe that keeping the building-blocks small <strong>and</strong> the number of interactions (between performance <strong>and</strong> utility models)<br />

minimal may assist in making the framework more flexible <strong>and</strong> scalable. For example, if system size increases, we can consider<br />

a superior agent or human user to be at the next higher level controlling the weights in the utility function without affecting the<br />

performance model. The larger system with supervisory control would then be analyzed using another higher-level QNM or a<br />

network of networks. TechSpecs has assisted this effort to a large extent, re-emphasizing the well-founded separation principle<br />

(separating knowledge/policy <strong>and</strong> mechanism) in the computing field. While we think that the aforementioned architectural<br />

principles have been well-utilized, we hope to broaden the layered control approach to encompass the infrastructural-level<br />

control into the framework. Another avenue for improvement is to design self-protecting mechanisms within our framework so<br />

that the security aspect of the framework is reinforced.<br />

Acknowledgments<br />

The work described here was performed under the <strong>DARPA</strong> UltraLog Grant#: MDA972-1-1-0038. The authors wish to acknowledge<br />

<strong>DARPA</strong> for their generous support.<br />

References<br />

[1] Jennings, N. R. <strong>and</strong> Wooldridge, M., 2000, “Agent-Oriented Software Engineering”, H<strong>and</strong>book of Agent Technology,<br />

AAAI/MIT Press.<br />

[2] UltraLog Program Site. www.ultralog.net. <strong>DARPA</strong>.<br />

[3] Jung, H., <strong>and</strong> Tambe, M., 2003, “Performance Models for Large Scale Multi-Agent Systems”, Proceedings of the Seocnd<br />

Joint Conference on Autonomous Agents <strong>and</strong> Multi-Agnet Systems.<br />

[4] Rana, O. F., <strong>and</strong> Stout, O., “What is Scalability in Mult-Agent Systems”, 2000, Proceedings of the Fourth International<br />

Conference on Autonomous Agents.<br />

[5] Tesauro, G., Chess, D. M., Walsh, W. E., Das, R., Whalley, I., Kephart, J. O., <strong>and</strong> White, S. R., 2004, “A Multi-Agent<br />

Systems Approach to Autonomic Computing”, Autonomous Agents <strong>and</strong> Multi-Agent Systems.<br />

[6] Bolch, G., de Meer, H., Greiner, S., <strong>and</strong> Trivedi, K. S., 1998, Queueing Networks <strong>and</strong> Markov Chains: Modeling <strong>and</strong><br />

Performance Evaluation with Computer Science Applications. John Wiley <strong>and</strong> Sons.<br />

[7] Thadakamalla, H. P., Raghavan, U. N., Kumara, S. R. T., <strong>and</strong> Albert, R., 2004, “Survivability of Multi-Agent Systems - A<br />

Topological Perspective”, IEEE Intelligent Systems: Dependable Agent Systems, vol. 19, no. 5, pp. 24-31, Sep/Oct 2004.<br />

[8] Hong, Y., <strong>and</strong> Kumara, S. R. T., 2004, “Coordinating Control Decisions of Software Agents for Adaptation to Dynamic<br />

Environments”, Working Paper, Marcus Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering, Pennsylvania State<br />

Univerity, University Park, PA.<br />

[9] Brinn, M., <strong>and</strong> Greaves, M., 2003, “Leveraging Agent Properties to Assure Survivability of Distributed Multi-Agent<br />

Systems”, in the Proceedings of the Second Joint Conference on Autonomous Agents <strong>and</strong> Multi-Agent Systems.<br />

[10] Cougaar Open Source Site. www.cougaar.org. <strong>DARPA</strong>.<br />

[11] Whitt, W., 1983, “The Queueing Network Analyzer”, The Bell System Technical Journal, vol. 62, no. 9, pp. 2779-2815.<br />

[12] Gnanasamb<strong>and</strong>am, N., Lee, S., Gautam, N., Kumara, S. R. T., Peng, W., Manikonda, V., Brinn, M., <strong>and</strong> Greaves, M.,<br />

2004, “Reliable MAS Performance Prediction Using Queueing Models”, IEEE Multi-Agent Security <strong>and</strong> Survivability<br />

Symposium.


An Autonomous Performance Control Framework for<br />

Distributed Multi-Agent Systems: A Queueing Theory<br />

Based Approach<br />

Nathan<br />

Gnanasamb<strong>and</strong>am<br />

gsnathan@psu.edu<br />

Seokcheon Lee<br />

stonesky@psu.edu<br />

Pennsylvania State University<br />

310 Leonhard Building<br />

University Park, PA 16802<br />

Soundar R.T. Kumara<br />

skumara@psu.edu<br />

ABSTRACT<br />

Distributed Multi-Agent Systems (DMAS) such as supply chains<br />

functioning in highly dynamic environments need to achieve maximum<br />

overall utility during operation. The utility from maintaining<br />

performance is an important component of their survivability.<br />

This utility is often met by identifying trade-offs between quality<br />

of service <strong>and</strong> performance. To adaptively choose the operational<br />

settings for better utility, we propose an autonomous <strong>and</strong> scalable<br />

queueing theory based methodology to control the performance of<br />

a hierarchical network of distributed agents.<br />

Categories <strong>and</strong> Subject Descriptors<br />

C.4 [Performance of Systems]: Design studies, modeling techniques,<br />

performance attributes<br />

General Terms<br />

Performance<br />

Keywords<br />

Multi-Agent Systems, Survivability, Queueing Models<br />

1. INTRODUCTION<br />

With the emerging popularity of distributed multi-agent systems<br />

as application platforms, it is necessary that they survive dynamic<br />

<strong>and</strong> stressful environmental conditions, even partial permanent damage.<br />

While the survival notion necessitates adaptivity to diverse<br />

conditions along the dimensions of performance, security <strong>and</strong> robustness,<br />

delivering the correct proportion of these quantities can<br />

be quite a challenge. From a performance st<strong>and</strong>point, a survivable<br />

system can deliver excellent Quality of Service (QoS) even when<br />

stressed. A DMAS could be considered survivable if it can maintain<br />

at least x% of system capabilities <strong>and</strong> y% of system perfor-<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage <strong>and</strong> that copies<br />

bear this notice <strong>and</strong> the full citation on the first page. To copy otherwise, to<br />

republish, to post on servers or to redistribute to lists, requires prior specific<br />

permission <strong>and</strong>/or a fee.<br />

AAMAS’05, July 25-29, 2005, Utrecht, Netherl<strong>and</strong>s.<br />

Copyright 2005 ACM 1-59593-094-9/05/0007 ...$5.00.<br />

mance in the face of z% of infrastructure loss <strong>and</strong> wartime loads<br />

(x, y, z are user-defined) [1].<br />

We address a piece of the survivability problem by building an<br />

autonomous performance control framework for the DMAS drawing<br />

on the idea of composing the bigger society of smaller building<br />

blocks (i.e. agent communities) [3]. Identifying data-flowsinthe<br />

agent network (similar to [4]) <strong>and</strong> utilizing the network’s servicelevel<br />

attributes such as delays, utilization <strong>and</strong> response times as a<br />

basis for its utility (like in [5]) we build a self-optimizing framework<br />

for DMAS. We believe that by using queueing theory we can<br />

analyze data-flows within the agent community as a network of<br />

queues with greater granularity in terms of processing delays <strong>and</strong><br />

network latencies <strong>and</strong> also capitalize on using a building block approach<br />

by restricting the model to the community. We contribute<br />

by engineering a queueing theory based adaptation (control) framework<br />

to enhance the performance of the application layer, which<br />

inherently can be visualized as residing over the infrastructure (logical<br />

layer or middle-ware) <strong>and</strong> the physical layer (resources such as<br />

CPU, b<strong>and</strong>width).<br />

2. FRAMEWORK ARCHITECTURE<br />

Building on the ideas of high-level system specifications (or Tech-<br />

Specs) <strong>and</strong> utilizing queueing network models (QNMs) for performance<br />

estimation as in [2] we build a real-time framework for<br />

application-level survivability. This framework is represented in<br />

Figure 1 <strong>and</strong> consists of activities, modules, knowledge repositories<br />

<strong>and</strong> information flow through a distributed collection of agents.<br />

2.1 Architecture Overview<br />

When the DMAS is stressed by an amount S by the underlying<br />

layers (due to under-allocation of resources) <strong>and</strong> the environment<br />

(due to increased workloads during wartime conditions), the<br />

DMAS Controller has to examine all its performance-related variables<br />

from set X <strong>and</strong> the current overall performance P in order<br />

to adapt. The variables that need to be maintained are specified in<br />

the TechSpecs <strong>and</strong> may include delays, time-stamps, utilization <strong>and</strong><br />

their statistics. They are collected in a distributed fashion through<br />

the measurement points MP which are “soft” storage containers<br />

residing inside the agents <strong>and</strong> contain information on what, where<br />

<strong>and</strong> how frequently they should be measured. The DMAS Controller<br />

knows the set of flows F that traverse the network <strong>and</strong> the set<br />

of packet types T from the TechSpecs. With {F, T, X, C}, where<br />

C is a suggestion from the DMAS Controller, the Model Builder<br />

can select a suitable queueing model template Q. The Control Set


O, U<br />

Stresses<br />

Physical/Infrastructure Layer<br />

X<br />

F,T<br />

Se<br />

TechSpecs<br />

P<br />

DB<br />

MP<br />

OS<br />

Model Builder<br />

Q<br />

P,X<br />

C<br />

DMAS<br />

Controller<br />

S,P,O<br />

Control Set<br />

Evaluator<br />

O',X'<br />

Queueing<br />

Model<br />

P'<br />

Stresses<br />

Application / User<br />

U*,O*<br />

U'<br />

Sa<br />

Utility<br />

Calculator<br />

Figure 1: Architecture Overview<br />

Evaluator knows the current operating mode (opmode) set O as<br />

well as the set of possible opmodes, OS from TechSpecs. To evaluate<br />

the performance due to a c<strong>and</strong>idate opmode set O ′ , the Control<br />

Set Evaluator uses the Queueing Model with a scaled set of operating<br />

conditions X ′ . Once the performance P ′<br />

is estimated by<br />

the Queueing Model it can be cached in the performance database<br />

PDB <strong>and</strong> then sent to the Utility Calculator. The Utility Calculator<br />

computes the domain utility due to (O ′ ,P ′ ) <strong>and</strong> caches it in<br />

the utility database, UDB. Subsequently, the optimal opmode set<br />

O ∗ is identified <strong>and</strong> sent to the DMAS Controller. The functional<br />

units of the architecture are distributed but for each community that<br />

forms part of a MAS society, O ∗ will be calculated by a single<br />

agent. We now examine the capabilities of the framework.<br />

2.1.1 Self-Monitoring Capability<br />

TechSpecs acts as a distributed structure that contains meta-data<br />

about all variables, X, that have to be monitored in different portions<br />

of the community. The data/statistics collected in a distributed<br />

way, is then aggregated to assist in control alternatives by the toplevel<br />

controller that each community possesses. Each agent can<br />

look up its own TechSpecs <strong>and</strong> from time-to-time forward a measurement<br />

to its superior. The superior can analyse this information<br />

(eg. calculate statistics such as delay, delay-jitter) <strong>and</strong>/or add to this<br />

information <strong>and</strong> forward it again.<br />

2.1.2 Self-Modeling Capability<br />

One of the key features of this framework is that it has the capability<br />

to choose a type of model for representing itself for the<br />

purpose of performance evaluation. The system is equipped with<br />

several queueing model templates that it can utilize to analyze the<br />

system configuration with. The inputs to the Model Builder are the<br />

flows that traverse the network (F ), the types of packets (T ) <strong>and</strong><br />

the current configuration of the network. Given we know that there<br />

are n agents interconnected in a hierarchical fashion, this unit represents<br />

the information in the required template format (Q) which<br />

is subsequently used to analyze the current performance.<br />

O*<br />

U<br />

DB<br />

2.1.3 Self-Evaluating Capability<br />

The evaluation capability allows the MAS to examine its own<br />

performance under a given set of plausible conditions. This prediction<br />

of performance is used for the elimination of control alternatives<br />

that may lead to instabilities. Given that a variety of tasks<br />

traverse the heterogeneous network of agents in predefined routes<br />

(called flows), the processing <strong>and</strong> wait times of tasks at various<br />

points in the network are not alike because of dissimilar configurations,<br />

resource availabilities <strong>and</strong>/or environmental stresses. Under<br />

these conditions, performance is evaluated in terms of end-to-end<br />

delays for the “sense-plan-respond” cycles.<br />

2.1.4 Self-Controlling Capability<br />

Since tasks can be processed at various pre-defined qualities, opmodes<br />

allow for trading-off quality of service (task quality) for performance<br />

(end-to-end response time). The available resources fluctuate<br />

depending on stresses S = S e +S a, where S e are the stresses<br />

from the environment (i.e. multiple contending applications) <strong>and</strong><br />

S a are the application stresses (i.e. increased tasks). Using current<br />

measured performance P <strong>and</strong> the measured stress S the DMAS<br />

Controller relates the overall utility (U) asU(P, S) = P w nx n<br />

where x n is the actual utility component <strong>and</strong> w n is the associated<br />

weight specified by the user. To adjust P to get the best achievable<br />

utility under S, the following is done. Since P depends on<br />

O, which is a vector of opmodes collected from the community, we<br />

can use the QNM to find O ∗ <strong>and</strong> hence P ∗ that maximizes U(P, S)<br />

for a given S from within the set OS. In words, we find the vector<br />

of opmodes (O ∗ ) that maximizes domain utility at current S. The<br />

utility computation is performed in the Utility Calculator module<br />

using a learned utility model based on UDB.<br />

3. CONCLUSIONS<br />

We combined queueing analysis <strong>and</strong> application-level control to<br />

engineer a generic framework that is capable of self-optimizing<br />

its domain-specific utility to assure application-level survivability.<br />

While application-level adaptivity yields improvement in utility further<br />

gains are possible by leveraging underlying layers.<br />

4. ADDITIONAL AUTHORS<br />

Additional Authors: Natarajan Gautam (Pennsylvania State University,<br />

email: ngautam@psu.edu), Wilbur Peng <strong>and</strong> Vikram<br />

Manikonda (IAI Inc., email: wpeng,vikram@i-a-i.com), Marshall<br />

Brinn (BBN Technologies, email: mbrinn@bbn.com) <strong>and</strong><br />

Mark Greaves (<strong>DARPA</strong> IXO, email: mgreaves@darpa.mil).<br />

5. REFERENCES<br />

[1] M. Brinn <strong>and</strong> M. Greaves. Leveraging agent properties to<br />

assure survivability of distributed multi-agent systems.<br />

Proceedings of the Second Joint Conference on Autonomous<br />

Agents <strong>and</strong> Multi-Agent Systems, 2003.<br />

[2] N. Gnanasamb<strong>and</strong>am, S. Lee, N. Gautam, S. R. T. Kumara,<br />

W. Peng, V. Manikonda, M. Brinn, <strong>and</strong> M. Greaves. Reliable<br />

mas performance prediction using queueing models. IEEE<br />

Multi-agent Security <strong>and</strong> Survivabilty Symposium, 2004.<br />

[3] H. Jung <strong>and</strong> M. Tambe. Performance models for large scale<br />

multi-agent systems: Using distributed pomdp building<br />

blocks. Proceedings of the Second Joint Conference on<br />

Autonomous Agents <strong>and</strong> Multi-Agent Systems, July 2003.<br />

[4] O. F. Rana <strong>and</strong> K. Stout. What is scalabilty in multi-agent<br />

systems? Proceedings of the Fourth International Conference<br />

on Autonomous Agents, 2000.<br />

[5] G. Tesauro, D. M. Chess, W. E. Walsh, R. Das, I. Whalley,<br />

J. O. Kephart, <strong>and</strong> S. R. White. A multi-agent systems<br />

approach to autonomic computing. Autonomous Agents <strong>and</strong><br />

Multi-Agent Systems, 2004.


Manuscript for IEEE Transactions on Automatic Control 1<br />

ADAPTIVE CONTROL FOR LARGE-SCALE INFORMATION NETWORKS THROUGH<br />

ALTERNATIVE ALGORITHMS TO SUPPORT SURVIVABILITY *<br />

Seokcheon Lee † <strong>and</strong> Soundar Kumara ‡<br />

†‡ Department of <strong>Industrial</strong> & <strong>Manufacturing</strong> Engineering, The Pennsylvania State University,<br />

University Park, PA 16802<br />

† Phone: 814-863-4799; Fax: 814-863-4745; E-mail: stonesky@psu.edu<br />

‡ Corresponding author. Phone: 814-863-2359; Fax: 814-863-4745; E-mail: skumara@psu.edu<br />

ABSTRACT<br />

As modern networks can be easily exposed to various adverse events such as malicious<br />

attacks <strong>and</strong> accidental failures, there is a need to study their survivability. We study a large-scale<br />

information network composed of distributed software components linked together through a<br />

task flow structure. The service provided by the network is to produce a global solution to a<br />

given problem, which is an aggregate solution of partial solutions of individual tasks. Quality of<br />

service of the network is determined by the value of global solution <strong>and</strong> the time taken for<br />

generating global solution. In this paper we design an adaptive control mechanism along the<br />

lines of model predictive control to support the survivability of such networks by utilizing<br />

alternative algorithms. To address adaptivity we model stress environment by quantifying<br />

resource availability through sensors. We build a mathematical programming model with the<br />

resource availability incorporated, which predicts quality of service as a function of alternative<br />

algorithms. The programming model is decentralized through an auction market without any<br />

degradation of the solution optimality. By periodically opening the auction market, the system<br />

can achieve desirable performance adaptive to changing stress environments while assuring<br />

scalability property. We verify the designed control mechanism empirically.<br />

Key Words: Adaptive control, survivability, alternative algorithms, scalability<br />

* This work is supported in part by <strong>DARPA</strong> under Grant MDA 972-01-1-0038.


Manuscript for IEEE Transactions on Automatic Control 2<br />

1. Introduction<br />

Critical infrastructures become increasingly dependent on networked systems in many<br />

domains for automation or organizational integration. Though such infrastructure can improve<br />

the efficiency <strong>and</strong> effectiveness, these systems can be easily exposed to various adverse events<br />

such as malicious attacks <strong>and</strong> accidental failures [1]. Two metrics, namely survivability <strong>and</strong><br />

scalability, can be used to determine the efficiency <strong>and</strong> effectiveness of these systems.<br />

Survivability is defined as “the capability of a system to fulfill its mission, in a timely manner, in<br />

the presence of attacks, failure, or accidents” [2]. One promising way to achieve survivability is<br />

through adaptivity: changing the system behavior to achieve the system goal in response to the<br />

changing environment [3]. But, unpredictable adaptation can sometimes result in worse<br />

performance than without adaptation [4]. Scalability is defined as: “the ability of a solution to<br />

some problem to work when the size of the problem increases” (From Dictionary of Computing<br />

at http://wombat.doc.ic.ac.uk). As the size of networked systems grows, scalability becomes a<br />

critical issue when developing practical software systems [5].<br />

As software systems grow larger <strong>and</strong> more complex, component technology has become one<br />

of the important research topics in the computing community [6][7]. A component is a reusable<br />

program element, with which developers can build systems needed by simply defining their<br />

specific roles <strong>and</strong> wiring them together. In networks with component-based architecture, each<br />

component is highly specialized for specific tasks. Another emerging technology is adaptive<br />

software [8][9]. Adaptive software has alternative algorithms for the same numerical problem<br />

<strong>and</strong> a switching function for selecting the best algorithm in response to environmental changes.<br />

As modern operating environments are highly dynamic, adaptive software becomes an important<br />

tool to achieve portable high performance.


Manuscript for IEEE Transactions on Automatic Control 3<br />

We study a large-scale information network, which is composed of distributed software<br />

components linked together through a task flow structure. A problem given to the network is<br />

decomposed in terms of root tasks for some components <strong>and</strong> those tasks are propagated through<br />

a task flow structure to other components. As a problem can be decomposed with respect to<br />

space, time, or both, a component can have multiple root tasks that can be considered<br />

independent <strong>and</strong> identical in their nature. The service provided by the network is to produce a<br />

global solution to a given problem, which is an aggregate of partial solutions of individual tasks.<br />

Each component can have alternative algorithms to process a task which trade off processing<br />

time <strong>and</strong> value of partial solution. Quality of Service (QoS) of the network is determined by the<br />

value of global solution <strong>and</strong> time for generating global solution (i.e., completion time).<br />

Survivability of the network is the capability to provide high QoS in the presence of adverse<br />

events such as malicious attacks <strong>and</strong> accidental failures. In this paper we design an adaptive<br />

control mechanism to support the survivability of such networks by utilizing alternative<br />

algorithms.<br />

The organization of this paper is as follows. In Section 2 we discuss problem domain <strong>and</strong> in<br />

Section 3 formally define the problem in detail. We design control mechanism in Sections 4<br />

through 7 <strong>and</strong> show empirical results in Section 8. <strong>Final</strong>ly, we discuss implications <strong>and</strong> possible<br />

extensions of our work in Section 9.<br />

2. Problem domain<br />

The networks we study represent distributed <strong>and</strong> component-based architectures for providing<br />

a solution to a given problem. A problem is decomposed in terms of root tasks <strong>and</strong> solved by<br />

distributed components through a task flow structure. As a problem can be decomposed with


Manuscript for IEEE Transactions on Automatic Control 4<br />

respect to space, time, or both, a component can have multiple root tasks that can be considered<br />

independent <strong>and</strong> identical in their nature. When the size of a problem becomes large, the size of<br />

the network as well as the number of tasks for each component can be large. One can imagine<br />

wide range of scientific <strong>and</strong> engineering problems that can be solved with such architectures.<br />

Cougaar (Cognitive Agent Architecture: http://www.cougaar.org) developed by <strong>DARPA</strong><br />

(Defense Advanced Research Project Agency), is such an architecture for building large-scale<br />

multi-agent systems. Recently, there have been efforts to combine the technologies of agents <strong>and</strong><br />

components to improve building large-scale software systems [10]-[12]. While component<br />

technology focuses on reusability, agent technology focuses on processing complex tasks as a<br />

community. Cougaar is in line with this trend. In Cougaar a software system comprises of agents<br />

<strong>and</strong> an agent of components (called plugins). The task flow structure in those systems is that of<br />

components as a combination of intra-agent <strong>and</strong> inter-agent task flows. As the agents in Cougaar<br />

can be distributed both from geographical <strong>and</strong> information content sense, the networks<br />

implemented in Cougaar have distributed <strong>and</strong> component-based architecture.<br />

UltraLog (http://www.ultralog.net) networks are military supply chain planning systems<br />

implemented in Cougaar [13]-[17]. Each agent in these networks represents an organization of<br />

military supply chain <strong>and</strong> has a set of components specialized for each functionality (allocation,<br />

expansion, inventory management, etc) <strong>and</strong> class (ammunition, water, fuel, etc). The objective of<br />

an UltraLog network is to provide an appropriate logistics plan for a given military operational<br />

plan. A logistics plan is a global solution which is an aggregate of individual schedules built by<br />

components. An operational plan is decomposed into logistics requirements of each thread for<br />

each agent, <strong>and</strong> a requirement is further decomposed into root tasks (one task per day) for a<br />

designated component. As a result, a component can have hundreds of root tasks depending on


Manuscript for IEEE Transactions on Automatic Control 5<br />

the horizon of an operation <strong>and</strong> thous<strong>and</strong>s of tasks to process as the root tasks are propagated. As<br />

the scale of operation increases there can be thous<strong>and</strong>s of agents (tens of thous<strong>and</strong>s of<br />

components) working together to generate a logistics plan. The system makes initial planning<br />

<strong>and</strong> continuous replanning to cope with logistics plan deviations or operational plan changes.<br />

Initial planning <strong>and</strong> replanning are the instances of the current research problem.<br />

QoS of these networks is determined by the quality of logistics plan (value of solution) <strong>and</strong><br />

(plan) completion time. These two metrics directly affect the performance of an operation. As the<br />

networks are working in a military environment, they are especially vulnerable to malicious<br />

attacks <strong>and</strong> accidental failures. Now, the question is how can we make this system survivable to<br />

generate high quality logistics plans in a timely manner in the presence of such adverse events?<br />

3. Problem specification<br />

In this section we formally define the problem by detailing network configuration, control<br />

action, <strong>and</strong> stress environment. We focus on computational CPU resources assuming that the<br />

system is computation-bounded.<br />

3.1 Network configuration<br />

A network is composed of a set of components A <strong>and</strong> each component resides in its own<br />

machine 1 . Task flow structure of the network, which defines precedence relationship between<br />

components, is an arbitrary directed acyclic graph. A problem given to the network is<br />

decomposed in terms of root tasks for some components <strong>and</strong> those tasks are propagated through<br />

the task flow structure. Each component processes one of the tasks in its queue (which has root<br />

1 For simplicity we consider the cases where there is one component in a machine. Though the designed control<br />

mechanism is also applicable to resource sharing environments, we may need to consider resource allocation in<br />

addition as will be discussed in Section 9.


Manuscript for IEEE Transactions on Automatic Control 6<br />

tasks as well as tasks from predecessor components) <strong>and</strong> then sends it to successor components.<br />

We denote the number of root tasks of component i as rt i . Fig. 1 shows an example network<br />

composed of four components. Each of A 1 <strong>and</strong> A 2 has 100 root tasks. A 3 <strong>and</strong> A 4 have no root<br />

tasks but they have 200 <strong>and</strong> 100 tasks respectively from the corresponding predecessors.<br />

<br />

<br />

A 1<br />

A 3<br />

100<br />

0<br />

<br />

A 2 A 4<br />

100<br />

0<br />

Fig. 1. An example network<br />

3.2 Control action<br />

A component can use one of alternative algorithms to process a task. Different alternatives<br />

trade off CPU time <strong>and</strong> value of solution with more CPU time resulting in higher solution value.<br />

As one can find optimal mixed alternatives, a component has a monotonically increasing convex<br />

function, say value function, with CPU time as a function of value. We call the value in the<br />

function as value mode that a component can select as its decision variable. A value function is<br />

defined with three elements as<br />

〈 f i ( vi<br />

), vi(min),<br />

vi(max)〉<br />

as shown in Fig. 1. This function indicates<br />

that component i’s expected CPU time 2 to process a task is f i (v i ) with a value mode v i <strong>and</strong> v i(min) ≤<br />

v i ≤ v i(max) . We assume that components cannot change the mode for a task in process.<br />

3.3 Stress environment<br />

Survivability stresses such as malicious attacks <strong>and</strong> accidental failures, affect the system by<br />

directly consuming resources or indirectly invoking defense mechanisms as remedies. For<br />

2 The distribution of CPU time can be arbitrary though we use only expected CPU time.


Manuscript for IEEE Transactions on Automatic Control 7<br />

example, “Denial of Service” attack consumes resources directly while relevant defense<br />

mechanism also consumes resources in terms of resistance, recognition, <strong>and</strong> recovery [1]. We<br />

consider both survivability stresses <strong>and</strong> remedies as stress environment from the viewpoint of<br />

components. The space of stress environment is high-dimensional <strong>and</strong> also evolving [18][19].<br />

But, as we concentrate on CPU resources, a stress environment can be considered as a set of<br />

threads residing in the machines of the network <strong>and</strong> sharing resources with components. The<br />

threads, say stressors, may have admission to access resources or be stealing resources without<br />

admission.<br />

3.4 Problem definition<br />

The service provided by the network is to produce a global solution to a given problem, which<br />

is an aggregate of partial solutions of individual tasks. QoS of the network is determined by the<br />

value of global solution <strong>and</strong> the cost of completion time. The value of global solution is the<br />

summation of partial solution values, <strong>and</strong> the cost of completion time is determined by a cost<br />

function CCT(T) which is a monotonically increasing function with completion time T. We<br />

assume that the solution values <strong>and</strong> cost are represented in a common unit 3 . Consider v d i as the<br />

value mode used to process d th task by component i <strong>and</strong> e i the number of tasks processed by<br />

component i to the completion. Then, the control objective is to maximize QoS by utilizing<br />

alternative algorithms (v) as in (2). As stated earlier, we design an adaptive control mechanism to<br />

achieve the objective for supporting the survivability of large-scale information networks.<br />

arg max<br />

v<br />

e<br />

i<br />

∑∑<br />

i∈ A d = 1<br />

v<br />

d<br />

i<br />

− CCT(T )<br />

(2)<br />

3 Relative importance can be considered by scaling the functions <strong>and</strong> it results in the same function structures.


Manuscript for IEEE Transactions on Automatic Control 8<br />

4. Overall control procedure<br />

There are two representative optimal control approaches in dynamic systems: Dynamic<br />

Programming (DP) <strong>and</strong> Model Predictive Control (MPC). Though DP gives optimal closed-loop<br />

policy it has inefficiencies in dealing with large-scale systems especially when systems are<br />

working in finite time horizon [20]-[22]. In MPC, for each current state, an optimal open-loop<br />

control policy is designed for finite-time horizon by solving a static mathematical programming<br />

model [23]-[26]. The design process is repeated for the next observed state feedback forming a<br />

closed-loop policy reactive to each current system state. Though MPC does not give absolutely<br />

optimal policy in stochastic environments, the periodic design process alleviates the impacts of<br />

stochasticity <strong>and</strong> it is easy to adapt to new contexts by explicitly h<strong>and</strong>ling objective function or<br />

constraints.<br />

Considering the characteristic of the current problem, we choose MPC framework. Our<br />

networks are large-scale working in finite time horizon <strong>and</strong> need to adapt to unpredictable stress<br />

environment. Therefore, under MPC framework, we develop an adaptive control mechanism as<br />

depicted in Fig. 2. First, to address adaptivity we model stress environment by quantifying<br />

resource availability through sensors. Second, we build a mathematical programming model with<br />

the resource availability incorporated, which predicts QoS as a function of alternative algorithms.<br />

Third, we provide an auction market as a decentralized coordination mechanism for solving the<br />

programming model. By periodically opening the auction market, the system can achieve<br />

desirable performance adaptive to changing stress environment while assuring scalability<br />

property. We define sensors <strong>and</strong> build a mathematical programming model in Section 5, <strong>and</strong><br />

refine it based on stability analysis in Section 6. The refined programming model is decentralized<br />

in Section 7.


Manuscript for IEEE Transactions on Automatic Control 9<br />

Stress<br />

Environment Modeling<br />

Sensor Design<br />

Sensor<br />

Component<br />

Sensor<br />

Component<br />

Sensor<br />

Component<br />

Mathematical<br />

Programming<br />

Periodic Auctioning<br />

Auction<br />

Decentralized<br />

Coordination<br />

Fig. 2. Overall control procedure<br />

5. Mathematical programming model<br />

In this section we define sensors <strong>and</strong> build a mathematical programming model under MPC<br />

framework.<br />

5.1 Sensors<br />

Each component monitors its operating environment through a sensor. The sensor measures<br />

resource availability MRA i (t), which is defined as available fraction of a resource when a<br />

component i requests that resource in the last control period at control point t. There are two<br />

quantities to extract this measurement, which are request time <strong>and</strong> execution time. Request time<br />

is the duration for which the component requests resource or equivalently queue length<br />

(including a task in service) is more than zero. Execution time is the duration for which the<br />

component utilizes the resource. If control period is SW, resource sensor calculates MRA i (t) as:


Manuscript for IEEE Transactions on Automatic Control 10<br />

( t MRA i<br />

execution time in ( t − SW ,t )<br />

) = . (3)<br />

request time in ( t − SW ,t )<br />

5.2 Mathematical programming model<br />

A component can estimate its resource availability in the future by using observed resource<br />

availability in the past. So, by incorporating the estimation, the component can induce service<br />

time per task as a function of value mode as:<br />

f ( v ) / MRA ( t ) . (4)<br />

i<br />

i<br />

Now, consider current time as t <strong>and</strong> estimate the completion time T by assuming that each<br />

component uses a mode common to all the tasks (i.e. pure strategy). We will discuss the<br />

optimality of the pure strategy later in this subsection.<br />

In a task flow structure where each component processes only one task after its predecessors<br />

complete their tasks, the completion time will be the length of the longest path (i.e., critical path)<br />

as widely studied in project management literature. However, as the number of tasks increases, a<br />

bottleneck component, which has maximal total service time, becomes dominating the<br />

completion time. As each component in our networks can have large number of tasks to process<br />

rather than just one, the completion time T can be estimated as:<br />

i<br />

T − t ≈ Max [ R ( t ) + L ( t ) f ( v )] / MRA ( t ) , (5)<br />

i∈A<br />

i<br />

in which R i (t) denotes remaining CPU time for a task in process <strong>and</strong> L i (t) the number of<br />

remaining tasks excluding a task in process. After identifying initial number of tasks L i (0) as in<br />

(6) where i denotes the immediate predecessors of component i, each component updates it by<br />

counting down as they process tasks.<br />

i<br />

∑<br />

a∈i<br />

i<br />

i<br />

L i (0 ) = rti<br />

+ La(0<br />

)<br />

(6)<br />

i


Manuscript for IEEE Transactions on Automatic Control 11<br />

So, given completion time T it is optimal for each component i to select a mode by the<br />

following:<br />

Max L ( t )<br />

i v i<br />

(7)<br />

subject to<br />

[ i i i i<br />

i<br />

R ( t ) + L ( t ) f ( v )] / MRA ( t ) ≤ T − t . (8)<br />

Consequently, the programming model can be formulated in a straightforward way as in (9),<br />

named naïve decision model. The model maximizes QoS by trading off the value of solution <strong>and</strong><br />

the cost of completion time.<br />

<br />

Naïve decision model<br />

Max<br />

s.t.<br />

∑<br />

i∈A<br />

[ R ( t ) + L ( t ) f ( v )] / MRA( t ) ≤ T − t<br />

v<br />

L ( t )v<br />

i<br />

i<br />

i(min)<br />

≤ v<br />

i<br />

i<br />

− CCT(T )<br />

i<br />

≤ v<br />

i<br />

i(max)<br />

i<br />

for all i ∈ A<br />

for all i ∈ A<br />

(9)<br />

The naïve decision model maximizes QoS as if all the tasks of each component are available<br />

in its queue at current time t. That is, a network under the naïve decision model can achieve a<br />

performance close to the optimal performance of an ideal network with maximal task availability<br />

when L i (t)s are large. As any mixed strategy (i.e. using different modes for processing tasks)<br />

cannot perform better in the ideal network due to the convexity of value functions, it is optimal<br />

for each component to use a pure strategy. We will refine the model so that it can be applicable<br />

even though L i (t)s are small in the next section.


Manuscript for IEEE Transactions on Automatic Control 12<br />

6. Model refinement<br />

In this section we analyze system behavior under the naïve decision model <strong>and</strong> refine it to<br />

eliminate undesirable behavioral properties.<br />

6.1 Analysis of system behavior<br />

To analyze the system behavior under the naïve decision model, we made experimentation<br />

using discrete-event simulation. There are five components in the system linked serially as in<br />

Fig. 3. Component A 1 in the lowest position is assigned 100 root tasks. Components have a<br />

common deterministic value function <strong>and</strong> the cost of completion time is a linear increasing<br />

function as indicated in the figure. There is no stress in the system <strong>and</strong> components measure<br />

MRA i (t) equal to 1 all the time. The system makes decision every 100 time units (i.e., SW=100)<br />

by solving the naïve decision model.<br />

100<br />

A 1<br />

A 2 A 3 A 4<br />

A 5<br />

<br />

CCT(T) = 10T<br />

Fig. 3. An example network for stability analysis<br />

Fig. 4 <strong>and</strong> 5 show the resultant behavior of the system, in which the decisions T * <strong>and</strong> v * i are<br />

divergent. The divergent behavior indicates that there is inefficiency in the naïve decision model<br />

<strong>and</strong> system performance can be improved if we eliminate this inefficiency. The divergent<br />

behavior is due to the inaccurate prediction of the naïve decision model. In the example network<br />

the system (or A 5 ) can complete at T * when A 4 completes before T * . As each component is trying<br />

to complete at T * without considering its position in the task flow structure, the components


Manuscript for IEEE Transactions on Automatic Control 13<br />

cannot receive tasks in time from their predecessors. This inaccuracy leads to changing the<br />

decisions in the subsequent decision points resulting in the divergent behavior.<br />

1150<br />

1120<br />

Optimal T<br />

1090<br />

1060<br />

1030<br />

1000<br />

0 200 400 600 800 1000 1200<br />

Time<br />

Fig. 4. Behavior of T * under naïve decision model<br />

40.0<br />

A4<br />

38.0<br />

Mode<br />

36.0<br />

34.0<br />

A5<br />

A3<br />

32.0<br />

A2<br />

30.0<br />

A1<br />

0 200 400 600 800 1000 1200<br />

Time<br />

Fig. 5. Behavior of v i * under naïve decision model<br />

6.2 Model refinement<br />

To stabilize the system behavior we need to reinforce the naïve decision model by taking into<br />

account the components’ positions in the task flow structure. For this purpose, we define Depth


Manuscript for IEEE Transactions on Automatic Control 14<br />

D i (t) as a quantitative representation of the component’s position. D i (t) is the required time gap<br />

between the system’s <strong>and</strong> the component’s completion times at time t. Each component needs to<br />

complete at less than or equal to T-D i (t) to keep the completion time T. Components without<br />

successors have depth equal to 0 but components with successors have positive depths. A<br />

component a can keep its depth if its predecessors’ depth is D a (t) plus its total service time for<br />

the last arriving tasks in the worst case. So, a component i’s depth to keep the depths of its all<br />

successors is the maximal of the required depths from its successors represented as:<br />

D i( t ) = max[ Da( t ) + f a( va<br />

) / MRAa<br />

( t )] , (10)<br />

a∈i<br />

∑<br />

b∈a<br />

in which i denotes successors of component i <strong>and</strong> a predecessors of component a.<br />

Though it is possible to refine the naïve decision model by incorporating the depths as<br />

variables, the model complexity increases because each component’s constraint will be<br />

intertwined with the decision variables of all connected components. So, we simply estimate<br />

components’ depths through the decisions used at the last control point. At each control point<br />

each successor informs its predecessors of required depths <strong>and</strong> each predecessor chooses the<br />

maximal one as its depth. As a result, we can consider the depth as constant rather than variable<br />

so that the refined model has no increase in complexity. We call the refined model in (11) as<br />

stable decision model. If we don’t consider the depth, i.e., D i (t)=0, the model becomes equivalent<br />

to the naïve decision model. Also, the stable decision model becomes an exact CPM/PERT<br />

formulation as a special case found in project management literature when one consider D i (t) as<br />

variable.


Manuscript for IEEE Transactions on Automatic Control 15<br />

<br />

Stable decision model<br />

Max<br />

s.t.<br />

∑<br />

i∈A<br />

[ R ( t ) + L ( t ) f ( v )] / MRA( t ) ≤ T − t − D ( t )<br />

v<br />

L ( t )v<br />

i<br />

i<br />

i(min)<br />

≤ v<br />

i<br />

i<br />

− CCT(T )<br />

i<br />

≤ v<br />

i<br />

i(max)<br />

i<br />

i<br />

for all<br />

for all<br />

i ∈ A<br />

i ∈ A<br />

(11)<br />

6.3 System behavior under stable decision model<br />

To observe system behavior under the stable decision model, we experimented with the<br />

example network described in Fig. 3. Fig. 6 <strong>and</strong> 7 show the resultant behavior of the system, in<br />

which the decisions T * <strong>and</strong> v * i are stable. The stability indicates that the inefficiency of the naïve<br />

decision model is removed as a result of improved prediction accuracy.<br />

1150<br />

1120<br />

Optimal T<br />

1090<br />

1060<br />

1030<br />

1000<br />

0 200 400 600 800 1000 1200<br />

Time<br />

Fig. 6. Behavior of T * under stable decision model


Manuscript for IEEE Transactions on Automatic Control 16<br />

31.0<br />

30.8<br />

30.6<br />

Mode<br />

30.4<br />

30.2<br />

A1<br />

A2<br />

A3<br />

A4<br />

30.0<br />

A5<br />

0 200 400 600 800 1000 1200<br />

Time<br />

Fig. 7. Behavior of v i * under stable decision model<br />

The effects of the stability on performance are shown in Table 1. QoS is improved<br />

significantly when using the stable decision model. Improved prediction accuracy made the<br />

system behaving stable <strong>and</strong> consequently performing better.<br />

Table 1. The effects of stability on performance<br />

Decision model<br />

Naïve<br />

Stable<br />

T V QoS T V QoS<br />

1171 15289 3583 1082 15104 4282<br />

T: Completion time, V: Value of solution<br />

7. Decentralization<br />

The next question is how to decentralize the programming model. Centralized control<br />

mechanisms scale badly, due to the rapid increase of computational <strong>and</strong> communicational<br />

overheads with system size. Single point failure will often lead to failure of the complete system<br />

leading to a non-robust network. Decentralization can address these issues by distributing the<br />

computations <strong>and</strong> communications to multiple entities. In addition to these properties


Manuscript for IEEE Transactions on Automatic Control 17<br />

decentralization will give a byproduct, information security. As we discussed earlier our effort is<br />

to support survivability. If information is revealed to others directly information security will be<br />

in question. In this section we decentralize the programming model through an auction market.<br />

7.1 Two-tier auction market<br />

There are two popular methods of decentralizing structured programming models:<br />

decomposition methods <strong>and</strong> auction/bidding algorithms. Considering the compatible structure of<br />

the programming model, we decentralize it through a non-iterative auction mechanism, so called<br />

multiple-unit auction with variable supply [27]. In this auction a seller may be able <strong>and</strong> willing<br />

to adjust the supply as a function of bidding. In the programming model we have built, all<br />

components are coupled with each other. However, the objective function <strong>and</strong> constraints are<br />

separable if one variable T is fixed. This characteristic makes it possible to solve the model<br />

through an auctioning process for T. The completion time T is an unbounded resource <strong>and</strong> the<br />

supply can be adjusted as a function of bidding. To design the auction market we define a seller<br />

which determines T * based on the bids from the components. We call this auction market as<br />

Two-tier Auctioning Model.<br />

We define T i as available resource of component i which is required minimally to the amount<br />

of T i(min) as in (12) <strong>and</strong> maximally T i(max) as in (13).<br />

T<br />

T<br />

= [ R ( t ) L ( t ) f ( v )] / MRA ( t )<br />

(12)<br />

i (min) i +<br />

i<br />

i<br />

i(min)<br />

= [ R ( t ) L ( t ) f ( v )] / MRA ( t )<br />

(13)<br />

i (max) i +<br />

i<br />

Each component bids to the seller with maximal value as a function of T as in (14). The seller<br />

decides T * based on the bids by considering CCT(T) as in (15). After the seller broadcasts T * ,<br />

each component selects its optimal value mode in the limits T * as in (16). Though this auctioning<br />

i<br />

i(max)<br />

i<br />

i


Manuscript for IEEE Transactions on Automatic Control 18<br />

process gives an equivalent solution to the centralized programming model, it gives more<br />

benefits as communications <strong>and</strong> computations are distributed to multiple market participants.<br />

<br />

Two-tier auctioning model<br />

Component’s bid<br />

b (T ) = −∞<br />

i<br />

= L ( t )v<br />

i<br />

= L ( t ) f<br />

i<br />

i(max)<br />

−1<br />

i<br />

Seller’s decision<br />

Max ∑bi ( T ) − CCT ( T )<br />

i∈A<br />

Component’s decision<br />

(T − t − D i( t ))MRA i( t ) − R i( t )<br />

(<br />

)<br />

L ( t )<br />

i<br />

if T − t − D ( t ) < T<br />

if T − t − D ( t ) > T<br />

else<br />

i<br />

i<br />

i(min)<br />

i(max)<br />

(14)<br />

(15)<br />

v<br />

*<br />

i<br />

= v<br />

i(max)<br />

*<br />

i<br />

if<br />

T<br />

*<br />

− t − D<br />

i<br />

> T<br />

i(max)<br />

− 1 (T − t − D i( t ))MRA i( t ) − R i( t )<br />

(16)<br />

= fi<br />

(<br />

) else<br />

L ( t )<br />

7.2 Multi-tier auction market<br />

Though the designed auctioning process is decentralized it incorporates a centralized seller<br />

which needs to coordinate all the components. As the centralized auction can still exhibit<br />

problems in terms of scalability <strong>and</strong> robustness we introduce a multi-tier auction market.<br />

Suppose there are two component groups a <strong>and</strong> b with a ⊂ b, <strong>and</strong> denote S a as a set of optimal<br />

completion time solutions of group a <strong>and</strong> S b of group b. Then, the maximal of S b is greater than<br />

or equal to the maximal of S a as in (17).<br />

max S<br />

a<br />

≤ max S if a ⊂ b<br />

(17)<br />

b<br />

Proof. Suppose it is not true, that is, T a =max S a > T b =max S b . Then, for group b,


Manuscript for IEEE Transactions on Automatic Control 19<br />

∑<br />

i∈b<br />

≡<br />

≡ [<br />

b (T<br />

i<br />

∑<br />

i∈a<br />

∑<br />

i∈a<br />

i<br />

i<br />

b<br />

b (T<br />

) − CCT(T<br />

b<br />

b (T<br />

) +<br />

b<br />

∑<br />

i∉a<br />

And, for group a,<br />

∑<br />

i∈a<br />

b (T<br />

i<br />

b<br />

) ><br />

b<br />

) − CCT(T<br />

i<br />

b<br />

b (T<br />

) − CCT(T<br />

b<br />

b<br />

) ≤<br />

∑<br />

i∈b<br />

b (T<br />

) − CCT(T<br />

i<br />

)] − [<br />

∑<br />

i∈a<br />

∑<br />

i∈a<br />

b (T<br />

So, the inequality in (20) should hold.<br />

∑<br />

i∉a<br />

a<br />

∑<br />

i∉a<br />

b<br />

i<br />

a<br />

) − CCT(T<br />

b<br />

i<br />

) ><br />

b (T<br />

a<br />

a<br />

∑<br />

i∈a<br />

i<br />

a<br />

)<br />

b (T<br />

a<br />

) +<br />

) − CCT(T<br />

) − CCT(T<br />

a<br />

a<br />

∑<br />

i∉a<br />

)] ><br />

b (T<br />

i<br />

∑<br />

i∉a<br />

a<br />

i<br />

) − CCT(T<br />

b (T<br />

a<br />

) −<br />

∑<br />

i∉a<br />

a<br />

b (T<br />

i<br />

)<br />

b<br />

. (18)<br />

)<br />

) . (19)<br />

b i (T ) < bi<br />

(T )<br />

(20)<br />

But, this inequality is not possible because b i (T) is an increasing function with T.<br />

<br />

Through this property the two-tier auctioning model can be transformed into a multi-tier<br />

model, in which there are multiple brokers arbitrating components <strong>and</strong> the seller. A broker bids<br />

to its superior broker or seller for T≥T s(m) as in (21), in which T s(m) denotes the maximal of<br />

optimal completion time solutions of a group s(m) <strong>and</strong> s(m) subordinate components <strong>and</strong> brokers<br />

of broker m. In this way, the search space becomes reduced as the bidding process goes to the<br />

superior. In this multi-tier auctioning model communications <strong>and</strong> computations are more<br />

distributed through the brokers overcoming the problems of the two-tier model.<br />

<br />

Multi-tier auctioning model<br />

Broker’s bid<br />

b<br />

m<br />

(T ) = −∞<br />

=<br />

∑<br />

a<br />

a∈s(<br />

m )<br />

b (T )<br />

if T < max{arg max<br />

else<br />

T<br />

∑<br />

a<br />

a∈s(<br />

m )<br />

b (T ) − CCT(T )}<br />

(21)


Manuscript for IEEE Transactions on Automatic Control 20<br />

8. Empirical results<br />

We ran several experiments using discrete-event simulation to validate the designed control<br />

mechanism. Though we use a small network in the experimentation for validation purpose, the<br />

decentralized model, especially, can h<strong>and</strong>le much larger networks.<br />

8.1 Experimental design<br />

The experimental network is composed of fifteen components with a tree structure as shown<br />

in Fig. 8. Each component in the lowest position has 200 root tasks. Also, all the components<br />

have a common linear value function <strong>and</strong> the cost of completion time is linear increasing<br />

function as indicated in the figure.<br />

A 1<br />

A 2<br />

<br />

CCT(T) = 4T<br />

A 3<br />

A 15<br />

A 4 A 5<br />

A 8 A 9 A 10 A 11<br />

A 6 A 7<br />

A 12 A 13 A 14<br />

200 200 200 200 200 200 200 200<br />

Fig. 8. Experimental network configuration<br />

We set up four different experimental conditions as shown in Table 2. There can be stressors<br />

which share resources with components. We assign weight w i to a component i <strong>and</strong> w i ′ to a<br />

stressor sharing resource with component i. A stressor, which has infinite work (continuously<br />

requiring resource), can impose different levels of stress on the component directly by changing<br />

w i ′. When it is zero there is no stress, <strong>and</strong> as it increases the stress level increases. We implement<br />

the stress environment by using a weighted round-robin scheduling, in which CPU time received


Manuscript for IEEE Transactions on Automatic Control 21<br />

by each thread in a round is equal to its assigned weight. Also, the distribution of CPU time can<br />

be deterministic or stochastic. While using stochastic value function we repeat 5 experiments.<br />

Table 2. Experimental conditions<br />

Condition Stress f i (v i )<br />

Con1 Unstressed Deterministic<br />

Con2 Unstressed Exponential<br />

Con3 Stressed Deterministic<br />

Con4 Stressed Exponential<br />

w i =0.1for all i∈A, w A4 ′ =1 in 500≤t≤1000 for Con3 <strong>and</strong> Con4<br />

Initial value mode: (2, 5, 5, 3 for A 4 to A 15 )<br />

We use four different control policies for each experimental condition as shown in Table 3.<br />

FL <strong>and</strong> FH use fixed value modes over time. AC-X policies represent the adaptive control<br />

mechanism we have designed. In AC-N the system is controlled under naïve decision model<br />

while in AC-S under stable decision model. When using adaptive policies the system makes<br />

decision every 100 time units (i.e., SW=100).<br />

Table 3. Control policies used for experimentation<br />

Control policy<br />

FL<br />

FH<br />

AC-N<br />

AC-S<br />

Description<br />

Fixed with lowest value mode<br />

Fixed with highest value mode<br />

Adaptive control under naïve decision model<br />

Adaptive control under stable decision model<br />

8.2 Results<br />

Numerical results from the experimentation are summarized in Table 4. The adaptive control<br />

policies show significant advantages compared to non-adaptive ones in all different conditions.<br />

But, the benefit of AC-S is not clear in the numerical results. Though AC-S outperforms AC-N<br />

in deterministic environments, AC-N outperforms AC-S in stochastic environments. This means<br />

that AC-S cannot guarantee better performance especially in stochastic environments. But, we


Manuscript for IEEE Transactions on Automatic Control 22<br />

can say that AC-S is a robust policy to keep the system from behaving divergently <strong>and</strong> degrading<br />

performance significantly as have shown in the previous stability analysis.<br />

Table 4. Experimental results<br />

Control Policy<br />

FL FH AC-N AC-S<br />

Condition T V QoS T V QoS T V QoS T V QoS<br />

Con1 1656 13558 6934 6313 30643 5391 1663 22898 16245 1656 22884 16259<br />

Con2 1652 13547 6942 6302 30643 5435 1723 22982 16089 1728 22959 16046<br />

Con3 1656 13558 6934 6313 30643 5391 1966 23401 15539 1965 23403 15542<br />

Con4 1652 13547 6942 6371 30643 5159 2024 23495 15401 2007 23406 15376<br />

T: Completion time, V: Value of solution<br />

Fig. 9 shows the behavior of T * under adaptive control policies in unstressed environments. In<br />

the deterministic environment the system behaves stable under AC-S while diverging under AC-<br />

N. But, it is not valid in the stochastic environment because the system seems more stable under<br />

AC-N. This might explain partially why AC-S does not perform better in stochastic<br />

environments.<br />

1720<br />

1710<br />

Optimal T<br />

1700<br />

1690<br />

1680<br />

1670<br />

1660<br />

1650<br />

AC-S (Con2)<br />

AC-N (Con2)<br />

AC-S (Con1)<br />

AC-N (Con1)<br />

1640<br />

0 200 400 600 800 1000 1200 1400 1600 1800<br />

Time<br />

Fig. 9. Behavior of T * in unstressed environments


Manuscript for IEEE Transactions on Automatic Control 23<br />

The system controlled by adaptive control policies is naturally adaptive to changing<br />

environments as components monitor their environments <strong>and</strong> incorporate them into the decision<br />

process. As shown in Fig. 10 <strong>and</strong> 11 for deterministic case <strong>and</strong> Fig. 12 <strong>and</strong> 13 for stochastic case,<br />

when environment changes the system adapts to the new environment.<br />

4000<br />

3000<br />

Optimal T<br />

2000<br />

1000<br />

0<br />

0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />

Time<br />

Fig. 10. Adaptive behavior of T * in deterministic environment (Con3)<br />

6.0<br />

A 8<br />

A 4<br />

5.0<br />

A 2<br />

Mode<br />

4.0<br />

3.0<br />

A 1<br />

2.0<br />

1.0<br />

0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />

Time<br />

Fig. 11. Adaptive behavior of v i * in deterministic environment (Con3)


Manuscript for IEEE Transactions on Automatic Control 24<br />

4000<br />

3000<br />

Optimal T<br />

2000<br />

1000<br />

0<br />

0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />

Time<br />

Fig. 12. Adaptive behavior of T * in stochastic environment (Con4)<br />

6.0<br />

A 8<br />

A 4<br />

5.0<br />

A 2<br />

4.0<br />

A 1<br />

Mode<br />

3.0<br />

2.0<br />

1.0<br />

0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />

Time<br />

Fig. 13. Adaptive behavior of v i * in stochastic environment (Con4)<br />

9. Conclusions<br />

A typical information network emerges as a result of automation or organizational integration,<br />

which is large-scale with distributed <strong>and</strong> component-based architecture. In this paper we<br />

developed an adaptive control mechanism to support the survivability of such networks by


Manuscript for IEEE Transactions on Automatic Control 25<br />

utilizing alternative algorithms. We designed an auction market which coordinates the<br />

components of a network. Each component bids based on its measured resource availability <strong>and</strong><br />

optimal decisions are made through a multi-tier auctioning process. By periodically opening the<br />

auction market, the system can achieve desirable performance adaptive to changing stress<br />

environment while assuring scalability property.<br />

Our work can be extended by considering more general network configurations. There can be<br />

multiple components in a machine sharing resources together. In such resource sharing<br />

environments, we have an opportunity to improve system performance by appropriately<br />

allocating resources. Though the designed control mechanism is applicable to the resource<br />

sharing environments, it would be desirable to explore an appropriate control mechanism by<br />

incorporating the resource allocation in addition.<br />

References<br />

[1] S. Jha <strong>and</strong> J. M. Wing, “Survivability analysis of networked systems,” in Proc. 23rd Int.<br />

Conf. Software engineering, 2001, pp. 307-317.<br />

[2] R. Ellison, D. Fisher, H. Lipson, T. Longstaff, <strong>and</strong> N. Mead, “Survivable network systems:<br />

An emerging discipline,” Software Engineering Institute, Carnegie Mellon University,<br />

Pittsburg, PA, Tech. Rep. CMU/SEI-97-153, 1997.<br />

[3] J. E. Eggleston, S. Jamin, T. P. Kelly, J. K. MacKie-Mason, W. E. Walsh, <strong>and</strong> M. P.<br />

Wellman, “Survivability through market-based adaptivity: The MARX project,” in Proc.<br />

<strong>DARPA</strong> Information Survivability Conference <strong>and</strong> Exposition, 2000, pp. 145-156.<br />

[4] S. Bowers, L. Delcambre, D. Maier, C. Cowan, P. Wagle, D. McNamee, A. L. Meur, <strong>and</strong> H.<br />

Hinton, “Applying adaptation spaces to support quality of service <strong>and</strong> survivability,” in


Manuscript for IEEE Transactions on Automatic Control 26<br />

Proc. <strong>DARPA</strong> Information Survivability Conference <strong>and</strong> Exposition, 2000, pp. 271-283.<br />

[5] O. F. Rana <strong>and</strong> K. Stout, “What is scalability in multi-agent systems?,” in Proc. 4th Int.<br />

Conf. Autonomous Agents, 2000, pp. 56-63.<br />

[6] B. Meyer, “On to components,” IEEE Computer, vol. 32, no. 1, pp. 139-140, 1999.<br />

[7] P. Clements, “From subroutine to subsystems: Component-based software development,” in<br />

Component Based Software Engineering, A. W. Brown, Ed. IEEE Computer Society Press,<br />

pp. 3-6, 1996.<br />

[8] M. O. McCracken, A. Snavely, <strong>and</strong> A. Malony, “Performance modeling for dynamic<br />

algorithm selection,” in Proc. Int. Conf. Computational Science, 2003, pp. 749-758.<br />

[9] P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Medvidovic, A.<br />

Quilici, D. S. Rosenblum, <strong>and</strong> A. L. Wolf, “An architecture-based approach to self-adaptive<br />

software,” IEEE Intelligent Systems, vol. 14, no. 3, pp. 54-62, 1999.<br />

[10] F. M. T. Brazier, C. M. Jonker, <strong>and</strong> J. Treur, “Principles of component-based design of<br />

intelligent agents,” Data <strong>and</strong> Knowledge Engineering, vol. 41, no. 1, pp. 1-28, 2002.<br />

[11] H. J. Goradia <strong>and</strong> J. M. Vidal, “Building blocks for agent design,” in Proc. 4th Int.<br />

Workshop on Agent-Oriented Software Engineering, 2003, pp. 17-30.<br />

[12] R. Krutisch, P. Meier, <strong>and</strong> M. Wirsing, “The AgentComponent approach, combining agents<br />

<strong>and</strong> components,” in Proc. 1st German Conf. Multiagent System Technologies, 2003, pp. 1-<br />

12.<br />

[13] D. Moore, W. Wright, <strong>and</strong> R. Kilmer, “Control surfaces for Cougaar,” in Proc. First Open<br />

Cougaar Conference, 2004, pp. 37-44.<br />

[14] W. Peng, V. Manikonda, <strong>and</strong> S. Kumara, “Underst<strong>and</strong>ing agent societies using distributed<br />

monitoring <strong>and</strong> profiling,” in Proc. First Open Cougaar Conference, 2004, pp. 53-60.


Manuscript for IEEE Transactions on Automatic Control 27<br />

[15] H. Gupta, Y. Hong, H. P. Thadakamalla, V. Manikonda, S. Kumara, <strong>and</strong> W. Peng, “Using<br />

predictors to improve the robustness of multi-agent systems: Design <strong>and</strong> implementation in<br />

Cougaar,” in Proc. First Open Cougaar Conference, 2004, pp. 81-88.<br />

[16] D. Moore, A. Helsinger, <strong>and</strong> D. Wells, “Deconfliction in ultra-large MAS: Issues <strong>and</strong> a<br />

potential architecture,” in Proc. First Open Cougaar Conference, 2004, pp. 125-133.<br />

[17] R. D. Snyder <strong>and</strong> D. C. Mackenzie, “Cougaar agent communities,” in Proc. First Open<br />

Cougaar Conference, 2004, pp. 143-147.<br />

[18] A. P. Moore, R. J. Ellison, <strong>and</strong> R. C. Linger, “Attack modeling for information security <strong>and</strong><br />

survivability,” Software Engineering Institute, Carnegie Mellon University, Pittsburg, PA,<br />

Tech. Note CMU/SEI-2001-TN-001, 2001.<br />

[19] F. Moberg, “Security analysis of an information system using an attack tree-based<br />

methodology,” M.S. thesis, Automation Engineering Program, Chalmers University of<br />

Technology, Sweden, 2000.<br />

[20] G. Barto, S. J. Bradtke, <strong>and</strong> S. P. Singh, “Learning to act using real-time dynamic<br />

programming,” Artificial Intelligence, vol. 72, no. 1-2, pp. 81-138, 1995.<br />

[21] R. S. Sutton, A. G. Barto, <strong>and</strong> R. J. Williams, “Reinforcement learning is direct adaptive<br />

optimal control,” IEEE Control Systems, vol. 12, no. 2, pp. 19-22, 1992.<br />

[22] L. P. Kaelbling, M. L. Littman, <strong>and</strong> A. W. Moore, “Reinforcement learning: A survey,” J.<br />

Artificial Intelligence Research, vol. 4, pp. 237-285, 1996.<br />

[23] J. B. Rawlings, “Tutorial overview of model predictive control,” IEEE Control Systems, vol.<br />

20, no. 3, pp. 38-52, 2000.<br />

[24] M. Morari <strong>and</strong> J. H. Lee, “Model predictive control: Past, present <strong>and</strong> future,” Computers<br />

<strong>and</strong> Chemical Engineering, vol. 23, no. 4, pp. 667-682, 1999.


Manuscript for IEEE Transactions on Automatic Control 28<br />

[25] M. Nikolaou, “Model predictive controllers: A critical synthesis of theory <strong>and</strong> industrial<br />

needs,” in Advances in Chemical Engineering Series, Academic Press, 2001.<br />

[26] S. J. Qin <strong>and</strong> T. A. Badgwell, “A survey of industrial model predictive technology,” Control<br />

Engineering Practice, vol. 11, pp. 733-764, 2003.<br />

[27] Y. Lengwiler, “The multiple unit auction with variable supply,” Economic Theory, vol. 14,<br />

no. 2, pp. 373-392, 1999.


1<br />

Self-Organizing Resource Allocation for<br />

Minimizing Completion Time in Large-Scale<br />

Distributed Information Networks<br />

Seokcheon Lee, Soundar Kumara, <strong>and</strong> Natarajan Gautam<br />

Abstract—As information networks grow larger in size due to<br />

automation or organizational integration, it is important to<br />

provide simple decision-making mechanisms for each entity or<br />

groups of entities that will lead to desirable global performance.<br />

In this paper, we study a large-scale information network<br />

consisting of distributed software components linked together<br />

through a task flow structure <strong>and</strong> design a resource control<br />

mechanism for minimizing completion time. We define load index<br />

which represents component’s workload. When resources are<br />

allocated locally proportional to the load index, the network can<br />

maximize the utilization of distributed resources <strong>and</strong> achieve<br />

optimal performance in the limit of large number of tasks.<br />

Coordinated resource allocation throughout the network emerges<br />

as a result of using the load index as global information. To clarify<br />

the obscurity of “large number of tasks” we provide a quantitative<br />

criterion for the adequacy of the proportional resource allocation<br />

for a given network. By periodically allocating resources under<br />

the framework of model predictive control, a closed-loop policy<br />

reactive to each current system state is formed. The designed<br />

resource control mechanism has several emergent properties that<br />

can be found in many self-organizing systems such as social or<br />

biological systems. Though it is localized requiring almost no<br />

computation, it realizes desirable global performance adaptive to<br />

changing environments.<br />

Index Terms—Distributed information networks, emergence,<br />

resource allocation, scalability.<br />

C<br />

I. INTRODUCTION<br />

ritical infrastructures are increasingly becoming dependent<br />

on networked systems in many domains due to automation<br />

or organizational integration. The growth in complexity <strong>and</strong><br />

size of software systems is leading to the increasing importance<br />

of distributed <strong>and</strong> component-based architectures. Distributed<br />

computing aims at using computing power of machines<br />

Manuscript received June 24, 2005. This work was supported in part by<br />

<strong>DARPA</strong> under Grant MDA 972-01-1-0038.<br />

S. Lee is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering,<br />

The Pennsylvania State University, University Park, PA 16802 USA (phone:<br />

814-863-4799; fax: 814-863-4745; e-mail: stonesky@psu.edu).<br />

S. Kumara is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong><br />

Engineering, The Pennsylvania State University, University Park, PA 16802<br />

USA (e-mail: skumara@psu.edu).<br />

N. Gautam is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong><br />

Engineering, The Pennsylvania State University, University Park, PA 16802<br />

USA (e-mail: ngautam@psu.edu).<br />

connected by a network. When a task requires intensive<br />

computation, it becomes natural choice to achieve high<br />

performance. A component is a reusable program element.<br />

Component technology utilizes the components so that<br />

developers can build systems needed by simply defining their<br />

specific roles <strong>and</strong> wiring them together [1][2]. In networks with<br />

component-based architecture, each component is highly<br />

specialized for specific tasks.<br />

We study a large-scale information network (with respect to<br />

the number of components as well as machines) comprising of<br />

distributed software components linked together through a task<br />

flow structure. A problem given to the network is decomposed<br />

in terms of root tasks for some components <strong>and</strong> those tasks are<br />

propagated through a task flow structure to other components.<br />

As a problem can be decomposed with respect to space, time, or<br />

both, a component can have multiple root tasks that can be<br />

considered independent <strong>and</strong> identical in their nature. The<br />

service provided by the network is to produce a global solution<br />

to the given problem, which is an aggregation of the partial<br />

solutions of individual tasks. Quality of Service (QoS) of the<br />

network is determined by the time for generating the global<br />

solution, i.e. completion time. For a given topology,<br />

components are sharing resources <strong>and</strong> the network can control<br />

its behavior through resource allocation. In specific, we address<br />

allocating resources of each machine to the components<br />

residing on that machine. In this paper we develop a resource<br />

control mechanism of such networks for minimizing<br />

completion time.<br />

Many self-organizing systems such as social <strong>and</strong> biological<br />

systems exhibit emergent properties. Though entities act with a<br />

simple mechanism without central authority, these systems are<br />

adaptive <strong>and</strong> desirable global performance can often be<br />

realized. The control mechanism designed in this paper has<br />

such properties so that it can be applicable to large-scale<br />

networks working in a dynamic environment. Scalability,<br />

defined as “the ability of a solution to some problem to work<br />

when the size of the problem increases” (From Dictionary of<br />

Computing at http://wombat.doc.ic.ac.uk), becomes a critical<br />

issue when developing practical software systems as the size of<br />

networks grows [3]. We also provide a criterion by which one<br />

can evaluate if the emergent properties hold for a given<br />

network.<br />

The organization of this paper is as follows. In Section II we


2<br />

discuss problem domain <strong>and</strong> in Section III formally define the<br />

problem in detail. After designing resource control mechanism<br />

in Sections IV <strong>and</strong> V, we show empirical results in Section VI.<br />

<strong>Final</strong>ly, we conclude our work in Section VII.<br />

replanning to cope with logistics plan deviations or operational<br />

plan changes. Initial planning <strong>and</strong> replanning are the instances<br />

of the current research problem. Plan completion time of such<br />

networks directly affects the performance of military operation.<br />

II. PROBLEM DOMAIN<br />

The networks we study represent distributed <strong>and</strong><br />

component-based architectures for providing a solution for a<br />

given problem. A problem is decomposed in terms of root tasks<br />

<strong>and</strong> solved by distributed components through a task flow<br />

structure. As a problem can be decomposed with respect to<br />

space, time, or both, a component can have multiple root tasks<br />

that can be considered independent <strong>and</strong> identical in their nature.<br />

When the size of a problem becomes large, the size of the<br />

network as well as the number of tasks for each component can<br />

be large. One can imagine wide range of scientific <strong>and</strong><br />

engineering problems that can be solved with such<br />

architectures.<br />

Cougaar (Cognitive Agent Architecture:<br />

http://www.cougaar.org) developed by <strong>DARPA</strong> (Defense<br />

Advanced Research Project Agency), is such an architecture<br />

for building large-scale multi-agent systems. Recently, there<br />

have been efforts to combine the technologies of agents <strong>and</strong><br />

components to improve building large-scale software systems<br />

[4]-[6]. While component technology focuses on reusability,<br />

agent technology focuses on processing complex tasks as a<br />

community. Cougaar is in line with this trend. In Cougaar a<br />

software system comprises of agents <strong>and</strong> an agent of<br />

components (called plugins). The task flow structure in those<br />

systems is that of components as a combination of intra-agent<br />

<strong>and</strong> inter-agent task flows. As the agents in Cougaar can be<br />

distributed both from geographical <strong>and</strong> information content<br />

sense, the networks implemented in Cougaar have distributed<br />

<strong>and</strong> component-based architecture.<br />

UltraLog (http://www.ultralog.net) networks are military<br />

supply chain planning systems implemented in Cougaar<br />

[7]-[11]. Each agent in these networks represents an<br />

organization of military supply chain <strong>and</strong> has a set of<br />

components specialized for each functionality (allocation,<br />

expansion, inventory management, etc) <strong>and</strong> class (ammunition,<br />

water, fuel, etc). The objective of an UltraLog network is to<br />

provide an appropriate logistics plan for a given military<br />

operational plan. A logistics plan is a global solution which is<br />

an aggregate of individual schedules built by components. An<br />

operational plan is decomposed into logistics requirements of<br />

each thread for each agent, <strong>and</strong> a requirement is further<br />

decomposed into root tasks (one task per day) for a designated<br />

component. As a result, a component can have hundreds of root<br />

tasks depending on the horizon of an operation <strong>and</strong> thous<strong>and</strong>s<br />

of tasks to process as the root tasks are propagated. As the scale<br />

of operation increases there can be thous<strong>and</strong>s of agents (tens of<br />

thous<strong>and</strong>s of components) in hundreds of machines working<br />

together to generate a logistics plan.<br />

An UltraLog network makes initial planning <strong>and</strong> continuous<br />

III. PROBLEM SPECIFICATION<br />

In this section we formally define the problem in a general<br />

form by detailing the network model <strong>and</strong> resource allocation.<br />

We concentrate on computational CPU resources assuming that<br />

the system is computation-bounded.<br />

A. Network Model<br />

A network is composed of a set of components A <strong>and</strong> a set of<br />

nodes (i.e., machines) N. K n denotes a set of components that<br />

reside in node n sharing the node’s CPU resource. Task flow<br />

structure of the network, which defines precedence relationship<br />

between components, is an arbitrary directed acyclic graph. A<br />

problem given to the network is decomposed in terms of root<br />

tasks for some components <strong>and</strong> those tasks are propagated<br />

through the task flow structure. Each component processes one<br />

of the tasks in its queue (which has root tasks as well as tasks<br />

from predecessor components) <strong>and</strong> then sends it to successor<br />

components. We denote the number of root tasks <strong>and</strong> expected<br />

CPU time 1 per task of component i as respectively. Fig.<br />

1 shows an example network in which there are four<br />

components residing in three nodes. Components A 1 <strong>and</strong> A 2<br />

resides in N 1 <strong>and</strong> each of them has 100 root tasks. A 3 in N 2 <strong>and</strong><br />

A 4 in N 3 have no root tasks, but each of them has 100 tasks from<br />

the corresponding predecessors, namely A 1 <strong>and</strong> A 2<br />

respectively.<br />

<br />

A 1<br />

<br />

A 3<br />

N 2<br />

<br />

<br />

A 2 A 4<br />

N 3<br />

Fig. 1. An example network. The network is composed of four components in<br />

three nodes <strong>and</strong> the performance can depend on the resource allocation of node<br />

N 1 .<br />

B. Resource Allocation<br />

When there are multiple components in a node, the network<br />

needs to control its behavior through resource allocation. In the<br />

example network, node N 1 has two components <strong>and</strong> the system<br />

performance can depend on its resource allocation to these two<br />

components. There are several CPU scheduling algorithms for<br />

allocating a CPU resource amongst multiple threads. Among<br />

the scheduling algorithms, proportional CPU share (PS)<br />

scheduling is known for its simplicity, flexibility, <strong>and</strong> fairness<br />

[12]. In PS scheduling threads are assigned weights <strong>and</strong><br />

resource shares are determined proportional to the weights<br />

1 The distribution of CPU time can be arbitrary though we use only expected<br />

CPU time.


3<br />

[13]. Excess CPU time from some threads is allocated fairly to<br />

other threads. There are many PS scheduling algorithms such as<br />

Weighted Round-Robin scheduling, Lottery scheduling, <strong>and</strong><br />

Stride scheduling [14]-[16].<br />

We adopt PS scheduling as resource allocation scheme<br />

because of its generality in addition to the benefits mentioned<br />

above. We define resource allocation variable set w = {w i (t):<br />

i∈A, t≥0} in which w i (t) is a non-negative weight of component<br />

i at time t. If total managed weight of a node n is ω n , the<br />

boundary condition for assigning weights over time can be<br />

described as:<br />

∑<br />

i∈K<br />

C. Problem Definition<br />

n<br />

w ( t ) = ω where w ( t ) ≥ 0 . (1)<br />

i<br />

n<br />

The service provided by a network is to produce a global<br />

solution to a given problem, which is an aggregate solution of<br />

partial solutions of individual tasks. QoS is determined by<br />

completion time taken to generate the global solution. In this<br />

paper we develop a resource control mechanism to minimize<br />

the completion time T though resource allocation (w) as in (2).<br />

arg min<br />

w<br />

T<br />

i<br />

(2)<br />

has a set of serial operations <strong>and</strong> each operation should be<br />

processed on a specific machine. A job shop scheduling<br />

problem is sequencing the operations in each machine by<br />

satisfying a set of job precedence constraints such that the<br />

completion time is minimized. Our problem can be exactly<br />

transformed into such a job shop scheduling problem. However,<br />

scheduling problems are in general intractable. Though the job<br />

shop scheduling problem is polynomially solvable when there<br />

are two machines <strong>and</strong> each job has two operations, it becomes<br />

NP-hard on the number of jobs even if the number of machines<br />

or operations is more than two [24][25]. Considering that the<br />

task flow structure of our networks is arbitrary, our scheduling<br />

problem is NP-hard on the number of components in general<br />

<strong>and</strong> the increase of the number of tasks imposes additional<br />

complexity. Moreover, there can be large number of nodes in<br />

our networks.<br />

Though it is possible to use some available heuristic<br />

algorithms from the job shop scheduling problem, our<br />

scheduling problem has a particular characteristic, i.e., the<br />

number of tasks for each component can be large. Though the<br />

increase of the number of tasks adds more complexity, it can<br />

also give us great opportunity to develop an efficient heuristic<br />

solution. So, we analyze the impacts of the largeness on the<br />

optimal scheduling in the course of developing a resource<br />

control mechanism.<br />

IV. OVERALL SOLUTION METHODOLOGY<br />

There are two representative optimal control approaches in<br />

dynamic systems: Dynamic Programming (DP) <strong>and</strong> Model<br />

Predictive Control (MPC). Though DP gives optimal<br />

closed-loop policy it has inefficiencies in dealing with<br />

large-scale systems especially when systems are working in<br />

finite time horizon [17]-[19]. In MPC, for each current state, an<br />

optimal open-loop control policy is designed for finite-time<br />

horizon by solving a static mathematical programming model<br />

[20]-[23]. The design process is repeated for the next observed<br />

state feedback forming a closed-loop policy reactive to each<br />

current system state. Though MPC does not give absolute<br />

optimal policy in stochastic environments, the periodic design<br />

process alleviates the impacts of stochasticity. Considering the<br />

characteristic of our problem, we choose MPC framework. Our<br />

networks are large-scale <strong>and</strong> work in finite time horizon. So,<br />

we need to build a mathematical programming model.<br />

The mathematical programming model is essentially a<br />

scheduling problem formulation. There are a variety of<br />

formulations <strong>and</strong> algorithms available for diverse scheduling<br />

problems in the context of multiprocessor, manufacturing, <strong>and</strong><br />

project management. In general, a scheduling problem is to<br />

allocate limited resources to a set of tasks to optimize a specific<br />

objective. One widely studied objective is completion time<br />

(also called makespan in the manufacturing literature) as in the<br />

problem we have considered. Though it is not easy to find a<br />

problem exactly same as ours, it is possible to convert our<br />

problem into one of the scheduling problems. For example, in a<br />

job shop, there are a set of jobs <strong>and</strong> a set of machines. Each job<br />

V. RESOURCE CONTROL MECHANISM<br />

In this section we develop a resource control mechanism<br />

under MPC framework. After exemplifying the effects of<br />

resource allocation, we develop a resource control mechanism<br />

by characterizing an optimal open-loop resource allocation<br />

policy in the limit of large number of tasks, <strong>and</strong> providing a<br />

quantitative criterion for the largeness. For theoretical analysis,<br />

we assume a hypothetical weighted round-robin server for CPU<br />

scheduling though it is not strictly required in practice as will<br />

be discussed. The hypothetical server has idealized fairness as<br />

the CPU time received by each thread in a round is infinitesimal<br />

<strong>and</strong> proportional to the weight of the thread.<br />

A. Effects of Resource Allocation<br />

The completion time T is the time taken to generate the<br />

global solution, i.e., to process all the tasks of a network. We<br />

denote T n as the completion time taken to process all the tasks<br />

of node n <strong>and</strong> T i of component i. Then, the relationships as in<br />

(3) hold.<br />

T<br />

= Max T = Max T T = Max T . (3)<br />

n∈N<br />

n<br />

i∈A<br />

i , n<br />

i<br />

i∈K<br />

n<br />

A component’s instantaneous resource availability RA i (t) is<br />

the available fraction of a resource when the component<br />

requests the resource at time t. Service time S i (t) is the time<br />

taken to process a task at time t <strong>and</strong> has a relationship with<br />

RA i (t) as:


4<br />

t Si<br />

t<br />

∫ + ( )<br />

i )<br />

t<br />

When RA i (t) remains constant S i (t) becomes:<br />

RA ( τ dτ<br />

= P . (4)<br />

i<br />

Pi<br />

Si ( t)<br />

= . (5)<br />

RA ( t)<br />

Now, consider the example network in Fig. 1. In the network<br />

only N 1 has the chance to allocate its resource as it has two<br />

residing components. T N1 is invariant to resource allocation <strong>and</strong><br />

equal to 300 (=100*1+100*2). But, T A1 <strong>and</strong> T A2 can vary<br />

depending on the resource allocation of N 1 . When the resource<br />

is allocated equally to the components, both RA A1 (t) <strong>and</strong> RA A2 (t)<br />

are equal to 0.5 initially. As A 1 completes at t=200<br />

(=100*1/0.5), A 2 starts utilizing the resource fully from then,<br />

i.e. RA A2 (t)=1 for t≥200. So, A 2 completes 50 tasks at t=200<br />

(=50*2/0.5) <strong>and</strong> remaining 50 tasks at t=300 (=200+50*2/1).<br />

A 3 completes at t=202 (=200+1*2/1) because task inter-arrival<br />

time from A 1 is equal to its service time. As A 4 ’s service time is<br />

less than task inter-arrival time (=4) for t≤200, A 4 completes 49<br />

tasks at t=200 with one task in queue arriving at t=200. From<br />

t=200 task inter-arrival time from A 2 becomes reduced to 2<br />

which is less than A 4 ’s service time. So, tasks become<br />

accumulated till t=300 <strong>and</strong> A 4 completes at t=353<br />

(=200+51*3/1). In this way we trace exact system behavior<br />

under three resource allocation strategies as shown in Fig. 2.<br />

RA<br />

1<br />

4/5<br />

2/3<br />

1/2<br />

1/3<br />

1/5<br />

1:1<br />

1:2<br />

1:4<br />

0 50 100 150 200 250<br />

w A1 : w A2<br />

1 : 1 1 : 2 1 : 4<br />

T A1 200 300 300<br />

T A2 300 300 250<br />

T A3 202 302 352<br />

T A4 353 303 302.5<br />

T 353 303 352<br />

(a) Completion time<br />

300 Time<br />

The network cannot complete at less than t=300 because<br />

each of N 1 <strong>and</strong> N 3 requires 300 CPU time. When the resource is<br />

allocated with 1:2 ratio, the completion time T is minimal close<br />

to 300. The ratio is proportional to each component’s total<br />

required CPU time, i.e., 1:2 ≡ 100*1:100*2. One interesting<br />

question is whether the proportional allocation can give the best<br />

performance even if the successors have different parameters.<br />

i<br />

RA<br />

1<br />

1/2<br />

1/3<br />

1/5<br />

0 50 100 150 200 250 300 Time<br />

(b) Resource availability of A 1 (c) Resource availability of A 2<br />

Fig. 2. Effects of resource allocation. Depending on the resource allocation of<br />

node N 1 , each of components A 1 <strong>and</strong> A 2 follows different resource availability<br />

profile as in (b) <strong>and</strong> (c). Consequently, the difference results in different<br />

completion times as in (a).<br />

4/5<br />

2/3<br />

1:1<br />

1:2<br />

1:4<br />

The answer is yes. If a component A 1 is allocated more resource<br />

than the proportional allocation, T A3 is dominated by the<br />

maximal of T A1 <strong>and</strong> A 3 ’s total CPU time. But, the first quantity<br />

is less than T N1 <strong>and</strong> the second quantity is an invariant. So,<br />

allocating more resource than the proportional allocation<br />

cannot help reducing the completion time of the network.<br />

However, if a component is allocated less resource than the<br />

proportional allocation, its successor’s task inter-arrival time is<br />

stepwise decreasing. As a result, the successor underutilizes<br />

resource <strong>and</strong> can complete later than under the proportional<br />

allocation. Therefore, the proportional allocation leads the<br />

network to efficiently utilize distributed resources <strong>and</strong><br />

consequently helps minimizing the completion time of the<br />

network, though it is localized independent of the successors’<br />

parameters.<br />

B. Optimal Open-loop Policy<br />

To generalize the arguments for arbitrary network<br />

configurations, we define Load Index LI i which represents<br />

component i’s total CPU time required to process its tasks. As a<br />

component needs to process its own root tasks as well as<br />

incoming tasks from its predecessors, its number of tasks L i is<br />

identified as in (6) where i denotes the immediate predecessors<br />

of component i. Then, LI i is represented as in (7).<br />

∑<br />

L = rt + L<br />

(6)<br />

i<br />

i<br />

i<br />

a∈i<br />

LI = L P<br />

To provide theoretical foundation of optimal resource<br />

allocation policy, we convert a network into a network with<br />

tasks having infinitesimal processing times. Each root task is<br />

divided into r infinitesimal tasks <strong>and</strong> each P i is replaced with<br />

P i /r. Then, the load index of each component is the same as the<br />

original network but tasks are infinitesimal. We denote the<br />

completion time of the network with infinitesimal tasks as T´.<br />

Also, we define a term called task availability as an indicator of<br />

relative preference for task arrival patterns. An arrival pattern<br />

gives higher task availability than another if cumulative<br />

number of arrived tasks is larger or equal over time. A<br />

component prefers a task arrival pattern with higher task<br />

availability as it can utilize more resource. Consider a network<br />

<strong>and</strong> reconfigure it such that all components have their tasks in<br />

their queues at t=0. Each component has maximal task<br />

availability in the reconfigured network <strong>and</strong> the completion time<br />

of the reconfigured network forms the lower bound T LB of a<br />

network’s completion time T given by:<br />

T<br />

LB<br />

= Max<br />

n∈N<br />

i<br />

i<br />

∑<br />

i∈<br />

K n<br />

a<br />

LI<br />

i<br />

(7)<br />

. (8)<br />

Theorem 1. T´ equals to T LB when each node allocates its<br />

resource proportional to its residing components’ load<br />

indices as:


5<br />

LI i<br />

w i( t ) = wi<br />

= ω n( i ) for all t ≥ 0 , (9)<br />

LI<br />

∑<br />

p∈K<br />

n(<br />

i )<br />

p<br />

T<br />

LB<br />

s<br />

ωn<br />

+ ωn<br />

= Max<br />

n∈N<br />

ω<br />

n<br />

s<br />

∑<br />

i∈<br />

K n<br />

LI<br />

i<br />

(12)<br />

where n(i) denotes a node in which component i resides.<br />

Theorem 2. T s´ equals to T s LB under proportional allocation.<br />

Proof. RA i (t) is more than or equal to assigned weight proportion<br />

as:<br />

w i ( t )<br />

RA ( t ) ≥ for t ≥ 0 . (10)<br />

ω<br />

i<br />

n( i )<br />

Proof. RA i (t) becomes:<br />

RA i( t ) ≥<br />

ω<br />

w ( t )<br />

n( i )<br />

i<br />

+ ω<br />

s<br />

n( i )<br />

for t ≥ 0 . (13)<br />

Suppose a component i receives its tasks at a constant interval<br />

of T LB /L i . Then, under proportional allocation, S i (t) is less<br />

than or equal to T LB /L i over time as shown in (11).<br />

P<br />

i<br />

=<br />

∫<br />

wi<br />

=<br />

ω<br />

n( i )<br />

LB<br />

T<br />

⇒<br />

L<br />

i<br />

t+<br />

Si( t )<br />

t<br />

S ( t ) =<br />

i<br />

RA ( τ )dτ<br />

≥<br />

≥ S ( t )<br />

i<br />

i<br />

LI<br />

∑<br />

i<br />

LI<br />

p∈K<br />

n( i )<br />

p<br />

∫<br />

for t ≥ 0<br />

t+<br />

Si(<br />

t<br />

t )<br />

LI<br />

Si( t ) ≥<br />

T<br />

w i( t )<br />

dτ<br />

ω<br />

n( i )<br />

i<br />

LB<br />

S ( t )<br />

i<br />

(11)<br />

So, any component can complete by T LB <strong>and</strong> generate tasks at<br />

a constant interval of T LB /L i from t=T LB /L i (first task<br />

generation time) under proportional allocation when it<br />

receives tasks at a constant interval of T LB /L i from t=0 (first<br />

task arrival time). As tasks are infinitesimal <strong>and</strong> root tasks<br />

increase task availability, each component can receive<br />

infinitesimal tasks at a constant interval in 0≤t≤T LB or more<br />

preferably, <strong>and</strong> complete at less than or equal to T LB . So, the<br />

network completes at T LB under proportional allocation. <br />

From Theorem 1 we can conjecture that a network can<br />

achieve a performance close to T LB under proportional<br />

allocation in the limit of large number of tasks. We propose the<br />

proportional allocation as an optimal resource allocation<br />

policy. Though the proportional allocation is localized, the<br />

network can maximize the utilization of distributed resources<br />

<strong>and</strong> achieve desirable performance. Coordinated resource<br />

allocation throughout the network emerges as a result of using<br />

the load index as global information. If nodes do not follow the<br />

proportional allocation policy, some components can receive<br />

their tasks less preferably resulting in underutilization <strong>and</strong><br />

consequently increased completion time as have shown in the<br />

previous subsection.<br />

Another important property of the proportional allocation<br />

policy is that it is itself adaptive. Suppose there are some<br />

stressors sharing resources with the components. We denote<br />

ω s n as the amount of shared resource by a stressor in node n.<br />

Then, the lower bound performance T LB s under stress is given<br />

by (12). We denote the completion time under stress as T s´.<br />

Then, (11) results in (14) under proportional allocation.<br />

LB<br />

Ts<br />

L<br />

i<br />

≥ S ( t ) for t ≥ 0<br />

(14)<br />

i<br />

Therefore, the network completes at T LB s under proportional<br />

allocation. <br />

Theorem 2 depicts that the proportional allocation policy is<br />

optimal independent of the stress environments. Though we do<br />

not consider them explicitly, the policy gives lower bound<br />

performance adaptively. This characteristic is especially<br />

important when the system is vulnerable to unpredictable stress<br />

environments. Modern networked systems can be easily<br />

exposed to various adverse events such as accidental failures<br />

<strong>and</strong> malicious attacks, <strong>and</strong> the space of stress environment is<br />

high-dimensional <strong>and</strong> also evolving [26]-[28].<br />

C. Adequacy criterion<br />

The arguments we have made hold in the limit of large<br />

number of tasks. As the term “large” is obscure we need to give<br />

it a concrete definition. We define it with an adequacy criterion,<br />

by which one can evaluate if the desirable properties of the<br />

proportional allocation hold for a given network. For this<br />

purpose we characterize upper bound performance of a<br />

network under proportional allocation.<br />

Theorem 3. Under proportional allocation a network’s upper<br />

bound T UB of completion time T is given by:<br />

T<br />

UB<br />

= T<br />

LB<br />

+ Max Max<br />

e∈E<br />

j∈Se<br />

∑<br />

i∈j<br />

[ P<br />

∑<br />

LI<br />

i<br />

p∈K<br />

n(<br />

i)<br />

p<br />

/ LI<br />

i<br />

] , (15)<br />

where E denotes a set of components which have no successor<br />

<strong>and</strong> S e a set of task paths to component e. A task path to<br />

component e is a set of components in a path from a<br />

component with no predecessor to component e <strong>and</strong> does not<br />

include component e.<br />

Proof. From (11) we can induce the lowest upper bound S i UB of<br />

S i (t) as:


6<br />

S<br />

UB<br />

i<br />

∑<br />

= P LI / LI . (16)<br />

i<br />

p∈K<br />

n(<br />

i)<br />

So, a component i can complete by T LB <strong>and</strong> generate tasks at<br />

a constant interval of T LB /L i from t=S i UB when it receives<br />

tasks at a constant interval of T LB /L i from t=0. Now, consider<br />

component i’s successor s which has only one predecessor.<br />

As the successor receives tasks at a constant interval of T LB /L s<br />

from t=S i UB or more preferably, it can complete by S i UB +T LB .<br />

So, a component e∈E (with no successor) can receive tasks at<br />

a constant interval of T LB /L e from maximal task traveling time<br />

to the component of:<br />

Max<br />

j∈S<br />

e<br />

∑<br />

i∈<br />

j<br />

S<br />

p<br />

UB<br />

i<br />

i<br />

(17)<br />

(note that a path j does not include component e) or more<br />

preferably so that its completion time T e is bounded as:<br />

T<br />

e<br />

≤ T<br />

LB<br />

+ Max<br />

j∈S<br />

e<br />

∑<br />

i∈<br />

j<br />

S<br />

UB<br />

i<br />

. (18)<br />

And, the upper bound of T is the maximal of the bounds.<br />

Though we formulated the upper bound performance<br />

without considering stress environments, one can easily modify<br />

it so that the upper bound performance can reflect the stress<br />

environments (if each ω n s is identifiable or assumable). The<br />

adequacy criterion is defined as the ratio between T LB <strong>and</strong> T UB<br />

as in (19). When the criterion is close to one, a network can<br />

achieve the lower bound performance using the proportion<br />

allocation policy. Typically, the criterion converges to one as<br />

each L i increases. However, as the criterion approaches zero,<br />

the policy become more <strong>and</strong> more inadequate. The example<br />

network in Fig. 1 is quite adequate because the network’s<br />

adequacy is 0.99 (300/303).<br />

LB<br />

<br />

T<br />

Adequacy = (19)<br />

UB<br />

T<br />

So far, we assumed a hypothetical weighted round-robin<br />

server which is difficult to realize in practice. But, our<br />

arguments do not seem to be invalid because they are based on<br />

worst-case analysis <strong>and</strong> quantum size is relatively infinitesimal<br />

compared to working horizon in reality.<br />

D. Resource control mechanism<br />

Once a network has an appropriate adequacy over a certain<br />

level (depending on the nature of the network), the proportional<br />

allocation is deployed periodically under MPC framework.<br />

Consider current time as t. To update load index as the system<br />

moves on, we slightly modify it to represent total CPU time for<br />

the remaining tasks as:<br />

LI ( t ) = R ( t ) + L ( t ) P , (20)<br />

i<br />

i<br />

in which R i (t) denotes remaining CPU time for a task in process<br />

<strong>and</strong> L i (t) the number of remaining tasks excluding a task in<br />

process. After identifying initial number of tasks L i (0)=L i , each<br />

component updates it by counting down as they process tasks.<br />

Periodically, a resource manager of each node collects current<br />

LI i (t)s from residing components <strong>and</strong> allocates resource<br />

proportional to the indices as in (21). As the resource allocation<br />

policy is purely localized there is no need for synchronization<br />

between nodes. The designed resource control mechanism is<br />

scalable as each node can make decisions independent of<br />

others while requiring almost no computation.<br />

w<br />

i<br />

i ( t)<br />

ωn(<br />

i)<br />

∑ LI p ( t)<br />

p∈K<br />

n(<br />

i)<br />

i<br />

i<br />

LI ( t)<br />

= (21)<br />

VI. EMPIRICAL RESULTS<br />

We ran several experiments using discrete-event simulation<br />

to validate the designed resource control mechanism.<br />

A. Experimental design<br />

The experimental network is composed of eight components<br />

in four nodes as in Fig. 3. Two components are sharing a<br />

resource in N 3 <strong>and</strong> four components in N 4 . Also, ω n is 1 for all<br />

n∈N <strong>and</strong> CPU is allocated using a weighted round-robin<br />

scheduling in which CPU time received by each component in a<br />

round is equal to its assigned weight.<br />

N 1 N 2<br />

A 1<br />

N 3<br />

A 3 A 4<br />

N 4<br />

A 5 A 6 A 7 A 8<br />

Fig. 3. Experimental network configuration. The network is composed of<br />

eight components in four nodes <strong>and</strong> the performance can depend on the<br />

resource allocation of nodes N 3 <strong>and</strong> N 4 .<br />

We set up ten different experimental conditions as shown in<br />

Table I. We vary the number of root tasks rt i <strong>and</strong> CPU time per<br />

task P i , <strong>and</strong> the distribution of P i can be deterministic or<br />

exponentially distributed. While using stochastic distribution<br />

we repeat 5 experiments.<br />

We use three different resource control policies for each<br />

experimental condition. Table II shows these control policies.<br />

In round-robin allocation policy (RR) the components in each<br />

node are assigned equal weights over time. PA-O <strong>and</strong> PA-C use<br />

the proportional allocation policy in open-loop <strong>and</strong> closed-loop<br />

A 2


7<br />

respectively. In PA-O resources are allocated only at t=0 <strong>and</strong><br />

kept over time while in PA-C periodically (every 100 time<br />

units). PA-C is the resource control mechanism we have<br />

designed.<br />

B. Results<br />

TABLE I<br />

EXPERIMENTAL CONDITIONS<br />

Condition Distribution of P i rt i P i<br />

Con1-1 Deterministic [000 000 000 000 [04 12 04 08<br />

Con1-2 Exponential 200 200 200 200] 02 02 02 02]<br />

Con2-1 Deterministic [100 100 100 100 [04 12 04 10<br />

Con2-2 Exponential 200 200 200 200] 02 02 06 06]<br />

Con3-1 Deterministic [100 100 100 100 [04 12 04 10<br />

Con3-2 Exponential 200 200 200 100] 02 02 20 10]<br />

Con4-1 Deterministic [100 100 100 100 [04 12 04 08<br />

Con4-2 Exponential 200 200 200 200] 02 02 10 02]<br />

Con5-1 Deterministic [100 100 200 200 [04 10 04 08<br />

Con5-2 Exponential 200 200 200 200] 02 02 02 02]<br />

TABLE II<br />

CONTROL POLICIES FOR EXPERIMENTATION<br />

Control policy<br />

Description<br />

RR<br />

Round-Robin allocation<br />

PA-O Proportional allocation - Open loop<br />

PA-C Proportional allocation - Closed loop<br />

Numerical results from the experimentation are shown in<br />

Table III. Lower <strong>and</strong> upper bounds are calculated for each<br />

experimental condition. The network adequacy of each<br />

condition is close to one <strong>and</strong> the proportional allocation policy<br />

can be used effectively for all the conditions.<br />

Proportional allocation policies (PA-O <strong>and</strong> PA-C) shows<br />

significant advantages compared to round-robin allocation in<br />

all the different conditions. The completion time T under<br />

proportional allocation is bounded to T UB <strong>and</strong> close to T LB in all<br />

deterministic conditions (note that the performance of PA-O<br />

<strong>and</strong> PA-C is the same in deterministic environments),<br />

supporting the effectiveness of the resource allocation policy.<br />

Though T UB does not work accurately in stochastic<br />

environments, the performance improves close to T LB when the<br />

proportional allocation is implemented in closed-loop. The<br />

periodic design process alleviates the impacts of stochasticity.<br />

So, we can conclude that the designed control mechanism can<br />

be effectively used even in stochastic environments for the<br />

networks with high adequacy.<br />

The performance differences can be reasoned from resource<br />

utilization as discussed earlier. A node with maximal total CPU<br />

time needs to utilize its resource almost fully to achieve a<br />

performance close to T LB . For example, N 2 is such a node in<br />

Con1-1 (deterministic) <strong>and</strong> Con1-2 (stochastic). Resource<br />

utilization profiles of N 2 are shown in Fig. 4 for Con1-1 <strong>and</strong><br />

Fig. 5 for Con1-2, in which a data point corresponds to the<br />

amount of utilized resource during a control period (100 time<br />

units). In deterministic environment (Con1-1), N 2 utilizes its<br />

resource almost fully under both proportional allocation<br />

Utilization (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

PA-O<br />

RR<br />

PA-C<br />

0<br />

0 1000 2000 3000 4000 5000 6000<br />

Fig. 4. Resource utilization of N 2 in Con1-1. In a deterministic environment,<br />

N 2 utilizes its resource almost fully under both proportional allocation policies<br />

(PA-O, PA-C) while underutilizing in initial stage under round-robin<br />

allocation policy (RR).<br />

Utilization (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

PA-O<br />

RR<br />

Time<br />

0<br />

0 1000 2000 3000 4000 5000 6000<br />

Fig. 5. Resource utilization of N 2 in Con1-2. In a stochastic environment, N 2<br />

utilizes its resource more under proportional allocation policies (PA-O, PA-C)<br />

compared to round-robin allocation policy (RR), <strong>and</strong> resource utilization<br />

under closed-loop policy (PA-C) is larger than under open-loop policy<br />

(PA-O).<br />

Time<br />

PA-C<br />

TABLE III<br />

EXPERIMENTAL RESULTS<br />

Control policy<br />

RR PA-O PA-C<br />

T LB T UB Adequacy T T LB /T T T LB /T T T LB /T<br />

Con1-1 4800 4820 0.996 5619 0.854 4820 0.996 4820 0.996<br />

Con1-2 4800 4820 0.996 5618 0.854 5021 0.956 4939 0.972<br />

Con2-1 7200 7230 0.996 7612 0.946 7200 1.000 7200 1.000<br />

Con2-2 7200 7230 0.996 7679 0.938 7323 0.983 7252 0.993<br />

Con3-1 6000 6073 0.988 6412 0.936 6012 0.998 6012 0.998<br />

Con3-2 6000 6073 0.988 6408 0.936 6193 0.969 6013 0.998<br />

Con4-1 7200 7228 0.996 7200 1.000 7200 1.000 7200 1.000<br />

Con4-2 7200 7228 0.996 7231 0.996 7109 1.013 7169 1.004<br />

Con5-1 7200 7220 0.997 7810 0.922 7210 0.999 7210 0.999<br />

Con5-2 7200 7220 0.997 7979 0.902 7351 0.979 7319 0.984


8<br />

policies while underutilizing in initial stage under round-robin<br />

allocation. In stochastic environment (Con1-2), resource<br />

utilization profiles under proportional allocation policies<br />

become different. Though both policies give more utilization<br />

compared to round-robin allocation, resource utilization under<br />

closed-loop policy is larger than under open-loop policy. Such<br />

differences of resource utilization result in the performance<br />

differences in Table III. The designed control mechanism helps<br />

maximizing the utilization of distributed resources so as to<br />

achieve desirable performance.<br />

VII. CONCLUSIONS<br />

A typical information network emerges as a result of<br />

automation or organizational integration, which is large-scale<br />

with distributed <strong>and</strong> component-based architecture. In this<br />

paper we designed a resource control mechanism of such<br />

networks for minimizing completion time. The designed<br />

resource control mechanism has several desirable properties.<br />

First, it is localized as each node can make decisions<br />

independent of others. Second, it requires almost no<br />

computation. Third, nevertheless the network can achieve a<br />

desirable performance. Fourth, it is itself adaptive to the stress<br />

environments without explicit considerations. Such emergent<br />

properties can be found in many self-organizing systems such<br />

as social or biological systems. Though entities act with a<br />

simple mechanism without central authority, desirable global<br />

performance can often be realized. When a large-scale network<br />

is working in a dynamic environment under the designed<br />

control mechanism, it is really a self-organizing system.<br />

REFERENCES<br />

[1] B. Meyer, “On to components”, IEEE Computer, vol. 32, no. 1, pp.<br />

139-140, 1999.<br />

[2] P. Clements, “From subroutine to subsystems: Component-based<br />

software development,” in Component Based Software Engineering, A.<br />

W. Brown, Ed. IEEE Computer Society Press, 1996, pp. 3-6.<br />

[3] O. F. Rana <strong>and</strong> K. Stout, “What is scalability in multi-agent systems?,” in<br />

Proc. 4th Int. Conf. Autonomous Agents, 2000, pp. 56-63.<br />

[4] F. M. T. Brazier, C. M. Jonker, <strong>and</strong> J. Treur, “Principles of<br />

component-based design of intelligent agents,” Data <strong>and</strong> Knowledge<br />

Engineering, vol. 41, no. 1, pp. 1-28, 2002.<br />

[5] H. J. Goradia <strong>and</strong> J. M. Vidal, “Building blocks for agent design,” in Proc.<br />

4th Int. Workshop on Agent-Oriented Software Engineering, 2003, pp.<br />

17-30.<br />

[6] R. Krutisch, P. Meier, <strong>and</strong> M. Wirsing, “The AgentComponent approach,<br />

combining agents <strong>and</strong> components,” in Proc. 1st German Conf.<br />

Multiagent Sys. Technologies, 2003, pp. 1-12.<br />

[7] D. Moore, W. Wright, <strong>and</strong> R. Kilmer, “Control surfaces for Cougaar,” in<br />

Proc. First Open Cougaar Conference, 2004, pp. 37-44.<br />

[8] W. Peng, V. Manikonda, <strong>and</strong> S. Kumara, “Underst<strong>and</strong>ing agent societies<br />

using distributed monitoring <strong>and</strong> profiling,” in Proc. First Open Cougaar<br />

Conference, 2004, pp. 53-60.<br />

[9] H. Gupta, Y. Hong, H. P. Thadakamalla, V. Manikonda, S. Kumara, <strong>and</strong><br />

W. Peng, “Using predictors to improve the robustness of multi-agent<br />

systems: Design <strong>and</strong> implementation in Cougaar,” in Proc. First Open<br />

Cougaar Conference, 2004, pp. 81-88.<br />

[10] D. Moore, A. Helsinger, <strong>and</strong> D. Wells, “Deconfliction in ultra-large MAS:<br />

Issues <strong>and</strong> a potential architecture,” in Proc. First Open Cougaar<br />

Conference, 2004, pp. 125-133.<br />

[11] R. D. Snyder <strong>and</strong> D. C. Mackenzie, “Cougaar agent communities,” in<br />

Proc. First Open Cougaar Conference, 2004, pp. 143-147.<br />

[12] J. Regehr, “Some guidelines for proportional share CPU scheduling in<br />

general-purpose operating systems,” Presented as a work in progress at<br />

22nd IEEE Real-Time Systems Symposium, London, UK, Dec. 3-6, 2001.<br />

[13] I. Stoica, H. Abdel-Wahab, J. Gehrke, K. Jeffay, S. K. Baruah, <strong>and</strong> C. G.<br />

Plexton, “A proportional share resource allocation algorithm for<br />

real-time, time-shared systems,” in Proc. 17th IEEE Real-Time Systems<br />

Symposium, 1996, pp. 288-299.<br />

[14] C. A. Waldspurger <strong>and</strong> W. E. Weihl, “Lottery scheduling: Flexible<br />

proportional-share resource management,” in Proc. First Symposium on<br />

Operating System Design <strong>and</strong> Implementation, 1994, pp. 1-11.<br />

[15] C. Waldspurger <strong>and</strong> W. Weihl, “Stride scheduling: Deterministic<br />

proportional-share resource management,” Lab. for Computer Science,<br />

Massachusetts Institute of Technology, Cambridge, MA, Tech. Rep.<br />

MIT/LCS/TM-528, 1995.<br />

[16] C. Waldspurger, “Lottery <strong>and</strong> stride scheduling: Flexible proportional<br />

share resource management,” Ph.D. dissertation, Lab. for Computer<br />

Science, Massachusetts Institute of Technology, Cambridge, MA, 1995.<br />

[17] G. Barto, S. J. Bradtke, <strong>and</strong> S. P. Singh, “Learning to act using real-time<br />

dynamic programming,” Artificial Intelligence, vol. 72, pp. 81-138, 1995.<br />

[18] R. S. Sutton, A. G. Barto, <strong>and</strong> R. J. Williams, “Reinforcement learning is<br />

direct adaptive optimal control,” IEEE Control Systems, vol. 12, no. 2, pp.<br />

19-22, 1992.<br />

[19] L. P. Kaelbling, M. L. Littman, <strong>and</strong> A. W. Moore, “Reinforcement<br />

learning: A survey,” Journal of Artificial Intelligence Research, vol. 4,<br />

pp. 237-285, 1996.<br />

[20] J. B. Rawlings, “Tutorial overview of model predictive control,” IEEE<br />

Control Systems, vol. 20, no. 3, pp. 38-52, 2000.<br />

[21] M. Morari <strong>and</strong> J. H. Lee, “Model predictive control: Past, present <strong>and</strong><br />

future,” Computers <strong>and</strong> Chemical Engineering, vol. 23, no. 4, pp.<br />

667-682, 1999.<br />

[22] M. Nikolaou, “Model predictive controllers: A critical synthesis of theory<br />

<strong>and</strong> industrial needs,” Advances in Chemical Engineering Series,<br />

Academic Press, 2001.<br />

[23] S. J. Qin <strong>and</strong> T. A. Badgwell, “A survey of industrial model predictive<br />

technology,” Control Engineering Practice, vol. 11, pp. 733-764, 2003.<br />

[24] T. Gonzalez <strong>and</strong> S. Sahni, “Flowshop <strong>and</strong> jobshop schedules: Complexity<br />

<strong>and</strong> approximation,” Operations Research, vol. 26, pp. 36-52, 1978.<br />

[25] J. Lenstra, A. R. Kan, <strong>and</strong> P. Brucker, “Complexity of machine scheduling<br />

problems,” Annals of Discrete Mathematics, vol. 1, pp. 343-362, 1977.<br />

[26] S. Jha <strong>and</strong> J. M. Wing, “Survivability analysis of networked systems,” in<br />

Proc. 23rd Int. Conf. Software engineering, 2001, pp. 307-317.<br />

[27] A. P. Moore, R. J. Ellison, <strong>and</strong> R. C. Linger, “Attack modeling for<br />

information security <strong>and</strong> survivability,” Software Engineering Institute,<br />

Carnegie Mellon University, Pittsburg, PA, Tech. Note<br />

CMU/SEI-2001-TN-001, 2001.<br />

[28] F. Moberg, “Security analysis of an information system using an attack<br />

tree-based methodology,” M.S. thesis, Automation Engineering Program,<br />

Chalmers University of Technology, Sweden, 2000.


Efficient Method of Quantifying Minimal Completion Time for Component-<br />

Based Service Networks: Network Topology <strong>and</strong> Resource Allocation *<br />

Seokcheon Lee, Soundar Kumara, <strong>and</strong> Natarajan Gautam<br />

Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />

The Pennsylvania State University<br />

University Park, PA 16802<br />

{stonesky, skumara, ngautam}@psu.edu<br />

ABSTRACT<br />

In a grid service environment, it is important to be able to agilely quantify the quality of<br />

service achievable by each alternative composition of resources <strong>and</strong> services. This capability<br />

is an essential driver to not only efficiently utilizing the resources <strong>and</strong> services, but also<br />

promoting the virtual economy. In this paper, we develop such a method of quantifying the<br />

minimal completion time for component-based service networks whose task flow structure is<br />

a combination of intra-service <strong>and</strong> inter-service task flows. The performance of the network<br />

is a function of network topology <strong>and</strong> resource allocation. Network topology assigns<br />

components to available machines <strong>and</strong> resource allocation allocates the resources of each<br />

machine to the residing components. Though similar problems can be found in the<br />

multiprocessor scheduling literature, our problem is different especially because a component<br />

in our networks can have multiple tasks to process, i.e. a component can process tasks in<br />

parallel with its successor or predecessor components. The designed method incorporates the<br />

fact that the components in a network can be considered independent under a certain resource<br />

allocation policy when the number of tasks of each component is large.<br />

Index Terms: Multiprocessor systems, sequencing <strong>and</strong> scheduling, network topology,<br />

modeling <strong>and</strong> prediction, optimization<br />

1. Introduction<br />

Individual systems are becoming interoperable in virtue of several enabling technologies. The<br />

Grid technology provides inexpensive access to large computational resources across<br />

institutional boundaries [1]. Services can be composed over the Internet via Web Service<br />

technology creating enormous opportunities for automation of business processes [2]. OGSA<br />

(Open Grid Services Architecture: http://www.globus.org/ogsa/) defines a grid system<br />

* This work was supported, in part, by <strong>DARPA</strong> (Grant#: MDA972-01-1-0038) under the UltraLog program.<br />

1


architecture based on both the Grid <strong>and</strong> Web Service technologies. The Grid Service enables the<br />

integration of resources <strong>and</strong> services across distributed, heterogeneous, dynamic virtual<br />

organizations [3]. Cost <strong>and</strong> quality considerations may force large number of customers to look<br />

for resources <strong>and</strong> services via such an architecture to deal with their own computing problems.<br />

Ubiquitous computing technology embeds computers in various objects <strong>and</strong> places for sensing<br />

<strong>and</strong> controlling environments [4]. As this technology is becoming realized <strong>and</strong> gives rise to<br />

complex computing problems, the use of such an architecture might be inevitable.<br />

In a grid service environment, a problem is processed by composing multiple resources <strong>and</strong><br />

services. As there can be several alternative compositions of resources <strong>and</strong> services for a given<br />

problem, virtual markets will play a critical role in coordinating huge amount of economic<br />

entities such as customers, service providers, <strong>and</strong> resource providers. There are various market<br />

mechanisms such as OCEAN [5], Compute Power Market [6], <strong>and</strong> Nimord/G [7], proposed for<br />

the large-scale virtual economy. However, one essential enabler of such markets is the ability to<br />

agilely quantify the quality of service (QoS) achievable by each alternative. Without such a<br />

capability, the alternatives cannot be valuated in a timely manner <strong>and</strong> the virtual economy will<br />

fail to efficiently utilize the resources <strong>and</strong> services.<br />

There can be various ways of defining QoS depending on the nature of the problems. We<br />

consider a class of problems whose QoS is determined by completion time for generating a<br />

solution. The completion time (also called makespan) is one of the most widely studied<br />

objectives for diverse scheduling problems in the context of multiprocessor, manufacturing, <strong>and</strong><br />

project management. Regarding to problem solving structure we adopt component-based<br />

architecture as a general framework. A component is a reusable program element. Component<br />

technology utilizes the components so that developers can build systems needed by simply<br />

2


defining their specific roles <strong>and</strong> wiring them together [8][9]. In service networks with<br />

component-based architecture, each component is highly specialized for specific tasks <strong>and</strong> task<br />

flow structure between components is a combination of intra-service <strong>and</strong> inter-service task flows.<br />

A problem given to such a network is decomposed in terms of root tasks for some components<br />

<strong>and</strong> those tasks are propagated through a task flow structure to other components. As a problem<br />

can be decomposed with respect to space, time, or both, a component can have multiple root<br />

tasks that can be considered independent <strong>and</strong> identical in their nature. One can imagine wide<br />

range of scientific <strong>and</strong> engineering problems that can be solved by such a network.<br />

In this paper, we develop an efficient method of quantifying the minimal completion time for<br />

the component-based service networks. For a given set of resources <strong>and</strong> services, the<br />

performance can vary depending on the way of utilizing distributed heterogeneous resources.<br />

Network topology assigns components to available machines with a set of constraints. The<br />

components of a web service may not be separable to different machines <strong>and</strong> a web service may<br />

be allowed to specific machines. Though mobile code provides a great flexibility for creating<br />

distributed systems there are technical challenges such as security to fulfill its promise [10]-[13].<br />

Given a network topology, there can be multiple components in a machine sharing the machine’s<br />

resources together. So, resource allocation can play an important role in controlling the<br />

performance of a network. These two control facilities determine the performance of a network<br />

<strong>and</strong> the minimal completion time represents achievable QoS by a set of resources <strong>and</strong> services.<br />

Similar problems can be found in the multiprocessor scheduling literature 1 . There is a set of<br />

components with a task flow structure between them <strong>and</strong> each component without predecessors<br />

has one root task. Each component processes exactly one task only after all of its predecessors<br />

complete their tasks. A multiprocessor scheduling is composed of an assignment of components<br />

1 We adapt the terms used in the multiprocessor scheduling to our context throughout this paper.<br />

3


to machines (network topology) <strong>and</strong> a sequence of components for each machine (resource<br />

allocation). However, our problem is different especially because a component in our networks<br />

can have multiple tasks to process, i.e., a component can process tasks in parallel with its<br />

successors or predecessors. The easiest multiprocessor scheduling problem is when components<br />

are independent, i.e., there is no task flow between components. However, this problem is known<br />

as NP-complete [14][15]. Considering that the task flow structure of our networks is arbitrary<br />

<strong>and</strong> each component can have multiple tasks to process, our scheduling problem is even harder.<br />

In this context, the method designed in this paper is a heuristic which is applicable to the<br />

cases where the number of tasks to be processed by each component is large. Though the<br />

increase of the number of tasks adds more complexity, it can give us great opportunity to<br />

develop an efficient heuristic. Also, our method addresses resource reservation. When different<br />

applications share resources together, their performance can be guaranteed through the resource<br />

reservation. The method quantifies the minimal completion time by incorporating the resource<br />

reservations of other applications <strong>and</strong> also enables to make the resource reservations for the<br />

service network under consideration.<br />

The organization of this paper is as follows. In section 2 we formally define the problem in<br />

detail. After designing the method in Sections 3, we show empirical results in Section 4. <strong>Final</strong>ly,<br />

we discuss implications <strong>and</strong> possible extensions of our work in Section 5.<br />

2. Problem statement<br />

In this section we formally define the problem by detailing component-based service network,<br />

network topology, <strong>and</strong> resource allocation. We focus on computational CPU resources assuming<br />

that the system is computation-bounded.<br />

4


2.1 Component-based service network<br />

A network is composed of a set I = {i: i∈I} of components <strong>and</strong> a task flow structure between<br />

them. Task flow structure of the network, which defines precedence relationship between<br />

components, is an arbitrary directed acyclic graph. A problem given to a network is decomposed<br />

in terms of root tasks for some components <strong>and</strong> those tasks are propagated through a task flow<br />

structure. Each component processes one of the tasks in its queue (which has root tasks as well as<br />

tasks from predecessor components) <strong>and</strong> then sends it to successor components. We denote the<br />

number of root tasks of component i as rt i . There is a set K = {k: k∈K} of available machines <strong>and</strong><br />

P i (k) represents CPU time per task of component i at machine k reflecting computation speed<br />

difference between machines.<br />

Fig. 1 shows an example network composed of four components in three machines. In the<br />

figure denotes rt i <strong>and</strong> P i (k) at the residing machine respectively. Components I 1 <strong>and</strong> I 2 are<br />

residing in machine K 1 <strong>and</strong> each of them has 100 root tasks. I 3 in K 2 <strong>and</strong> I 4 in K 3 have no root<br />

tasks but they have 200 <strong>and</strong> 100 tasks from the corresponding predecessors.<br />

<br />

I 1<br />

<br />

I 3<br />

K 2<br />

K 3<br />

<br />

I 2 I 4<br />

K 1<br />

Fig. 1. An example network composed of four components in three machines. denotes the<br />

number of root tasks <strong>and</strong> CPU time per task at the residing machine.<br />

2.2 Network topology<br />

Considering that the components of a web service may not be separable to different machines,<br />

we define a set J = {j: j∈J} of clusters <strong>and</strong> denote the components of a cluster j as M j . Each<br />

5


component is a member of one of the clusters <strong>and</strong> the components in a cluster should be assigned<br />

to the same machine. Each cluster can be assigned to a set of machines <strong>and</strong> we denote the<br />

assignable machine set of cluster j as N j . We define topology variable set X = {x jk : j∈J, k∈K} in<br />

which x jk is 1 if cluster j is assigned to machine k <strong>and</strong> 0 otherwise. The constraints of topology<br />

variables are as in (1).<br />

<br />

Network topology constraints<br />

∑<br />

k∈N<br />

∑<br />

k∉N<br />

x<br />

jk<br />

j<br />

j<br />

x<br />

x<br />

jk<br />

jk<br />

= 1<br />

= 0<br />

∈{0,1}<br />

for all<br />

for all<br />

for all<br />

j ∈ J<br />

j ∈ J<br />

j ∈ J<br />

<strong>and</strong><br />

k ∈ K<br />

(1)<br />

2.3 Resource allocation<br />

When there are multiple components in a machine, a network can control its behavior through<br />

resource allocation. In the example network, machine K 1 has two components <strong>and</strong> the system<br />

performance depends on its resource allocation to these two components. There are several CPU<br />

scheduling algorithms for allocating a CPU resource amongst multiple threads. Among the<br />

scheduling algorithms, proportional CPU share (PS) scheduling is known for its simplicity,<br />

flexibility, <strong>and</strong> fairness [16]. In PS scheduling threads are assigned weights <strong>and</strong> resource shares<br />

are determined proportional to the weights [17]. Excess CPU time from some threads is allocated<br />

fairly to other threads. There are many PS scheduling algorithms such as Weighted Round-Robin<br />

scheduling, Lottery scheduling, <strong>and</strong> Stride scheduling [18]-[20].<br />

We adopt PS scheduling as resource allocation scheme because of its generality in addition to<br />

the benefits mentioned above. We define resource allocation variable set w = {w i (t): i∈I, t≥0} in<br />

which w i (t) is a non-negative weight of component i at time t. We denote the components<br />

6


assigned to machine k as S I[k] <strong>and</strong> the clusters assigned to machine k as S J[k] . If ω k a of total<br />

managed weight ω k is available to assign in machine k (i.e. ω k -ω k a<br />

is reserved by other<br />

applications), the constraints of resource allocation variables for a given topology are as in (2).<br />

<br />

Resource allocation constraints<br />

∑<br />

i∈S<br />

I<br />

[ k ]<br />

a<br />

w i( t ) ≤ ω k for all k ∈ K<br />

(2)<br />

2.4 Problem definition<br />

As the completion time T is a function of network topology (X) <strong>and</strong> resource allocation (w),<br />

the objective is to quantify the minimal completion time T *<br />

represented in (3) with the<br />

constraints of (1) <strong>and</strong> (2).<br />

T<br />

*<br />

= Min T<br />

X ,w<br />

. (3)<br />

3. Minimal completion time<br />

As stated earlier, we design a method of quantifying the minimal completion time by limiting<br />

to the cases where the number of tasks to be processed by each component is large. In this<br />

section, we investigate the impacts of the largeness on the optimal resource allocation for a given<br />

topology. Then, we formulate the problem by incorporating network topology <strong>and</strong> provide a<br />

heuristic algorithm for solving the problem formulation.<br />

3.1 Optimal resource allocation<br />

For a given topology, we define Load Index LI i which represents component i’s total CPU<br />

time required to process its tasks. As a component needs to process its own root tasks as well as<br />

incoming tasks from its predecessors, its number of tasks L i is identified as in (4), where i<br />

7


denotes the immediate predecessors of component i. Then, by denoting CPU time per task in the<br />

given topology as P i , LI i is represented as in (5).<br />

∑<br />

L i = rti<br />

+ La<br />

(4)<br />

i<br />

a∈i<br />

LI = L P . (5)<br />

To provide theoretical foundation of optimal resource allocation, we convert a network into a<br />

network with infinitesimal tasks. Each root task is divided into r infinitesimal tasks <strong>and</strong> each P i is<br />

replaced with P i /r. Then, the load index of each component is the same as the original network<br />

but tasks are infinitesimal. We denote the completion time of the network with infinitesimal<br />

tasks as T´. Also, we define a term called task availability as an indicator of relative preference<br />

for task arrival patterns. A component’s task availability for an arrival pattern is higher than for<br />

another if cumulative number of arrived tasks is larger or equal over time. A component prefers a<br />

task arrival pattern with higher task availability as it can utilize more resource. Consider a<br />

network <strong>and</strong> reconfigure it such that all components have their tasks in their queues at t=0. Each<br />

component has maximal task availability in the reconfigured network <strong>and</strong> the completion time of<br />

the reconfigured network forms the lower bound T LB of a network’s completion time T given by:<br />

i<br />

i<br />

T<br />

LB<br />

ωk<br />

= Max ∑ LI i<br />

k∈K<br />

a . (6)<br />

ω<br />

k i∈<br />

S I<br />

[ k ]<br />

Then, assuming a hypothetical weighted round-robin server 2 for CPU scheduling, T´ equals to<br />

T LB when each machine allocates resource to the residing components according to (7), where<br />

k(i) denotes a machine in which component i resides.<br />

2 The hypothetical server has idealized fairness as the CPU time received by each thread in a round is infinitesimal<br />

<strong>and</strong> proportional to the weight of the thread. This assumption is reasonable because quantum size is relatively<br />

infinitesimal compared to working horizon in reality.<br />

8


LI i<br />

wi<br />

( t ) ≥ ω<br />

k( i )<br />

for all i ∈ I <strong>and</strong> t ≥ 0<br />

LB<br />

(7)<br />

T<br />

Proof. A component’s instantaneous resource availability RA i (t), which is the available fraction<br />

of a resource when the component requests the resource at time t, is more than or equal to<br />

assigned weight proportion as:<br />

w i( t )<br />

RA i( t ) ≥ for t ≥ 0 . (8)<br />

ω<br />

k( i )<br />

Service time S i (t) is the time taken to process a task at time t <strong>and</strong> has a relationship with RA i (t)<br />

as:<br />

∫<br />

t+ Si<br />

( t)<br />

RAi<br />

( τ ) dτ<br />

= Pi<br />

. (9)<br />

t<br />

Suppose a component i receives its tasks at a constant interval of T LB /L i . Then, under the<br />

resource allocation in (7), S i (t) is less than or equal to T LB /L i over time as shown in (10).<br />

i<br />

Pi<br />

=<br />

∫<br />

RA i ( τ )dτ<br />

≥<br />

∫<br />

T<br />

⇒<br />

L<br />

LB<br />

i<br />

t+<br />

S ( t )<br />

t<br />

≥ S ( t )<br />

i<br />

t+<br />

S (<br />

t<br />

i<br />

t )<br />

wi<br />

( t ) LI<br />

dτ<br />

≥<br />

ω T<br />

k(<br />

i )<br />

i<br />

LB<br />

S ( t )<br />

i<br />

(10)<br />

So, any component can complete by T LB <strong>and</strong> generate tasks at a constant interval of T LB /L i<br />

from t=T LB /L i (first task generation time) under the resource allocation in (7) when it receives<br />

tasks at a constant interval of T LB /L i from t=0 (first task arrival time). As tasks are infinitesimal<br />

<strong>and</strong> root tasks increase task availability, each component can receive infinitesimal tasks at a<br />

constant interval in 0≤t≤T LB or more preferably, <strong>and</strong> complete at less than or equal to T LB . So,<br />

the network completes at T LB .<br />

<br />

So, a network can achieve a performance close to T LB under this resource allocation in the<br />

9


limit of large number of tasks. If machines do not follow this resource allocation, some<br />

components can receive their tasks less preferably than constant interval resulting in<br />

underutilization <strong>and</strong> consequently increased completion time. The minimal weights required to<br />

achieve T LB are constants over time as in (11) <strong>and</strong> the summation of these weights for each<br />

machine forms the required amount ω r k of resource reservation in the machine as in (12). Note<br />

that ω r k is less than or equal to ω a k satisfying the resource allocation constraints in (2).<br />

<br />

Constant resource allocation<br />

w<br />

i<br />

LI i<br />

= ω<br />

k( i )<br />

for all i ∈ I <strong>and</strong> t ≥ 0<br />

LB<br />

(11)<br />

T<br />

<br />

Resource reservation<br />

r ω<br />

ω k = k LI i for all k ∈ K<br />

LB ∑<br />

(12)<br />

T<br />

i∈<br />

S I<br />

[ k ]<br />

3.2 Optimal network topology<br />

As CPU time per task is machine-dependent we rewrite the load index as a function of<br />

machine as:<br />

LIi ( k)<br />

= Li<br />

Pi<br />

( k)<br />

. (13)<br />

Considering that the components in a cluster cannot be assigned to separate machines, we define<br />

Cluster Load Index CLI j (k) as:<br />

∑<br />

CLI j ( k)<br />

= LIi<br />

( k)<br />

. (14)<br />

Then, under the constant resource allocation, the completion time for a given topology can be<br />

estimated by:<br />

i∈<br />

M j<br />

10


ω<br />

∑<br />

k<br />

Max CLI j ( k )<br />

k∈K<br />

a<br />

. (15)<br />

ωk<br />

j∈<br />

Consequently, the minimal completion time T * can be formulated as in (16) by incorporating<br />

topology variables <strong>and</strong> constraints in (1).<br />

S J [<br />

k ]<br />

<br />

Topology problem formulation<br />

T<br />

*<br />

s.t.<br />

ω<br />

= Min Max<br />

k∈K<br />

ω<br />

∑<br />

k∈N<br />

∑<br />

k∉N<br />

x<br />

jk<br />

j<br />

j<br />

x<br />

x<br />

jk<br />

jk<br />

= 1<br />

= 0<br />

∈{0,1}<br />

k<br />

a<br />

k<br />

∑<br />

j∈J<br />

CLI<br />

j<br />

( k )x<br />

jk<br />

for all j ∈ J<br />

for all j ∈ J<br />

for all j ∈ J <strong>and</strong><br />

k ∈ K<br />

(16)<br />

The formulation has a simplistic form because it is completely separated from resource<br />

allocation variables. As a result, the formulation can be mapped into the easiest multiprocessor<br />

scheduling problem, i.e., an assignment of independent clusters to machines. As discussed, this<br />

problem is NP-complete <strong>and</strong> there are diverse heuristic algorithms available in the literature.<br />

Eleven heuristics were selected <strong>and</strong> examined with various problem configurations in [21]. They<br />

are Opportunistic Load Balancing, Minimum Execution Time, Minimum Completion Time,<br />

Min-min, Max-min, Duplex, Genetic Algorithm, Simulated Annealing, Genetic Simulated<br />

Annealing, Tabu, <strong>and</strong> A * . Though Genetic Algorithm always gave the best performance, if<br />

algorithm execution time is also considered, it was shown that the simple Min-min heuristic<br />

performs well in comparison to others. So, we recommend the Min-min heuristic as an algorithm<br />

for solving the problem formulation. By adapting to our context the Min-min heuristic is as<br />

follows.<br />

11


Min-min heuristic algorithm<br />

Step 1: Initialize a set of all unassigned clusters, U←J, <strong>and</strong> current machine-level completion<br />

times, mc(k)←0 for all k∈K.<br />

Step 2: Compute the minimal completion time after assignment for each unassigned cluster,<br />

ωk<br />

M={ min [ CLI j ( k ) + mc( k )]<br />

k∈N<br />

a<br />

: j∈U}.<br />

j ω<br />

k<br />

Step 3: Select the minimal from M, mmc←min M, <strong>and</strong> find corresponding cluster <strong>and</strong><br />

machine, c <strong>and</strong> m respectively.<br />

Step 4: Assign c to m <strong>and</strong> update mc(m), mc(m)←mc(m)+mmc.<br />

Step 5: Remove c from U.<br />

Step 6: If U=∅ then go to step 7. Otherwise go to step 2.<br />

Step 7: T * ← max mc( k ) .<br />

k∈K<br />

4. Empirical results<br />

We ran several experiments through discrete-event simulation to validate the designed<br />

method. Though we have not considered stochasticity so far, this empirical study will support the<br />

effectiveness of the method even in stochastic environments.<br />

4.1 Network description<br />

The network is composed of eight components in four clusters as in Table 1. Task flow<br />

structure between components is described in Fig. 2. There are three available machines {K 1 , K 2 ,<br />

K 3 } with ω k =ω a k =1 for all k, <strong>and</strong> each cluster is assignable to any machine.<br />

12


Table 1. Experimental network parameters<br />

Component rt i P i (k) a Cluster<br />

I 1 0 4 J 1<br />

I 2 0 12 J 2<br />

I 3 0 4 J 3<br />

I 4 0 8 J 3<br />

I 5 200 2 J 4<br />

I 6 200 2 J 4<br />

I 7 200 2 J 4<br />

I 8 200 2 J 4<br />

a<br />

for all k∈K<br />

I 1<br />

I 3 I 4<br />

I 5 I 6 I 7<br />

I 2<br />

I 8<br />

Fig. 2. Experimental task flow structure between eight components. The components are members<br />

of three clusters <strong>and</strong> each cluster is assignable to any of three machines.<br />

4.2 Performance evaluation<br />

The Min-min heuristic algorithm gives T * =4800 <strong>and</strong> the resulting topology is as in Fig. 3(b).<br />

The heuristic solution is equivalent to the exact solution of (16) in this experimental network.<br />

K 1<br />

K 2 K 1<br />

K 2<br />

I 1<br />

I 1<br />

I 3 I 4<br />

I 3 I 4<br />

K 3<br />

I 2<br />

I 8<br />

K 3<br />

I 2<br />

(b) Optimal topology<br />

I 5 I 6 I 7 I 8<br />

I 5 I 6 I 7<br />

(a) Non-optimal topology<br />

Fig. 3. Experimental network topologies. In (a), clusters J 1 <strong>and</strong> J 3 are assigned to machine K 1 , J 2 to<br />

K 2 , <strong>and</strong> J 4 to K 3 . In (b), J 4 is reassigned to K 1 <strong>and</strong> J 3 to K 3 .<br />

13


We set up eight different experimental conditions by combining three independent factors as<br />

shown in Table 2. We use two different network topologies as in Fig. 3, which are non-optimal<br />

<strong>and</strong> optimal topologies. Two resource allocation policies are used: round-robin allocation <strong>and</strong><br />

constant allocation. In round-robin allocation the components in each machine are assigned equal<br />

weights <strong>and</strong> in constant allocation according to the components’ load indices as in (11). To<br />

implement PS scheduling we use a weighted round-robin scheduling in which CPU time<br />

received by each component in a round is equal to its assigned weight. Also, the distribution of<br />

P i (k) can be deterministic or stochastic. While using stochastic distribution we repeat 5<br />

experiments.<br />

Table 2. Experimental design<br />

Condition Topology Resource allocation P i (k)<br />

Con1 Non-optimal Round-Robin Deterministic<br />

Con2 Non-optimal Round-Robin Exponential<br />

Con3 Non-optimal<br />

Constant Deterministic<br />

Con4 N on-optimal Constant Exponential<br />

Con5 Optimal Round-Robin Deterministic<br />

Con6 Optimal Round-Robin Exponential<br />

Con7 Optimal Constant Deterministic<br />

Con8 Optimal Constant Exponential<br />

Numerical results fr om the experimentation are shown in Table 3. The last two conditions<br />

(Con7 <strong>and</strong> Con8), which use the optimal network topology <strong>and</strong> constant resource allocation,<br />

gives a performance close to T * <strong>and</strong> outperforms other conditions significantly. Also, constant<br />

allocation for both non-optimal (Con3 <strong>and</strong> Con4) <strong>and</strong> optimal (Con7 <strong>and</strong> Con8) topologies, gives<br />

a performance superior to round-robin allocation <strong>and</strong> close to lower bound performance T LB in<br />

both deterministic <strong>and</strong> stochastic environments. These facts support the optimality of the<br />

constant resource allocation <strong>and</strong> consequently the validity of the method of quantifying the<br />

minimal completion time.<br />

14


Table 3. Experimental results<br />

Condition T LB T * Actual T % a<br />

Con1 6400 4800 7215 150.3<br />

Con2 6400 4800 7314 152.4<br />

Con3 6400 4800 6416 133.7<br />

Con4 6400 4800 6404 133.4<br />

Con5 4800 4800 5619 117. 1<br />

Con6 4800 4800 5645 117.6<br />

Con7 4800 4800 4820 100.4<br />

Con8 4800 4800 4899 102.1<br />

a A / T<br />

*<br />

ctual T<br />

4.3 Resource reservation<br />

The resource reservations required in the optimal topology are [ω r K1 =0.667, ω r K2 =1, ω r K3 =1]<br />

computed from (12). Our argument is that the network can achieve the optimal performance T *<br />

with these reservations even though unreserved resources are allocated to other applications. To<br />

validate this, we use eleven different reservations for machine K 1 as shown in Table 4. In each<br />

condition, ω r K1 is allocated proportional to the load indices of the residing components <strong>and</strong><br />

unreserved resources are assigned to an application which has infinite work (continuously<br />

requiring resources). The numerical results are shown in Table 4 <strong>and</strong> Fig. 4 to 5. In overall, the<br />

completion time decreases as ω r K1 increases. However, when ω r K1 is greater than 0.667, there is no<br />

significant advantage in deterministic environment. In contrast, the threshold in stochastic<br />

environment is somewhere between 0.667 <strong>and</strong> 0.7. Considering that the other applications may<br />

not require resources continuously, such a slight difference (≤ 0.033) does not seem to be<br />

significant.<br />

Table 4. The effects of resource reservation<br />

Actual T<br />

r Deterministic Exponential<br />

ω K1<br />

P i (k) P i (k)<br />

0.1 32284 34502<br />

0.2 16132 16605<br />

0.3 10746 11069<br />

15


0. 4 8057 8237<br />

0.5 6439 6635<br />

0.6 5369 5503<br />

0 .667 4829 5135<br />

0.7 4827 4941<br />

0.8 4824 4965<br />

0.9 4822 4984<br />

1.0 4820 4946<br />

35000<br />

30000<br />

25000<br />

Actual T<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0<br />

Resource reservation<br />

Fig. 4. The effects of resource reservation in deterministic environment. When the resource<br />

reservation in K 1 is greater than 0.667, there is no significant decrease of completion time.<br />

35000<br />

30000<br />

25000<br />

Actual T<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0<br />

Resource reservation<br />

Fig. 5. The effects of resource reservation in stochastic environment. When the resource reservation<br />

in K 1 is greater than somewhere between 0.667 <strong>and</strong> 0.7, there is no significant decrease of<br />

completion time.<br />

16


5. Conclusions<br />

The simple Min-min heuristic algorithm was proposed as a method of quantifying the<br />

minimal completion time for the component-based service networks. A network can achieve the<br />

performance under<br />

the constant resource allocation in the limit of large number of tasks. Also,<br />

the<br />

performance can be guaranteed with the resource reservations we have formulated. The<br />

designed method is efficient enough to satisfy the requirements for the use in a grid service<br />

environment. In spite of its simplicity, the method can quantify the quality of service effectively.<br />

The virtual markets driven by such methods will make timely transactions with desirable<br />

surpluses leading to productive virtual economy.<br />

Our work can be extended by taking into account alternative algorithms. Each component can<br />

have alternative algorithms to process a task which trade off processing time <strong>and</strong> quality of<br />

solution. While network topology <strong>and</strong> resource allocation try to efficiently utilize limited<br />

resources, alternative algorithms can change the amount of required resources. As modern<br />

operating environments are highly dynamic, alternative algorithms becomes an important tool to<br />

achieve portable high performance [22][23]. Quality of service is determined by not only<br />

completion time but also quality of solution. The question is how to quantify the optimal quality<br />

of service that can be provided by such a network.<br />

References<br />

[1] I. Foster <strong>and</strong> C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. San<br />

Francisco: Morgan Kaufmann Publishers, 1999.<br />

[2] R. Hamadi <strong>and</strong> B. Benatallah, “A Petri net-based model for web service composition,” in<br />

Proc. 14th Australasian Database Conf. Database technologies, Adelaide, Australia, 2003,<br />

17


pp. 191-200.<br />

[3] I. Foster, C. Kesselman, J. M. Nick, <strong>and</strong> S. Tuecke, “Grid services for distributed system<br />

integration,” IEEE Computer, vol. 35, no. 6, pp. 37-46, 2002.<br />

[4] M. Weiser, “The computer for the 21st century,” Scientific American, vol. 265, no. 3, pp.<br />

94-104, 1991.<br />

[5] P. Padala, C. Harrison, N. Pelfort, E. Jansen, M. P. Frank, <strong>and</strong> C. Chokkareddy, “OCEAN:<br />

The open computation exchange <strong>and</strong> arbitration network, A market approach to meta<br />

computing,” in<br />

Proc. 2nd Int. Symp. Parallel <strong>and</strong> Distributed Computing, 2003, pp. 185-<br />

192.<br />

[6]<br />

R. Buyya <strong>and</strong> S. Vazhkudai, “Compute power market: Towards a market-oriented grid,” in<br />

Proc. First IEEE/ACM Int. Symp. Cluster Computing <strong>and</strong> the Grid, 2001, pp.574-581.<br />

[7] R. Buyya, D. Abramson, <strong>and</strong> J. Giddy, “Nimrod/G: An architecture for a resource<br />

management <strong>and</strong> scheduling system in a global computational grid,” in Proc. 4th Int. Conf.<br />

High Performance Computing in Asia-Pacific Region, 2000, pp. 283-289.<br />

[8] B. Meyer, “On to components”, IEEE Computer, vol. 32, no. 1, pp. 139-140, 1999.<br />

[9]<br />

P. Clements, “From subroutine to subsystems: Component-based software development,” in<br />

Component Based Software Engineering, A. W. Brown, Ed. IEEE Computer Society Press,<br />

1996, pp. 3-6.<br />

[10] D. B. Lange, “Mobile objects <strong>and</strong> mobile agents: The future of distributed computing?,” in<br />

Proc. 12th European Conf. Object-Oriented Programming, 1998, pp. 1-12.<br />

[11] D. Schoder <strong>and</strong> T. Eymann, “The real challenges of mobile agents,” Communications of the<br />

ACM, vol. 43, no. 6, 2000, pp.111-112.<br />

[12] D. B. Lange <strong>and</strong> M. Oshima, “Seven good reasons for mobile agents,” Communications of<br />

18


the ACM, vol. 42, no. 3, 1999, pp. 88-89.<br />

[13] D. Chess, C. Harrison, <strong>and</strong> A. Kershenbaum, “Mobile agents: Are they a good idea?,” in<br />

Mobile Object Systems: Towards the Programmable Internet, Lecture Notes in Computer<br />

Science, vol. 1222, J. Vitek <strong>and</strong> C. Tschudin, Eds. Springer-Verlag, 1997, pp. 25–47.<br />

[14] O. H. Ibarra <strong>and</strong> C. E. Kim, “Heuristic algorithms for scheduling independent tasks on<br />

nonidentical processors,” Journal of the Association for Computing Machinery, vol. 24, no.<br />

2, pp. 280-289, 1977.<br />

[15] D. Fern<strong>and</strong>ez-Baca, “Allocating modules to processors in a distributed system,” IEEE<br />

Transactions on Software Engineering, vol. 15, no. 11, pp. 1427-1436, 1989.<br />

[16] J. Regehr, “Some guidelines for proportional share CPU scheduling in general-purpose<br />

operating systems,” Presented as a work in progress at 22nd IEEE Real-Time Systems<br />

Symposium, London, UK, Dec. 3-6, 2001.<br />

[17] I. Stoica, H. Abdel-Wahab, J. Gehrke, K. Jeffay, S. K. Baruah, <strong>and</strong> C. G. Plexton, “A<br />

proportional share resource allocation algorithm for real-time, time-shared systems,” in<br />

Proc. 17th IEEE Real-Time Systems Symposium, 1996, pp. 288-299.<br />

[18] C. A. Waldspurger <strong>and</strong> W. E. Weihl, “Lottery scheduling: Flexible proportional-share<br />

resource management,” in Proc. First Symposium on Operating System Design <strong>and</strong><br />

Implementation, 1994, pp. 1-11.<br />

[19] C. Waldspurger <strong>and</strong> W. Weihl, “Stride scheduling: Deterministic proportional-share<br />

resource management,” Lab. for Computer Science, Massachusetts Institute of Technology,<br />

Cambridge, MA, Tech. Rep. MIT/LCS/TM-528, 1995.<br />

[20] C. Waldspurger, “Lottery <strong>and</strong> stride scheduling: Flexible proportional share resource<br />

management,” Ph.D. dissertation, Lab. for Computer Science, Massachusetts Institute of<br />

19


Technology, Cambridge, MA, 1995.<br />

[21] T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran, A. I. Reuther, J. P.<br />

Robertson, M. D. Theys, B. Yao, D. Hensgen, <strong>and</strong> R. F. Freund, “A comparison of eleven<br />

static heuristics for mapping a class of independent tasks onto heterogeneous distributed<br />

computing systems,” Journal of Parallel <strong>and</strong> Distributed Computing, vol. 61, pp. 810-837,<br />

2001.<br />

[22] M. O. McCracken, A. Snavely, <strong>and</strong> A. Malony, “Performance modeling for dynamic<br />

algorithm selection,” in Proc. Int. Conf. Computational Science, 2003, pp. 749-758.<br />

[23] P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Medvidovic, A.<br />

Quilici, D. S. Rosenblum, <strong>and</strong> A. L. Wolf, “An architecture-based approach to self-adaptive<br />

software,” IEEE Intelligent Systems, vol. 14, no. 3, pp. 54-62, 1999.<br />

20


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1<br />

MARKET-BASED MODEL PREDICTIVE CONTROL FOR LARGE-SCALE<br />

INFORMATION NETWORKS: COMPLETION TIME AND VALUE OF SOLUTION<br />

Seokcheon Lee, Soundar Kumara, <strong>and</strong> Natarajan Gautam<br />

Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />

The Pennsylvania State University<br />

University Park, PA 16802<br />

{stonesky, skumara, ngautam}@psu.edu<br />

ABSTRACT<br />

There are several important properties of modern software systems. They tend to be largescale<br />

with distributed <strong>and</strong> component-based architectures. Also, dynamic nature of operating<br />

environments leads them to utilize alternative algorithms. However, on the other h<strong>and</strong>, these<br />

properties make it hard to provide appropriate control mechanisms due to the increased<br />

complexity. Components are sharing resources <strong>and</strong> each component can have alternative<br />

algorithms. As a result, the behavior of a software system can be controlled through resource<br />

allocation as well as algorithm selection. This novel control problem is worthy of investigation in<br />

order to double the benefits of those properties. In this paper we design a scalable control<br />

mechanism for such systems. The quality of service we are considering is a product of the value<br />

of solution <strong>and</strong> the time for generating solution for a given problem. We build a mathematical<br />

programming model that trade off these two conflicting objectives <strong>and</strong> decentralize the model<br />

through an auction market. By periodically opening the auction market for each existing system<br />

state, a closed-loop policy is formed. We verify the designed control mechanism empirically.<br />

Index Terms: Distributed applications, modeling <strong>and</strong> prediction, optimization, scalability


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2<br />

1. Introduction<br />

The growth in complexity <strong>and</strong> size of software systems due to automation or organizational<br />

integration is leading to the increasing importance of distributed <strong>and</strong> component-based<br />

architectures. Distributed computing aims at using computing power of machines connected by a<br />

network. When a task requires intensive computation, it becomes natural choice to achieve high<br />

performance. A component is a reusable program element. Component technology utilizes the<br />

components so that developers can build systems needed by simply defining their specific roles<br />

<strong>and</strong> wiring them together [1][2]. In networks with component-based architecture, each<br />

component is highly specialized for specific tasks. Another emerging technology is adaptive<br />

software [3][4]. Adaptive software has alternative algorithms for the same numerical problem<br />

<strong>and</strong> a switching function for selecting the best algorithm in response to environmental changes.<br />

As modern operating environments are highly dynamic, adaptive software becomes an important<br />

tool to achieve portable high performance.<br />

We study a large-scale information network (with respect to the number of components as<br />

well as machines) comprising of distributed software components linked together through a task<br />

flow structure. A problem given to the network is decomposed in terms of root tasks for some<br />

components <strong>and</strong> those tasks are propagated through a task flow structure to other components.<br />

As a problem can be decomposed with respect to space, time, or both, a component can have<br />

multiple root tasks that can be considered independent <strong>and</strong> identical in their nature. The service<br />

provided by the network is to produce a global solution to a given problem, which is an<br />

aggregate of partial solutions of individual tasks. Each component can have alternative<br />

algorithms to process a task which trade off processing time <strong>and</strong> value of partial solution. Quality<br />

of Service (QoS) of the network is determined by the value of global solution <strong>and</strong> the time for


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 3<br />

generating global solution (i.e., completion time). For a given topology, the network can control<br />

its behavior by utilizing two different kinds of control actions: algorithm selection <strong>and</strong> resource<br />

allocation. While resource allocation tries to efficiently utilize limited resources, algorithm<br />

selection can change the amount of required resources. The resource allocation we are addressing<br />

here, is allocating resources of each machine to the residing components for a given topology. As<br />

problems are decomposed in various ways depending on their nature <strong>and</strong> size, <strong>and</strong> their QoS<br />

functions are context-dependent, the network needs to provide adaptive solutions to given<br />

problems by utilizing such control actions.<br />

One can imagine wide range of scientific <strong>and</strong> engineering problems that can be solved by<br />

such a network. UltraLog (http://www.ultralog.net) networks, implemented in Cougaar<br />

(Cognitive Agent Architecture: http://www.cougaar.org) developed by <strong>DARPA</strong> (Defense<br />

Advanced Research Project Agency), are the instances [5]-[9]. Each agent in these networks<br />

represents an organization of military supply chain <strong>and</strong> has a set of components specialized for<br />

each functionality (allocation, expansion, inventory management, etc) <strong>and</strong> class (ammunition,<br />

water, fuel, etc). The objective of an UltraLog network is to provide an appropriate logistics plan<br />

for a given military operational plan. A logistics plan is a global solution which is an aggregate<br />

of individual schedules built by components. An operational plan is decomposed into logistics<br />

requirements of each thread for each agent, <strong>and</strong> a requirement is further decomposed into root<br />

tasks (one task per day) for a designated component. As a result, a component can have hundreds<br />

of root tasks depending on the horizon of an operation <strong>and</strong> thous<strong>and</strong>s of tasks to process as the<br />

root tasks are propagated. As the scale of operation increases there can be thous<strong>and</strong>s of agents<br />

(tens of thous<strong>and</strong>s of components) in hundreds of machines working together to generate a<br />

logistics plan. QoS of these networks is determined by the quality of logistics plan (value of


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 4<br />

solution) <strong>and</strong> (plan) completion time. These two metrics directly affect the performance of the<br />

operation.<br />

In this paper we design a control mechanism for such novel networks. We stress scalability<br />

with respect to computational complexity as well as communicational overhead, as an important<br />

consideration of the control mechanism for its practical use. The control mechanism should be<br />

able to supply appropriate control policy in a timely manner even though the size of the network<br />

is large. Such a property is important especially when completion time is an explicit<br />

consideration as in our control problem. However, the property is hard to achieve in general if<br />

one pursues exactly optimal policy. Therefore, we design a scalable control mechanism by<br />

sacrificing some amount of optimality in a systematic way as follows.<br />

First, we adopt Model Predictive Control (MPC) as our control framework. In MPC, for each<br />

current state, an optimal open-loop control policy is designed for finite-time horizon by solving a<br />

static mathematical programming model [10]-[13]. The design process is repeated for the next<br />

observed state feedback forming a closed-loop policy reactive to each current system state.<br />

Though MPC does not give absolutely optimal policy in stochastic environments, the periodic<br />

design process alleviates the impacts of stochasticity. Note that technologies such as Dynamic<br />

Programming are not efficient in terms of computational complexity as they try to give optimal<br />

closed-loop control policy. Second, under MPC framework, we build a heuristic programming<br />

model due to computational complexity. The heuristic model is solvable in polynomial time <strong>and</strong><br />

its solution converges to the solution of exact model in the limit of large number of tasks. Third,<br />

we provide a decentralized coordination mechanism for solving the programming model.<br />

Computations <strong>and</strong> communications are distributed to multiple entities through an auction market<br />

while giving a solution equivalent to the solution of the programming model.


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 5<br />

The organization of this paper is as follows. In Section 2 we formally define the problem in<br />

detail. After designing the control mechanism in Sections 3 <strong>and</strong> 4, we show empirical results in<br />

Section 5. <strong>Final</strong>ly, we discuss implications <strong>and</strong> possible extensions of our work in Section 6.<br />

2. Problem specification<br />

In this section we formally define the control problem by detailing network configuration <strong>and</strong><br />

control actions. We focus on computational CPU resources assuming that the system is<br />

computation-bounded.<br />

2.1 Network configuration<br />

A network is composed of a set of components A <strong>and</strong> a set of nodes (i.e., machines) N. K n<br />

denotes a set of components that reside in node n sharing the node’s CPU resource. Task flow<br />

structure of the network, which defines precedence relationship between components, is an<br />

arbitrary directed acyclic graph. A problem given to the network is decomposed in terms of root<br />

tasks for some components <strong>and</strong> those tasks are propagated through the task flow structure. Each<br />

component processes one of the tasks in its queue (which has root tasks as well as tasks from<br />

predecessor components) <strong>and</strong> then sends it to successor components. We denote the number of<br />

root tasks of component i as rt i . Fig. 1 shows an example network in which there are four<br />

components residing in three nodes. Components A 1 <strong>and</strong> A 2 reside in N 1 <strong>and</strong> each of them has<br />

100 root tasks. A 3 in N 2 <strong>and</strong> A 4 in N 3 have no root tasks, but they have 200 <strong>and</strong> 100 tasks<br />

respectively from the corresponding predecessors.


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 6<br />

<br />

A 1<br />

A 3<br />

N 2<br />

N 1<br />

100<br />

0<br />

<br />

A 2 A 4<br />

N 3<br />

100<br />

0<br />

Fig. 1. An example network<br />

2.2 Control actions<br />

The network can utilize two different kinds of control actions in controlling its behavior:<br />

algorithm selection <strong>and</strong> resource allocation.<br />

Algorithm selection<br />

A component can use one of alternative algorithms to process a task. Different alternatives<br />

trade off CPU time <strong>and</strong> value of solution with more CPU time resulting in higher solution value.<br />

As one can find optimal mixed alternatives, a component has a monotonically increasing<br />

piecewise-linear convex function, say value function, with CPU time as a function of value. We<br />

call the value in the function as value mode that a component can select as its decision variable.<br />

A value function is defined with three elements as f v ), v v 〉 as shown in Fig. 1.<br />

〈 i ( i i(min),<br />

i(max)<br />

This function indicates that component i’s expected CPU time 1 to process a task is f i (v i ) with a<br />

value mode v i <strong>and</strong> v i(min) ≤ v i ≤ v i(max) . We assume that components cannot change the mode for a<br />

task in process.<br />

Resource allocation<br />

When there are multiple components in a node, the network needs to control its behavior<br />

through resource allocation. In the example network, node N 1 has two components <strong>and</strong> the<br />

1 The distribution of CPU time can be arbitrary though we use only expected CPU time.


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 7<br />

system performance can depend on its resource allocation to these two components. There are<br />

several CPU scheduling algorithms for allocating a CPU resource amongst multiple threads.<br />

Among the scheduling algorithms, proportional CPU share (PS) scheduling is known for its<br />

simplicity, flexibility, <strong>and</strong> fairness [14]. In PS scheduling threads are assigned weights <strong>and</strong><br />

resource shares are determined proportional to the weights [15]. Excess CPU time from some<br />

threads is allocated fairly to other threads. There are many PS scheduling algorithms such as<br />

Weighted Round-Robin scheduling, Lottery scheduling, <strong>and</strong> Stride scheduling [16]-[18]. We<br />

adopt PS scheduling as resource allocation scheme because of its generality in addition to the<br />

benefits mentioned above. We define resource allocation variable set w = {w i (t): i∈A, t≥0} in<br />

which w i (t) is a non-negative weight of component i at time t. If total managed weight of a node<br />

n is ω n , the boundary condition for assigning weights over time can be described as:<br />

∑<br />

i∈K<br />

n<br />

wi<br />

( t)<br />

= ωn<br />

where wi<br />

( t)<br />

≥ 0 . (1)<br />

2.3 Problem definition<br />

The service provided by the network is to produce a global solution to a given problem, which<br />

is an aggregate of partial solutions of individual tasks. QoS of the network is determined by the<br />

value of global solution <strong>and</strong> the cost of completion time. The value of global solution is the<br />

summation of partial solution values, <strong>and</strong> the cost of completion time is determined by a cost<br />

function CCT(T) which is a monotonically increasing function with completion time T. We<br />

assume that the solution values <strong>and</strong> cost are represented in a common unit 2 . Consider v d i as the<br />

value mode used to process d th task by component i <strong>and</strong> e i the number of tasks processed by<br />

component i to the completion. Then, the control objective is to maximize QoS by utilizing<br />

2 Relative importance can be considered by scaling the functions <strong>and</strong> it results in the same function structures.


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 8<br />

algorithm selection (v) <strong>and</strong> resource allocation (w) as in (2). As stated earlier, we design a<br />

scalable control mechanism to achieve the objective in the framework of MPC by building a<br />

mathematical programming model <strong>and</strong> decentralizing it.<br />

arg max<br />

v,w<br />

e<br />

i<br />

∑∑<br />

i∈ A d = 1<br />

v<br />

d<br />

i<br />

− CCT(T )<br />

(2)<br />

3. Mathematical programming model<br />

The mathematical programming model is essentially a scheduling problem formulation. There<br />

are a variety of formulations <strong>and</strong> algorithms available for diverse scheduling problems in the<br />

context of multiprocessor, manufacturing, <strong>and</strong> project management. In general, a scheduling<br />

problem is allocating limited resources to a set of tasks to optimize a specific objective. One<br />

widely studied objective is completion time (also called makespan in the manufacturing<br />

literature) as the problem we have considered. Though it is not easy to find a problem exactly<br />

same as ours, it is possible to convert our problem into one of the scheduling problems. For<br />

example, in job shop, there are a set of jobs <strong>and</strong> a set of machines. Each job has a set of serial<br />

operations <strong>and</strong> each operation should be processed on a specific machine. A job shop scheduling<br />

problem is sequencing the operations in each machine by satisfying a set of job precedence<br />

constraints such that the completion time is minimized. When we assign a value mode to each<br />

task, our problem can be exactly transformed into a job shop scheduling problem. However,<br />

scheduling problems are in general intractable. Though the job shop scheduling problem is<br />

polynomially solvable when there are two machines <strong>and</strong> each job has two operations, it becomes<br />

NP-hard on the number of jobs even if the number of machines or operations is more than two<br />

[19][20]. Considering that the task flow structure of our networks is arbitrary, our scheduling


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 9<br />

problem is NP-hard on the number of components in general. The increase of the number of<br />

tasks <strong>and</strong> consideration of alternative algorithms make the problem even harder. Moreover, there<br />

can be large number of nodes in our networks.<br />

Though it may be possible to use some available heuristic algorithms from the job shop<br />

scheduling problem by taking into account alternative algorithms, our scheduling problem has a<br />

particular characteristic, i.e., the number of tasks for each component can be large. Though the<br />

increase of the number of tasks adds more complexity, it can also lead us to develop an efficient<br />

heuristic programming model. In this section, we characterize an optimal resource allocation by<br />

analyzing the impacts of the largeness <strong>and</strong> subsequently build a mathematical programming<br />

model solvable in polynomial time.<br />

3.1 Optimal resource allocation<br />

Consider current time t=0 <strong>and</strong> assume that each component uses a value mode common to all<br />

the tasks (i.e. pure strategy). We will discuss the optimality of the pure strategy later in this<br />

subsection. We define Load Index LI i which represents component i’s total CPU time required to<br />

process its tasks. As a component needs to process its own root tasks as well as incoming tasks<br />

from its predecessors, its number of tasks L i is identified as in (3), where i denotes the immediate<br />

predecessors of component i. Then, LI i is represented as in (4).<br />

∑<br />

L i = rti<br />

+ La<br />

(3)<br />

a∈i<br />

LI = L f v )<br />

(4)<br />

i<br />

To provide theoretical foundation of optimal resource allocation policy, we convert a network<br />

into a network with tasks having infinitesimal processing times. Each root task is divided into r<br />

infinitesimal tasks <strong>and</strong> each f i (v i ) is replaced with f i (v i )/r. Then, the load index of each component<br />

i<br />

i ( i


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 10<br />

is the same as the original network but tasks are infinitesimal. We denote the completion time of<br />

the network with infinitesimal tasks as T´. Also, we define a term called task availability as an<br />

indicator of relative preference for task arrival patterns. An arrival pattern gives higher task<br />

availability than another if cumulative number of arrived tasks is larger or equal over time. A<br />

component prefers a task arrival pattern with higher task availability as it can utilize more<br />

resource. Consider a network <strong>and</strong> reconfigure it such that all components have their tasks in their<br />

queues at t=0. Each component has maximal task availability in the reconfigured network <strong>and</strong> the<br />

completion time of the reconfigured network forms the lower bound T LB<br />

of a network’s<br />

completion time T given by:<br />

LB<br />

n∈N<br />

∑<br />

T = Max LI i . (5)<br />

i∈<br />

K n<br />

For theoretical analysis, we assume a hypothetical weighted round-robin server for CPU<br />

scheduling though it is not strictly required in practice as will be discussed. The hypothetical<br />

server has idealized fairness as the CPU time received by each thread in a round is infinitesimal<br />

<strong>and</strong> proportional to the weight of the thread.<br />

Theorem 1. T´ equals to T LB when each node allocates its resource proportional to its residing<br />

components’ load indices as:<br />

LI i<br />

wi<br />

( t)<br />

= wi<br />

= ω n(<br />

i)<br />

for all i ∈ A <strong>and</strong> t ≥ 0 , (6)<br />

LI<br />

∑<br />

p∈K<br />

n(<br />

i)<br />

where n(i) denotes a node in which component i resides.<br />

p<br />

Proof. A component’s instantaneous resource availability RA i (t), which is the available fraction<br />

of a resource when the component requests the resource at time t, is more than or equal to<br />

assigned weight proportion as:


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 11<br />

w i( t )<br />

RA i( t ) ≥ for t ≥ 0 . (7)<br />

ω<br />

n( i )<br />

Service time S i (t) is the time taken to process a task at time t <strong>and</strong> has a relationship with RA i (t)<br />

as:<br />

t Si(<br />

∫ +<br />

t<br />

t )<br />

RA ( τ )dτ<br />

=<br />

i<br />

f<br />

i<br />

( v<br />

i<br />

). (8)<br />

Suppose a component i receives its tasks at a constant interval of T LB /L i . Then, under<br />

proportional allocation, S i (t) is less than or equal to T LB /L i over time as shown in (9).<br />

f<br />

=<br />

i<br />

( v<br />

i<br />

p∈K<br />

) =<br />

LI<br />

∑<br />

n(<br />

i<br />

LI<br />

i )<br />

∫<br />

p<br />

t+<br />

S (<br />

t<br />

i<br />

t )<br />

RA ( τ )dτ<br />

≥<br />

LI<br />

Si( t ) ≥<br />

T<br />

i<br />

i<br />

LB<br />

S ( t )<br />

i<br />

∫<br />

t+<br />

S (<br />

t<br />

i<br />

t )<br />

T<br />

⇒<br />

L<br />

w i( t ) w<br />

dτ<br />

=<br />

ω ω<br />

LB<br />

i<br />

n(<br />

i )<br />

≥ S ( t )<br />

i<br />

n(<br />

i<br />

i )<br />

S ( t )<br />

i<br />

for t ≥ 0<br />

(9)<br />

So, any component can complete by T LB <strong>and</strong> generate tasks at a constant interval of T LB /L i<br />

from t=T LB /L i (first task generation time) under proportional allocation when it receives tasks at<br />

a constant interval of T LB /L i from t=0 (first task arrival time). As tasks are infinitesimal <strong>and</strong> root<br />

tasks increase task availability, each component can receive infinitesimal tasks at a constant<br />

interval in 0≤t≤T LB or more preferably, <strong>and</strong> complete at less than or equal to T LB . So, the<br />

network completes at T LB under proportional allocation.<br />

<br />

From Theorem 1 we can conjecture that a network can achieve a performance close to T LB<br />

under proportional allocation in the limit of large number of tasks. If nodes do not follow the<br />

proportional allocation policy, some components can receive their tasks less preferably than<br />

constant interval resulting in underutilization <strong>and</strong> consequently increased completion time. Also,<br />

it is optimal for each component to use a pure strategy. Each component’s optimal strategy in the


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 12<br />

network with maximal task availability is a pure strategy due to the convexity of value functions,<br />

<strong>and</strong> a network can achieve the optimal performance under proportional allocation. Though we<br />

assumed a hypothetical weighted round-robin server which is difficult to realize in practice, the<br />

arguments do not seem to be invalid because they are based on worst-case analysis <strong>and</strong> quantum<br />

size is relatively infinitesimal compared to working horizon in reality.<br />

3.2 Programming model<br />

As discussed, each component’s optimal strategy is a pure strategy <strong>and</strong> the completion time T<br />

is close to T LB under proportional resource allocation in the limit of large number of tasks. Now,<br />

consider current time as t. To update load index as the system moves on, we slightly modify it to<br />

represent the total CPU time for the remaining tasks as:<br />

LI t)<br />

= R ( t)<br />

+ L ( t)<br />

f ( v ) , (10)<br />

i ( i i i i<br />

in which R i (t) denotes remaining CPU time for a task in process <strong>and</strong> L i (t) the number of<br />

remaining tasks excluding a task in process. After identifying initial number of tasks L i (0)=L i ,<br />

each component updates it by counting down as they process tasks.<br />

Then, under proportional resource allocation, the completion time T can be estimated as:<br />

n∈N<br />

∑<br />

T − t ≈ Max [ Ri<br />

( t)<br />

+ Li<br />

( t)<br />

fi<br />

( vi<br />

)] . (11)<br />

i∈<br />

K n<br />

The estimation leads to building a programming model in a straightforward way. Given<br />

completion time T it is optimal for a node n to select a mode by the following:<br />

∑<br />

Max Li ( t ) v i<br />

(12)<br />

i∈K<br />

n<br />

subject to<br />

∑<br />

i∈<br />

K n<br />

[ Ri<br />

( t)<br />

+ Li<br />

( t)<br />

fi<br />

( vi<br />

)] ≤ T − t . (13)


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 13<br />

Consequently, the programming model can be formulated with two sub-models: optimization<br />

model as in (14) <strong>and</strong> resource allocation model as in (15). The optimization model maximizes<br />

QoS by trading off the value of solution <strong>and</strong> the cost of completion time, <strong>and</strong> the resource<br />

allocation model allocates resources proportional to the load indices of residing components<br />

based on the solution of (14).<br />

<br />

Programming model<br />

Max<br />

s.t.<br />

∑<br />

i∈A<br />

∑<br />

i∈K<br />

v<br />

L ( t )v<br />

n<br />

i<br />

i(min)<br />

[ R ( t ) + L ( t<br />

i<br />

≤ v<br />

i<br />

i<br />

− CCT(T )<br />

≤ v<br />

i<br />

) f<br />

i(max)<br />

i<br />

( v )] ≤ T − t<br />

i<br />

for all<br />

for all<br />

n ∈ N<br />

i ∈ A<br />

(14)<br />

w<br />

*<br />

i<br />

=<br />

*<br />

Ri<br />

( t)<br />

+ Li<br />

( t)<br />

fi<br />

( vi<br />

)<br />

ω<br />

* n(<br />

i)<br />

∑[<br />

R p ( t)<br />

+ L p ( t)<br />

f p ( v p )]<br />

(15)<br />

p∈K<br />

n(<br />

i)<br />

The optimal QoS from (14) with t=0 forms a QoS upper bound QoS UB <strong>and</strong> a network can<br />

achieve a performance close to QoS UB in the limit of large number of tasks. The programming<br />

model is efficient in terms of complexity because the two different kinds of control actions are<br />

completely separated. It is solvable in polynomial time as will be discussed in the next section.<br />

4. Decentralization<br />

The next question is how to decentralize the mathematical programming model. Centralized<br />

control mechanisms scale badly, due to the rapid increase of computational <strong>and</strong> communicational<br />

overheads with system size. Single point failure of the controller will often lead to failure of the<br />

complete system leading to non-robust network. Decentralization can address these issues by


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 14<br />

distributing the computations <strong>and</strong> communications to multiple entities in the system. There are<br />

two popular methods of decentralizing structured programming models: decomposition methods<br />

<strong>and</strong> auction/bidding algorithms. Considering the compatible structure of the programming<br />

model, we decentralize it through a non-iterative auction mechanism, so called multiple-unit<br />

auction with variable supply [21]. In this auction a seller may be able <strong>and</strong> willing to adjust the<br />

supply as a function of bidding.<br />

4.1 Auction market design<br />

In the programming model we have built, all nodes <strong>and</strong> components are coupled with each<br />

other. However, it has a typical structure, where objective function <strong>and</strong> constraints are separable<br />

to each node if one variable T is fixed. This characteristic makes it possible to solve the model<br />

through an auctioning process for T. The completion time T is an unbounded resource <strong>and</strong> the<br />

supply can be adjusted as a function of bidding.<br />

To design the auction market we define two different types of participants in addition to the<br />

components: Seller <strong>and</strong> Resource Manager. There is one seller in the system which determines<br />

T * based on the bids from resource managers. A resource manager of each node manages the<br />

resource of the node <strong>and</strong> arbitrates between its components <strong>and</strong> the seller.<br />

We define T i as available resource of component i which is required minimally to the amount<br />

of T i(min) as in (16) <strong>and</strong> maximally T i(max) as in (17).<br />

T<br />

T<br />

= [ R ( t ) L ( t ) f ( v )]<br />

(16)<br />

i (min) i +<br />

i<br />

i<br />

i(min)<br />

= [ R ( t ) L ( t ) f ( v )]<br />

(17)<br />

i (max) i +<br />

i<br />

i<br />

i(max)<br />

A component i bids to its resource manager with maximal value as a function of T i as in (18).<br />

The resource manager bids to the seller with maximal total value of its components as a function<br />

of T based on the bids from its components as in (19). The seller decides T * based on the bids


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 15<br />

from resource managers by taking into account the cost of T as in (20). After the seller<br />

broadcasts T * , each resource manager decides T * i <strong>and</strong> w * i as in (21) <strong>and</strong> (22). In (21) T * i is less<br />

than or equal to the maximally required resource T i(max) so that the resource can be allocated<br />

proportional to the components’ load indices. Each component selects optimal value mode in the<br />

limit of T * i as in (23). This auctioning process gives an equivalent solution to the programming<br />

model.<br />

<br />

Auctioning model<br />

Component’s bid<br />

b (T ) = −∞<br />

i<br />

i<br />

= L ( t )v<br />

i<br />

= L ( t ) f<br />

i<br />

i(max)<br />

−1<br />

i<br />

(Ti<br />

− Ri( t )<br />

( )<br />

L ( t )<br />

i<br />

if T<br />

if T<br />

else<br />

i<br />

i<br />

< T<br />

> T<br />

i(min)<br />

i(max)<br />

(18)<br />

Resource manager’s bid<br />

b (T ) = −∞<br />

n<br />

=<br />

∑<br />

i∈K<br />

n<br />

b (T<br />

i<br />

= Max {<br />

i∈K<br />

i(max)<br />

∑<br />

n<br />

)<br />

b (T<br />

i<br />

i<br />

) :<br />

∑<br />

i∈K<br />

n<br />

T<br />

i<br />

≤ T − t }<br />

if T <<br />

if T ><br />

else<br />

∑<br />

i∈K<br />

i∈K<br />

n<br />

∑<br />

n<br />

T<br />

T<br />

i(min)<br />

i(max)<br />

(19)<br />

Seller’s decision<br />

T<br />

*<br />

= argmax ∑bn<br />

( T ) − CCT ( T )<br />

T<br />

n∈N<br />

(20)<br />

Resource manager’s decision<br />

{T<br />

*<br />

i<br />

: i ∈ K n } = arg max { bi<br />

(Ti<br />

) : Ti<br />

≤ min(T − t, Ti(max)<br />

)}<br />

(21)<br />

{ T :i∈K<br />

}<br />

i<br />

n<br />

∑<br />

i∈K<br />

n<br />

∑<br />

i∈K<br />

n<br />

*<br />

∑<br />

i∈K<br />

n


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 16<br />

w<br />

*<br />

i<br />

=<br />

T<br />

p∈K<br />

*<br />

i<br />

∑<br />

T<br />

n(<br />

i)<br />

*<br />

p<br />

ω<br />

n(<br />

i)<br />

(22)<br />

Component’s decision<br />

v<br />

*<br />

i<br />

*<br />

1 ( Ti<br />

− Ri<br />

( t)<br />

= f<br />

− i ( )<br />

(23)<br />

L ( t)<br />

i<br />

4.2 Analysis<br />

Resource manager’s bidding function b n (T) in (19) can be composed referring to the solution<br />

algorithm of fractional knapsack problem. In the fractional knapsack problem, there are multiple<br />

items that can be broken into fractions. Given unit weight <strong>and</strong> unit value of each item, the<br />

problem is to determine the amount of each item so as to maximize total value subject to a<br />

weight capacity. The fractional knapsack problem can be easily solved by a greedy algorithm,<br />

i.e., take as much as possible of the item that is the most valuable per unit weight until the<br />

capacity is reached. Similarly, b n (T) can be composed using a greedy algorithm. As b i (T i ) in (18)<br />

is a piecewise-linear increasing concave function, take the most valuable piece per unit T i among<br />

the first available pieces until all pieces are taken. This greedy algorithm leads building the<br />

resource manager’s bidding function in O(|K n | 2 ), where |X| denotes the cardinality of set X.<br />

Similarly, resource manager’s decision problem in (21) can be solved in O(|K n | 2 ), using the<br />

greedy algorithm except that (fractional) pieces are taken until a capacity is reached. So, the<br />

complexity of all resource managers’ local problems is O(|A| 2 ) in the worst case when |N|=1.<br />

The seller’s decision problem in (20) is simply a single variable problem, which can be solved<br />

using diverse search methods depending on the structure of objective function. As each b n (T) is<br />

piecewise-linear increasing concave function, ∑b n (T) is also a piecewise-linear increasing<br />

concave function <strong>and</strong> its number of pieces is proportional to |A|. To compose ∑b n (T) from each


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 17<br />

b n (T), sort the starting T of each piece in ascending order (O(|A|log|A|)), <strong>and</strong>, for each T,<br />

summate from each b n (T) by moving on to corresponding pieces (O(|A||N|)). So, the seller can<br />

compose ∑b n (T) in O(|A| 2 ) in worst case when |N|=|A|. Once ∑b n (T) is composed, the<br />

complexity of the decision problem is proportional to the number of pieces <strong>and</strong> solvable in<br />

O(|A|). So, the seller’s decision problem is solvable in O(|A| 2 ). The complexity of other local<br />

problems such as (18), (22), <strong>and</strong> (23) is O(|A|).<br />

Therefore, if the auctioning model is solved in a centralized controller, it is solvable in<br />

O(|A| 2 ). That is, the complexity of the programming model is O(|A| 2 ). However, the auctioning<br />

model improves scalability as computations <strong>and</strong> communications are distributed to multiple<br />

market participants. Components as well as resource managers are solving their local problems<br />

in parallel rather than sequentially. This parallel processing reduces the time taken to solve the<br />

programming model. In addition, the participants communicate locally in terms of bids rather<br />

than all details to a centralized controller.<br />

5. Empirical results<br />

We ran several experiments using discrete-event simulation to validate the designed control<br />

mechanism. Though we use a small network in the experimentation for validation purpose, the<br />

decentralized model, especially, can h<strong>and</strong>le much larger networks.<br />

5.1 Experimental design<br />

The experimental network is composed of sixteen components in seven nodes as shown in<br />

Fig. 2. Each component in the lowest position has root tasks as indicated in the figure. The value<br />

function is for A 7 <strong>and</strong> A 8 , <strong>and</strong> for others. Also, ω n is 1 for all n∈N <strong>and</strong>


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 18<br />

CPU is allocated using a weighted round-robin scheduling in which CPU time received by each<br />

component in a round is equal to its assigned weight.<br />

N 1 N 2 N 3<br />

N 4<br />

A 1 A 2<br />

A 3 A 4<br />

N 5<br />

A 5 A 6<br />

N 6<br />

A 7 A 8<br />

A 9 A 10 A 11 A 12 A 13 A 14 A 15 A 16<br />

200 200 200 200 400 400 400 400<br />

N 7<br />

Fig. 2. Experimental network configuration<br />

We set up six different experimental conditions as shown in Table 1. We vary the cost of<br />

completion time <strong>and</strong> the distribution of CPU time can be deterministic or exponential. While<br />

using stochastic value function we repeat 5 experiments. QoS UB is calculated from (14) with t=0<br />

for each condition as shown in the table.<br />

Table 1. Experimental conditions<br />

Condition CCT(T) f i (v i ) QoS UB<br />

Con1-1 0.5T Deterministic 30000<br />

Con1-2 0.5T Exponential 30000<br />

Con2-1 1.5T Deterministic 19200<br />

Con2-2 1.5T Exponential 19200<br />

Con3-1 2.5T Deterministic 12000<br />

Con3-2 2.5T Exponential 12000<br />

We use ten different control policies for each experimental condition as shown in Table 2.<br />

First eight control policies (FX-XX) use fixed value modes over time. In predictive control<br />

policies (PC-XX) components selects value modes by solving the optimization model in (14). In<br />

round-robin resource allocation (XX-RR) the components in each node are assigned equal


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 19<br />

weights <strong>and</strong> in proportional allocation (XX-PA) proportional to the components’ load indices as<br />

in (15). PC-PA is the control policy corresponding to the programming model we have<br />

developed. The system makes decision every 100 time units.<br />

Table 2. Control policies used for experimentation<br />

Control policy<br />

F2-RR<br />

F2-PA<br />

F3-RR<br />

F3-PA<br />

F4-RR<br />

F4-PA<br />

F5-RR<br />

F5-PA<br />

PC-RR<br />

PC-PA<br />

Description<br />

v i = 2 for all i with round-robin allocation<br />

v i = 2 for all i with proportional allocation<br />

v i = 3 for all i with round-robin allocation<br />

v i = 3 for all i with proportional allocation<br />

v i = 4 for all i with round-robin allocation<br />

v i = 4 for all i with proportional allocation<br />

v i = 5 for all i with round-robin allocation<br />

v i = 5 for all i with proportional allocation<br />

Predictive control with round-robin allocation<br />

Predictive control with proportional allocation<br />

5.2 Results<br />

Numerical results from the experimentation are shown in Table 3. PC-PA gives the best<br />

performance close to QoS UB in all different conditions. As the cost of completion time increases<br />

the system under PC-PA completes earlier as a result of trading off between the value of solution<br />

<strong>and</strong> the cost of completion time. There can be seen many cases in which the value of solution<br />

under PC-PA is even larger in spite of less completion time. It is because the programming<br />

model gives the maximal value of solution for a given completion time.<br />

Though both PC-PA <strong>and</strong> PC-RR choose value modes by solving the optimization model in<br />

(14), PC-RR gives worse performance because the optimization model is built presuming<br />

proportional resource allocation. Proportional allocation shows significant advantages compared<br />

to round-robin allocation in all thirty instances of comparison. The superiority supports the<br />

optimality of proportional resource allocation <strong>and</strong> consequently the effectiveness of the<br />

programming model.


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 20<br />

Table 3. Experimental results<br />

Control Policy<br />

F2-RR F2-PA F3-RR F3-PA F4-RR F4-PA F5-RR F5-PA PC-RR PC-PA<br />

T 5614 4814 8019 7219 10423 9624 12828 12028 12828 12028<br />

Con1-1<br />

V 14400 14400 21600 21600 28800 28800 36000 36000 36000 36000<br />

QoS 11593 11993 17590 17990 23588 23988 29585 29985 29585 29985<br />

% 0.386 0.400 0.586 0.600 0.786 0.800 0.986 1.000 0.986 1.000<br />

T 5592 4993 8093 7114 10356 9593 12846 11885 12846 11885<br />

Con1-2<br />

V 14400 14400 21600 21600 28800 28800 36000 36000 36000 36000<br />

QoS 11604 11903 17553 18043 23622 24004 29577 30058 29577 30058<br />

% 0.387 0.397 0.585 0.601 0.787 0.800 0.986 1.002 0.986 1.002<br />

T 5614 4814 8019 7219 10423 9624 12828 12028 11282 9742<br />

Con2-1<br />

V 14400 14400 21600 21600 28800 28800 36000 36000 34283 33700<br />

QoS 5980 7179 9572 10771 13164 14364 16757 17956 17359 19087<br />

% 0.311 0.374 0.499 0.561 0.686 0.748 0.873 0.935 0.904 0.994<br />

T 5592 4993 8093 7114 10356 9593 12846 11885 11313 10062<br />

Con2-2<br />

V 14400 14400 21600 21600 28800 28800 36000 36000 34183 33845<br />

QoS 6011 6910 9460 10929 13266 14411 16731 18173 17214 18752<br />

% 0.313 0.360 0.493 0.569 0.691 0.751 0.871 0.947 0.897 0.977<br />

T 5614 4814 8019 7219 10423 9624 12828 12028 6171 4881<br />

Con3-1<br />

V 14400 14400 21600 21600 28800 28800 36000 36000 25354 24055<br />

QoS 366 2365 1553 3553 2741 4740 3928 5928 9927 11853<br />

% 0.031 0.197 0.129 0.296 0.228 0.395 0.327 0.494 0.827 0.988<br />

T 5592 4993 8093 7114 10356 9593 12846 11885 6309 5089<br />

Con3-2<br />

V 14400 14400 21600 21600 28800 28800 36000 36000 25593 24277<br />

QoS 419 1917 1367 3816 2910 4818 3886 6289 9820 11554<br />

% 0.035 0.160 0.114 0.318 0.243 0.402 0.324 0.524 0.818 0.963<br />

T: Completion time, V: Value of solution, %: QoS/QoS UB<br />

6. Conclusions<br />

The increasing complexity of modern software systems gives rise to the needs for more<br />

sophisticated but scalable control mechanisms. In this paper we designed such a control<br />

mechanism for an emerging information network. The network is large-scale with distributed<br />

<strong>and</strong> component-based architectures, <strong>and</strong> its behavior can be controlled by algorithm selection<br />

<strong>and</strong> resource allocation. In the designed control mechanism, an auction market coordinates the<br />

components of a network to produce optimal decisions <strong>and</strong> the market opens periodically for<br />

each current system state.


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 21<br />

Our work can be extended by providing adaptivity to changing stress environments. As the<br />

modern systems can be easily exposed to various adverse events such as accidental failures <strong>and</strong><br />

malicious attacks, there is a need to adapt to such environments. Because the adverse events<br />

affect the system by limiting available resources, it would be possible to model such<br />

environments by quantifying the resource availability of the system through appropriate sensors.<br />

Acknowledgements<br />

The authors acknowledge the support for this research provided by <strong>DARPA</strong> (Grant#:<br />

MDA972-01-1-0038) under the UltraLog program.<br />

References<br />

[1] B. Meyer, “On to components”, IEEE Computer, vol. 32, no. 1, pp. 139-140, 1999.<br />

[2] P. Clements, “From subroutine to subsystems: Component-based software development,” in<br />

Component Based Software Engineering, A. W. Brown, Ed. IEEE Computer Society Press,<br />

1996, pp. 3-6.<br />

[3] M. O. McCracken, A. Snavely, <strong>and</strong> A. Malony, “Performance modeling for dynamic<br />

algorithm selection,” in Proc. Int. Conf. Computational Science, 2003, pp. 749-758.<br />

[4] P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Medvidovic, A.<br />

Quilici, D. S. Rosenblum, <strong>and</strong> A. L. Wolf, “An architecture-based approach to self-adaptive<br />

software,” IEEE Intelligent Systems, vol. 14, no. 3, pp. 54-62, 1999.<br />

[5] D. Moore, W. Wright, <strong>and</strong> R. Kilmer, “Control surfaces for Cougaar,” in Proc. First Open<br />

Cougaar Conference, 2004, pp. 37-44.<br />

[6] W. Peng, V. Manikonda, <strong>and</strong> S. Kumara, “Underst<strong>and</strong>ing agent societies using distributed


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 22<br />

monitoring <strong>and</strong> profiling,” in Proc. First Open Cougaar Conference, 2004, pp. 53-60.<br />

[7] H. Gupta, Y. Hong, H. P. Thadakamalla, V. Manikonda, S. Kumara, <strong>and</strong> W. Peng, “Using<br />

predictors to improve the robustness of multi-agent systems: Design <strong>and</strong> implementation in<br />

Cougaar,” in Proc. First Open Cougaar Conference, 2004, pp. 81-88.<br />

[8] D. Moore, A. Helsinger, <strong>and</strong> D. Wells, “Deconfliction in ultra-large MAS: Issues <strong>and</strong> a<br />

potential architecture,” in Proc. First Open Cougaar Conference, 2004, pp. 125-133.<br />

[9] R. D. Snyder <strong>and</strong> D. C. Mackenzie, “Cougaar agent communities,” in Proc. First Open<br />

Cougaar Conference, 2004, pp. 143-147.<br />

[10] J. B. Rawlings, “Tutorial overview of model predictive control,” IEEE Control Systems, vol.<br />

20, no. 3, pp. 38-52, 2000.<br />

[11] M. Morari <strong>and</strong> J. H. Lee, “Model predictive control: Past, present <strong>and</strong> future,” Computers<br />

<strong>and</strong> Chemical Engineering, vol. 23, no. 4, pp. 667-682, 1999.<br />

[12] M. Nikolaou, “Model predictive controllers: A critical synthesis of theory <strong>and</strong> industrial<br />

needs,” Advances in Chemical Engineering Series, Academic Press, 2001.<br />

[13] S. J. Qin <strong>and</strong> T. A. Badgwell, “A survey of industrial model predictive technology,” Control<br />

Engineering Practice, vol. 11, pp. 733-764, 2003.<br />

[14] J. Regehr, “Some guidelines for proportional share CPU scheduling in general-purpose<br />

operating systems,” Presented as a work in progress at 22nd IEEE Real-Time Systems<br />

Symposium, London, UK, Dec. 3-6, 2001.<br />

[15] I. Stoica, H. Abdel-Wahab, J. Gehrke, K. Jeffay, S. K. Baruah, <strong>and</strong> C. G. Plexton, “A<br />

proportional share resource allocation algorithm for real-time, time-shared systems,” in<br />

Proc. 17th IEEE Real-Time Systems Symposium, 1996, pp. 288-299.<br />

[16] C. A. Waldspurger <strong>and</strong> W. E. Weihl, “Lottery scheduling: Flexible proportional-share


Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 23<br />

resource management,” in Proc. First Symposium on Operating System Design <strong>and</strong><br />

Implementation, 1994, pp. 1-11.<br />

[17] C. Waldspurger <strong>and</strong> W. Weihl, “Stride scheduling: Deterministic proportional-share<br />

resource management,” Lab. for Computer Science, Massachusetts Institute of Technology,<br />

Cambridge, MA, Tech. Rep. MIT/LCS/TM-528, 1995.<br />

[18] C. Waldspurger, “Lottery <strong>and</strong> stride scheduling: Flexible proportional share resource<br />

management,” Ph.D. dissertation, Lab. for Computer Science, Massachusetts Institute of<br />

Technology, Cambridge, MA, 1995.<br />

[19] T. Gonzalez <strong>and</strong> S. Sahni, “Flowshop <strong>and</strong> jobshop schedules: Complexity <strong>and</strong><br />

approximation,” Operations Research, vol. 26, pp. 36-52, 1978.<br />

[20] J. Lenstra, A. R. Kan, <strong>and</strong> P. Brucker, “Complexity of machine scheduling problems,”<br />

Annals of Discrete Mathematics, vol. 1, pp. 343-362, 1977.<br />

[21] Y. Lengwiler, “The multiple unit auction with variable supply,” Economic Theory, vol. 14,<br />

no. 2, pp. 373-392, 1999.


Coordinating Control Decisions of Software Agents for Adaptation to Dynamic<br />

Environments<br />

Y. Hong 1 , S. R. T. Kumara 1<br />

1 Harold <strong>and</strong> Inge Marcus Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />

The Pennsylvania State University, University Park, PA, 16802, USA<br />

Abstract<br />

We suggest a design for an infrastructure-level load control mechanism of a multiagent system, Cougaar. The<br />

purpose of control is to strengthen the robustness of a software multiagent system with respect to load<br />

balancing such that the system can keep working without disastrous performance degradation even under<br />

occasional harsh running environments. Resource control in multiagent systems is carried out mainly by<br />

agent’s self-control, which makes the control problem very difficult. We suggest a hierarchical control<br />

structure in order to reduce complexity of control while inducing coherent movement of agents.<br />

Keywords:<br />

load balancing, hierarchical control, multi-agent system<br />

1 INTRODUCTION<br />

Multiagent systems have significant advantages in the<br />

development of complex distributed software system [1].<br />

Agents are naturally matched to components in complex<br />

systems. Therefore, complicated interactions among the<br />

subcomponents can be represented by agent interactions.<br />

Due to the modularity <strong>and</strong> autonomy of agents, the<br />

application could be composed by assembling the agents.<br />

The multiagent systems are flexible in design. Partial<br />

changes in the system could be localized for a few agents<br />

without affecting the rest of the system. Thus, constructing<br />

or altering a large software system could become easier<br />

with agent technology.<br />

In addition to the advantages in designing <strong>and</strong> constructing<br />

a large system, robustness is also an important factor for a<br />

multiagent system to be a good software construction<br />

technology. Robustness of a software represents “the<br />

ability of software to react appropriately to abnormal<br />

circumstances” [2]. Like many biological or man-made<br />

systems, through feedback controls <strong>and</strong> redundancy of<br />

components (agents), the software system can also cope<br />

with uncertainties in dynamic environments <strong>and</strong> improve its<br />

robustness at the expense of increasing complexity [3][4].<br />

The time varying computational load could be one of<br />

threats to robustness. A sudden excessive workload could<br />

degrade performance to an extent in which the system<br />

cannot meet minimum requirements on response time.<br />

This is in specific very critical for real time applications.<br />

Because agent systems are distributed <strong>and</strong> decentralized,<br />

it is hard to build a control mechanism by which agents can<br />

adapt to the changing environments effectively <strong>and</strong><br />

coherently. In order to resolve this problem, we suggest an<br />

infrastructure-level load control mechanism for a<br />

multiagent system, Cougaar. The reason we consider<br />

infrastructure level control mechanism is that the<br />

application developers’ efforts to secure robustness of<br />

software with respect to the load control could be much<br />

reduced. Multiagent systems such as Cougaar [5] <strong>and</strong><br />

Jade [6] provide many infrastructure level services, which<br />

save the application developers efforts required to build<br />

basic functions of the multiagent system. Load control<br />

function can be included in the infrastructure <strong>and</strong> its<br />

necessity has been emphasized [7]. Infrastructure can hide<br />

the complexity of controlling resource allocation such that<br />

application developers tune the performance using highlevel<br />

abstract parameters for load control.<br />

2. LOAD BALANCING IN MULTIAGENT SYSTEMS<br />

In multiagent systems, system functions are decomposed<br />

into software agents. Agents carry out system functions by<br />

exchanging services with each other [7]. Agents have their<br />

own work <strong>and</strong> specialize in a specific service. Agents<br />

request some service from another agent who is<br />

specialized in that service. Providing the service requires<br />

the use of some computational resource such as CPU<br />

time. Agents are distributed on multiple machines, which<br />

are connected through communication networks. More<br />

than one agent can be on a machine <strong>and</strong> share the CPU<br />

time. The frequency of service request of each agent is<br />

time varying depending on real world, which the application<br />

deals with.<br />

Considerable research has been done on dynamic load<br />

balancing for computer clustering. However, we cannot<br />

apply this directly to a multiagent system [7]. As noted by<br />

chow <strong>and</strong> kwok [7], multiagent systems (MAS) are different<br />

from computer clustering with respect to load balancing.<br />

Firstly, in MAS, agents are continuously running while in<br />

computer clustering, jobs submitted by users are killed<br />

after completion. Secondly, communications between<br />

agents in multiagent system are highly variable, whereas,<br />

communications between jobs usually has static patterns.<br />

Another difference, which is not pointed out by chow <strong>and</strong><br />

kwok, is that agents could proactively manage their<br />

workload.


Load balancing issues have not been paid much attention<br />

in MAS studies [7][8]. There are few papers in multiagent<br />

load balancing. Schaerf et. al. [9] studies how an agent can<br />

adapt to the environment. They separated the resources<br />

from the agents. In their model, agents assign their jobs to<br />

these resources. Using reinforcement learning, they<br />

showed that agents could adapt to each other under fixed<br />

or even for dynamical loads. Chow <strong>and</strong> kwok [7] devise an<br />

agent reallocation algorithm, called ‘Comet’ algorithm,<br />

which select agents to be moved to other machines.<br />

Agents are distributed on multiple machines. The comet<br />

algorithm chooses agents based on credits, which are<br />

continuously evaluated for each agent. The agent with low<br />

credit will be moved. The credit will decrease as the<br />

agent’s workload increases or the agent has more<br />

communication with other agents on other machines. We<br />

consider a similar agent system environment with chow<br />

<strong>and</strong> kwok. However, we added a feature of agent’s selfregulation<br />

on the workload.<br />

3 QUEUEING MODEL FOR WORKLOAD DYNAMICS<br />

We conjecture that workload dynamics could be modeled<br />

as a queueing system. A service request from outside or<br />

other agents could be seen as a customer in a queueing<br />

system. While a request is being served, the later incoming<br />

service requests will wait in the queue. We consider a<br />

situation in which agents have multiple alternative<br />

algorithms to provide their service. Those algorithms trade<br />

off between computation time <strong>and</strong> quality of solution.<br />

Thus, depending on the workload in the queue, an agent<br />

can choose an optimal algorithm to improve the overall<br />

performance measure. This is similar to anytime algorithm<br />

composition [10]. In anytime algorithm, we have to<br />

determine time duration in which the algorithm solves the<br />

problem. Here, we assume that problem solving time is not<br />

predetermined. Instead, it is a statistical characteristic of<br />

the algorithm. From the queueing model perspective, this<br />

could be seen as a service rate control problem [11].<br />

Multiagent system infrastructure can have a facility, where<br />

each machine works as a server by assigning<br />

computational resources (run right) to the agents for CPU<br />

time-sharing. We call the server as a node. This could be<br />

seen as a polling model, which has been used to model for<br />

time-sharing in a computer operating system or link<br />

sharing in a communication network. The node could give<br />

priority to a certain agent by visiting the agent more<br />

frequently. The node could monitor the amount of workload<br />

or arrival rate of service requests through agents. Based<br />

on the detected changes, it can change the priority of the<br />

agent.<br />

Imbalance among machines can be controlled by<br />

reallocation of agents from a high loaded machine to a low<br />

loaded machine for better performance. However, in this<br />

paper, we considered only agent <strong>and</strong> machine level<br />

control.<br />

4 DECENTRALIZED CONTROL<br />

In view of the above mentioned workload dynamics, load<br />

control in multiagent systems could be seen as a<br />

decentralized stochastic control problem [12]. The<br />

decentralized control system consists of multiple control<br />

posts. They locally sense <strong>and</strong> control some part of the<br />

system they take charge of. However, their controls<br />

influence the system dynamics collectively. Thus, in order<br />

to operate the system optimally, decisions taken by<br />

controllers should be compatible <strong>and</strong> coherent. The<br />

information sharing between agents is treated as the main<br />

issue. In order for the controllers to obtain global optimal<br />

control decisions, exchange of all the local information is<br />

inevitable. Each controller makes a decision by solving a<br />

larger problem in which other agents’ movement is<br />

considered. However, it is unrealistic in the case of<br />

multiagent systems because of the long communication<br />

time. Finding optimal controls might be very difficult due to<br />

the size of the problem. It is difficult or almost impossible to<br />

find a purely decentralized optimal control policy for a<br />

multiagent system in this way. There are few systems in<br />

which locally made decisions could be globally compatible<br />

[13]. However this is very limited to some specific problem<br />

only. Thus, we need a control structure in which each<br />

control component (for example, an agent) takes control<br />

decisions by communicating only with closely connected<br />

components, while well coordinated decisions could be<br />

generated. In this paper, we suggest a hierarchical control<br />

structure which aims at achieving the above mentioned<br />

expectations.<br />

5 HIERARCHICAL CONTROL<br />

5.1 General Description<br />

In order to manage the complexity of large-scale problems,<br />

hierarchical control approach have been studied in various<br />

areas [14][15]. For multiagent systems, hierarchical control<br />

has been adopted as an intermediate form between<br />

centralized <strong>and</strong> decentralized control as a tradeoff between<br />

the advantages of the two approaches [14]. Hierarchical<br />

control can reduce the computation of gathering<br />

information <strong>and</strong> finding an optimal control than centralized<br />

control. On the other h<strong>and</strong>, it has better coordination<br />

capabilities than decentralized control.<br />

We consider three levels of hierarchy – the entire system,<br />

nodes <strong>and</strong> agents. There are usually multiple<br />

subcomponents under a higher-level controller i.e. there<br />

are multiple agents under a node <strong>and</strong> multiple nodes under<br />

a top controller. Agents <strong>and</strong> a node controller have direct<br />

communication connection by which they share<br />

information. The information sharing is restricted between<br />

the components, which are connected in the hierarchy. In<br />

this case, an agent reports its workload <strong>and</strong> performance<br />

(see 5.2) <strong>and</strong> the node controller announces control (the<br />

visit order <strong>and</strong> frequency) or state information such as<br />

estimates about environment parameters, which could be<br />

more effectively observed by the node rather than agents.<br />

We could also consider similar information exchanges<br />

between nodes <strong>and</strong> the top controller. Here, nodes report<br />

their node level workload trend to the top controller. On the<br />

other h<strong>and</strong>, the top controller could inform the system level<br />

environment parameters <strong>and</strong> order to move agents from<br />

one node to another.<br />

We assume the control frequency to be different at<br />

different levels. It is higher in lower levels compared to the<br />

higher levels. Control decisions on service rate are more<br />

frequent than the changes on the CPU time assignment<br />

policy in the node level. For a given arrival rate <strong>and</strong><br />

configuration, CPU time assignment policy in node level<br />

will not be changed until the arrival rate <strong>and</strong> configuration<br />

change. At higher level, the frequency of events is less <strong>and</strong><br />

the time intervals between events become longer. This<br />

could be said to have multi-time scale depending on the<br />

level [16]. The difference of our problem from other multitime<br />

scale problems is that there are multiple components<br />

in the lower levels. In addition, it is reasonable to assume<br />

that the environment does not change very frequently, in<br />

such a manner that the system may not be able to<br />

estimate environment parameters <strong>and</strong> control it.<br />

A Higher-level controller has coarser information than the<br />

lower level subsystems. A higher-level controller could<br />

have better global information over its territory because it<br />

collects information from its subcomponents. However, it<br />

will not use the information gathered as a system state


directly. It has coarser scale. It will neglect a certain range<br />

of fluctuations in the measurements from subcomponent.<br />

In this framework, controls in a level do not affect the<br />

higher level. However, higher-level controls constrain the<br />

lower level components’ working conditions <strong>and</strong> thus<br />

decreasing the degree of freedom on the control problem<br />

of the lower level.<br />

Now we want to show our ideas on the optimal control<br />

problem in which the features of hierarchical control<br />

structure are reflected.<br />

5.2 Optimal Control Problem<br />

The load control problem is to find optimal control policies<br />

for each component such that it minimizes the long run<br />

cost of the overall system, while it is subjected to time<br />

varying computational workload. The cost function we want<br />

to optimize through load control is a multi-objective<br />

function of holding cost <strong>and</strong> penalty cost for service quality.<br />

The performance is measured for each agent<br />

independently. The performance of the overall system is<br />

assumed to be the sum of the individual agent’s<br />

performance.<br />

Each agent could have different algorithms for a service<br />

depending on the service type. For simplicity, we assume<br />

that every agent has two algorithms, called level 6 <strong>and</strong><br />

level 2 respectively. The quality of solutions for Level 6<br />

algorithm is higher than Level 2 on an average while the<br />

computation time for Level 2 is less than that of Level 6.<br />

The default algorithm is the Level 6 algorithm. We will<br />

impose a penalty cost for using level 2 algorithm. Thus,<br />

when the system is congested, using level 2 algorithm<br />

would be helpful, even though it incurs the penalty cost.<br />

There could be various stresses such as (1) the<br />

unexpected increase in service request, <strong>and</strong> (2) loss of<br />

CPU time due to other applications. Stress (1) could be<br />

seen as an arrival rate change in queueing model. In agent<br />

systems, arrival of service requests could be time varying<br />

<strong>and</strong> bursty. Stress (2) could be seen as a server in a<br />

polling system serving an imaginary additional queue (an<br />

agent) at r<strong>and</strong>om time <strong>and</strong> for r<strong>and</strong>om duration. The<br />

sources of these two stresses in (1) <strong>and</strong> (2) are<br />

environmental factors, which we cannot control. Thus, we<br />

have to model them as r<strong>and</strong>om events. The controller will<br />

choose control action considering estimations about these<br />

events.<br />

Increasing arrival rate could be modeled using a Markov<br />

Modulated Poisson Process (MMPP). MMPP describes an<br />

arrival rate, which changes depending on the source state<br />

where if the source state is i, the arrival process is Poisson<br />

with arrival rate λ i . The source state is modeled as<br />

continuous time markov chain. [16]<br />

Depending on the level, the problem definitions are<br />

different. In the perspective of the agent, the problem is to<br />

find optimal service rate in a situation where there are<br />

r<strong>and</strong>om server vacations. The server vacation means that<br />

other agents get a run-right <strong>and</strong> process their work. The<br />

frequency to visit the agent is controlled by the node. The<br />

agent should find an optimal service control policy for the<br />

given expected vacation time. Taking into account the<br />

service rate in server vacation problem has not been<br />

studied so far to our best knowledge. Usually in server<br />

vacation control problems, the optimal service beginning<br />

time or the optimal time to add additional server<br />

(removable server case) have been studied [18].<br />

A node has a problem of finding polling policy. The runright<br />

assignment frequency will not change frequently<br />

because we assumed that the changes in the<br />

circumstances around agents are not frequent. Thus for a<br />

given node state, the problem is to find static polling table<br />

[19]. Whenever the node state changes, the node will pick<br />

another appropriate polling table.<br />

The problem is fairly complex because the self-regulating<br />

agents share a single resource CPU. They are<br />

interdependent. One agent’s decision could affect other<br />

agents’ waiting time for CPU. This feature makes our<br />

problem significantly different from other queueing models.<br />

6 HIERARCHICAL CONTROL FOR COUGAAR<br />

In this section, we show how to apply our hierarchical<br />

control ideas to a multiagent system, Cougaar [5]. We use<br />

Cougaar version 10.2.1.<br />

6.1 The Cougaar Infrastructure<br />

Distinguishing features of Cougaar are blackboard<br />

communication <strong>and</strong> plugins [5]. An agent consists of<br />

plugins that implement agents’ functions. In Cougaar, it is<br />

recommended that the functions of agents should be well<br />

divided into the sufficiently small units of program modules,<br />

plugins. They communicate by posting <strong>and</strong> reading<br />

messages on the blackboard that exists in every agent.<br />

Communications between agents are also conducted<br />

through the blackboard.<br />

Cougaar runs its code – plugins or some infrastructural<br />

level modules via shared thread. The total number of<br />

shared threads is limited in a node. This means that the<br />

number of simultaneously running plugins could not be<br />

greater than the upper limit. The number of available<br />

threads for the node is predetermined at the initial loading<br />

stage.<br />

Cougaar provides mobile agent functions. Agents can<br />

move from one machine to another. This function could<br />

also be used for load control by moving agents from highloaded<br />

machine to less loaded machines.<br />

6.2 Queueing Model in Cougaar<br />

Usually an agent has many plugins in cougaar applications<br />

[5]. From the perspective of infrastructure, the plugin could<br />

be seen as a workload or a job. If the agent does not get a<br />

thread for this plugin, the plugin should be put in the queue<br />

until the agent obtains a thread.<br />

A service request is processed in an agent through a<br />

sequence of plugins. The service request is represented as<br />

an object, called Task. While Tasks go through those<br />

plugins, they are exp<strong>and</strong>ed or aggregated <strong>and</strong> finally<br />

allocated to assets, which represent actual physical<br />

resources. Each plugin repeats a series of processes –<br />

retrieving Tasks, processing Tasks <strong>and</strong> publishing another<br />

Task. For example, if the application is a planning system<br />

<strong>and</strong> re-planning is triggered whenever there are<br />

discrepancies between planning <strong>and</strong> real world, we could<br />

see continuous arrival of plugins to the queue in the<br />

agents. This phenomenon could be naturally described as<br />

queueing models.<br />

6.3 Control Using Thread Services<br />

Even though the shared threads are managed by thread<br />

service, they are not used for control purpose. Current<br />

infrastructure just assigns threads to agents in a roundrobin<br />

fashion. We build a control structure utilizing the<br />

thread services so that agent <strong>and</strong> node level controls are<br />

feasible in the hierarchical control structure as we<br />

mentioned before. Through the control structure, agents<br />

(nodes) assign threads at their will through a<br />

predetermined scheme to plugins (agents) through thread<br />

service.<br />

After the plugin finishes its work, it will release the runright.<br />

Thus, the run right could be reassigned to other<br />

plugins. On the other h<strong>and</strong>, a node can dynamically


change the limitation on the number of run rights of each<br />

agent. If a certain agent has high workload, the node could<br />

reduce the number of run-rights on other agents so that the<br />

agent can get more opportunity to run its work.<br />

(Agent)<br />

Scheduler<br />

Sensor<br />

(ThreadListener)<br />

DynamicSortedQueue<br />

TreeNode<br />

(Node)<br />

Scheduler<br />

Resource Allocator<br />

(RightsSelector)<br />

(Agent)<br />

Scheduler<br />

Sensor<br />

(ThreadListener)<br />

Figure 1: infrastructure level<br />

Figure 1 shows a schematic representation of hierarchical<br />

control structure within a node. In Cougaar, nodes <strong>and</strong><br />

agents have their own schedulers. They assume a tree<br />

data structure. Cougaar does not have a direct<br />

communication channel between agents <strong>and</strong> nodes. We<br />

modified them to exchange feedback report <strong>and</strong> control<br />

message through the scheduler. Each agent could monitor<br />

every plugins’ arrival, service start <strong>and</strong> service end through<br />

the Thread Listener. Plugins have algorithms to process<br />

Tasks. Agent could choose an algorithm of a plugin for<br />

service rate control through Dynamic Sorted Queue. Every<br />

agent has the Dynamic Sorted Queue. A java interface to<br />

Plugin is added to let the agent set the algorithm in a<br />

plugin. Node could assign the run-right (thread) to specific<br />

agent using right selector. We let the scheduler have a set<br />

of control policies. We could see that the modified<br />

infrastructure could effectively control the plugins <strong>and</strong> runright<br />

assignment through experiments on a small example<br />

agent society.<br />

7 CONCLUDING REMARKS<br />

Load balancing in multiagent systems is different from<br />

other load balancing problems because of agent’s selfregulation<br />

<strong>and</strong> highly dynamic communication load<br />

between agents [7]. This paper discussed a hierarchical<br />

control structure, which would help agents or nodes make<br />

control decisions based on local information <strong>and</strong> obtain an<br />

overall optimal system’s performance. In addition, the<br />

higher-level controller’s estimation on changes in system<br />

parameters could help the agent adapt to changing<br />

environment.<br />

Agent’s self-regulation makes load-balancing problem<br />

significantly difficult. Each agent wants to use more CPU<br />

time. However, if every agent uses CPU time greedily,<br />

overall system performance may not be optimal. This could<br />

be seen as a Social dilemma. Use of Game theory is a<br />

promising approach in finding equilibrium among agents.<br />

8 ACKNOWLEDGMENTS<br />

The work described here was performed under the <strong>DARPA</strong><br />

UltraLog Grant#: MDA972-1-1-0038. The authors wish to<br />

acknowledge <strong>DARPA</strong> for their generous support.<br />

[2] Meyer, B., 1997, Object-Oriented Software<br />

Construction, Second Edition, Upper Saddle River,<br />

N.J., Prentice Hall.<br />

[3] Csete, M.E. <strong>and</strong> Doyle, J.C., 2002, Reverse<br />

Engineering of Biological Complexity, Science,<br />

295:1664-1669.<br />

[4] Huhns, M.N. <strong>and</strong> Holderfield, V.T., 2002, Robust<br />

Software, IEEE Internet Computing, March/April:80-<br />

82.<br />

[5] Cougaar Open Source Site. http://www.cougaar.org<br />

[6] Java Agent DEvelopment Framework (JADE).<br />

http://sharon.cselt.it/projects/jade/<br />

[7] Chow, K. <strong>and</strong> Kwok, Y., 2002, On Load Balancing for<br />

Distributed Multiagent Computing, IEEE Transactions<br />

On Parallel <strong>and</strong> Distributed Systems, 13/8:787-801.<br />

[8] Lee, L.C., Nwana, H.S., Ndumu, D.T. <strong>and</strong> De Wilde,<br />

P., 1998, The stability, scalability <strong>and</strong> performance of<br />

multiagent systems, BT Technology Journal, 16/3:<br />

94-103.<br />

[9] Schaerf, A., Shoham, Y., <strong>and</strong> Tennenholtz, M. , 1995,<br />

Adaptive Load Balancing: A Study in Multiagent<br />

Learning. Journal of Artificial Intelligence Research,<br />

2:475-500.<br />

[10] Zilberstein, S. <strong>and</strong> Russell, S., 1996, Optimal<br />

composition of real-time systems, Artificial<br />

Intelligence, 82:181-213.<br />

[11] George, J. M. <strong>and</strong> Harrison, J. M., 2001, Dynamic<br />

control of a queue with adjustable service rate,<br />

Operations Research, 49/5:720-731.<br />

[12] Ooi, J.M., Verbout, S.M., Ludwig, J.T., <strong>and</strong> Wornell,<br />

G.W., 1997, A Separation Theorem for Periodic<br />

Sharing Information Patterns in Decentralized<br />

Control, IEEE Transactions On Automatic Control,<br />

32/2:1546-1550.<br />

[13] Yao, D.D. <strong>and</strong> Schechner, Z., 1989, Decentralized<br />

Control of Service Rates in a Closed Jackson<br />

Network, IEEE Transactions On Automatic Control,<br />

42/11:236-240.<br />

[14] Lygeros, J., Godbole, D. N., <strong>and</strong> Sastry, S., 1997, A<br />

Design Framework for Hierarchical, Hybrid Control.,<br />

California PATH Research <strong>Report</strong>, UCB-ITS-PRR-97-<br />

24.<br />

[15] Gershwin, S.B., 1989, Hierarchical Flow Control: A<br />

Framework for Scheduling <strong>and</strong> Planning Discrete<br />

Events in <strong>Manufacturing</strong> Systems, Proceedings of the<br />

IEEE, 77/1:195-209.<br />

[16] Chang, H.S., Fard, P.J., Marcus, S.I., <strong>and</strong> Shayman,<br />

M., 2003, Multitime Scale Markov Decision<br />

Processes, IEEE Transactions On Automatic Control,<br />

48/6:976-987.<br />

[17] Gusella, R., 1991, Characterizing the variability of<br />

arrival processes with indexes of dispersion, IEEE<br />

Journal on Selected Areas in Communications,<br />

9/2:203-211.<br />

[18] Zhang, R., Phillis, Y.A., <strong>and</strong> Zhu, X., 1998, Fuzzy<br />

Control of Queueing Systems with Removable<br />

Servers, IEEE International Conference on Systems,<br />

Man, <strong>and</strong> Cybernetics, 3:2160-2165.<br />

[19] Levy, H. <strong>and</strong> Sidi, M., 1990, Polling Systems:<br />

Applications, Modeling, <strong>and</strong> Optimization, IEEE<br />

Transactions on Communications, 38/10:1750-1760.<br />

9 REFERENCES<br />

[1] Jennings, N. R., 2001, An agent-based approach for<br />

building complex software systems, Communications<br />

of the ACM, 44/4:35-41.


Underst<strong>and</strong>ing Agent Societies Using Distributed Monitoring <strong>and</strong> Profiling<br />

† Wilbur Peng, † Vikram Manikonda, <strong>and</strong> ‡ Soundar Kumara<br />

† Intelligent Automation Incorporated<br />

7519 St<strong>and</strong>ish Place, Suite 200, Rockville, MD 20855<br />

{wpeng ,vikram}@i-a-i.com<br />

‡<br />

<strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />

310 Leonhard Building, The Pennsylvania State University, University Park, PA 16802<br />

{skumara}@psu.edu<br />

Abstract<br />

In this paper, we describe methodologies for<br />

underst<strong>and</strong>ing large-scale agent societies using the<br />

Castellan, a distributed profiling <strong>and</strong> logging system<br />

developed for Cougaar. Castellan enables the detailed<br />

efficient logging of blackboard plan activity. We describe<br />

the design, functionality, use <strong>and</strong> a number of<br />

applications of the Castellan tool, including a<br />

visualization <strong>and</strong> data mining tool based on a flexible<br />

algorithm for finding subgraph isomorphisms. By<br />

mapping “equivalent” meaningful graph nodes <strong>and</strong> edges<br />

to representative subgraph elements, the graph reduction<br />

approach reduces large plan graphs of hundreds of<br />

thous<strong>and</strong>s to millions of nodes to meaningful <strong>and</strong><br />

underst<strong>and</strong>able clusters <strong>and</strong> graph nodes. This algorithm<br />

is demonstrated through its application to event traces<br />

obtained from running Castellan within a military<br />

logistics planning society. In addition to providing data<br />

for static analysis after planning <strong>and</strong> execution, the<br />

Castellan approach is also useful for on-line analysis of<br />

active, running agent systems. We also describe a<br />

number of other potential applications of distributed<br />

monitoring for modeling, control, load balancing <strong>and</strong><br />

analysis.<br />

1. Introduction<br />

Distributed agent systems provide significant challenges<br />

for debugging, testing, profiling <strong>and</strong> tuning. Agent<br />

societies consist of distributed, state encapsulated entities<br />

that can run concurrently. Additionally, they have the<br />

additional constraint of state encapsulation, i.e. each<br />

agent does not have direct access to the state of other<br />

agents. Instead, they interact solely through message<br />

passing. Within an agent, different functionscan interact<br />

through sharing state.<br />

The Cougaar agent infrastructure supports an approach to<br />

distributed planning in which tasks are created <strong>and</strong><br />

exp<strong>and</strong>ed into subtasks by agents which can in turn be<br />

forwarded to other agents. The planning process creates a<br />

plan graph that spans multiple agents that can potentially<br />

be very large, growing to hundreds thous<strong>and</strong>s to millions<br />

of elements. Adding to the complexity of underst<strong>and</strong>ing<br />

system function, the plan graph generated by the agents<br />

can be dynamically modified during the planning <strong>and</strong><br />

execution phases of the society. As Cougaar agent<br />

societies increase in size <strong>and</strong> scope, underst<strong>and</strong>ing the<br />

distributed execution of the system becomes increasingly<br />

difficult. Being able to trace the time-evolving, eventdriven<br />

behavior across agents running societies becomes<br />

increasingly important.<br />

In this paper, we discuss methods for underst<strong>and</strong>ing,<br />

analyzing <strong>and</strong> controlling Cougaar agent societies<br />

through distributed profiling. Section 1.1 covers<br />

background concepts in distributed planning used by<br />

Cougaar. In Section 2, the Castellan profiling <strong>and</strong> logging<br />

system is introduced <strong>and</strong> its implementation <strong>and</strong> design<br />

described. Section 3 presents in detail an application of<br />

Castellan to data mining <strong>and</strong> visualization application<br />

using a plan graph reduction algorithm. <strong>Final</strong>ly, Section<br />

4 discusses potential applications of Castellan.<br />

1.1 Distributed plan graphs in Cougaar societies<br />

In this section, we review some basic concepts of<br />

planning in the Cougaar context. Additional details about<br />

plan representation can be found in [3].<br />

In Cougaar applications such as logistics planning <strong>and</strong><br />

execution, agents generate plans by decomposing tasks<br />

into subtasks, aggregating tasks, <strong>and</strong> forwarding tasks to<br />

other organization entities which are in turn are<br />

represented by other agents.


In the Cougaar planning model, the basic element is a<br />

task. Each task has a unique identifier (UID) <strong>and</strong> a set of<br />

fields including the task verb (e.g. “Supply”, “Project”,<br />

“Transport”) <strong>and</strong> the direct object (e.g. the UID of the<br />

Asset which the task acts on).<br />

Each task element must be allocated to a plan element<br />

during the distributed planning process. These include:<br />

• Allocation elements. An allocation is a<br />

assignments of tasks to particular assets. The<br />

assets can be locally represented (e.g. an<br />

inventory) or an organizational asset (e.g. a<br />

customer organization allocates a task T to a<br />

Supplier asset. Here, the Supplier asset<br />

represents an actual agent to which T will be<br />

forwarded.)<br />

• Expansions. Decomposition of tasks into<br />

subtasks.<br />

• Aggregations. These collect multiple tasks into<br />

a single task.<br />

Each agent can therefore be modeled as taking inputs as a<br />

set of tasks, generating local blackboard tasks <strong>and</strong> plan<br />

elements, <strong>and</strong> generating outputs as a set of tasks to be<br />

forwarded to the representative agent(s). (In Cougaar,<br />

tasks are forwarded to another agent by the logic provider<br />

if they are allocated to the (local) organization asset<br />

which represents the target agent.) All the blackboards<br />

<strong>and</strong> task elements are assumed to be persistent, unique<br />

objects.<br />

The result of a single planning run is logically a<br />

connected, distributed plan graph that spans multiple<br />

agents <strong>and</strong> multiple nodes. In addition, the plan graph<br />

may evolve <strong>and</strong> change during replanning as tasks are<br />

rescinded, modified <strong>and</strong> replanned.<br />

2. Castellan System Design <strong>and</strong><br />

Implementation<br />

The primary distinguishing aspect of Castellan is that<br />

provides the ability to observe the time evolving state of<br />

the distributed agent blackboards rather than the final<br />

state after planning has been reached.<br />

The Castellan system has two aspects: the client<br />

implementation which monitors planning at agents, <strong>and</strong><br />

the server implementation that collects the logs<br />

accumulated by the client side. Figure 1 shows an<br />

example of the concept of operations. It shows a set of<br />

agents which are being monitored <strong>and</strong> sending events<br />

traces to a server application. In turn, the server<br />

application can log them to a database or feed them<br />

directly to monitoring <strong>and</strong> analysis applications. In the<br />

current implementation of Castellan, the server<br />

application is itself implemented as a plugin which can be<br />

embedded in a Cougaar agent.<br />

Castellan has evolved to support the following modes of<br />

operation on the client implementation:<br />

• Plugin based execution. A monitoring client<br />

plugin loaded in each agent subscribes to all<br />

modifications to the blackboard.<br />

• Logic provider based execution. A Castellan<br />

logic provider is attached to each agent <strong>and</strong><br />

monitors changes to the blackboard through the<br />

logic provider interface.<br />

The primary difference between these two approaches is<br />

that the latter allows monitoring the source of each<br />

change to the blackboard as well as the number of<br />

execute cycles associated with each change. The latter<br />

approach is useful in debugging since it can observe<br />

which plugins execute <strong>and</strong> the number of execute cycles<br />

of each plugin loaded for an agent. These features are<br />

useful for debugging <strong>and</strong> detailed performance analysis of<br />

agents.<br />

Agent 1 Agent 2 Agent 3<br />

Event<br />

Database<br />

Castellan Server<br />

Plan Analysis<br />

Applications<br />

Sensors<br />

Event Protocol<br />

Figure 1. Castellan System Concept<br />

As agent execution proceeds, the client implementation<br />

generates a stream of events for each task <strong>and</strong> plan<br />

element added, changed, or removed to the system<br />

blackboard. The event trace <strong>and</strong> logging protocol extracts<br />

a subset of the data encapsulated by the tasks, assets, <strong>and</strong><br />

plan elements sufficient to reconstitute the entire plan<br />

graph. These include:<br />

• The unique identifier, encoded as a symbol id<br />

rather than a string.<br />

• The timestamps associated with the blackboard<br />

action (both simulation <strong>and</strong> wall clock time.)<br />

• For tasks, the verb for tasks encoded as a<br />

symbol id.


• For tasks, the UID of the direct object also<br />

encoded as a symbol.<br />

• For allocations, the allocation results observed.<br />

A design objective of Castellan was to reduce the amount<br />

of b<strong>and</strong>width consumed by the event traces while<br />

retaining key data <strong>and</strong> minimizing CPU consumption.<br />

This was accomplished through a variety of approaches,<br />

including:<br />

• Compression of UID <strong>and</strong> other symbols using a<br />

space efficient symbol id based protocol.<br />

• Detecting <strong>and</strong> transmitting only changes in the<br />

allocation results for each task.<br />

• Batching mechanisms, e.g. serializing batches of<br />

messages rather than individual messages.<br />

The generated stream of events can be delivered to the<br />

Castellan server implement during planning. The message<br />

transport between the client implementation <strong>and</strong> the<br />

server implementation can be varied depending on the<br />

application. Currently, a buffered relay is used to transmit<br />

events through batches of serialized events. (A relay is an<br />

point-to-point communications mechanism between<br />

Cougaar agents.) This form of “in-b<strong>and</strong>” communications<br />

shares the communications channel with other Cougaar<br />

message traffic <strong>and</strong> hence is more suitable for distributed<br />

control applications in which agents need to be aware of<br />

the detailed planning status of other agents within the<br />

society. Alternative message transport implementations<br />

(e.g. using a separate communications backplane) can<br />

provide non-intrusive analysis for debugging <strong>and</strong> testing.<br />

The scalability of Castellan as a real-time monitoring tool<br />

is limited primary by the following factors:<br />

• B<strong>and</strong>width of the links between the monitoring<br />

agents. For example, a 160,000 element event<br />

trace from a 30 agent, 60 min. long planning run<br />

in total takes approximately a total of 8 MB of<br />

b<strong>and</strong>width. This is not significant for a LAN test<br />

environment but may be an issue for real world<br />

operating environments.<br />

• The impact on CPU consumption for each agent<br />

being monitored. In a small Cougaar society of<br />

~30 agents, this was measured to impact total<br />

planning time by approximately 5%.<br />

• The processing bottleneck at the server agent.<br />

While receiving the data is not very expensive,<br />

inserting the results at run time into a SQL<br />

database tends to overwhelm most typical<br />

processors.<br />

In large agent societies, we do not expect that all agents<br />

can be monitored by a single server due to the volume of<br />

data generated. Instead, Castellan can be configured to<br />

monitor any subset of agents as desired, e.g. a community<br />

or an enclave. For debugging <strong>and</strong> profiling purposes, the<br />

data can then be merged after the run is complete.<br />

3. Graph Reduction Using Subgraph<br />

Isomorphism<br />

This section describes a significant application of the<br />

Castellan system that enables real-time <strong>and</strong> off-line<br />

analysis of the planning functions of a distributed agent<br />

system. Existing graph “clustering” <strong>and</strong> reduction<br />

approaches have been used in data mining applications to<br />

find meaningful repeated subgraphs within a larger graph.<br />

[1][2]<br />

The plan graphs generated during a single planning run of<br />

a Cougaar agent society can be extremely large. For<br />

example, the event traces generated from a relatively<br />

small test planning society engaged in logistics planning<br />

consisting of more than thirty agents resulted in over one<br />

hundred sixty thous<strong>and</strong> events <strong>and</strong> a very large<br />

corresponding plan graph with thous<strong>and</strong>s of individual<br />

tasks, plan elements <strong>and</strong> assets. In this section, a general<br />

graph reduction algorithm is described that can be used to<br />

reduce the size of the plan graph in a manner tailored to<br />

specific applications.<br />

We define a plan graph P={N,E} as a set of nodes<br />

N={T,A} <strong>and</strong> a set of directed, attributed edges E={L,<br />

X,G}. Here, the nodes N consist of a set of tasks T <strong>and</strong> a<br />

set of assets A. The attributes associated with each task t<br />

∈ T are expressed as a tuple (V,Uid,Agent) <strong>and</strong> can<br />

include other properties depending on the amount of<br />

detail collected within the event trace. The plan graph P<br />

is a directed acyclic graph (DAG) since no cycles can<br />

exist in the current Cougaar task grammar. The output of<br />

the algorithm is a reduced graph R={N’,E’}.<br />

The set of edges E consist of a set of allocations L, a set<br />

of expansions X, <strong>and</strong> a set of aggregations G. Also, the<br />

nature of the Cougaar task grammar dictates a number of<br />

additional constraints on the plan graph. These include:<br />

• Each task node t 1 ∈ T can be connected to either<br />

an asset node s∈S through an allocation edge l<br />

or another task node t 2 through an expansion or<br />

allocation edges e ∈{X, G}.<br />

• Each task node t ∈ T has exactly one edge (<strong>and</strong><br />

hence one parent node) except for those task<br />

nodes which are connected by a set of edges G’<br />

⊂G.<br />

• The set of aggregation edges G is subdivided<br />

into a set of disjoint subsets which have the same<br />

destination node.


• The set of expansion edges X is subdivided into<br />

a set of disjoint subsets which have the same<br />

source node.<br />

• A task t is connected to exactly one task by an<br />

edge e unless e is a member of the set of<br />

expansion edges X.<br />

The basic principle behind the graph reduction approach<br />

is as follows. For each type of reduced graph mapping R,<br />

we define equivalence criteria between nodes in the graph<br />

to find abstract nodes. An example of criteria C 1 would<br />

be “All tasks at the same agent with parent tasks<br />

originating from another agent.” In applying the graph<br />

reduction algorithm, all nodes which satisfy this criteria<br />

are aggregated into a single node.<br />

Also, for every graph reduction mapping R, we define a<br />

equivalence criteria for abstract edges. Abstract edges<br />

are aggregates of equivalent subgraphs into a single<br />

representative edge.<br />

Continuing the example, consider a criteria C 2 that states<br />

“Subgraphs that connect all aggregate nodes satisfying C 1<br />

<strong>and</strong> connect to organizational assets associated with the<br />

same agent as C 1 .” Also, we define a second node<br />

equivalence criterion C 3 as “All tasks at the same agent<br />

which are allocated to assets representing external<br />

organizations.”<br />

We apply the criteria C 1 , C 2 <strong>and</strong> C 3 to a subgraph<br />

t 1 (→x 1 )t 2 (→x 2 ) t 3 (→l 2 )a 1 associated with agent A. This<br />

subgraph represents a task t 1 exp<strong>and</strong>ed to a task t 2 which<br />

is in turn exp<strong>and</strong>ed to a task t 3 <strong>and</strong> allocated to an asset<br />

a 1 . Here, we assume that t 1 has a parent external to A. We<br />

further assume that the asset a 1 is an organizational asset<br />

representing agent B. The task node t 1 satisfies C 1 <strong>and</strong><br />

hence is aggregated into an abstract node n 1 . Similarly,<br />

the task node t 3 satisfies C 3 <strong>and</strong> is aggregated into a node<br />

n 2 . The subgraph (→x 2 )t 2 (→x 2 ) therefore matches C 2 <strong>and</strong><br />

is associated with a single edge e 1 ∈E’ which connects<br />

the two abstract nodes n 1, , n 2, ∈ N’.<br />

Together, the equivalence criteria for abstract nodes <strong>and</strong><br />

edges leads to the identification of equivalent subgraphs.<br />

By changing the equivalence criteria for aggregated nodes<br />

<strong>and</strong> edges, a variety of different graph reduction<br />

mappings can be achieved.<br />

The computational complexity of the approach described<br />

above varies depending on the complexity of finding <strong>and</strong><br />

matching isomorphic subgraphs. Cougaar task graphs are<br />

generally well structured, <strong>and</strong> for most of the equivalence<br />

criteria that are described in the following, subgraphs can<br />

be matched using a (worst case) O(n) graph traversal,<br />

where n is the size of the subgraph. In the equivalence<br />

criteria described in the next section, all subgraphs fall<br />

within a single agent’s plan graph, thus bounding the size<br />

of the matched subgraphs. If m is the total number of<br />

abstract nodes discovered, then the total computational<br />

complexity is O(m * n).<br />

3.1 Algorithm implementation <strong>and</strong> applications<br />

The implementation of the graph reduction algorithm<br />

within Castellan allows generation of the task graph from<br />

an arbitrary stream of events. Except for asset <strong>and</strong><br />

organizational information, it is not necessary to have a<br />

complete plan graph to use this approach to devise<br />

reduced graphs.<br />

The following types of reduced graphs were found to be<br />

useful for underst<strong>and</strong>ing Cougaar societies <strong>and</strong> are<br />

implemented within the Castellan system.<br />

Aggregate task graphs defined using an equivalence<br />

criterion that maps tasks of the same type with the<br />

identical verb to single nodes. Also, theis equivalence<br />

criterion requires a strict ordering in depth between tasks<br />

which are aggregated. Specifically, in order to map a<br />

node n to an abstract node n’, the node n’s parents (<strong>and</strong><br />

all of its ancestors by implication) must map to an<br />

abstract node n 2 ’ which is an ancestor of n’. This<br />

requirement is imposed to prevent cycles from appearing.<br />

Figure 2 shows a conceptual representation of task<br />

aggregation in which multiple “similar” subgraphs are<br />

collapsed into a single aggregate subgraph. An example<br />

of a task aggregate plan graph is shown in Figure 3.<br />

Asset dependency graphs consider the assets (both<br />

organizational <strong>and</strong> physical) as abstract nodes. In this<br />

case, no aggregation of assets is performed as all assets<br />

are considered unique. The criteria for abstract edges are<br />

as follows:<br />

• All assets are mapped to abstract nodes.<br />

(Optionally, additional asset matching criteria<br />

can be introduced to aggregate assets.)<br />

• In addition, all agents that generate tasks are<br />

designed as “Source” abstract nodes. (These<br />

serve as the roots of the reduced DAG.)<br />

• All tasks <strong>and</strong> allocations that form plan graph<br />

dependencies between different assets are<br />

mapped to a single abstract edge.<br />

Asset dependency graphs are useful for finding both<br />

organization <strong>and</strong> physical dependencies within a<br />

distributed plan.<br />

Workflow graphs characterize the input/output<br />

relationships between agents <strong>and</strong> are particularly useful


for tracking the dependencies between agents incurred<br />

during distributed planning under a particular society<br />

configuration. The equivalence criteria for abstract nodes<br />

<strong>and</strong> edges are defined as follows:<br />

• Task nodes that are on the boundary (e.g. have a<br />

parent from another agent) <strong>and</strong> have identical<br />

verbs are considered equivalent.<br />

• All task nodes with the identical verb allocated<br />

to another agent are equivalent.<br />

• All subgraphs linking boundary nodes of the two<br />

types described above are mapped to the same<br />

abstract edges<br />

CLEAR-<br />

PAYMENT<br />

ORDERBOOKS<br />

BOOKS-<br />

FROM<br />

WAREHOUSE<br />

PACK<br />

SHIP<br />

CLEAR-<br />

PAYMENT<br />

ORDERBOOKS<br />

CLEARPAYMENT<br />

BOOKS-<br />

FROM<br />

WAREHOUSE<br />

PACK<br />

ORDERBOOKS<br />

SHIP<br />

ROUTE<br />

BOOKS-<br />

FROM<br />

WAREHOUSE<br />

0.6<br />

PACK<br />

SHIP<br />

CLEAR-<br />

PAYMENT<br />

0.3<br />

PUBLISH<br />

ORDERBOOKS<br />

BOOKS-<br />

FROM<br />

WAREHOUSE<br />

PUBLISH<br />

AND ( EXPANSION)<br />

AGGREGATION<br />

OR<br />

An example of a workflow graph is shown in Figure 4,<br />

with a detailed blowup in Figure 5. This particular graph<br />

was extracted from an event trace consisting of more than<br />

160,000 events <strong>and</strong> more thirty agents; however, it clearly<br />

shows the input/output relationships between agents <strong>and</strong><br />

the number of each type of task transmitted between<br />

agents. Each box contains abstract nodes belonging a<br />

single agent. Here, the hexagonal nodes depict a set of<br />

tasks with identical verbs which are generated in an<br />

specific agent <strong>and</strong> subsequently forwarded to other agents<br />

for planning, the light colored boxes depict abstract nodes<br />

represented tasks which are inputs to the agent, <strong>and</strong> the<br />

ovals represent tasks allocated to assets within the agent.<br />

This type of graph is useful for deriving the dependencies<br />

between agents by finding the types of tasks which are<br />

inputs <strong>and</strong> outputs from agents <strong>and</strong> filtering out the<br />

internal details of the plan graph within each agent. A<br />

potential application of this system would be a smart load<br />

balancer which anticipates the generation <strong>and</strong> allocation<br />

of tasks <strong>and</strong> allocated higher priority accordingly.<br />

ROUTE<br />

Figure 2. Example of Task Aggregation<br />

In summary, each of these provides a different logical<br />

view of the overall task graph that is useful for different<br />

purposes.<br />

Figure 3. Example Task Aggregate Graph<br />

Reduction


1-6-INFBN<br />

Transport(2)<br />

Supply(338)<br />

ProjectSupply(96)<br />

ProjectWithdraw(33)<br />

1-35-ARBN<br />

Supply(354)<br />

ProjectSupply(153)<br />

Transport(2)<br />

ProjectWithdraw(42)<br />

Supply(692)<br />

ProjectSupply(249)<br />

Transport(2)<br />

ProjectSupply(85)<br />

Supply(375)<br />

47-FSB<br />

2-BDE-1-AD<br />

Transport(2)<br />

DISCOM-1-AD<br />

Transport(2)<br />

Withdraw(692)<br />

ProjectWithdraw(357)<br />

Supply(82)<br />

ProjectSupply(13)<br />

Transport(2)<br />

ProjectSupply(21)<br />

Supply(168)<br />

ProjectSupply(59)<br />

Supply(464)<br />

ProjectSupply(128)<br />

Supply(628)<br />

123-MSB<br />

ProjectSupply(13)<br />

Supply(45)<br />

Supply(131)<br />

ProjectSupply(23)<br />

Transport(2)<br />

1-AD<br />

Transport(12)<br />

227-SUPPLYCO<br />

ProjectSupply(42)<br />

Supply(337)<br />

Transport(2)<br />

ProjectSupply(87)<br />

ProjectSupply(6)<br />

Supply(532)<br />

Supply(54)<br />

ProjectSupply(53)<br />

Supply(160)<br />

Transport(2)<br />

485-CSB<br />

Transport(2)<br />

71-MAINTBN<br />

Transport(2)<br />

18-MAINTBN<br />

Transport(2)<br />

565-RPRPTCO<br />

Transport(2)<br />

106-TCBN<br />

Transport(2)<br />

16-CSG<br />

Transport(2)<br />

7-CSG<br />

Transport(2)<br />

ProjectWithdraw(178)<br />

Withdraw(383)<br />

592-ORDCO<br />

102-POL-SUPPLYCO<br />

Supply(29)<br />

ProjectSupply(12)<br />

ProjectSupply(23)<br />

Supply(67)<br />

Transport(2)<br />

51-MAINTBN<br />

Transport(2)<br />

37-TRANSGP<br />

Transport(2)<br />

6-TCBN<br />

Transport(2)<br />

28-TCBN<br />

Transport(2)<br />

29-SPTGP<br />

Transport(2)<br />

Withdraw(108)<br />

ProjectWithdraw(20)<br />

343-SUPPLYCO<br />

Transport(2)<br />

ProjectSupply(42)<br />

Supply(337)<br />

ProjectWithdraw(70)<br />

Withdraw(487)<br />

Transport(2)<br />

ProjectSupply(53)<br />

ProjectSupply(6)<br />

Supply(54)<br />

Supply(89)<br />

ProjectSupply(87)<br />

Supply(199)<br />

3-SUPCOM-HQ<br />

Transport(2)<br />

Transport(20)<br />

191-ORDBN<br />

110-POL-SUPPLYCO<br />

OSC<br />

Supply(44)<br />

ProjectSupply(21)<br />

ProjectWithdraw(23)<br />

Withdraw(67)<br />

21-TSC-HQ<br />

Transport(16)<br />

Transport(2)<br />

DLAHQ<br />

ProjectSupply(42)<br />

Supply(337)<br />

HNS<br />

ProjectSupply(93)<br />

Supply(161)<br />

ProjectWithdraw(87)<br />

Withdraw(199)<br />

Transport(52)<br />

TRANSCOM<br />

Transport(1706)<br />

Transport(1708)<br />

Transport(1706)<br />

Transport(1708)<br />

GlobalAir<br />

GlobalSea<br />

Transport(1706)<br />

Transport(1680)<br />

Transport(1680)<br />

Transport(1708)<br />

Transport(1708)<br />

Transport(1708)<br />

Transport(1706)<br />

Transport(3388)<br />

Transport(3388)<br />

Transport(1708)<br />

PlanePacker<br />

TheaterGround<br />

CONUSGround<br />

ShipPacker<br />

Transit(448)<br />

Transport(224)<br />

Transport(446)<br />

Transit(892)<br />

Transport(156)<br />

Transit(312)<br />

Transit(32)<br />

Transport(16)<br />

capture the detailed interactions that span multiple nodes,<br />

applications <strong>and</strong> plugins, nor can they track the dynamic<br />

evolution of system state during planning <strong>and</strong> execution.<br />

Moreover, agent systems may be non-deterministic,<br />

resulting in different results for each run. In the absence<br />

of such tools, underst<strong>and</strong>ing <strong>and</strong> debugging agent systems<br />

becomes exceeding difficult.<br />

Castellan can be used to analyze global agent society<br />

behavior. The plan graph reduction algorithms can be<br />

used to evaluate the completeness of the plan <strong>and</strong> to<br />

confirm whether or not the patterns of plan generation are<br />

correct. The approach can also identify groups of tasks,<br />

which are not complete, i.e. which have not been<br />

associated with any plan element.<br />

Profiling tools are also often useful to increase <strong>and</strong><br />

optimize performance. Although distributed agent<br />

systems can benefit from parallelism, often serial<br />

bottlenecks may be present, e.g. planning/execution may<br />

be depending on single agents within the system that<br />

constrain the rest of the planning process.<br />

Figure 4. Example Workflow Graph<br />

1-6-INFBN<br />

Transport(2)<br />

47-FSB<br />

Supply(338)<br />

Supply(692)<br />

Withdraw(692)<br />

ProjectSupply(96)<br />

ProjectSupply(249)<br />

ProjectWithdraw(357)<br />

ProjectWithdraw(33)<br />

Transport(2)<br />

1-35-ARBN<br />

The Castellan event traces can provide useful information<br />

by capturing the time dependent evolution of the plan<br />

rather than capturing a single snapshot at the end of<br />

planning. Moreover, the event traces measure the time to<br />

perform planning actions that may be time consuming,<br />

enabling the identification of hotspots <strong>and</strong> bottlenecks.<br />

4.2 On-Line Control <strong>and</strong> Monitoring<br />

Applications<br />

Supply(354)<br />

ProjectSupply(153)<br />

Transport(2)<br />

ProjectWithdraw(42)<br />

ProjectSupply(85)<br />

Supply(375)<br />

In the current version of Castellan, applications such as<br />

workflow analysis, visualization <strong>and</strong> data mining that<br />

used static event trace databases have been supported.<br />

However, the concepts <strong>and</strong> approaches used in Castellan<br />

can be applied to on-line analysis as well.<br />

Figure 5 Example Workflow Graph (Detail)<br />

4. Discussion<br />

Applications of distributed logging <strong>and</strong> monitoring<br />

applications within large-scale agent societies include<br />

both offline static analysis <strong>and</strong> on-line sensors <strong>and</strong><br />

monitoring.<br />

4.1 Profiling <strong>and</strong> Debugging Applications<br />

Conventional debugging tools are inadequate to h<strong>and</strong>le<br />

large-scale agent based systems. They cannot easily<br />

Sensors <strong>and</strong> control strategies that require prediction of<br />

system performance can benefit from distributed<br />

monitoring. These include:<br />

• Falling behind sensors. We have used to<br />

Castellan to extract data streams to build falling<br />

behind sensors that can predict whether the<br />

society as a whole is falling behind due to<br />

excessive CPU load. In this case, the data from<br />

Castellan was used to train various neural<br />

network systems that would inform agents<br />

within the system when the society was in<br />

danger of falling behind.<br />

• Load balancing sensors. Based on the workflow<br />

analysis, it is possible to dynamically at runtime<br />

find the flow of tasks between multiple agents


within the monitored enclave. With such a<br />

model present, it becomes possible to identify<br />

the processing requirements of tasks as they flow<br />

through the agent society <strong>and</strong> hence allocate<br />

resources accordingly as planning/execution<br />

progresses.<br />

5. Conclusions<br />

Tools for monitoring agent systems have been noticeably<br />

missing from many agent infrastructures. As Cougaar has<br />

evolved <strong>and</strong> been applied to increasingly large societies<br />

<strong>and</strong> complex applications, the need for systems such as<br />

Castellan that provide detailed run-time event information<br />

will increase for both analysis of static event traces for<br />

<strong>and</strong> on-line monitoring applications. It also provides a<br />

general purpose graph reduction algorithm than enables a<br />

wide variety of approaches to analyzing <strong>and</strong><br />

underst<strong>and</strong>ing large distributed plan graphs.<br />

6. Acknowledgements<br />

This research was performed under the <strong>DARPA</strong> Ultralog<br />

effort <strong>and</strong> was supported by <strong>DARPA</strong> grant MDA972-1-1-<br />

0038 <strong>and</strong> Contract 2087-IAI-ARPA-0038. We would like<br />

to thank Dr. Mark Greaves, Marshall Brinn <strong>and</strong> Beth<br />

DePass for their support, comments <strong>and</strong> insightful<br />

discussions.<br />

7. References<br />

[1] Emden R. Gansner <strong>and</strong> Stephen C. North. “An open<br />

graph visualization system <strong>and</strong> its applications to<br />

software engineering”, Software Practice <strong>and</strong><br />

Experience, pp. 1–5, 1999.<br />

[2] Jonyer, L. B. Holder, <strong>and</strong> D. J. Cook. ``Graph-Based<br />

Hierarchical Conceptual Clustering in Structural<br />

Databases'', In the Proceedings of the Seventeenth<br />

National Conference on Artificial Intelligence, 2000<br />

[3] Cougaar Developers Guide, Version 11.0<br />

http://www.cougaar.org.


Reliable MAS Performance Prediction Using Queueing Models<br />

Nathan Gnanasamb<strong>and</strong>am, Seokcheon Lee, Natarajan Gautam, Soundar R.T. Kumara<br />

Pennsylvania State University<br />

State College, PA 16801<br />

{gsnathan, stonesky, ngautam, skumara}@psu.edu<br />

Wilbur Peng, Vikram Manikonda<br />

Intelligent Automation Inc.<br />

Rockville, MD, 20855<br />

{wpeng, vikram}@i-a-i.com<br />

Marshall Brinn<br />

BBN Technologies<br />

10 Moulton Street, Cambridge, MA 02138<br />

mbrinn@bbn.com<br />

Mark Greaves<br />

<strong>DARPA</strong> IXO<br />

3701 North Fairfax Drive, Arlington, VA 22203-1714<br />

mgreaves@darpa.mil<br />

Abstract<br />

In this paper, we model a multi-agent system (MAS)<br />

in military logistics based on the systemic specifications<br />

of the capabilities <strong>and</strong> attributes of individual agents<br />

(TechSpecs). Assuring the survivability of the MAS that implements<br />

distributed planning <strong>and</strong> execution is a significant<br />

design-time <strong>and</strong> run-time challenge. Dynamic battlefield<br />

stresses in military logistics range from heavy computational<br />

loads (information warfare) to being destructive<br />

to infrastructure. In order to sustain <strong>and</strong> recover from<br />

damages to continuously deliver performance, a mechanism<br />

that distributes knowledge about the capabilities <strong>and</strong><br />

strategies of the system is crucial. Using a queueing model<br />

to represent the network of distributed agents, strategies<br />

are developed for a prototype military logistics system.<br />

The TechSpecs contain the capabilities of the agents, playbooks<br />

or rules, quantities to monitor, types of information<br />

flow (input/output), measures of performance (Quality of<br />

Service) <strong>and</strong> their computation methods, measurement<br />

points, defenses against stresses <strong>and</strong> configuration details<br />

(to reflect comm<strong>and</strong> <strong>and</strong> control structure as well as task<br />

flow). With these details, models could be dynamically<br />

developed <strong>and</strong> analyzed in real-time for fine-tuning the<br />

system. Using a Cougaar (<strong>DARPA</strong> Agent Framework)<br />

based model for initial parameter estimation <strong>and</strong> analysis,<br />

we obtain an analytical <strong>and</strong> a simulation model <strong>and</strong><br />

extract generic results. Results indicate strong correlation<br />

between experimental <strong>and</strong> actual events in the agent society.<br />

0-7803-8799-6/04/$20.00 ©2004 IEEE.<br />

Keywords: Multi-agent systems, Survivability, Queueing<br />

network models, Technical specifications<br />

1. Introduction<br />

Multi-agent systems that implement distributed planning<br />

<strong>and</strong> execution are highly complex systems to design <strong>and</strong><br />

model. In this research, we model a survivable multiagent<br />

system (MAS) based on the systemic specifications<br />

(TechSpecs) of the capabilities <strong>and</strong> attributes of individual<br />

agents. The MAS under consideration is exposed to significant<br />

stresses because it operates in highly unpredictable<br />

battlefield-like environments. Even under such hostile conditions,<br />

the stated goal of this survivable MAS based logistics<br />

system is to deliver robustness, security <strong>and</strong> performance.<br />

Hence, performance prediction using suitable models<br />

is vital to being able to tune the actual performance delivered<br />

by the MAS.<br />

Within the research domain of military logistics, we are<br />

conducting our studies using a continuous planning <strong>and</strong> execution<br />

(CPE) agent society. The CPE society is constructed<br />

using the Cougaar MAS development platform developed<br />

under <strong>DARPA</strong>’s leadership [2]. From the modeling perspective,<br />

the CPE society (or otherwise) is nothing but a collection<br />

of distributed agents that lend themselves to be represented<br />

by a network of queues. With this motivation, we analytically<br />

modeled the CPE society using queueing theory.<br />

In doing so, we realized that if the TechSpecs were suitably<br />

specified, the generation of the queueing model could be<br />

55


Figure 1. Agent Hierarchy in CPE Society<br />

accomplished with lesser human intervention. The primary<br />

function of the model is to help evaluate the performance of<br />

the MAS <strong>and</strong> provide alternatives to steer the agent society<br />

towards optimal regions of operation boosting performance<br />

in a distributed environment. Therefore the main focus of<br />

this research lies in specifying the MAS in a systematic<br />

fashion so that queueing models can be derived from the<br />

specification.<br />

1.1 Continuous Planning <strong>and</strong> Execution Society<br />

Overview<br />

The CPE society comprises of agents <strong>and</strong> a world model.<br />

Agents in the CPE society assume a combination of comm<strong>and</strong><br />

<strong>and</strong> control, <strong>and</strong> customer-supplier roles as required<br />

in a military logistics scenario. The world model is an artificial<br />

source that provides the agents with external stimuli.<br />

Figure 1 represents the superior-subordinate <strong>and</strong> the<br />

customer-supplier relations between the brigade (BDE),<br />

battalion (BN), company (CPY) <strong>and</strong> supplier (SUPP) agents<br />

as modeled in this research. Each agent in the society constantly<br />

performs one or more of the following tasks: 1)<br />

Evaluates its own perception of the world state through local<br />

sensors <strong>and</strong> remote inputs; 2) Performs planning, replanning,<br />

plan reconciliation <strong>and</strong> plan refinement; 3) Executing<br />

plans, either through local actuators or through sending<br />

messages to other agents; 4) Adapting to the environment,<br />

e.g. centralizing or decentralizing planning as computational<br />

resources permit.<br />

1.2 Definitions<br />

The following definitions are in order when relating to<br />

the system under consideration.<br />

Stresses occur due to the operation of the MAS in battlefield<br />

environments where events such as permanent infrastructure<br />

damage <strong>and</strong> information attacks adversely affect<br />

overall system performance.<br />

Based on the planning activity in CPE, we simply base<br />

our measures of performance (MOPs) on timeliness or<br />

freshness of a plan at the point of usage <strong>and</strong> on the quality<br />

of the plan. Based on the requirements of Ultra*log<br />

[3], a broad series of performance measures categorized according<br />

to timeliness, completeness, correctness, accountability<br />

<strong>and</strong> confidentiality is available but is outside the requirements<br />

of CPE. Some insights about these MOPs can<br />

be gained from [6]. The MOPs are the components of the<br />

quality of service (QoS) expected from the system.<br />

Survivability of a distributed agent based system (or otherwise)<br />

is the extent to which the quality of service (QoS)<br />

of the system is maintained under stress [6].<br />

Although we consider a survivable MAS, we only concern<br />

ourselves with performance analysis in this work. We<br />

assume that a global controller exists that coordinates between<br />

threads relating to performance, robustness <strong>and</strong> security.<br />

The contents of this paper are organized in the following<br />

way. In Section 2, we introduce the concept of Tech-<br />

Specs based design <strong>and</strong> some of the benefits associated with<br />

this approach. We then discuss the components of the CPE<br />

society in detail <strong>and</strong> organize the TechSpecs for CPE into<br />

various categories in Section 3. The discussion on Tech-<br />

Specs leads us further in the direction of how to utilize them<br />

to form models. We dicuss some models we created in Section<br />

4. We provide two analytical methods using queueing<br />

networks to model a small example in CPE <strong>and</strong> verify our<br />

models using a simulation. <strong>Final</strong>ly, in Section 5 we discuss<br />

our conclusions <strong>and</strong> some possible directions for future research.<br />

2. The Concept of TechSpecs Based Design<br />

Technical Specifications (or TechSpecs) refer to<br />

component-wise, static information relating to agent<br />

input/output behavior, operating requirements, control<br />

actions <strong>and</strong> their consequences for adaptivity [7]. In<br />

addition to outlining a comprehensive set of functionalities,<br />

the TechSpecs are responsible for the definition of domain<br />

MOPs, their respective computational methodologies <strong>and</strong><br />

QoS measurement points. The construction of TechSpecs<br />

helps us proceed in the following direction:<br />

1. Use the specs to ensure a close mapping between MAS<br />

functionality <strong>and</strong> an abstracted model. An apparent<br />

choice here is a queueing model because of similarities<br />

between multi-class traffic in queueing networks <strong>and</strong><br />

the different types of flows in CPE.<br />

2. Establish the parameters of the queueing model - from<br />

TechSpecs directly (eg. update rate at a node) as well<br />

as by collecting empirical data from sample runs (eg.<br />

processing times).<br />

56


Benefits of TechSpecs<br />

The advantage of establishing comprehensive TechSpecs<br />

is that it leads to the codification of requirements, functionalities,<br />

measurements <strong>and</strong> responses to situations. Further,<br />

it enhances the potential to aid the MAS configuration (what<br />

nodes to put agents on) both statically <strong>and</strong> dynamically. An<br />

incomplete list of potential benefits of using a TechSpecs<br />

based approach to MAS design is provided below:<br />

• Enhancement of the MAS Design: Since TechSpecs<br />

impose the requirement of predictability, the MAS<br />

components must be built with fidelity<br />

Figure 2. TechSpecs based MAS Design<br />

3. As the queueing model provides an indication of system<br />

performance for a given configuration, use it to<br />

quickly explore options for control (choices resulting<br />

from adjusting (queueing) parameters or configurations).<br />

Once a suitable c<strong>and</strong>idate is obtained, this<br />

choice is translated back into the application level knob<br />

settings (for control) to result in better QoS for the<br />

MAS.<br />

The direction that TechSpecs motivates us to take is illustrated<br />

in Figure 2. Figure 2 indicates that we could use<br />

the specs in an online or offline fashion. Because the functionality<br />

is clearly defined using TechSpecs, offline analysis<br />

can be independently carried out to remove instabilities<br />

from the MAS design. Assuming automatic conversion<br />

from TechSpec to a model is feasible, TechSpecs have a<br />

real-time use as well - i.e. use the specs as a template to<br />

derive the model. As noted above, the c<strong>and</strong>idate parameters<br />

from the queueing model (parameters that may lead to<br />

performance improvement) cannot be used directly. Reconverting<br />

these choices to actual control knob settings may be<br />

h<strong>and</strong>led by a seperate global controller. We allude to this in<br />

Section 3.2.<br />

It can be noted that the idea of TechSpecs bears analogy<br />

to the conventional control problems in electronic or<br />

hardware realms where the technical specification or rating<br />

could be leveraged to effect better design <strong>and</strong> control. This<br />

was one of the motivating factors for TechSpecs based design<br />

for MAS.<br />

• Distribution of Knowledge: TechSpecs carries with it<br />

the idea of being composable. By using the TechSpecs<br />

of smaller components as building blocks we can build<br />

the TechSpecs of larger systems when the system exp<strong>and</strong>s.<br />

• Concurrent Analysis: Model building can be concurrent<br />

with actual MAS design. Provides a look-ahead<br />

capability to avoid regions of instability or bottlenecks<br />

(especially from queueing analysis).<br />

3. CPE Society TechSpecs<br />

In this section we discuss the formulation of TechSpecs.<br />

In order to build TechSpecs the functionalities of the components<br />

of the CPE society are defined as described in Section<br />

3.1. We then categorize the capabilities of CPE components<br />

in a manner that would lend itself to easy translation<br />

into the queueing models. We then show through examples<br />

how the mapping process between a TechSpec <strong>and</strong> a queueing<br />

model could be interpreted. This would enable us to<br />

analyze the MAS using the models we develop in Section<br />

4.<br />

3.1 Description of CPE Society Components<br />

The World Model: The world model refers to the<br />

conceptual set-up that provides the agents with external<br />

stimuli. It captures a military engagement scenario using a<br />

2-dimensional model of the world. As shown in Figure 3,<br />

CPY agents moving along the x-axis engage an unlimited<br />

supply of targets that move along the y-axis. The targets<br />

move at a fixed rate but engagement slows them down.<br />

While a probabilistic model is chosen to create targets<br />

<strong>and</strong> engaging them, a deterministic model is chosen for<br />

fuel consumption (which is dependent on the distance<br />

moved). A logistics model for resupplying the units with<br />

fuel or ammunition is based on the dem<strong>and</strong> generation<br />

from maneuver plans. Currently, the world model is also<br />

implemented as an agent.<br />

57


Figure 3. The World Model<br />

CPY Agent: Each CPY unit is designated a target<br />

area for engaging in combat actions. These action require a<br />

superior agent (BN) to supply a maneuver plan to each of<br />

the CPY agents. This plan enables the CPY agent to move<br />

along the x-axis <strong>and</strong> engage the enemy by firing. Each<br />

of these agents simulate sensors <strong>and</strong> actuators. The CPY<br />

agents consume resources <strong>and</strong> subsequently forward the<br />

dem<strong>and</strong> to SUPP agents. The current status is reported to<br />

superior agents to enable replanning.<br />

BN Agent: The BN agent maintains situational awareness<br />

of all the agents under its direct comm<strong>and</strong> <strong>and</strong> performs<br />

(re)planning for them using a consistent set of observations<br />

that is collected continously. The BN agent has to execute<br />

a branch <strong>and</strong> bound algorithm of a specified planning depth<br />

<strong>and</strong> breadth to generate a maneuver plan for its subordinates.<br />

The BN agent serves as a medium for transferring<br />

orders from superiors to subordinates.<br />

BDE Agent: The BDE agent is responsible for generating<br />

maneuver plans for the BN <strong>and</strong> CPY agents although<br />

this implementation does not empower the BDE with that<br />

functionality.<br />

SUPP Agent: SUPP agents represent an abstracted<br />

set of supply <strong>and</strong> inventory <strong>and</strong> sustainment services.<br />

These agents take maneuver plans from the CPY agents<br />

<strong>and</strong> supply them with fuel or ammunition. It is currently<br />

assumed that the SUPP units have infinite inventory. Projected<br />

<strong>and</strong> actual consumption depend on the sustainment<br />

plan generated from orders <strong>and</strong> the presence of enemy<br />

targets.<br />

3.1.1 TechSpec Organization<br />

Right at the outset, our goal is to embed enough transperancy<br />

in the TechSpecs to allow the generation of models<br />

(queueing models). Hence, we extract the input/output behavior,<br />

state, actions <strong>and</strong> QoS for each entity within CPE<br />

<strong>and</strong> form the following categories within the TechSpecs :<br />

• Internal State of an Agent: Corresponds to continously<br />

updated variables or data structures corresponding to<br />

the actual working of the agent.<br />

• Inputs: Relates to distinct classes of information received<br />

or sent to or from an agent respectively.<br />

• Outputs: Information provided to other agents.<br />

58


• Actions: Determines the actions that need to be taken<br />

as a result of state changes or the dependencies introduced<br />

by input/output operations.<br />

• Operating Modes: The fidelity or the rate at which outputs<br />

are sent may relate to the operating mode of an<br />

agent. Switching operating modes may be necessary<br />

to alter QoS requirements or as counter-measure for<br />

stress.<br />

• QoS Measurement (QoS Measurement Points): Indicates<br />

the measure of performance that needs to be<br />

monitored or measured in order to compute the QoS<br />

at the designated measurement point. For example,<br />

when we consider queueing models, we would be interested<br />

in measuring the average waiting times at different<br />

agents to compute a quantity such as the freshness<br />

of the maneuver plan.<br />

Table 1. TechSpec Categories:<br />

Perspective<br />

Application<br />

• Tradeoffs: While these may not pertain to every agent,<br />

some agents have the capability to trade-off a certain<br />

measure of performance to gain another. These are<br />

specified explicitly in TechSpecs.<br />

This categorization facilitates the delineation of specific<br />

flows of jobs between agents. For example, consider the<br />

following flow: External stimuli at CPY gets converted to<br />

update tasks at CPY, delivered to BN as updates, converted<br />

to a manueuver plan at BN, delivered to CPY <strong>and</strong> then forwarded<br />

to SUPP for sustainment. From a queueing theory<br />

perspective, the update tasks that originates at CPY <strong>and</strong> end<br />

up at BN for the purpose of planning could constitute a<br />

class of traffic with CPY <strong>and</strong> BN acting as servers to process<br />

these tasks. Similarly, consider the flow where external<br />

stimuli received at CPY end up as updates at BDE through<br />

BN. This could be regarded as another class of traffic. At<br />

this point it is important to notice that classes of traffic could<br />

be derived form the input/output details embedded within<br />

TechSpecs. We decribe how we h<strong>and</strong>le these flows in the<br />

queueing network formulation in Section 4.<br />

Another example of how we could describe something in<br />

the application domain (say a QoS metric) with the queueing<br />

model is as follows. If one is interested in how fresh a<br />

maneuver plan is at its usage point (i.e. CPY), the model<br />

could describe it in terms of the queueing delays for a particular<br />

class of traffic. In our application, this very quantity<br />

happens to be a QoS metric called manuever plan freshness.<br />

In the actual MAS, this metric is calculated directly from the<br />

timestamps that are tagged to the tasks.<br />

3.1.2 TechSpec Representation<br />

Although an elaborate discussion of the format of TechSpec<br />

representation is outside the scope of this paper, we present<br />

some aspects of the specification directly relating to the application<br />

<strong>and</strong> some infrastructural requirements that need to<br />

be part of the specification.<br />

Table 1 represents some TechSpecs categories specific<br />

to this application. Simply speaking, this is a tabular representation<br />

of the information contained in Section 3.1 organized<br />

using the aforementioned categories. From Table 1<br />

one can underst<strong>and</strong> that an output called update originates<br />

from CPY agent <strong>and</strong> travels up at BN because BN is CPY’s<br />

superior. Similarly, an output called maneuver plan would<br />

reach CPY from BN. One assumption that is being made<br />

here is that updates travel up the hierarchy <strong>and</strong> plans downward.<br />

These outputs form part of the different classes of<br />

traffic if observed from a queueing perspective. Another example<br />

would be that the plan action in the BN agent relates<br />

to a functionality in the MAS domain <strong>and</strong> would simply be<br />

abstracted by a processing time in the queueing domain.<br />

In addition to the above specification, static requirements<br />

of the agents in terms of infrastructure are also embedded<br />

into TechSpecs. Some of these requirements for BDE, BN,<br />

CPY <strong>and</strong> BDE agents shown in Table 2.<br />

3.2 Translating TechSpecs to the Queueing Domain<br />

In order to translate the specs into queueing models we<br />

first use the following rules:<br />

1. Inputs <strong>and</strong> outputs are regarded as tasks;<br />

2. The rate at which external stimuli are received is captured<br />

by the arrival rate(λ);<br />

3. Actions take time to perform so they get abstracted by<br />

processing times(µ i );<br />

59


Table 2. TechSpecs: Infrastructure Perspective<br />

an open system because tasks constantly enter <strong>and</strong> exit<br />

the system.<br />

• Does any parameter of the model require empirical<br />

data from the actual society?<br />

Although some aspects in this research are currently being<br />

resolved, the following observations can be made.<br />

4. QoS Metrics such as freshness are in terms of average<br />

waiting times at several nodes ( ∑ W ij , i is the node, j<br />

is the class of traffic);<br />

5. If tasks follow a particular route (or flow as described<br />

in Section 3.1.1), then that route gets associated to a<br />

class of traffic;<br />

6. If a particular task goes into the node <strong>and</strong> gets converted<br />

to another task, we say class-switching has occured.<br />

For example, in our application update tasks go<br />

to BN <strong>and</strong> get converted to plan tasks;<br />

7. If a connection exists between two nodes, this is converted<br />

to a transition probabilty p ij , where i is the<br />

source <strong>and</strong> j is the target node.<br />

Using the above rules as well as the aforementioned representations<br />

of TechSpecs we develop a mapping between<br />

the TechSpecs <strong>and</strong> a queueing model. Although the current<br />

procedure is manual, in thoery this procedure could<br />

be automated. Such an automatic capabilty of translating<br />

TechSpecs would prove very beneficial for predicting performance<br />

of the MAS in real-time. Table 3 captures the<br />

queueing model abstraction from TechSpecs for the CPY<br />

agents. Similarly, we can establish the mapping for other<br />

agents as well. Some useful guidelines that were followed<br />

in order to translate the TechSpecs into models are as follows:<br />

• Identify flows of traffic: Trace the route followed<br />

by each type of packet completely within the system<br />

boundary i.e. from the entry into the system until it exits<br />

the system. These would subsequently form classes<br />

of traffic in the queueing model. Care has to be taken<br />

to note any class switching.<br />

• Identify the network type: The network could be<br />

closed (fixed number of tasks) or open. The CPE is<br />

• Who does the TechSpecs translation? Where does the<br />

model run? In our case the translation is done manually<br />

at present. The model would run at a place visible<br />

to the controller (possibly as a seperate agent at the<br />

highest level). The controller we refer to here is the<br />

actual effector of control actions throughout the CPE<br />

society <strong>and</strong> is seperate from all we have discussed so<br />

far. The role of the controller is also to balance between<br />

other threads such as robustness <strong>and</strong> security.<br />

• The identification of control alternatives is currently<br />

centralized. However, we visualize a decentralized, hierarchical<br />

controller for effecting the changes.<br />

4. Queueing Network Models (QNMs)<br />

A complex logistics system such as the CPE society has<br />

numerous interactions. Yet, if the functionalities are abstracted<br />

to capture some application level specifics in terms<br />

of queueing model elements (example as shown in Table 3),<br />

analytical predictions on the behavior of the MAS can be<br />

made. Analytical models are good c<strong>and</strong>idates for enforcing<br />

adaptive control quickly <strong>and</strong> in real-time. Each agent behaves<br />

like a server that process jobs waiting in line. Hence,<br />

the mapping between an agent <strong>and</strong> a server with a queue<br />

is easily established. Because of the task flow structure<br />

<strong>and</strong> the superior-subordinate relationships in the TechSpecs,<br />

queues can be connected in t<strong>and</strong>em with jobs entering <strong>and</strong><br />

exiting the system. This results in the formation of an open<br />

queuing network.<br />

We conducted initial experiments using an actual<br />

Cougaar based MAS, an analytical formulation <strong>and</strong> an<br />

Arena simulation. We used this experiment to bootstrap<br />

our modeling process in terms of parameter estimation <strong>and</strong><br />

calibration. However, working with the MAS was timeconsuming<br />

as our goal was to identify modeling alternatives<br />

<strong>and</strong> control ramifications. Hence we continued our experimentation<br />

with a scaled up queueing model <strong>and</strong> simulation<br />

with the insight gained from working with the actual society.<br />

Thus the open queueing network’s parameters were carefully<br />

chosen <strong>and</strong> tasks sub-divided into mutiple-classes to<br />

denote a particular task within the MAS. The TechSpecs<br />

clearly delineate the input <strong>and</strong> output tasks facilitating the<br />

60


Table 3. Queuing Model Abstraction from<br />

TechSpecs for CPY Agent<br />

Figure 4. Task Flow in the MAS<br />

mapping to arrivals <strong>and</strong> services in a queueing network. Application<br />

level QoS measures of the MAS are calcuated in<br />

terms of the waiting times (or other equivalent perfromance<br />

measures) at the individual nodes of the QNM.<br />

Figure 4 is a representation of the CPE society from a<br />

queueing perspective. We show two types of tasks flowing<br />

in the network namely the plan (denoting maneuver <strong>and</strong><br />

sustainment) <strong>and</strong> the update tasks. These tasks can be divided<br />

further into three classes of traffic. The first class<br />

refers to update packets entering at the CPY nodes <strong>and</strong> proceeding<br />

further as updates to BDE through BN. Class 2<br />

relates to those update packets that are converted to plan<br />

tasks. There is class-switching at nodes 2 <strong>and</strong> 3 <strong>and</strong> we introduce<br />

approximations to deal with this later in the paper.<br />

The third class relates to the maneuver plan tasks that reach<br />

SUPP nodes through CPY. Although we know multiple task<br />

types exist in the MAS, by making the simplifying assumption<br />

<strong>and</strong> treating all job classes alike we analyze the MAS<br />

using Jackson networks [5] in Section 4.1. We further analyze<br />

the system taking into accout multiple classes of traffic<br />

as discussed in Section 4.2. We compare the two analytical<br />

approaches with a simulation model.<br />

4.1 Jackson Network Model<br />

We apply a single class Jackson network [5] formulation<br />

for open queuing networks to our example by choosing<br />

a weighted average service time for nodes with multiple<br />

classes. The nine agents of the MAS considered here can<br />

then be assumed to be M/M/1 systems. The arrival rates<br />

of the open network can be computed by solving the traffic<br />

equations. Assuming the load is balanced to start-with, the<br />

routing probabilities are also known. If each node of the<br />

system is ergodic, we can calculate the steady state probabilities<br />

<strong>and</strong> performance measures of the entire network by<br />

computing these measures for every agent exactly as in an<br />

M/M/1 system.<br />

We consider a simple example. For this queueing<br />

model, we assume all tasks are of a single type <strong>and</strong> do not<br />

distinguish between classes as shown in Figure 4. Let λ 0i<br />

<strong>and</strong> λ i0 be the rate of arrival <strong>and</strong> exit into <strong>and</strong> from the i th<br />

node respectively. Since the routing probabilties are known<br />

we can calculate the arrival rates λ i of each of the nodes of<br />

the open network by solving the following traffic equations:<br />

λ i = λ 0i +<br />

9∑<br />

λ j p ji , i =1, ..., 9 .<br />

j=1<br />

The routing probabilties (p ji : probability from i (column<br />

index) to j (row index)) for the balanced case are as follows:<br />

0 1/5 1/5 0 0 0 0 0 0<br />

0 0 0 1/4 1/4 1/4 1/4 0 0<br />

0 0 0 1/4 1/4 1/4 1/4 0 0<br />

0 1/5 1/5 0 0 0 0 0 0<br />

0 1/5 1/5 0 0 0 0 0 0<br />

0 1/5 1/5 0 0 0 0 0 0<br />

0 1/5 1/5 0 0 0 0 0 0<br />

0 0 0 1/4 1/4 1/4 1/4 0 0<br />

0 0 0 1/4 1/4 1/4 1/4 0 0<br />

Note that the customer exits from a node i with probability<br />

1 − ∑ j p ji. Once the arrival rates are known, we can<br />

calculate the average waiting times at the nodes by using<br />

the following formula:<br />

1/µ i<br />

W i =<br />

, i =1, ..., 9 .<br />

1 − (λ i /µ i )<br />

The QoS metrics namely maneuver plan freshness (MPF)<br />

<strong>and</strong> sustainment plan freshness (SPF) are calculated in<br />

61


terms of the average waiting times of the nodes at each<br />

level (W CPY ,W BN ,W SUPP ) as follows:<br />

MPF =2W CPY + W BN ,<br />

SPF =2W CPY + W BN + W SUPP .<br />

If the load is not balanced <strong>and</strong> the waiting times are different<br />

for the different branches, the QoS measures are accordingly<br />

calculated. It can be observed that two methods<br />

of control are straightaway obvious: 1) Adjust the µ i so<br />

that we could process faster if possible, 2) Alter the transition<br />

probabilties p ji to divert traffic to nodes that are less<br />

loaded. Although we allude to some control methods, these<br />

are outside the scope of this paper.<br />

4.2 BCMP Network Model<br />

We apply the Baskett, Ch<strong>and</strong>y, Muntz <strong>and</strong> Palacios<br />

(BCMP) algorithm [5] with a small modification to the<br />

above example. The network considered here consists of<br />

nine nodes <strong>and</strong> three class of traffic. The first class correponds<br />

to the stream that enter the CPY nodes <strong>and</strong> get sent to<br />

BDE through BN as updates. The second class corresponds<br />

to the tasks that enter the CPY nodes <strong>and</strong> get sent to the<br />

BN nodes for planning. The second class is converted to a<br />

plan <strong>and</strong> fed back to the CPY nodes. As class-switching occurs<br />

here we make a first order approximation <strong>and</strong> feed this<br />

as an independent class back at CPY nodes as tasks of the<br />

third class. Since most tasks are of the update type it makes<br />

sense to serve the latest update first <strong>and</strong> hence we follow the<br />

LCFS-PR (last come first served with preemptive resume)<br />

scheme whereever there are multiple classes. This allows us<br />

to assume the service rates to be exponential. Since all tasks<br />

arrive from the environment we assume the arrival process<br />

to be a Poisson Process.<br />

If λ ir is the arrival rate of the r th class at the i th node,<br />

λ 0,ir is the arrival rate of the arrival rate of the r th class<br />

at the i th node, <strong>and</strong> p js,ir is the probability that a task<br />

of class s at the j th node is transferred to a task of class<br />

r at the i th node, then the arrival rates for each class at<br />

the individual nodes can be calculated using the following<br />

traffic equations:<br />

λ ir = λ 0,ir +<br />

9∑<br />

j=1 s=1<br />

3∑<br />

λ js p js,ir , i =1, ..., 9 .<br />

The routing probabilties (p ji : probability from i to j) for<br />

the class 1 tasks (portion of update tasks that go to BDE)<br />

are as follows:<br />

0 1 1 0 0 0 0 0 0<br />

0 0 0 1/2 1/2 1/2 1/2 0 0<br />

0 0 0 1/2 1/2 1/2 1/2 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

The routing probabilties (p ji : probability from i (column<br />

index) to j (row index)) for the class 2 tasks (portion of<br />

update tasks that leave at 2 or 3) are as follows:<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 1/2 1/2 1/2 1/2 0 0<br />

0 0 0 1/2 1/2 1/2 1/2 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

The routing probabilties (p ji : probability from i (column<br />

index) to j (row index)) for the class 3 tasks (portion of<br />

update tasks that enter node 4,5,6 or 7 <strong>and</strong> proceed to node<br />

8 or 9) are as follows:<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 0 0 0 0 0 0<br />

0 0 0 1/2 1/2 1/2 1/2 0 0<br />

0 0 0 1/2 1/2 1/2 1/2 0 0<br />

Once the arrival rates for the different classes at all nodes<br />

are known, the waiting time (W ir or W i,r ) at node i for<br />

class r was calculated as follows:<br />

W ir =<br />

λ ir /µ ir<br />

(1 − ∑ 3<br />

r=1 λ ir/µ ir )µ ir<br />

.<br />

The application level QoS measures were calculated in<br />

terms of the node level average waiting times of the different<br />

classes of the BCMP network as follows:<br />

MPF = W CPY,2 + W BN,2 + W CPY,3 ,<br />

SPF = W CPY,2 + W BN,2 + W CPY,3 + W SUPP,3 .<br />

62


Figure 5. Maneuver Plan Freshness using<br />

Jackson Network<br />

Figure 6. Maneuver Plan Freshness using<br />

BCMP Network<br />

4.3 Discussion<br />

We assume that the load is initially balanced. Yet in the<br />

unbalanced case, waiting times for the different branches<br />

can be calculated seperately.<br />

We studied the impact of changing the processing rates at<br />

the nodes to illustrate the benefit of deriving a online queueing<br />

model that could form an integral part of a controller.<br />

Three methods were followed: 1) Jackson network model,<br />

2) BCMP network model, 3) A Discrete-Event Simulation<br />

Model in Arena [1]. We compute the maneuver plan <strong>and</strong><br />

sustainment plan freshness from the average waiting times<br />

of the individual nodes. We assume the processing rate for<br />

class 1 tasks, µ update_tasks =10Mb/s at all the nodes. We<br />

assume that the overall arrival rate from the environment<br />

is according to a Poisson Process with λ =2Mb/s. We<br />

vary the processing rates for the class 2 tasks at BN <strong>and</strong><br />

CPY <strong>and</strong> observe the impact on maneuver plan freshness as<br />

shown in Figure 5 <strong>and</strong> Figure 6. The low value of processing<br />

rates at the BN agent for class 2 tasks are in line with<br />

reality, wherein the BN agent implements a search procedure<br />

that is more time-consuming than to process class 1<br />

tasks which are updates meant for superiors in the chain<br />

of comm<strong>and</strong>. We found that the Jackson network matched<br />

reasonably well with the simulation results. The multi-class<br />

BCMP method performed better than the Jackson network<br />

because it was able to capture more of the MAS’s characteristics<br />

using different classes of traffic. This can be observed<br />

by comparing Figure 5 <strong>and</strong> Figure 6 with Figure 7.<br />

We consider only two parameters (processing rates for<br />

the class 2 tasks at BN <strong>and</strong> CPY) for variation <strong>and</strong> nine<br />

experiments for each method. We do this to keep the calculations<br />

simple. It can be observed from Figure 5 <strong>and</strong> Figure<br />

6 that adjusting the processing rate in BN impacts the QoS<br />

significantly as opposed to altering processing rates at CPY.<br />

Hence, to increase performance, the controller may have to<br />

adjust the application level knobs to provide a greater processing<br />

rate for the planning tasks. Similarly, other trends<br />

can be observed by adjusting other parameters.<br />

With these models, we believe it is be possible to identify<br />

unstable regions <strong>and</strong> steer the MAS towards regions<br />

providing better QoS. The running time of these models in<br />

Matlab is less than one second per iteration. If embedded<br />

within the system, several alternate <strong>and</strong> feasible system configurations<br />

can be simulated to identify c<strong>and</strong>idate choices<br />

for performance improvement.<br />

5. Results <strong>and</strong> Future Directions<br />

The hierarchy within the MAS, the specification of static<br />

attributes <strong>and</strong> the similarity between a distributed MAS<br />

based planning procedure <strong>and</strong> queueing network with multiple<br />

classes facilitate the performance modeling of the MAS<br />

using QNMs. TechSpecs are a structured method to encapsulate<br />

static data <strong>and</strong> distribute them because agent based<br />

planning applications are inherently distributed. From<br />

TechSpecs, queueing models (offline <strong>and</strong> online) can be developed<br />

for a cluster of nodes. The QNM will serve as an<br />

performance analysis tool for that cluster of nodes.<br />

63


Acknowledgements<br />

The work described here was performed under the<br />

<strong>DARPA</strong> UltraLog Grant#: MDA972-1-1-0038. The authors<br />

wish to acknowledge <strong>DARPA</strong> for their generous support.<br />

References<br />

[1] Arena. www.arenasimulation.com. Rockwell Automation.<br />

[2] Cougaar open source site. http://www.cougaar.org.<br />

<strong>DARPA</strong>.<br />

[3] Ultralog program site. http://www.ultralog.net.<br />

<strong>DARPA</strong>.<br />

Figure 7. Maneuver Plan Freshness using<br />

Simulation<br />

[4] Web-ontology (webont) working group.<br />

http://www.w3.org/2001/sw/WebOnt/.<br />

[5] G. Bolch, S. G. H. de Meter, <strong>and</strong> K. S.Trivedi. Queuing<br />

Networks <strong>and</strong> Markov Chains: Modeling <strong>and</strong> Performance<br />

Evaluation with Computer Science Applications.<br />

John Wiley <strong>and</strong> Sons, Inc., 1998.<br />

The main contributions of this work is that we have identified<br />

that TechSpecs could serve as good template that can<br />

guide MAS design <strong>and</strong> model development in a concurrent<br />

fashion. We have codified the static attributes of MAS<br />

in such a way that QNMs may be constituted from distributed<br />

information, especially in realtime. This technique<br />

for adaptivity by using a model on dem<strong>and</strong> to predict trends<br />

in QoS may be helpful in building survivable systems.<br />

[6] M. Brinn <strong>and</strong> M. Greaves. Leveraging agent properties<br />

to assure survivability of distributed multi-agent systems.<br />

Proceedings of the Second Joint Conference on<br />

Autonomous Agents <strong>and</strong> Multi-Agent Systems (Poster<br />

Session), 2003.<br />

[7] A. Cass<strong>and</strong>ra, D. Wells, M. Nodine, <strong>and</strong> P. Paz<strong>and</strong>ak.<br />

Techspecs: Content, issues <strong>and</strong> nomenclature. Technical<br />

<strong>Report</strong>, Telcordia Inc. <strong>and</strong> OBJS Inc., 2003.<br />

Currently, work is ongoing to identify an appropriate<br />

method of representation of TechSpecs that would have<br />

some reasoning <strong>and</strong> deduction capabilties such as OWL<br />

[4]. A module that could convert this representation of<br />

TechSpecs into queueing models automatically would be<br />

useful in this endeavor. An approach that would identify<br />

alternate choices for performance improvement is also<br />

necessary. <strong>Final</strong>ly, a controller that actually uses the<br />

analysis from the QNMs to optimize the global utility is<br />

also being pursued.<br />

64


Supply Chain Network: A Complex Adaptive Systems Perspective<br />

AMIT SURANA † , SOUNDAR KUMARA ‡* , MARK GREAVES**,<br />

USHA NANDINI RAGHAVAN<br />

In this era where on one h<strong>and</strong>, information technology is revolutionizing almost every<br />

domain of technology <strong>and</strong> society, on the other h<strong>and</strong> the “complexity revolution” is<br />

occurring in science at a silent pace. In this paper we look at the impact of the two, in the<br />

context of supply chain networks. With the advent of information technology, supply<br />

chains have acquired complexity almost equivalent to that of biological systems.<br />

However, one of the major challenges that we are facing in supply chain management is<br />

the deployment of coordination strategies that lead to adaptive, flexible <strong>and</strong> coherent<br />

collective behavior in supply chains. The main hurdle has been the lack of the principles<br />

that govern how supply chains with complex organizational structure <strong>and</strong> function arise<br />

<strong>and</strong> develop, <strong>and</strong> what organizations <strong>and</strong> functionality is attainable, given specific kinds<br />

of lower-level constituent entities. The study of Complex Adaptive Systems (CAS), has<br />

been a research effort attempting to find common characteristics <strong>and</strong>/or formal<br />

distinctions among complex systems arising in diverse domains (like biology, social<br />

systems, ecology <strong>and</strong> technology) that might lead to better underst<strong>and</strong>ing of how<br />

complexity occurs, whether it follows any general scientific laws of nature, <strong>and</strong> how it<br />

might be related to simplicity. In this paper we argue that supply chains should be treated<br />

as a CAS. With this recognition, we propose how various concepts, tools <strong>and</strong> techniques<br />

used in the study of CAS can be exploited to characterize <strong>and</strong> model supply chain<br />

networks. These tools <strong>and</strong> techniques are based on the fields of nonlinear dynamics,<br />

statistical physics <strong>and</strong> information theory.<br />

1. Introduction<br />

Supply chain is a complex network with an overwhelming number of interactions <strong>and</strong> interdependencies<br />

among different entities, processes <strong>and</strong> resources. The network is highly nonlinear,<br />

shows complex multi-scale behavior, has structure spanning several scales, <strong>and</strong> evolves <strong>and</strong> selforganizes<br />

through a complex interplay of its structure <strong>and</strong> function. This sheer complexity of<br />

supply chain networks, with inevitable lack of prediction makes it difficult to manage <strong>and</strong> control<br />

them. Furthermore, the changing organizational <strong>and</strong> market trends requires the supply chains to<br />

be highly dynamic, scalable, reconfigurable, agile <strong>and</strong> adaptive: the network should sense <strong>and</strong><br />

respond effectively <strong>and</strong> efficiently to satisfy customer dem<strong>and</strong>. Supply chain management<br />

necessitates that decisions made by business entities take more global factors into considerations.<br />

The successful integration of the entire supply chain process now depends heavily on the<br />

availability of accurate <strong>and</strong> timely information that can be shared by all members of the supply<br />

chain. Information Technology with its capability of setting up dynamic information exchange<br />

network has been a key enabling factor in shaping supply chains to meet such requirements. A<br />

major obstacle remains, however in the deployment of coordination <strong>and</strong> decision technologies to<br />

achieve complex, adaptive, <strong>and</strong> flexible collective behavior in the network. This is due to the lack<br />

of our underst<strong>and</strong>ing of organizational, functional <strong>and</strong> evolutionary aspects in supply chains. A<br />

† Department of Mechanical Engineering, The Massachusetts Institute of Technology, Cambridge, MA, 02139, email:<br />

surana@mit.edu<br />

‡ The Harold <strong>and</strong> Inge Marcus Department of <strong>Industrial</strong> & <strong>Manufacturing</strong> Engineering, University Park, PA 16802,<br />

email: skumara@psu.edu. * Corresponding Author<br />

**IXO, <strong>DARPA</strong>, 3701 North Fairfax Drive, Arlingon, VA 22203, 1714, mgreaves@darpa.mil


key realization to tackle this problem is that supply chain networks should not just be treated as a<br />

“system”, but as a “Complex Adaptive System” (CAS). The study of CAS augments the systems<br />

theory <strong>and</strong> provides a rich set of tools <strong>and</strong> techniques to model <strong>and</strong> analyze the complexity arising<br />

in systems encompassing science <strong>and</strong> technology. In this paper we take this perspective in<br />

dealing with supply chains <strong>and</strong> show how various advances in the realm of CAS, provide novel<br />

<strong>and</strong> effective ways to characterize, underst<strong>and</strong> <strong>and</strong> manage their emergent dynamics.<br />

A similar viewpoint has been emphasized in (Choi et al. 2001). The focus of Choi et al. was to<br />

demonstrate how supply networks should be managed if we recognize them as CAS. The concept<br />

of CAS allows one to underst<strong>and</strong> how supply networks as living systems co-evolve with the<br />

rugged <strong>and</strong> dynamic environment in which they exist <strong>and</strong> identify patterns that arise in such an<br />

evolution. The authors conjecture various propositions stating how the patterns of behavior of<br />

individual agents in a supply network can be related to the emergent dynamics of the network.<br />

One of the important deductions made is that when managing supply networks, managers must<br />

appropriately balance how much to control <strong>and</strong> how much to let emerge. However, no concrete<br />

framework has been suggested under which such conjectures can be verified <strong>and</strong> generalized. It is<br />

the onus of this paper to show how the theoretical advances made in the realm of CAS can be<br />

used to study such issues systematically <strong>and</strong> formally in the context of supply chain networks.<br />

This paper is divided into eight sections, including the introduction. In Section 2, we give a<br />

brief introduction to complex adaptive systems in which we discuss the architecture <strong>and</strong><br />

characteristics of complex systems in diverse areas encompassing biology, social systems,<br />

ecology <strong>and</strong> technology. In Section 3 we discuss characteristics of supply chain network <strong>and</strong><br />

argue that they should be understood in terms of a CAS. We also present some emerging trends in<br />

supply chains <strong>and</strong> the increasing critical role of information technology in supply chain<br />

management in the light of these trends. In Section 4 we give a brief overview of the main<br />

techniques that have been used for modeling <strong>and</strong> analysis of supply chains <strong>and</strong> then discuss how<br />

the science of complexity provides a genuine extension <strong>and</strong> reformulation of these approaches.<br />

Like any CAS, the study of supply chains, should involve a proper balance of simulation <strong>and</strong><br />

theory. System dynamics based <strong>and</strong> recently agent based simulation models (inspired from<br />

complexity theory) have been extensively used to make theoretical investigations of supply<br />

chains feasible <strong>and</strong> to support decision-making in real world supply chains. System dynamics<br />

approach often leads to models of supply chains, which can be described in the form of a<br />

dynamical system. Dynamical systems theory provides a powerful framework for rigorous<br />

analysis of such models <strong>and</strong> thus can be used to supplement the system dynamics simulation<br />

approach. We illustrate this in Section 5, using some nonlinear models, which consider the effect<br />

of priority, heterogeneity, feedback, delays <strong>and</strong> resource sharing on the performance of supply<br />

chain. Furthermore, the large volumes of data, generated from simulations can be used to<br />

underst<strong>and</strong> <strong>and</strong> comprehend the emergent dynamics of supply chains. Even though an exact<br />

underst<strong>and</strong>ing of the dynamics is difficult in complex systems, archetypal behavior patterns can<br />

often be recognized, using techniques from complexity theory like Nonlinear Time Series<br />

Analysis <strong>and</strong> Computational Mechanics, which are discussed in Section 6. The benefits of<br />

integrated supply chain concepts are widely recognized, but the analytical tools that can exploit<br />

those benefits are scarce. In order to study supply chains as a whole it is critical to underst<strong>and</strong> the<br />

interplay of organizational structure <strong>and</strong> functioning of supply chains. Network dynamics an<br />

extension of nonlinear dynamics to networks, provides a systematic framework to deal with such<br />

issues <strong>and</strong> is discussed in Section 7. We conclude in Section 8, with the recommendations for<br />

future research.<br />

2. Complex Adaptive Systems<br />

Many natural systems <strong>and</strong> increasingly many artificial (man-made) systems as well, are<br />

characterized by apparently complex behaviors that arise as the result of nonlinear spatiotemporal<br />

interactions among a large number of components or subsystems. We would use the


term agent <strong>and</strong> node interchangeably to refer to the component or subsystems. Examples of such<br />

natural systems include immune systems, nervous systems, multi-cellular organisms, ecologies,<br />

insect societies <strong>and</strong> social organizations. However, such systems are not just confined to biology<br />

<strong>and</strong> society. Engineering theories of controls, communications <strong>and</strong> computing have matured in<br />

recent decades, facilitating the creation of various large–scale systems, which have turned out to<br />

possess bewildering complexity, almost equivalent to that of biological systems. Systems sharing<br />

this property include parallel <strong>and</strong> distributed computing systems, communication networks,<br />

artificial neural networks, evolutionary algorithms, large-scale software systems, <strong>and</strong> economies.<br />

Such systems have been commonly referred to as Complex Systems (Baranger , Flake 1998,<br />

Adami 1998, Bar-Yam 1997). However, at the present time, the notion of complex system is not<br />

precisely delineated.<br />

The most remarkable phenomena exhibited by the complex systems, is the emergence of highly<br />

structured collective behavior over time from the interaction of simple subsystems without any<br />

centralized control. Their typical characteristics include: dynamics involving interrelated spatial<br />

<strong>and</strong> temporal effects, correlations over long length <strong>and</strong> time scales, strongly coupled degrees of<br />

freedom, non-interchangeable system elements, exist in quasi equilibrium <strong>and</strong> show a<br />

combination of regularity <strong>and</strong> r<strong>and</strong>omness (i.e. interplay of chaos <strong>and</strong> non-chaos). Such systems<br />

have structures spanning several scales <strong>and</strong> show emergent behavior. Emergence is generally<br />

understood to be a process that leads to the appearance of structure not directly described by the<br />

defining constraints <strong>and</strong> instantaneous forces that control a system. The combination of structure<br />

<strong>and</strong> emergence leads to self-organization, which is what happens when an emerging behavior has<br />

an effect of changing the structure or creating a new structure. Complex Adaptive System is a<br />

special category of complex systems to accommodate living beings. As the name suggests they<br />

are capable of changing themselves to adapt to changing environment. In this regard many<br />

artificial systems like those stated earlier can be considered as CAS, due to their capability of<br />

evolving. Coexistence of competition <strong>and</strong> cooperation is another dichotomy exhibited by CAS.<br />

A CAS can be considered as a network of dynamical elements where the states of both the<br />

nodes <strong>and</strong> the edges can change, <strong>and</strong> the topology of the network itself often evolves in time in a<br />

nonlinear <strong>and</strong> heterogeneous fashion. A dynamical system can be considered as simply behaving:<br />

“obeying the laws of physics”. From another perspective, it can be viewed as processing<br />

information: how systems get information, how they incorporate that information in the models of<br />

their surroundings, <strong>and</strong> how they make decisions on the basis of these models determine how they<br />

behave (Llyod <strong>and</strong> Slotine 1996). This leads to one of the more heuristic definitions of a complex<br />

system: one that “stores, processes <strong>and</strong> transmits, information” (Sawhil 1995). From a<br />

thermodynamic viewpoint such systems have the total energy (or its analogy) unknown, yet<br />

something is known about the internal state structure. In these large open systems (do not possess<br />

well defined boundaries) energy enters at low entropy <strong>and</strong> is dissipated. Open systems organize<br />

largely due to the reduction in the number of active degrees of freedom caused by dissipation.<br />

Not all behaviors or spatial configurations can be supported. The result is a limitation of the<br />

collective modes, cooperative behaviors, <strong>and</strong> coherent structures that an open system can express.<br />

A central goal of the sciences of complex systems is to underst<strong>and</strong> the laws <strong>and</strong> mechanisms by<br />

which complicated, coherent global behavior can emerge from the collective activities of<br />

relatively simple, locally interacting components.<br />

Complexity arises in natural system thorough evolution, while design plays an analogous role<br />

for the complex engineering systems. Convergent evolution/design leads to remarkable<br />

similarities at higher level of organization, though at the molecular or device level natural <strong>and</strong><br />

man-made systems differ significantly. Complexity in both cases is driven far more by the need<br />

for robustness to uncertainty in the environment <strong>and</strong> component parts than by basic functionality.<br />

Through design/evolution, such systems develop highly structured, elaborate internal<br />

configurations, with layers of feedback <strong>and</strong> signaling. It is the protocols that organize highly<br />

structured <strong>and</strong> complex modular hierarchies to achieve robustness, but also create fragilities to


are or ignored perturbations. The evolution of protocols can lead to a<br />

robustness/complexity/fragility spiral where complexity added for robustness also adds new<br />

fragilities, which in turn leads to new <strong>and</strong> thus spiraling complexities (Csete <strong>and</strong> Doyle 2002).<br />

However all this complexity remains largely hidden in normal operation becoming conspicuous<br />

acutely when contributing to rare cascading failures or chronically through fragility/complexity<br />

evolutionary spirals. Highly Optimized Tolerance (HOT) (Carlson <strong>and</strong> Doyle 1999) has been<br />

introduced recently to focus on the "robust, yet fragile" nature of complexity. It is also becoming<br />

increasingly clear that robustness <strong>and</strong> complexity in biology, ecology, technology, <strong>and</strong> social<br />

systems are so intertwined that they must be treated in a unified way. Given the diversity of<br />

systems falling into this broad class, the discovery of any commonalities or “universal” laws<br />

underlying such systems requires very general theoretical framework.<br />

The scientific study of CAS has been attempting to find common characteristics <strong>and</strong>/or formal<br />

distinctions among complex systems that might lead to better underst<strong>and</strong>ing of how complexity<br />

develops, whether it follows any general scientific laws of nature, <strong>and</strong> how it might be related to<br />

simplicity. The attractiveness of the methods developed in this research effort for generalpurpose<br />

modeling, design <strong>and</strong> analysis, lies in their ability to produce complex emergent<br />

phenomena out of a small set of relatively simple rules, constraints <strong>and</strong> the relationships couched<br />

in either quantitative or qualitative terms. We believe, that the tools <strong>and</strong> techniques developed in<br />

the study of CAS, offers a rich potential for design, modeling <strong>and</strong> analysis of large-scale systems<br />

in general <strong>and</strong> supply chains in particular.<br />

3. Supply Chain Networks as Complex Adaptive Systems<br />

A supply chain network is where information, products <strong>and</strong> finances are transferred between<br />

various suppliers, manufacturers, distributors, retailers <strong>and</strong> customers. A supply chain is<br />

characterized by a forward flow of goods <strong>and</strong> a backward flow of information. Typically a supply<br />

chain is comprised of two main business processes: material management <strong>and</strong> physical<br />

distribution (Min <strong>and</strong> Zhou 2002). The material management supports the complete cycle of<br />

material flow from the purchase <strong>and</strong> internal control of production material to the planning <strong>and</strong><br />

control of work-in-process, to the warehousing, shipping, <strong>and</strong> distribution of finished products.<br />

On the other h<strong>and</strong>, physical distribution encompasses all the outbound logistics activities related<br />

to providing customer services. Combining the activities of material management <strong>and</strong> physical<br />

distribution, a supply chain does not merely represent a linear chain of one-on-one business<br />

relationships, but a web of multiple business networks <strong>and</strong> relationships.<br />

Supply chain network is an emergent phenomenon. From the view of each individual entity, the<br />

supply chain is self-organizing. Although the totality may be unknown individual entities partake<br />

in the gr<strong>and</strong> establishment of the network by engaging in their localized decision-making i.e. in<br />

doing their best to select capable suppliers <strong>and</strong> ensure on-time delivery of products to their<br />

buyers. The network is characterized by nonlinear interactions <strong>and</strong> strong interdependencies<br />

between the entities. In most circumstances, order <strong>and</strong> control in the network is emergent, as<br />

opposed to predetermined. Control is generated through nonlinear though simple behavioral rules<br />

that operate based on local information. We argue that a supply chain network forms a complex<br />

adaptive system:<br />

• Structures spanning several scales: The supply chain network is a bi-level hierarchical<br />

<strong>and</strong> heterogeneous network where at the higher level each node represents an individual<br />

supplier, manufacturer, distributor, retailer or customer. However at the lower level the<br />

nodes represent the physical entities that exist inside each node in the upper level. The<br />

heterogeneity of most networks is a function of various technologies being provided by<br />

whatever vendor could supply them at the time their need was recognized.<br />

• Strongly coupled degrees of freedom <strong>and</strong> correlations over long length <strong>and</strong> time<br />

scales: Different entities in a supply chain typically operate autonomously with different<br />

objectives <strong>and</strong> subject to different set of constraints. However when it comes to


improving due date performance, increasing quality or reducing costs they become highly<br />

inter-dependent. It is the flow of material, resources, information <strong>and</strong> finances that<br />

provides the binding force. The well fare of any entity in the system directly depends on<br />

the performance of the others <strong>and</strong> their willingness <strong>and</strong> ability to coordinate. This leads to<br />

correlations between entities over long length <strong>and</strong> time scales.<br />

Figure 1. Supply Chain Network<br />

• Coexistence of Competition <strong>and</strong> Cooperation: The entities in a supply chain often have<br />

conflicting objectives. Competition abounds in the form of sharing <strong>and</strong> contention of<br />

resources. Global control over nodes is an exception rather than a rule; more likely is a<br />

localized cooperation out of which a global order emerges, which is itself unpredictable.<br />

• Nonlinear dynamics involving interrelated spatial <strong>and</strong> temporal effects: Supply<br />

chains have wide geographic distribution. Customers can initiate transactions at any time<br />

with little or no regard for existing load, thus contributing to a dynamic <strong>and</strong> noisy<br />

network character. The characteristics of a network tend to drift as workloads <strong>and</strong><br />

configuration change, producing a non-stationary behavior. The coordination protocols<br />

attempt to arbitrate among entities with resource conflicts. Arbitration is not perfect<br />

however; hence over <strong>and</strong> under corrections contribute to the nonlinear character of the<br />

network.<br />

• Quasi Equilibrium <strong>and</strong> combination of regularity <strong>and</strong> r<strong>and</strong>omness (i.e. interplay of<br />

chaos <strong>and</strong> non-chaos) The general tendency of a supply chain is to maintain a stable <strong>and</strong><br />

prevalent configuration in response to external disturbances. However they can undergo a<br />

radical structural change when they are stretched from equilibrium. At such a point a<br />

small event can trigger a cascade of changes that eventually can lead to system wide<br />

reconfiguration. In some situations unstable phenomena can arise, due to feedback<br />

structure, inherent adjustment delays <strong>and</strong> nonlinear decision-making processes that go in<br />

the nodes. One of the causes of unstable phenomena is that the information feedback in<br />

the system is slow relative to the rate of changes that occur in the system. The first mode<br />

of unstable behavior to arise in nonlinear systems is usually the simple one-cycle self-


sustained oscillations. If the instability drives the system further into the nonlinear<br />

regime, more complicated temporal behavior may be generated. The route to chaos<br />

through subsequent period-doubling bifurcations, as certain parameters of the system are<br />

varied, is generic to large class of systems in physics, chemistry, biology, economics <strong>and</strong><br />

other fields. Functioning in chaotic regime deprives the ability for long-term predictions<br />

about the behavior of the system, while short-term predictions may be possible<br />

sometimes. As a result, control <strong>and</strong> stabilization of such a system becomes very difficult.<br />

• Emergent behavior <strong>and</strong> Self-Organization: With the individual entities obeying a<br />

deterministic selection process, the organization of the overall supply chain emerges<br />

through a natural process of order <strong>and</strong> spontaneity. This emergence of highly structured<br />

collective behavior, over time from the interaction of the simple entities leads to<br />

fulfillment of customer orders. Dem<strong>and</strong> amplification, inventory swing are some other<br />

but undesirable emergent phenomena that can also arise. For instance, the decisions <strong>and</strong><br />

delays downstream in a supply chain often leads to amplifying non-desirable effect<br />

upstream, a phenomena commonly known as “Bull Whip” effect.<br />

• Adaptation <strong>and</strong> Evolution: Supply chain both reacts to <strong>and</strong> creates it environment.<br />

Generally speaking a supply chain interacts with almost every other conceivable network.<br />

Operationally, the environment depends on the chosen scale of analysis, for e.g. it can be<br />

taken as the customer market. Typically, significant dynamism exists in the environment<br />

which necessitates a constant adaptation of the supply network. However the<br />

environment is highly rugged making the co evolution difficult. The individual entities<br />

constantly observe what emerges from a supply network <strong>and</strong> make adjustments to<br />

organizational goals <strong>and</strong> supporting infrastructure. Another common way of adaptation is<br />

through altering boundaries of the network. The boundaries can change as a result of<br />

including or excluding particular entity <strong>and</strong> by adding or eliminating connections among<br />

entities, thereby changing the underlying pattern of interaction. As we discuss next,<br />

Supply chain management plays a critical role in making the network evolve in a<br />

coherent manner.<br />

3.1 Supply Chain Management<br />

Supply chain management is defined as the integration of key business processes from endusers<br />

through original suppliers that provide products, services, <strong>and</strong> information <strong>and</strong> add value for<br />

customers <strong>and</strong> other stakeholders (Cooper et. al. 1997). It involves balancing reliable customer<br />

delivery with manufacturing <strong>and</strong> inventory costs. It is evolved around a customer-focused<br />

corporate vision, which drives changes throughout a firm’s internal <strong>and</strong> external linkages <strong>and</strong><br />

then captures the synergy of inter-functional, inter-organizational integration <strong>and</strong> coordination.<br />

Due to the inherent complexity it is a challenge to coordinate the actions of entities across<br />

organizational boundaries so that they perform in a coherent manner.<br />

An important element in managing SCN is to control the ripple effect of lead-time so that the<br />

variability in supply chain can be minimized. Dem<strong>and</strong> forecasting is used to estimate dem<strong>and</strong> for<br />

each stage, <strong>and</strong> the inventory between stages for the network is used for protecting against<br />

fluctuations in supply <strong>and</strong> dem<strong>and</strong> across the network. Due to the decentralized control properties<br />

of the SCN, control of ripple effect requires coordination between entities in performing their<br />

tasks. The problem of coordination has reached another dimension due to some other trends in the<br />

current supply chains.<br />

Two important organizational <strong>and</strong> market trends that are on their way have been the<br />

atomization of markets as well as that of organizational entities (Balakrishnan et al. 1999). In<br />

such a scenario product realization process has a continuous customer involvement in all phases -<br />

from design to delivery. Customization is not only limited to selecting from pre-determined<br />

model variants; rather, product design, process plans, <strong>and</strong> even the supply chain configuration<br />

have to be tailored for each customer. The product realization organization has to be formed on


the fly- as a consortium of widely dispersed organizations to cater to the needs of a single<br />

customer. Thus organizations consist of series of opportunistic alliances among several focused<br />

organizational entities to address particular market opportunities. For manufacturing<br />

organizations to operate effectively in this environment of dynamic, virtual alliances, products<br />

must have modular architectures, processes must be well characterized <strong>and</strong> st<strong>and</strong>ardized,<br />

documentation must be digitized <strong>and</strong> widely accessible, <strong>and</strong> systems must be interoperable.<br />

Automation <strong>and</strong> intelligent information processing is vital for diagnosing problems during<br />

product realization <strong>and</strong> usage, coordination, design <strong>and</strong> production schedules, searching for<br />

relevant information in multi-media databases. These trends exacerbate the challenges of<br />

coordination <strong>and</strong> collaboration as the number of product realization networks increase, <strong>and</strong> so<br />

does the number of partners in each network.<br />

Inventory is unwise approach to dealing with highly changing market dem<strong>and</strong> <strong>and</strong> short life<br />

cycle products. Information is an appropriate substitute for inventory. Information about the<br />

material lead-time from different suppliers can be used for planning the material arrival, instead<br />

of building up inventory. The dem<strong>and</strong> information can be transmitted to the manufactures on a<br />

timely basis, so that the orders can be fulfilled with less inventory costs. In fact it is widely<br />

realized, that the successful integration of the entire supply chain process depends heavily on the<br />

availability of accurate <strong>and</strong> timely information that can be shared by all members of the supply<br />

chain. Supply chain management now increasingly relies on Information Technology as discussed<br />

below.<br />

3.2 Information Technology in Supply Chain Management<br />

Information technology with its capability of providing global reach <strong>and</strong> wide range of<br />

connectivity, enterprise integration, micro autonomy <strong>and</strong> intelligence, object <strong>and</strong> networked<br />

oriented computing paradigms <strong>and</strong> rich media support; has been key enabler for the management<br />

of future manufacturing enterprises. It is vital for eliminating collaboration <strong>and</strong> coordination<br />

costs, <strong>and</strong> to permit rapid setup of dynamic information exchange networks. Connectivity<br />

permits involvement of customers <strong>and</strong> other stakeholders in all aspects of manufacturing.<br />

Enterprise integration facilitates seamless interaction among global partners. Micro autonomy <strong>and</strong><br />

intelligence permit atomic tracking <strong>and</strong> remote control. New software paradigms enable<br />

distributed, intelligent <strong>and</strong> autonomous operations. Distributed computing facilitates quick<br />

localized decisions without loosing vast data gathering potential <strong>and</strong> powerful computing<br />

capabilities. Rich media support, which includes capabilities like digitization, visualization tools<br />

<strong>and</strong> virtual reality, facilitate collaboration <strong>and</strong> immersion.<br />

Many improvements have occurred in supply chain management because IT enables changes to<br />

be made in inventory management <strong>and</strong> production, dynamically. It assists the managers in coping<br />

up with uncertainty <strong>and</strong> lead-time through improved collection <strong>and</strong> sharing of information<br />

between supply chain nodes. The success of an enterprise is now largely dependent on how its<br />

information resources are designed, operated <strong>and</strong> managed, especially with the Information<br />

technology emerging as a critical input to be leveraged for significant organizational productivity.<br />

However, the difficulty arises when trying to design an information system that can h<strong>and</strong>le the<br />

information needs of supply chain nodes to allow efficient, flexible <strong>and</strong> decentralized supply<br />

chain management. The main hurdle in efficiently using information technology is the lack of our<br />

underst<strong>and</strong>ing of the organizational, functional <strong>and</strong> evolutionary principles of supply chains.<br />

Recognizing supply chains as CAS, can however lead to novel <strong>and</strong> effective ways to<br />

underst<strong>and</strong> their emergent dynamics. It has been found that many of the diverse looking CAS<br />

share similar characteristics <strong>and</strong> problems <strong>and</strong> thus can be tackled through similar approaches.<br />

While at present networks are largely controlled by humans; the complexity, diversity <strong>and</strong><br />

geographic distribution of the networks, makes it necessary that the networks maintain<br />

themselves in a sort of evolutionary sense, just as biological organisms do (Maxon 1990).<br />

Similarly, the problem of coordination, which is a challenge in supply chains, has been routinely


solved by biological systems for literally billions of years. We believe that the complexity,<br />

flexibility <strong>and</strong> adaptability in the collective behavior of the supply chains can be accomplished<br />

only by importing the mechanisms that govern these features in nature. Along with these robust<br />

design principles, we require equally sound techniques for modeling <strong>and</strong> analysis of supply<br />

chains. This would form the focus of this paper. We first give a brief overview of the main<br />

techniques that have been used for modeling <strong>and</strong> analysis of supply chains <strong>and</strong> then discuss how<br />

the science of complexity provides a genuine extension <strong>and</strong> reformulation of these approaches.<br />

4. Modeling <strong>and</strong> Analysis of Supply Chain Networks<br />

As pointed out the key challenge in designing supply chain networks or for that matter any<br />

large-scale systems is the difficulty of reverse engineering, i.e., determining what individual agent<br />

strategies lead to the desired collective behavior. Due to this difficulty in underst<strong>and</strong>ing the effect<br />

of individual characteristics on the collective behavior of the system, simulation have been the<br />

primary tools for designing <strong>and</strong> optimizing such systems. Simulation makes investigations<br />

possible <strong>and</strong> useful, when in the real world situation experimentation would be too costly or for<br />

ethical reasons not feasible, or where the decisions <strong>and</strong> their consequences are well separated in<br />

space <strong>and</strong> time. It seems at present that large-scale simulations of future complex processes may<br />

be the most logical, <strong>and</strong> perhaps, an important vehicle to study them objectively (Ghosh, 2002).<br />

Simulation in general helps one to detect design errors, prior to developing a prototype in a cost<br />

effective manner. Secondly, simulation of system operations may identify potential problems that<br />

might occur during actual operation. Thirdly, extensive simulation may potentially detect<br />

problems that are rare <strong>and</strong> otherwise elusive. Fourthly, hypothetical concepts that do not exist in<br />

nature, even those that defy natural laws, may be studied. The increased speed <strong>and</strong> precision of<br />

today’s computers promise the development of high fidelity models of physical <strong>and</strong> natural<br />

processes, ones that yield reasonably accurate results, quickly. This in turn would permit system<br />

architects to study the performance impact of wide variation of key parameters, quickly <strong>and</strong> in<br />

some cases, even in real time. Thus a qualitative improvement in system design may be achieved.<br />

In many cases, unexpected variations in external stress can be simulated quickly to yield<br />

appropriate system parameters values, which are then adopted into the system to enable it to<br />

successfully counteract the external stress.<br />

Mathematical analysis on the other h<strong>and</strong> has to a play a critical role because it alone can enable<br />

us to formulate rigorous generalizations or principle. Neither physical experiments nor computerbased<br />

experiments on their own can provide such generalizations. Physical experiments usually<br />

are limited to supplying inputs <strong>and</strong> constraints for rigorous models, because experiments<br />

themselves are rarely described in a language that permits deductive exploration. Computer based<br />

experiments or simulations have rigorous descriptions, but they deal only in specifics. A welldesigned<br />

mathematical model on the other h<strong>and</strong> generalizes the particulars revealed by the<br />

physical experiments, computer based models <strong>and</strong> any interdisciplinary comparisons. Using<br />

mathematical analysis we can study the dynamics, predict long term behavior, gain insights into<br />

system design: e.g., what parameters determine group behavior, how individual agent<br />

characteristics affect the system <strong>and</strong> that the proposed agent strategy leads to the desired group<br />

behavior. In addition, mathematical analysis may be used to select parameters that optimize<br />

system’s collective behavior, prevent instabilities, etc.<br />

It seems that successful modeling efforts of large-scale systems like supply chain network,<br />

large-scale software systems, communication networks, biological ecosystems, food webs, social<br />

organizations, etc. would require a solid empirical base. Pure abstract mathematical<br />

contemplation would unlikely lead to useful models. The discipline of physics provides an<br />

appropriate parallel; advances in theoretical physics are more often than not inspired by<br />

experimental findings. The study of supply chain networks should therefore involve an amalgam<br />

of both simulation <strong>and</strong> analytical techniques.


Considering the broad spectrum of a supply chain, no model can capture all the aspects of<br />

supply chain processes. The modeling proceeds at three levels:<br />

• Competitive Strategic analysis, which includes location-allocation decision, dem<strong>and</strong><br />

planning, distribution channel planning, strategic alliances, new product development,<br />

outsourcing, IT selection, pricing, <strong>and</strong> network structuring.<br />

• Tactical problems like inventory control, production/distribution coordination, material<br />

h<strong>and</strong>ling, layout design.<br />

• Operational level problems, which includes routing/scheduling, workforce scheduling<br />

<strong>and</strong> packaging.<br />

The models in supply chains can be categorized into four classes (Min <strong>and</strong> Zhou 2002):<br />

• Deterministic: single objective <strong>and</strong> multiple objective models.<br />

• Stochastic: optimal control theoretic <strong>and</strong> dynamic programming models.<br />

• Hybrid: with elements of both deterministic <strong>and</strong> stochastic models <strong>and</strong> includes inventory<br />

theoretic <strong>and</strong> simulations models.<br />

• IT driven: models that aim to integrate <strong>and</strong> coordinate various phases of supply chain<br />

planning on a real-time bases using application software, like ERP.<br />

Mathematical programming techniques <strong>and</strong> simulation have been primarily two approaches for<br />

the analysis <strong>and</strong> study of the supply chains models. The mathematical programming mainly takes<br />

into consideration static aspects of supply chain. The simulation on the other h<strong>and</strong> studies<br />

dynamics in supply chains <strong>and</strong> generally proceeds based on “system dynamics” <strong>and</strong> “agent<br />

based” methodologies. System dynamics is a continuous simulation methodology that uses<br />

concepts from engineering feedback control to model <strong>and</strong> analyze dynamic socioeconomic<br />

systems (Forrester, 1961). The mathematical description is realized with the help of ordinary<br />

differential equation. An important advantage of system dynamics is the possibility to deduce the<br />

occurrence of a specific behavior mode because the structure that leads to the system dynamics is<br />

made transparent. We present some nonlinear models in Section 5 which are useful for<br />

underst<strong>and</strong>ing the complex interdependencies, effects of priority, nonlinearities, delays,<br />

uncertainties <strong>and</strong> competition/cooperation for resource sharing in supply chains. The drawback of<br />

system dynamics model is that the structure has to be determined before starting the simulation.<br />

Agent-based modeling (a technique from complexity theory) on the other h<strong>and</strong> is a “bottom up<br />

approach” which simulates the underlying processes believed responsible for the global pattern,<br />

<strong>and</strong> allows us to evaluate what mechanisms are most influential in producing that emergent<br />

pattern. In (Schieritz <strong>and</strong> Grobler, 2003) a hybrid modeling approach has been presented that<br />

intends to make the system dynamics approach more flexible by combining it with the discrete<br />

agent-based modeling approach. Such large-scale simulations with their many degrees of freedom<br />

raise serious technical problems about the design of experiments <strong>and</strong> the sequence in which they<br />

should be carried out in order to obtain the maximum relevant information. Furthermore, in order<br />

to analyze data from such large-scale simulations we require systematic analytical <strong>and</strong> statistical<br />

methods. In Section 8, we describe two such techniques: Nonlinear Time Series Analyses <strong>and</strong><br />

Computational Mechanics.<br />

A useful paradigm for modeling a supply chain, taking into consideration the detailed pattern of<br />

interaction is to view it as a network. A network is essentially anything that can be represented by<br />

a graph: a set of points (also generically called nodes or vertices), connected by links (edges, ties)<br />

representing some relationship. Networks are inherently difficult to underst<strong>and</strong> due to their<br />

structural complexity, evolving structure, connection diversity, dynamical complexity of nodes,<br />

node diversity <strong>and</strong> meta–complication where all these factors influence each other. Queuing<br />

theory has primarily been used to address the steady-state operation of a typical network. On the<br />

other h<strong>and</strong> techniques from mathematical programming have been used to solve the problem of<br />

resource allocation in networks. This is meaningful when dynamic transients can be disregarded.<br />

However, present day supply chain networks are highly dynamic, reconfigurable, intrinsically


non-linear <strong>and</strong> non-stationarity. New tools <strong>and</strong> techniques are required for their analysis such that<br />

the structure, function <strong>and</strong> growth of networks can be considered simultaneously. In this regard<br />

we discuss “Network Dynamics” in Section 9, which deals with such issues <strong>and</strong> can be used to<br />

study the structure of supply chain <strong>and</strong> its implication for its functionality. Underst<strong>and</strong>ing the<br />

behavior of large complex networks is the next logical step for the field of nonlinear dynamics,<br />

because they are so pervasive in the real world. We begin with a brief introduction to dynamical<br />

systems theory, in particular nonlinear dynamics in next section.<br />

5 Dynamical Systems Theory<br />

Many physical systems that produce continuous-time response can be modeled by a set of<br />

differential equations of the form:<br />

dy = f ( y,<br />

a)<br />

, (I)<br />

dt<br />

where, y = y ( t),<br />

y ( t),......<br />

y ( )) represents the state of the system <strong>and</strong> may be thought of as a<br />

(<br />

1 2<br />

n<br />

t<br />

point in a suitably defined space S-which is known as phase space <strong>and</strong><br />

a = a ( t),<br />

a ( t)<br />

L,<br />

a ( )) is a parameter vector. The dimensionality of S is the number of<br />

(<br />

1 2<br />

m<br />

t<br />

apriori degrees of freedom in the system. The vector field f(y,a) is in general a non-linear operator<br />

acting on points in S. If f(y,a) is locally Lipschtiz, above equation defines an initial value problem<br />

in the sense that a unique solution curve passes through each point y in the phase space. Formally<br />

we may write the solution at time t given an initial value y0 as y( t)<br />

= ϕ<br />

t<br />

y0<br />

. ϕ<br />

t<br />

represents a oneparameter<br />

family of maps of the phase space into itself. We can perceive the solutions to all<br />

possible initial value problems for the system by writing them collectively as ϕ . This may be<br />

thought of as a flow of points in the phase space. Initially the dimension of the set ϕ t<br />

S will be<br />

that of S itself. As the system evolves, however, it is generally the case for the so-called<br />

dissipative system that the flow contracts onto a set of lower dimension known as attractor. The<br />

attractors can vary from simple stationary, limit cycle, quasi-periodic to complicated chaotic ones<br />

(Strogatz 1994, Ott 1996). The nature of attractor changes as parameters (a) are varied, a<br />

phenomena studied in bifurcation analysis. Typically a nonlinear system is always chaotic for<br />

some range of parameters. Chaotic attractors have a structure that is not simple; they are often not<br />

smooth manifolds, <strong>and</strong> frequently have a highly fractured structure, which is popularly referred to<br />

as Fractals (self–similar geometrical objects having structure at every scale). On this attractor,<br />

stretching <strong>and</strong> folding characterize the dynamics; the former phenomenon causes the divergence<br />

of nearby trajectories <strong>and</strong> latter constraints the dynamics to finite region of the state space. This<br />

accounts for fractal structure of attractors <strong>and</strong> the extreme sensitivity to changes in initial<br />

conditions, which is hallmark of chaotic behavior. System under chaos is unstable everywhere<br />

never settling down, producing irregular <strong>and</strong> aperiodic behavior which leads to a continuous<br />

broadb<strong>and</strong> spectrum. While this feature can be used to distinguish chaotic behavior from<br />

stationary, limit cycle, quasi-periodic motions using st<strong>and</strong>ard Fourier Analysis it makes it<br />

difficult to separate it from noise which also has a broadb<strong>and</strong> spectrum. It is this “deterministic<br />

r<strong>and</strong>omness” of chaotic behavior, which makes st<strong>and</strong>ard linear modeling <strong>and</strong> prediction<br />

techniques unsuitable for analysis.<br />

5.1 Nonlinear Models for Supply Chain<br />

Underst<strong>and</strong>ing the complex interdependencies, effects of priority, nonlinearities, delays,<br />

uncertainties <strong>and</strong> competition/cooperation for resource sharing is fundamental for prediction <strong>and</strong><br />

control of supply chains. System dynamics approach often leads to models of supply chains,<br />

which can be described in the form of equation (I). Dynamical systems theory provides a<br />

powerful framework for rigorous analysis of such models <strong>and</strong> thus can be used to supplement the<br />

S t


system dynamics approach. We next describe some nonlinear models <strong>and</strong> their detailed analysis.<br />

These models can be either used to represent entities in a supply chain or as macroscopic models,<br />

which capture collective behavior. The models reiterate the fact that simple rules can lead to<br />

complex behavior, which in general are difficult to predict <strong>and</strong> control.<br />

5.1.1 Preemptive Queuing Model with delays<br />

Priority <strong>and</strong> heterogeneity are fundamental to any logistic planning <strong>and</strong> scheduling. Tasks have<br />

to be prioritized in order to do the most important things first. This comes naturally as we try to<br />

optimize an objective <strong>and</strong> assign the tasks their “importance.” Priorities may also arise due to the<br />

non-homogeneity of the system where “knowledge” level of one agent is different from the other.<br />

In addition in all logistics systems, resources are limited, both in time <strong>and</strong> space. Temporal<br />

dependence plays an important role in logistic planning (interdependency). Sometime they can<br />

also arise from the physical facts when different stages of processing have certain temporal<br />

constraint.<br />

The considerations regarding the generality of assumptions <strong>and</strong> the clear one-to-one<br />

correspondence between the physical logistics tasks <strong>and</strong> the model parameters described in<br />

(Erramilli, <strong>and</strong> Forys 1991) made us apply the queuing model in context of supply chains<br />

(Kumara et al. 2003). The Queuing system considered here has two queues (A <strong>and</strong> B) <strong>and</strong> a<br />

single server with following characteristics:<br />

• Once served, the class A customer returns as a class B customer after a constant interval of<br />

time<br />

• Class B has non-preemptive priority over class A, i.e., the class A queue does not get served<br />

until the class B queue is emptied.<br />

• The schedules are organized every T units of time, i.e., if the low priority queue is emptied<br />

within time T, the server remains idle for the reminder of the interval.<br />

• <strong>Final</strong>ly, the higher priority class B has a lower service rate than the low priority class A.<br />

Figure 2. Preemptive Queuing Model<br />

Suppose the system is sampled at the end of every schedule cycle, <strong>and</strong> the following<br />

quantities are observed at the beginning of the kth interval:<br />

A<br />

k<br />

: Queue length of low priority queue<br />

B : Queue length of high priority queue<br />

k<br />

C : Outflow from low priority queue in the kth interval<br />

k


D<br />

k<br />

: Outflow from high priority queue in the kth interval<br />

λ : Inflow to low priority queue from the outside in the kth interval<br />

k<br />

The system is characterized by the following parameters:<br />

µ<br />

a<br />

: Rate per unit of the schedule cycle at which the low priority queue can be served<br />

µ : Rate per unit of the schedule cycle at which the high priority queue can be served<br />

b<br />

l: The feedback interval in units of the schedule cycle<br />

The following four equations then completely describe the evolution of the system:<br />

A<br />

C<br />

B<br />

k + 1<br />

= Ak<br />

+ λk<br />

− Ck<br />

(1)<br />

Dk<br />

= min( Ak<br />

+ λk<br />

, µ<br />

a<br />

(1 − ))<br />

(2)<br />

µ<br />

k<br />

b<br />

k +1<br />

= Bk<br />

+ Ck<br />

−l<br />

− Dk<br />

(3)<br />

D = min( B + C , µ )<br />

(4)<br />

k<br />

k<br />

k −l<br />

b<br />

Equations (1) <strong>and</strong> (3) are merely conservation rules, while equations (2) <strong>and</strong> (4) model the<br />

constraints on the outflows <strong>and</strong> the interaction between the queues. This model while<br />

conceptually simple, exhibits surprisingly complex behaviors.<br />

Dynamical Behavior<br />

The analytic approach to solve for the flow model under constant arrivals (i.e. λ<br />

k<br />

= λ for all k)<br />

shows several classes of solutions. The system is found to batch its workload even for perfectly<br />

smooth arrival patterns. Following are the characteristics of behavior of the system:<br />

1) Above a threshold arrival rate ( λ ≥ / 2 ), a momentary overload can send the system<br />

µ b<br />

into a number of stable modes of oscillations.<br />

2) Each mode of oscillations is characterized by distinct average queuing delays.<br />

3) The extreme sensitivity to parameters, <strong>and</strong> the existence of chaos, implies the system at a<br />

given time may be any one of a number of distinct steady-state modes.<br />

The batching of the workload can cause significant queuing delays even at moderate occupancies.<br />

Also such oscillatory behavior significantly lowers the real-time capacity of the system. For<br />

details of application of this model in supply chain context, refer to (Kumara et al. 2003).<br />

5.1.2 Managerial Systems<br />

Decision-making is another typical characteristic in which the entities in a supply chain are<br />

continuously engaged in. Entities make decisions to optimize their self-interests, often based on<br />

local, delayed <strong>and</strong> imperfect information.<br />

To illustrate the effects of decisions on the dynamics of supply chain as a whole, we consider a<br />

managerial system, which allocates resources to its production <strong>and</strong> marketing departments in<br />

accordance with shifts in inventory <strong>and</strong>/or backlog (Rasmussen <strong>and</strong> Moseklide 1988). It has four<br />

level variables: resources in production, resources in sales, inventory of finished products <strong>and</strong><br />

number of customers. In order to represent the time required to adjust production, a third order<br />

delay is introduced between production rate <strong>and</strong> inventory. The sum of the two resource variables<br />

is kept constant. The rate of production is determined from resources in production through a<br />

nonlinear function, which expresses a decreasing productivity of additional resources as the


company approaches maximum capacity. The sales rate, on the other h<strong>and</strong>, is determined by the<br />

number of customers <strong>and</strong> by the average sales per customer-year. Customers are mainly recruited<br />

through visits of the company salesman. The rate of recruitment depends upon the resources<br />

allocated to marketing <strong>and</strong> sales, <strong>and</strong> again it is assumed that there is a diminishing return to<br />

increasing sales activity: Once recruited, customers are assumed to remain with the company for<br />

an average period AT, the association time.<br />

A difference between production <strong>and</strong> sales causes the inventory to change. The Company is<br />

assumed to respond to such changes by adjusting its resource allocation. When the inventory is<br />

lower than desired, on the other h<strong>and</strong>, resources are redirected from sales to production. A certain<br />

minimum of resources is always maintained in both production <strong>and</strong> sales. In the model, this is<br />

secured by means of two limiting factors, which reduce the transfer rate when a resource floor is<br />

approached. <strong>Final</strong>ly the model assumes that there is a feedback from inventory to customer<br />

defection rate. If the inventory of finished products becomes very low, the delivery time is<br />

assumed to become unacceptable to many customers. As a consequence, the defection rate is<br />

enhanced by a factor 1+H.<br />

Figure 3. Managerial System<br />

Dynamical Behavior<br />

The managerial system described is controlled by two interacting negative feedback. Combined<br />

with the delays involved in adjusting production <strong>and</strong> sales, these loops create the potential for<br />

oscillatory behavior. If the transfer of resources is fast enough, this behavior is destabilized <strong>and</strong>


the system starts to perform self-sustained oscillations. The amplitude of these oscillations is<br />

finally limited by the various nonlinear restrictions in the model, particularly by the reduction of<br />

resource transfer rate as lower limits to resources in production or resources in sales are<br />

approached.<br />

A series of abrupt changes in the system behavior is observed as competition between the basic<br />

growth tendency <strong>and</strong> nonlinear limiting factors is shifted. The simple one-cycle attractor<br />

corresponding to H=10, becomes unstable for H=13 <strong>and</strong> a new stable attractor with twice the<br />

original period arises. If H is increased to 28 the stable attractor attains a period of 4. As H is<br />

further increased, the period-doubling bifurcations continue until H=30 the threshold to chaos is<br />

exceeded. The system now starts to behave in an aperiodic <strong>and</strong> apparently r<strong>and</strong>om behavior.<br />

Hence the system shows chaotic behavior through a series of period doubling bifurcations.<br />

5.1.3 Deterministic Queuing Model<br />

In this section we consider an alternate discrete-time deterministic queuing model, for studying<br />

decision making at an entity level in supply chains. The model consists of one server <strong>and</strong> two<br />

queuing lines (X <strong>and</strong> Y) representing some activity (Feichtinger et al. 1994). The input rates of<br />

both queues are constant <strong>and</strong> their sum equals the server-capacity. In each time period the server<br />

has to decide how much time to spend on each of the two activities.<br />

The following quantities can be defined:<br />

α : Constant input rate for activity X<br />

β : Constant input rate for activity Y<br />

Φ<br />

X<br />

: Time spent on activity X<br />

Φ<br />

Y<br />

: Time spent on activity Y<br />

x : Queue length of X<br />

k<br />

y : Queue length of Y<br />

k<br />

The amount of time<br />

Figure 4. Deterministic Queuing Model<br />

Φ<br />

X<br />

<strong>and</strong> Φ that will be spent on activities X <strong>and</strong> Y in period k+1 are<br />

Y<br />

determined by an adaptive feedback rule depending on the difference of the queue lengths<br />

x k<br />

<strong>and</strong>


y<br />

k<br />

.The decision rule or policy function says that longer queues are served with higher priority.<br />

Two possibilities considered are:<br />

1) All-or nothing decision: the server decides to spend all its time on the activity corresponding to<br />

the longer queue. Hence Φ is a Heaviside function given by<br />

Φ( x − y)<br />

= 1 if x ≥ y<br />

=0 if x < y .<br />

2) Mixed Solutions: the server decides to spend most of its time to the activity corresponding to<br />

the longer queue. For this decision function a S-shaped logistic function is used as given by:<br />

1<br />

Φ ( x − y)<br />

= .<br />

k ( x−<br />

)<br />

1+<br />

e<br />

y<br />

The parameter k tunes the “steepness” of the S-shape.<br />

With these decision functions the new queue lengths <strong>and</strong> are given equations<br />

xk+1<br />

y k+ 1<br />

xk+ 1<br />

= xk<br />

+ α − Φ(<br />

xk<br />

− yk<br />

) ,<br />

yk+ 1<br />

= yk<br />

+ β − Φ(<br />

xk<br />

− yk<br />

) .<br />

Using the constraints α + β = 1 <strong>and</strong> Φ<br />

X<br />

+ ΦY<br />

= 1, it is sufficient to consider the dynamics of<br />

the map in order to study the behavior of the system<br />

f ( x)<br />

= x + α − Φ(2x<br />

− 2) .<br />

Dynamical Behavior<br />

For 0


sustained in a supply chain. Resources can be of various types: physical resources, manpower,<br />

information <strong>and</strong> monetary. With the IT architectures being developed to realize supply chains,<br />

sharing of computational resources (like CPU, Memory, B<strong>and</strong>width, databases etc.) is also<br />

becoming a critical issue. It is through resource sharing that interdependencies arise between<br />

different entities. This leads to a complex web of interactions in supply chains just like for e.g. in<br />

a food web or an ecology. As a result such systems can be referred to as “Computational<br />

Ecosystems” (Hogg <strong>and</strong> Huberman 1988) in analogy with biological ecosystems.<br />

“Computational Ecosystems” is a generic model of the dynamics of resource allocation among<br />

agents trying to solve a problem collectively. The model captures following features: distributed<br />

control, asynchrony in execution, resource contention <strong>and</strong> cooperation among agents <strong>and</strong><br />

concomitant problem of incomplete knowledge <strong>and</strong> delayed information. The behavior of each<br />

agent is modeled using a payoff function whose nature determines whether an agent is<br />

cooperative or competitive. The agent here can be any entity in a supply chain like a distributor,<br />

retailer etc. or a software agent in e-commerce scenario. The state of the system is represented as<br />

an average number of entities using different resources <strong>and</strong> follows a delay differential equation<br />

under mean field approximation. The resources can be physical or computational as discussed<br />

before. For example in case of two resources with n identical agents, law governing the rate of<br />

change of occupation of a resource is given by:<br />

d n1<br />

( t)<br />

= α ( n ρ − n1<br />

( t)<br />

)<br />

dt<br />

where,<br />

n1 ( t)<br />

= Expected no. of agents using resource 1 at given instant of time t.<br />

α : Expected no. of choices made by an agent per unit time<br />

ρ : A r<strong>and</strong>om variable that denotes that resource1 will be perceived to have a higher payoff than<br />

res ource 2 <strong>and</strong> ρ gives its expected value.<br />

Figure 5. Computational Ecosystems


(τ = Time delay <strong>and</strong> σ : St<strong>and</strong>ard deviation of ρ )<br />

The global performance of the ecosystems can be obtained from the above equation. Under<br />

different conditions of delay, uncertainty, cooperation/competition the system shows a rich<br />

panoply of behaviors ranging from stable, sustained oscillations to intermittent chaos <strong>and</strong> finally<br />

to fully developed chaos. Furthermore, following generic deductions can be made from this<br />

model (Kephart et al. 1989): While information delay has adverse impact on the system<br />

performance, uncertainty has a profound effect on the stability of the system. One can<br />

deliberately increase uncertainty in agents’ evaluation of the merits of choices to make it stable<br />

but at the expense of performance degradation. Second possibility is very slow reevaluation rate<br />

of the agents, which however makes them non-adaptive. Heterogeneity in the nature of agents<br />

can however lead to more stability in the system compared to homogenous case but the system<br />

loses its ability to cope up with unexpected changes in the system such as new task requirements.<br />

On the other h<strong>and</strong> poor performance can be traced to the fact that the non-predictive agents do not<br />

take into account the information delay.<br />

If the agents are able to make accurate predictions of its current state, the information delay<br />

could be overcome, <strong>and</strong> the system would perform well. This results in a “co-evolutionary”<br />

system in which all of the individual are simultaneously trying to adapt to one another. In such a<br />

situation agents can act like Technical Analysts <strong>and</strong> System Analysts (Kephart et al. 1990). Agents<br />

as technical analysts (like those in market behavior) use either linear extrapolation or cyclic trend<br />

analysis to estimate the current state of the system. On the other h<strong>and</strong>, agents as system analysts<br />

have knowledge about both the individual characteristics of the other agents in the system <strong>and</strong><br />

how those characteristics are related to the overall system dynamics. Technical Analysts are<br />

responsive to the behavior of the system, but suffer from an inability to take into account the<br />

strategies of other agents. Moreover good predictive strategy for a single agent may be disastrous<br />

if applied on a global scale. System Analysts perform extremely well when they have very<br />

accurate information about other agents in the system, but can perform very poorly when their<br />

information is even slightly inaccurate. They take into account the strategies of other agents, but<br />

pay no heed to the actual behavior of the system. This suggests combining the strengths of both<br />

methods to form a hybrid- adaptive system analyst-, which modifies its assumptions about other<br />

to feedback about success of its own predictions. The resultant hybrid is able<br />

agents in response<br />

to perform well.<br />

In order to avoid chaos while maintaining high performance <strong>and</strong> adaptability to unforeseen<br />

changes more sophisticated techniques are required. One such way is by reward mechanism<br />

(Hogg <strong>and</strong> Huberman 1991) whereby the relative number of computational agents following<br />

effective strategies is increased at the expense of the others. This procedure, which generates a<br />

right mix of diverse population out of essentially homogenous ones, is able to control chaos by a<br />

series of bifurcations into a stable fixed point.<br />

In the above description each agent chooses amongst different resources according to its<br />

perceived payoff, which depends on the number of agents already using it. Even the agent with<br />

predictive ability is myopic in its view, as it considers only its current estimate of the system<br />

state, without regard to the future. Expectations come into play if agents use past <strong>and</strong> present<br />

global behavior in estimating the expected future payoff for each resource. A dynamical model of<br />

collective action that includes expectations can be found in (Glance 1993).<br />

6. Models from Observed Data<br />

One of the central problems in a supply chain, closely related to modeling, is that of dem<strong>and</strong><br />

forecasting: given the past, how can we predict the future dem<strong>and</strong>? The classic approach to<br />

forecasting is to build an explanatory model from the first principle <strong>and</strong> measure the initial<br />

conditions. Unfortunately this has not been possible for two reasons in systems like supply<br />

chains: Firstly, we still lack the general “first principles” for dem<strong>and</strong> variation in supply chains,


which are necessary to make good models. Secondly, due to the distributed nature of the supply<br />

chains, the initial data or the conditions are often difficult to obtain.<br />

Due to these factors, the modern theory of forecasting that has been used in supply chains,<br />

views a time series x(t) as a realization of a r<strong>and</strong>om process. This is appropriate when effective<br />

r<strong>and</strong>omness arises from complicated motion involving many independent, irreducible degrees of<br />

freedom. An alternative cause of r<strong>and</strong>omness is chaos, which can occur even in very simple<br />

deterministic systems as we discussed in the earlier sections. While chaos places a fundamental<br />

limit on long-term prediction, it suggests possibilities for short-term prediction. R<strong>and</strong>om-looking<br />

data may contain only few irreducible degrees of freedom. Time traces of the state variable of<br />

such chaotic systems display a behavior, which is intermediate between regular periodic or<br />

quasiperiodic motions, <strong>and</strong> unpredictable, truly stochastic behavior. It has long been seen as a<br />

form of “noise” because the tools for its analysis were couched in language tuned to linear<br />

process. The main such tool is Fourier analysis which is precisely designed to extract the<br />

composition of sines <strong>and</strong> cosines found in an observation x(t). Similarly, the st<strong>and</strong>ard linear<br />

modeling <strong>and</strong> prediction techniques, such as autoregressive moving average (ARMA) models are<br />

not suitable for nonlinear systems.<br />

With the advances in IT <strong>and</strong> science of complexity both the challenges for forecasting can be<br />

revived. Large-scale simulation <strong>and</strong> micro autonomy (Section 2) enable tracking of the detailed<br />

interaction between different entities in a supply chain. The large volumes of data, so generated<br />

can be used to underst<strong>and</strong> dem<strong>and</strong> patterns in specific <strong>and</strong> comprehend the emergence of other<br />

characteristics in general. Even though an exact prediction of future behavior is difficult, often<br />

archetypal behavior patterns can be recognized using this data. Techniques from the complexity<br />

theory like Nonlinear Time Series Analysis <strong>and</strong> Computational Mechanics are appropriate for this<br />

purpose.<br />

6.1 Nonlinear Time Series Analysis<br />

The need to extract interesting physical information about the dynamics of observed systems<br />

when they are operating in a chaotic regime has led to development of nonlinear time series<br />

analysis techniques. Systematically, the study of potentially, chaotic systems may be divided into<br />

three areas: identification of chaotic behavior, modeling <strong>and</strong> prediction <strong>and</strong> control. The first area<br />

shows how chaotic systems may be separated from stochastic ones <strong>and</strong>, at the same time,<br />

provides estimates of the degrees of freedom <strong>and</strong> the complexity of the underlying chaotic<br />

system. Based on such results, identification of a state space representation allowing for<br />

subsequent predictions may be carried out. The last stage, if desirable involves control of a<br />

chaotic system.<br />

Given the observed behavior of a dynamical system as a one-dimensional time series x(n) we<br />

want to build models for prediction. The most important task in this process is phase space<br />

reconstruction, which involves building topologically <strong>and</strong> geometrically equivalent attractor. In<br />

general steps in nonlinear time series analysis can be summarized as (Abarbanel 1996):<br />

• Signal Separation (Finding the signal): Separation of broadb<strong>and</strong> signal form broadb<strong>and</strong><br />

“noise” using deterministic nature of signal.<br />

• Phase Space reconstruction (Finding the space): Using the method of delays one can<br />

construct series of vectors, which is diffeomorphically equivalent to the attractor of the<br />

original dynamical system <strong>and</strong> at the same time distinguish it from the being stochastic.<br />

The basis for this is Taken’s Embedding theorem (Takens 1981). Time lagged variables<br />

are used to construct vectors for a phase space in d E dimension:<br />

y( n)<br />

= [ x(<br />

n),<br />

x(<br />

n + T),......<br />

x(<br />

n + ( d − E<br />

1) T)]<br />

The time lag T can be determined using mutual information (Fraser <strong>and</strong> Swinney 1983)<br />

<strong>and</strong> d E using false nearest neighbors test (Kennel et al. 1992).


• Classification of the signal: System identification in nonlinear chaotic systems means<br />

establishing a set of invariants for each system of interest <strong>and</strong> then comparing<br />

observations to that library of invariants. The invariants are properties of attractor <strong>and</strong> are<br />

independent of any particular trajectory of the attractor. Invariants can be divided into<br />

two classes: fractal dimensions (Farmer et. al. 1983) <strong>and</strong> Lyapunov exponents (Sano <strong>and</strong><br />

Sawada 1985). Fractal dimensions characterize geometrical complexity of dynamics i.e.<br />

how the sample of points along a system orbit are distributed spatially. Lyapunov<br />

exponents on the other h<strong>and</strong> describe the dynamical complexity i.e. “stretching <strong>and</strong><br />

folding” in the dynamical process.<br />

• Making models <strong>and</strong> Prediction: This step involves determination of the parameters of<br />

the assumed model of the dynamics:<br />

y(<br />

n)<br />

→ y(<br />

n + 1)<br />

y(<br />

n + 1) = F(<br />

y(<br />

n),<br />

a , a<br />

,..... a<br />

1 2 p<br />

which is consistent with invariant classifiers (Lyapunov exponents, dimensions). The functional<br />

form F (⋅) often used, includes polynomials, radial basis functions etc. Local False Nearest<br />

Neighbor (Abarbanel <strong>and</strong> Kennel 1993) test is used to determine how many dimensions are<br />

locally required to describe the dynamics generating the time series, without knowing the<br />

equations of motion <strong>and</strong> hence gives the dimension for the assumed model. The methods for<br />

building nonlinear models can be classified as Global <strong>and</strong> Local (Farmer <strong>and</strong> Sidorowich 1987;<br />

Casdalgi 1989). By definition Local methods vary from point to point in the phase space while<br />

Global Models are constructed once <strong>and</strong> for all in the whole phase space. Models based on<br />

Machine Learning techniques such as radial basis functions or Neural Networks (Powell 1987)<br />

<strong>and</strong> Support Vector Machines (Mukherjee et al. 1997) carry features of both. They are usually<br />

used as global functional forms, but they clearly demonstrate localized behavior too.<br />

The techniques from nonlinear time series analysis are well suited for modeling the<br />

nonlinearities in the supply chains. For an application of nonlinear time series analysis in supply<br />

chains, the reader is referred to Lee et al., 2002. Using it one can deduce that the time series is<br />

deterministic, so that it should be possible in principle to build predictive models. The invariants<br />

can be used to effectively characterize the complex behavior. For e.g., the largest Lyapunov<br />

exponent gives an indication of how far into the future, reliable predictions can be made while the<br />

fractal dimensions gives an indication of how complex a model should be chosen to represent the<br />

data. These models then provide the basis for systematically developing the control strategies. It<br />

should be noted the functional forms used for modeling in the step (4) above, are continuous in<br />

their argument. This approach builds models viewing a dynamical system as obeying laws of<br />

physics. From another perspective a dynamical system can be considered as processing<br />

information. So an alternative class of discrete “computational” models inspired from the theory<br />

of automata <strong>and</strong> formal languages can also be used for modeling the dynamics (Marcus 1996).<br />

“Computational Mechanics”, considers this viewpoint <strong>and</strong> describes the system behavior in terms<br />

of its intrinsic computational architecture i.e. how it stores <strong>and</strong> processes information.<br />

6.2 Computational Mechanics<br />

Computational mechanics is a method for inferring the causal structure of stochastic processes<br />

from empirical data or arbitrary probabilistic representations. It combines ideas <strong>and</strong> techniques<br />

from nonlinear dynamics, information theory <strong>and</strong> automata theory, <strong>and</strong> is, as it were, an “inverse”<br />

to statistical mechanics. Instead of starting with a microscopic description of particles <strong>and</strong> their<br />

interactions, <strong>and</strong> deriving macroscopic phenomena, it starts with observed macroscopic data, <strong>and</strong><br />

infers the simplest causal structure: the “ε -machine” capable of generating the observations. The<br />

ε -machine in turn describes the system's intrinsic computation, i.e., how it stores <strong>and</strong> processes<br />

information. This is developed using the statistical mechanics of orbit ensembles, rather than<br />

)<br />

a j


focusing on the computational complexity of individual orbits. By not requiring a Hamiltonian,<br />

computational mechanics can be applied in a wide range of contexts, including those where an<br />

energy function for the system may not manifest like for the supply chains. Notions of<br />

Complexity, Emergence <strong>and</strong> Self-Organization have also been formalized <strong>and</strong> quantified in terms<br />

of various information measures (Shalizi 2000).<br />

Given a time series, the (unknowable) exact states of an observed system are translated into<br />

sequence of symbols via a measurement channel (Crutchfield 1992). Two histories (i.e., two<br />

series of past data) carry equivalent information if they lead to the same (conditional) probability<br />

distribution in the future (i.e., if it makes no difference whether one or the other data-series is<br />

observed). Under these circumstances, i.e., the effects of the two series being indistinguishable,<br />

they can be lumped together. This procedure identifies causal states, <strong>and</strong> also identifies the<br />

structure of connections or succession in causal states, <strong>and</strong> creates what is known as an “epsilonmachine”.<br />

The ε -machines form a special class of Deterministic Finite State Automata (DFSA)<br />

with transitions labeled with conditional probabilities <strong>and</strong> hence can also be viewed as Markov<br />

chains. However, finite-memory machines likeε -machines may fail to admit a finite size model<br />

implying that the number of casual states could turn out to be infinite. In this case, a more<br />

powerful model than DFSA needs to be used. One proceeds by trying to use the next most<br />

powerful model in the hierarchy of machines known as the casual hierarchy (Crutchfield 1994),<br />

in analogy with the Chomsky hierarchy of formal languages. While “ε -machine reconstruction”<br />

refers to the process of constructing the machine given an assumed model class, “hierarchical<br />

machine reconstruction” describes a process of innovation to create a new model class. It detects<br />

regularities in a series of increasingly accurate models. The inductive jump to a higher<br />

computational level occurs by taking those regularities as the new representation.<br />

ε -machines reflect a balanced utilization of deterministic <strong>and</strong> r<strong>and</strong>om information processing<br />

<strong>and</strong> this is discovered automatically during ε -machine reconstruction. These machines are<br />

unique <strong>and</strong> optimal in the sense that they have maximal predictive power <strong>and</strong> minimum model<br />

size (hence satisfy Principle of Occam Razor i.e. causes should not be multiplied beyond<br />

necessity). ε -machine provides a minimal description of the pattern or regularities in a system in<br />

the sense that the pattern is the algebraic structure determined by the causal states <strong>and</strong> their<br />

transitions. ε -machines are also minimally stochastic. Hence computational mechanics acts as a<br />

method for automatic pattern discovery.<br />

ε -machine is the organization of the process, or at least of the part of it which is relevant to<br />

our measurements. The ε -machine being a model of the observed time series from a system can<br />

be used to define <strong>and</strong> calculate macroscopic or global properties that reflect the characteristic<br />

average information processing capabilities of the system. Some of these include Entropy rate,<br />

Excess entropy <strong>and</strong> Statistical Complexity (Feldman <strong>and</strong> Crutchfield 1998) <strong>and</strong> (Crutchfield <strong>and</strong><br />

Feldman 2001). The entropy density indicates how predictable the system is. Excess entropy on<br />

other h<strong>and</strong> provides a measure of the apparent memory stored in a spatial configuration <strong>and</strong><br />

represents how hard it is the prediction.ε -machine reconstruction leads to a natural measure of<br />

the statistical complexity of a process, namely the amount of information needed to specify the<br />

state of the ε -machine i.e. the Shannon Entropy. Statistical Complexity is distinct <strong>and</strong> dual from<br />

information theoretic entropies <strong>and</strong> dimension (Crutchfield <strong>and</strong> Young 1989). The existence of<br />

chaos shows that there is rich variety of unpredictability that spans the two extremes: periodic <strong>and</strong><br />

r<strong>and</strong>om behavior. This behavior between two extremes while of intermediate information content<br />

is more complex in that the most concise description (modeling) is an amalgam of regular <strong>and</strong><br />

stochastic processes. Information theoretic description of this spectrum in terms of dynamical<br />

entropies measures raw diversity of temporal patterns. The dynamical entropies however do not<br />

measure directly the computational effort required in modeling the complex behavior, which is<br />

what statistical complexity captures.


Computational mechanics sets limits on how well processes can be predicted <strong>and</strong> shows how at<br />

least in principle, those limits can be attained. ε -machines are what any prediction method would<br />

build, if only they could. Similar to ε -machine reconstruction, techniques exists which can be<br />

used to discover casual architecture in memory less transducers, transducers with memory <strong>and</strong><br />

spatially extended systems (Shalizi 2000). Computational mechanics can be used for modeling<br />

<strong>and</strong> prediction in supply chains in the following way:<br />

• In systems like supply chain, it is difficult to define analogs of various thermodynamic<br />

quantities like energy, temperature, pressure etc as we can do for physical systems. Each<br />

component in the network has a cognition, which is absent in physical systems; say a<br />

molecule of a gas. Due to such difficulties statistical mechanics cannot be applied directly<br />

to build prediction models for supply chains. As discussed previously by not requiring a<br />

Hamiltonian (the energy like function), computational mechanics is still applicable in<br />

case of supply chains.<br />

• ε -machines can be built to discover patterns in behavior of various quantities in supply<br />

chains like the inventory levels, dem<strong>and</strong> fluctuations, etc.<br />

• ε -machines can be used for prediction through a process known as “synchronization”<br />

(Crutchfield <strong>and</strong> Feldman 2003).<br />

• ε -machines can be used to calculate various global properties like entropy rate, excess<br />

entropy <strong>and</strong> statistical complexity, that reflect how the system stores <strong>and</strong> processes<br />

information. The significance of these quantities has been discussed earlier.<br />

• We can also quantify notions of Complexity, Emergence <strong>and</strong> Self-Organization in terms<br />

of various information measures derived from ε -machines. By evaluating such quantities<br />

we can compare complexity of different supply chains <strong>and</strong> quantify the extent to which<br />

the network is showing emergence. We can also infer when a supply chain is undergoing<br />

self-organization <strong>and</strong> to what extent. Such quantification can help us to compare<br />

precisely what policies or cognitive capabilities possessed by individual agents can lead<br />

to different degrees of emergence <strong>and</strong> self-organization. Hence we can decide to what<br />

extent we desire to enforce the control <strong>and</strong> to what extent we want to let the network<br />

emerge.<br />

7. Network Dynamics<br />

The ubiquity of networks in the social, biological <strong>and</strong> physical sciences <strong>and</strong> in technology leads<br />

naturally to an important set of common problems, which are being currently studied under the<br />

rubric of “Network Dynamics” (Strogatz 2001). Structure always affects function <strong>and</strong> it is<br />

important to consider dynamical <strong>and</strong> structural complexity together in the study of networks. For<br />

instance, the topology of social networks affects the spread of information <strong>and</strong> disease, <strong>and</strong> the<br />

topology of the power grid affects the robustness <strong>and</strong> stability of power transmission. The<br />

different problem areas in network dynamics are discussed below.<br />

One area of research in this field has been primarily concerned with the dynamical complexity<br />

in regular networks without regard to other network topologies. While the collective behavior<br />

depends on the details of the network, some generalization can still be drawn (Strogatz 2001). For<br />

instance, if the dynamical system at each node has stable fixed points <strong>and</strong> no other attractor, the<br />

network tends to lock into a static fixed pattern. If the nodes have competing interactions,<br />

network may become frustrated <strong>and</strong> display enormous number of locally stable equilibria. In the<br />

intermediate case where each node has a stable limit cycle, synchronization <strong>and</strong> patterns like<br />

traveling waves can be observed. For non-identical oscillators temporal analogue of phase<br />

transition can be seen with the control parameter as the coupling coefficient. At the opposite<br />

extreme if each node has identical chaotic attractor, the network can synchronize their erratic<br />

fluctuations. For a wide range of network topologies, synchronized chaos requires that the<br />

coupling be neither too weak nor too strong; otherwise spatial instabilities are triggered. Related


line of research that deals with networks of identical chaotic map is coupled map lattices<br />

(Kaneko <strong>and</strong> Tsuda 1996) <strong>and</strong> cellular automata (Wolfram 1994). However these systems have<br />

been used mainly as test-beds for exploring spatio-temporal chaos <strong>and</strong> pattern formation in the<br />

simplest mathematical settings, rather than as models of real systems.<br />

The second area in network dynamics is concerned about characterizing the network structure.<br />

Network structure or topologies in general can vary from completely regular like chains, grids,<br />

lattices <strong>and</strong> fully connected to completely r<strong>and</strong>om. Moreover the graphs can be directed or<br />

undirected <strong>and</strong> cyclic or acyclic. In order to characterize topological properties of the graphs,<br />

various statistical quantities have been defined. Most important of them include average path<br />

length, clustering coefficient, degree distributions, size of giant component <strong>and</strong> various spectral<br />

properties. A review of the main models <strong>and</strong> analytical tools, covering regular graphs, r<strong>and</strong>om<br />

graphs, generalized r<strong>and</strong>om graphs, small-world <strong>and</strong> scale-free networks, as well as the interplay<br />

between topology <strong>and</strong> the network's robustness against failures <strong>and</strong> attacks can be found in<br />

(Albert <strong>and</strong> Barabasi 2002, Dorogovtsev <strong>and</strong> Mendes 2002).<br />

The classic r<strong>and</strong>om graphs were introduced by Erdos <strong>and</strong> Renyi (Bollobas 1985) <strong>and</strong> have been<br />

the most thoroughly studied models of networks. Such graphs have Poisson degree distribution<br />

<strong>and</strong> statistically uncorrelated vertices. At large N (total number of nodes in the graph) <strong>and</strong> large<br />

enough p (probability that two arbitrary vertices are connected), a giant connected component<br />

appears in the network, a process known as percolation. The r<strong>and</strong>om graphs exhibit low average<br />

path length, <strong>and</strong> low clustering coefficient. The regular networks on other h<strong>and</strong> show high<br />

clustering coefficient <strong>and</strong> also a greater average path length compared to the r<strong>and</strong>om graphs of<br />

similar size. The networks found in real world, however are neither completely regular nor<br />

completely r<strong>and</strong>om. This has been recently discovered in the form of “small world” <strong>and</strong> “scale<br />

free” characteristics, for many real networks like: social networks, internet, WWW, power grids,<br />

collaboration networks, ecological <strong>and</strong> metabolic networks to name a few.<br />

In order to describe the transition from a regular network to a r<strong>and</strong>om network, Watts <strong>and</strong><br />

Strogatz introduced the so-called small-world graphs as models of social networks (Watts <strong>and</strong><br />

Strogatz 1998) <strong>and</strong> (Newman 2000). This model exhibits a high degree of clustering as in the<br />

regular network <strong>and</strong> a small average distance between vertices as in the classic r<strong>and</strong>om graphs. A<br />

common feature of this model with r<strong>and</strong>om graph model is that the connectivity distribution of<br />

the network peaks at an average value <strong>and</strong> decays exponentially. Such an exponential network is<br />

homogeneous in nature: each node has roughly the same number of connections. Due to high<br />

degree of clustering the models of dynamical systems with small-world coupling display<br />

enhanced signal-propagation speed, rapid disease propagation, <strong>and</strong> synchronizability (Watts <strong>and</strong><br />

Strogatz 1998).<br />

Another significant recent discovery in the field of complex networks is the observation that<br />

the connectivity distributions of a number of large-scale <strong>and</strong> complex networks, including the<br />

−γ<br />

WWW, Internet, <strong>and</strong> metabolic network, have the power law form P ( k)<br />

≈ k , where P(k)<br />

is<br />

the probability that a node in the network is connected to k other nodes, <strong>and</strong> γ is a positive real<br />

number (Barabasi et al. 2000, Barabasi 2001). Since power-laws are free of characteristic scale,<br />

such networks are called “scale-free network”. A scale-free network is inhomogeneous in nature:<br />

most nodes have few connections but small number (but statistically significant) have many<br />

connections. The average path length is smaller in the scale free network than in a r<strong>and</strong>om graph,<br />

indicating that the heterogeneous scale-free topology is more efficient in bringing the nodes<br />

closer than homogenous topology of the r<strong>and</strong>om graphs. The clustering coefficient of the scalefree<br />

network is about 5 times higher than that of the r<strong>and</strong>om graph, <strong>and</strong> this factor slowly<br />

increases with the number of nodes. It has been shown that it is practically impossible to achieve<br />

synchronization in a nearest-neighbor coupled network (regular connectivity) if the network is<br />

sufficiently large. However, it is quite easy to achieve synchronization in a scale-free dynamical<br />

network no matter how large the network is (Weng <strong>and</strong> Chen, 2002). Moreover, the


synchronizability of a scale-free dynamical network is robust against r<strong>and</strong>om removal of nodes,<br />

but is fragile to specific removal of the most highly connected nodes.<br />

The scale free property <strong>and</strong> high degree of clustering (the small world effect) however are not<br />

exclusive for a large number of real networks. Yet most models proposed to describe the topology<br />

of complex networks have the difficulty capturing simultaneously these two features. It has been<br />

shown in (Ravasz <strong>and</strong> Barabasi, 2003) that these two features are the consequence of a<br />

hierarchical organization present in the networks. This argument also agrees with that proposed<br />

by Herbert Simon (Simon 1997) who argues: “we could expect complex systems to be hierarchies<br />

in a world in which complexity has to evolve from simplicity. In their dynamics, hierarchies have<br />

a property, near decomposability, that greatly simplifies their behavior. Near decomposability<br />

also simplifies the description of complex systems <strong>and</strong> makes it easier to underst<strong>and</strong> how the<br />

information needed for the development of the system can be stored in reasonable compass”.<br />

Indeed many networks are fundamentally modular: one can easily identify groups of nodes that<br />

are highly interconnected with each other, but have only a few or no links to nodes outside of the<br />

group to which they belong. This clearly identifiable modular organization is at the origin of high<br />

degree of clustering coefficient. On the other h<strong>and</strong> these modules can be organized in a<br />

hierarchical fashion into increasingly large groups, giving rise to “hierarchical networks”, while<br />

still maintaining the scale-free topology. Thus modularity, scale-free character <strong>and</strong> high degree of<br />

clustering can be achieved under a common roof. Moreover, in hierarchical networks the degree<br />

of clustering characterizing the different groups follows a strict scaling law, which can be used to<br />

identify the presence of hierarchical structure in real networks.<br />

The mathematical theory of graphs with arbitrary degree distributions known as “generalized<br />

r<strong>and</strong>om graphs” can be found in (Newman et al. 2001) <strong>and</strong> (Newman 2003). Using the<br />

“generating function formulation”, the authors have been able to solve the percolation problem<br />

(i.e. have found conditions for predicting the appearance of a giant component), have obtained<br />

formulae for calculating clustering coefficient <strong>and</strong> average path length for generalized r<strong>and</strong>om<br />

graphs. The authors have proposed <strong>and</strong> studied models of propagation of diseases, failures, fads<br />

<strong>and</strong> synchronization on such graphs <strong>and</strong> have extended their results for bipartite <strong>and</strong> directed<br />

graphs.<br />

Network dynamics though in its infancy promises a formal framework to characterize the<br />

organizational <strong>and</strong> functional aspects in supply chains. With the changing trends in supply chains,<br />

many new issues have become critical like: organizational resistance to change, inter-functional<br />

or inter-organizational conflicts, relationship management, <strong>and</strong> consumer <strong>and</strong> market behavior.<br />

Such problems are ill structured <strong>and</strong> behavioral <strong>and</strong> cannot be commonly addressed by analytical<br />

tools such as mathematical programming. Successful supply chain integration depends on the<br />

supply chain partners’ ability to synchronize <strong>and</strong> share real-time information. The establishment<br />

of collaborative relationship among supply chain partners is a pre-requisite to information<br />

sharing. As a result successful supply chain management relies on systematically studying<br />

questions like 1) what are the robust architectures for collaboration <strong>and</strong> what are the coordination<br />

strategies that lead to such architectures, 2) if different entities make decisions on whether or not<br />

to cooperate on the basis of imperfect information about the group activity, <strong>and</strong> incorporate<br />

expectations on how their decision will affect other entities, can overall cooperation be sustained<br />

for long periods of time 3) how do the expectations, group size, <strong>and</strong> diversity affect coordination<br />

<strong>and</strong> cooperation <strong>and</strong> 4) which kinds of organizations are most able to sustain ongoing collective<br />

action, <strong>and</strong> how might such organizations evolve over time. Network dynamics addresses many<br />

of such questions <strong>and</strong> should be explored in context of supply chains.<br />

8. Conclusions <strong>and</strong> Future Work<br />

The idea of managing the whole supply chain <strong>and</strong> transform them into a highly autonomous,<br />

dynamic, agile, adaptive <strong>and</strong> reconfigurable network certainly provides an appealing vision for<br />

managers. The infrastructure provided by Information technology has made this vision partially


ealizable. But the inherent complexity of supply chains makes the efficient utilization of<br />

information technology an elusive endeavor. Tackling this complexity has been beyond the<br />

existing tools <strong>and</strong> techniques <strong>and</strong> requires revival <strong>and</strong> extensions.<br />

As a result we emphasized in this paper, that in order to effectively underst<strong>and</strong> a supply chain<br />

network, it should be treated as a CAS. We laid down some initial ideas for the extension of<br />

modeling <strong>and</strong> analysis of supply chains using the concepts, tools <strong>and</strong> techniques arising in the<br />

study of CAS. As a future work we need to verify the feasibility <strong>and</strong> usefulness of the proposed<br />

techniques in the context of large scale supply chains.<br />

Acknowledgements<br />

The authors wish to acknowledge <strong>DARPA</strong> (Grant#: MDA972-1-1-0038 under UltraLog Program)<br />

for their generous support for this research. In additions the partial support provided by NSF<br />

(Grant#:DMII-0075584) for Professor Kumara is greatly appreciated.<br />

References<br />

Abarbanel, H.D.I, 1996, The Analysis of Observed Chaotic Data, Springer-Verlag, New York.<br />

Abarbanel, H. D. I. <strong>and</strong> Kennel, M. B., 1993, Local False Nearest Neighbors <strong>and</strong> Dynamical<br />

Dimensions from Observed Chaotic Data, Phys. Rev. E, 47, 3057-3068.<br />

Adami, C., 1998, Introduction to Artificial Life, Springer-Verlag.<br />

Albert, R. <strong>and</strong> Barabasi, A. L., 2002, Statistical Mechanics of Complex Networks, Reviews of<br />

Modern Physics, 74, 47.<br />

Albert, R., Barabási, A. L., Jeong, H. <strong>and</strong> Bianconi, G., 2000, Power-law distribution of the<br />

World Wide Web, Science, 287, 2115.<br />

Albert R., Jeong, H., Barabasi, A. L.,2000, Error <strong>and</strong> attack tolerance of complex networks,<br />

Nature, 406, 378-382.<br />

Balakrishnan, A., Kumara, S. <strong>and</strong> Sundaresan, S., 1999, Exploiting Information Technologies for<br />

Product Realization, Information Systems Frontiers, A Journal of Research <strong>and</strong> Innovation,<br />

1(1), 25-50.<br />

Barabasi, A.L., July 2000, The Physics of Web, Physics Web.<br />

Barabasi, A. L., Albert, R., <strong>and</strong> Jeong, H., 2000, Scale-free characteristics of r<strong>and</strong>om networks:<br />

The topology of the World Wide Web, Physica A, 281, 69-77.<br />

Baranger, M., Chaos, Complexity, <strong>and</strong> Entropy: A physics talk for non-physicists,<br />

http://necsi.org/projects/baranger/cce.pdf.<br />

Bar-Yam, Y., 1997, Dynamics of complex systems, Reading, Mass, Addison-Wesley.<br />

Bollobas, B., 1985, R<strong>and</strong>om Graphs, Academic Press, London.<br />

Callaway, D. S., Newman, M. E. J., Strogatz, S. H. <strong>and</strong> Watts, D. J., 2000, Network robustness<br />

<strong>and</strong> fragility: Percolation on r<strong>and</strong>om graphs, Phys. Rev. Lett. 85, 5468-5471.<br />

Carlson, J. M., Doyle, J., 1999, Highly optimized tolerance: a mechanism for power laws in<br />

designed systems, Physics Review E, 60(2), 1412-1427.<br />

Casdalgi, M., 1989, Nonlinear prediction of chaotic time series, Physica D, 35, 335-356.<br />

Choi, T. Y., Dooley, K. J., Ruangtusanathan, M., 2001, Supply networks <strong>and</strong> complex adaptive<br />

systems: control versus emergence, Journal of Operations Management 19(3), 351-366.<br />

Cooper, M. C., Lambert, D. M., <strong>and</strong> Pagh, J. D., 1997, Supply chain management: More than a<br />

new name for logistics, The International Journal of Logistics Management, 8(1), 1-13.<br />

Crutchfield, J. P., 1992, Knowledge <strong>and</strong> Meaning … Chaos <strong>and</strong> Complexity, in Modeling<br />

Complex Systems, L. Lam <strong>and</strong> H. C. Morris, editors, Springer-Verlag, Berlin, 66 -101.<br />

Crutchfield, J. P., 1994, The Calculi of Emergence: Computation, Dynamics <strong>and</strong> Induction,<br />

Physica D, 75, 11-54.<br />

Crutchfield, J. P. <strong>and</strong> Young, K., 1989, Inferring Statistical Complexity, Physical Review Letters,<br />

63, 105-108.


Crutchfield, J. P. <strong>and</strong> Feldman, D. P., 2001, Synchronizing to the Environment: Information<br />

Theoretic Constraints on Agent Learning, Advances in Complex Systems, 4, 251-264.<br />

Crutchfield, J. P. <strong>and</strong> Feldman, D. P., 2003, Regularities Unseen, R<strong>and</strong>omness Observed: Levels<br />

of Entropy Convergence, Chaos (submitted).<br />

Csete, M. E. <strong>and</strong> Doyle, J., 2002, Reverse Engineering of Biological Complexity, Science, 295,<br />

1664.<br />

Dorogovtsev, S. N. <strong>and</strong> Mendes, J. F. F., 2002, Evolution of networks, Advances in Physics, 51,<br />

1079-1187.<br />

Erramilli, A. <strong>and</strong> Forys, L. J., 1991, Oscillations <strong>and</strong> Chaos in a Flow Model of a Switching<br />

System, IEEE Journal on selected areas in communications, 9(2), 171-178.<br />

Farmer, J. D., Ott, E. <strong>and</strong> Yorke, J. A., 1983, The dimension of chaotic attractors, Physica D, 7,<br />

153-180.<br />

Farmer, J. D. <strong>and</strong> Sidorowich, J. J., 1987, Predicting chaotic time-series, Physics Review Letters,<br />

59(8), 845-848.<br />

Feichtinger, G., Hommes, C. H. <strong>and</strong> Herold, W., 1994, Chaos in a simple Deterministic Queuing<br />

System, ZOR- Mathematical Methods of Operations Research, 40, 109-119.<br />

Feldman, D. P. <strong>and</strong> Crutchfield, J. P., Discovering Noncritical Organization: Statistical<br />

Mechanical, Information Theoretic, <strong>and</strong> Computational Views of Patterns in One-Dimensional<br />

Spin Systems, Santa Fe Institute Working Paper 98-04-026.<br />

Flake, G. W., 1998, The Computational Beauty of Nature, MIT Press.<br />

Forrester, J. W., 1961, <strong>Industrial</strong> Dynamics. Cambridge: MIT press.<br />

Fraser, A. M. <strong>and</strong> Swinney, H. L., 1983, Independent coordinates for strange attractors from<br />

mutual information, Phys. Rev. A, 33(2), 1134-1140.<br />

Ghosh, S., 2002, The role of Modleing <strong>and</strong> Asynchronous Distributed Simulation in Analyzing<br />

Complex systems of the Future, Information System Frontiers, A Journal of Research <strong>and</strong><br />

Innovation, 4(2), 166-171.<br />

Glance, N. S., 1993, Dynamics with Expectations, PhD Thesis Physics Department Stanford<br />

University.<br />

Hogg, T. <strong>and</strong> Huberman, B. A., 1988, The Behavior of Computational Ecologies, in The Ecology<br />

of Computation North-Holl<strong>and</strong>, 77-116.<br />

Hogg, T. <strong>and</strong> Huberman, B. A., 1991, Controlling Chaos in Distributed Systems, IEEE Trans. on<br />

Systems, Man <strong>and</strong> Cybernetics, 21, 1325-1332.<br />

Kaneko, K. <strong>and</strong> Tsuda, I., 1996, Complex Systems: Chaos <strong>and</strong> Beyond, Springer-Verlag.<br />

Kennel, M., Brown, R. <strong>and</strong> Abarbanel, H. D. I., 1992, Determining embedding dimension for<br />

phase-space reconstruction using a geometrical construction, Phys. Rev. A, 45(6), 3403-3068.<br />

Kephart, J. O., Hogg, T. <strong>and</strong> Huberman, B. A., 1989, Dynamics of Computational Ecosystems,<br />

Physical Review A, 40 (1), 404-421.<br />

Kephart, J. O., Hogg, T. <strong>and</strong> Huberman, BA, 1990, Collective Behavior of Predictive Agents,<br />

Physica D, 42, 48-65.<br />

Kumara, S., Ranjan, P., Surana, A. <strong>and</strong> Narayanan, V., Decision Making in Logistics: A Chaos<br />

Theory Based Analysis, Annals of the International Institution for Production Engineering<br />

Research (Annals of CIRP) (accepted to appear).<br />

Lee, S., Gautam, N., Kumara, S., Hong, Y., Gupta, H., Surana, A., Narayanan, V., Thadakamalla,<br />

H., Brinn, M. <strong>and</strong> Greaves, M., 2002, Situation Identification Using Dynamic Parameters in<br />

Complex Agent-Based Planning Systems, Intelligent Engineering Systems Through Artificial<br />

Neural Networks, 12, 555-560.<br />

Llyod, S. <strong>and</strong> Slotine, J. J. E., 1996, Information theoretic tools for stable adaptation <strong>and</strong> learning,<br />

Int. Journal of Adaptive Control <strong>and</strong> Signal Processing, 10, 499-530.<br />

Maxion, R. A., Toward Diagnosis as an Emergent Behavior in a Network Ecosystem, Physica D,<br />

42, 66-84.


Min, H. <strong>and</strong> Zhou, G., 2002, Supply chain modeling: past, present <strong>and</strong> future, Computers <strong>and</strong><br />

<strong>Industrial</strong> Engineering, 43, 231-249.<br />

Mukherjee, S., Osuna, E. <strong>and</strong> Girosi, F., 1997, Nonlinear Prediction of Chaotic Time Series<br />

Using Support Vector Machines, IEEE Workshop on Neural Networks for Signal Processing<br />

VII, 511-519.<br />

Newman, M. E. J., 2000, Models of the small world, J. Stat. Phys., 101, 819-841.<br />

Newman, M. E. J., 2002, The spread of epidemic disease on networks, Phys. Rev. E, 66,<br />

Newman, M. E. J., 2003, R<strong>and</strong>om graphs as models of networks, in H<strong>and</strong>book of Graphs <strong>and</strong><br />

Networks, S. Bornholdt <strong>and</strong> H. G. Schuster (eds.), Wiley-VCH, Berlin.<br />

Newman, M. E. J., Strogatz, S. H. <strong>and</strong> Watts, D. J., 2001, R<strong>and</strong>om graphs with arbitrary degree<br />

distribution <strong>and</strong> their applications, Phys. Rev. E, 64.<br />

Ott, E., 1996, Chaos in Dynamical Systems, Cambridge University Press.<br />

Powell, M. J. D., 1987, Radial basis function approximation to polynomials, preprint University<br />

of Cambridge.<br />

Rasmussen, D. R. <strong>and</strong> Moseklide, M., 1988, Bifurcations <strong>and</strong> chaos in generic management<br />

model, European Journal of Operations Research, 35, 80-88.<br />

Ravasz, E. <strong>and</strong> Barabasi A. L., 2003, Hierarchical organization in complex networks, Physical<br />

Review E, 67.<br />

Sano, M. <strong>and</strong> Sawada, Y., Measurement of the Lyapunov Spectrum form a Chaotic Time Series,<br />

1985, Phys. Rev. Lett., 55, 1082-1084.<br />

Sawhill, B. K., 1993, Self-Organized Criticality <strong>and</strong> Complexity Theory, Lectures in Complex<br />

Systems, edited by Nadel L. <strong>and</strong> Stein DL, Addison Wesley Longman, 143-170.<br />

Schieritz, N. <strong>and</strong> Grobler, A., 2003, Emergent Structures in Supply Chains- A study Integrating<br />

Agent-Based <strong>and</strong> System Dynamics Modeling, Paper presented at the 36th Annual Hawaii<br />

Internation Conference on System Sciences, Big Isl<strong>and</strong>.<br />

Shalizi, C. R. <strong>and</strong> Crutchfield, J. P., Computational Mechanics: Pattern <strong>and</strong> Prediction, Structure<br />

<strong>and</strong> Simplicity, SFI Working Paper 99-07-044.<br />

Shalizi, C. R., 2001, Causal Architecture, Complexity <strong>and</strong> Self-Organization in Time Series <strong>and</strong><br />

Cellular Automata, http://www.santafe.edu/~shalizi/thesis.<br />

Simon, H. A., 1997, The Sciences of the Artificial, Cambridge, MA: The MIT Press, 3rd Edition.<br />

Strogatz, S. H., 1994, Nonlinear Dynamics <strong>and</strong> Chaos, Addison-Wesley, Reading, MA.<br />

Strogatz, S. H., 2001, Exploring complex networks, Nature, 410, 268-276.<br />

Takens, F., 1981, in Dynamical Systems <strong>and</strong> Turbulence, Warwick, 1980, edited by D. R<strong>and</strong> <strong>and</strong><br />

L.S. Young, Lecture Notes in Mathematical No. 898 (Springer, Berlin), 366.<br />

Wang, X. F. <strong>and</strong> Chen, G., 2002, Synchronization in Scale-free Dynamical networks: Robustness<br />

<strong>and</strong> Fragility, IEEE Transactions on Circuits <strong>and</strong> Systems I-Fundamental Theory And<br />

Applications, 49(1), 54-62.<br />

Watts, D. J. <strong>and</strong> Strogatz, S. H., 1998, Collective dynamics of ‘small-world’ networks, Nature,<br />

393, 440-442.<br />

Wolfram, S., 1994, Cellular Automata <strong>and</strong> Complexity: Collected Papers, Reading, Mass:<br />

Addison-Wesley Pub. Co.


Decision Making in Logistics: A Chaos Theory Based Analysis<br />

S. R. T. Kumara 1 , P. Ranjan, A. Surana , V. Narayanan<br />

The Pennsylvania State University<br />

310 Leonhard Building, University Park, PA 16802<br />

Abstract<br />

Logistics in general is a complex system. In this paper we investigate the existence of chaos in logistics<br />

systems. Such an investigation is necessary to use appropriate <strong>and</strong> correct methods for further analysis,<br />

as linear systems techniques will not be useful. If a system exhibits chaos, decision-making should<br />

consider the system characterization parameters from a chaos theory perspective. In this paper, we<br />

consider a non-preemptive queuing model <strong>and</strong> its extensions to the logistics domain. A prototypical supply<br />

chain example is used <strong>and</strong> the resulting behavior is characterized. At certain input values the behavior of<br />

the logistics system exhibits chaos. This information is useful for further analysis for prediction <strong>and</strong> control.<br />

The working prototype is implemented in the <strong>DARPA</strong> Couggar agent architecture.<br />

Keywords:<br />

Non-linear Dynamics, Production, Distributed<br />

1 INTRODUCTION<br />

In a logistics system one of the most fundamental<br />

questions is the analysis of the system behavior. We<br />

define the system as the entities (software <strong>and</strong> hardware)<br />

along with their interconnections (network). A typical<br />

logistics system is characterized by a supply chain. Our<br />

hypothesis is that these systems are nonlinear, dynamic<br />

<strong>and</strong> in specific, chaotic. That is, the time evolution of the<br />

system behavior (measured by certain behavioral<br />

parameters of the system) is chaotic. The question now is<br />

how do we really characterize the time evolution <strong>and</strong> how<br />

can we use the insights obtained from such an analysis?<br />

This paper deals with these questions. We first give a brief<br />

explanation of the notion of nonlinear dynamics <strong>and</strong><br />

continue further discussion.<br />

2 NONLINEAR DYNAMICS, CHAOS AND FRACTALS<br />

In this section, we present a concise description of<br />

nonlinear dynamics, chaos <strong>and</strong> fractals. During the past<br />

decade, chaos theory has elicited a lot of interest among<br />

scientists <strong>and</strong> researchers. As a result, its ideas are<br />

beginning to be applied to many scientific <strong>and</strong> engineering<br />

disciplines, especially where nonlinear models are relevant<br />

[1].<br />

Many physical systems that produce continuous-time<br />

response may be modeled by a set of differential<br />

equations of the form<br />

_<br />

_<br />

d x( t)<br />

F(<br />

x(<br />

t))<br />

dt<br />

() ⋅<br />

= (1)<br />

F is generally a nonlinear vector field. The solution to<br />

this results in a trajectory<br />

x () t = f ( x()<br />

0 , t)<br />

(2)<br />

where f : M → M represents the flow that determines<br />

the evolution of x(t) for a particular initial condition x(0).<br />

If the system is dissipative, as the system evolves from<br />

different initial conditions, the solutions usually shrink<br />

asymptotically to a compact subset of the whole state<br />

space M. This compact subset is called an attracting set.<br />

Every attracting disjoint subset of an attracting set is called<br />

an attractor [1].<br />

In dissipative systems, the overall volume of the state<br />

space shrinks with time. However, there may be some<br />

directions along which the state space actually exp<strong>and</strong>s.<br />

That is, the system trajectories tend to move apart along<br />

certain directions <strong>and</strong> shrink along the others. However, as<br />

the attractors usually remain bounded, the flow exhibits a<br />

horseshoe-type pattern [2]. Because of this, trajectories<br />

starting from near-by points within an attractor may get<br />

separated exponentially as the system evolves. This<br />

condition is known as the sensitive dependence on initial<br />

condition (SDIC), <strong>and</strong> the attractor exhibiting SDIC is<br />

called a strange attractor.<br />

A flow f, for a particular initial condition, is said to be<br />

chaotic if the trajectories in an attractor exhibit: sensitive<br />

dependence on initial conditions, but are bounded,<br />

irregular <strong>and</strong> aperiodic behavior, <strong>and</strong> continuous broad<br />

b<strong>and</strong> spectrum.<br />

The irregular <strong>and</strong> aperiodic response of chaotic systems,<br />

usually betrays a special property of self-similarity or scale<br />

invariance. That is, the response appears similar over<br />

multiple scales of observation. Scale invariant<br />

mathematical entities are commonly known as fractals.<br />

Analytical techniques used to deduce the characteristics of<br />

nonlinear systems collectively constitute fractal analysis.<br />

The main objectives of fractal analysis can be broadly<br />

categorized depending on the end-purpose as follows:<br />

identification of the presence of chaos from the system<br />

response, establishing the invariants of the system<br />

dynamics for system identification or indirect state<br />

estimation, <strong>and</strong> chaos modeling, when the end-purpose is<br />

to capture <strong>and</strong> later reproduce the system dynamics. A<br />

more detailed description of these concepts may be found<br />

in [3]. For applications of Nonlinear Dynamics in the<br />

modeling <strong>and</strong> control of complex production systems, refer<br />

to [4][ 5].


In the rest of this paper we report a queuing model that is<br />

useful in supply chain analysis. We explain our rationale<br />

for selecting this model for adaptation to the logistics<br />

domain. For some of the other models existing in literature,<br />

refer to [6][7]. We extend the queuing model <strong>and</strong> apply it to<br />

the logistics scenario in the Cougaar architecture <strong>and</strong><br />

discuss the results. We raise the fundamental question of<br />

how we can use these results for further analysis <strong>and</strong><br />

control of a logistics system.<br />

3 SUPPLY CHAIN AND NONLINEARITY<br />

The notion of evolution over time falls into the realm of<br />

what physicists call dynamics. Logistics systems are<br />

dynamic. Their behavior can be nonlinear. Therefore we<br />

can model a logistics system using the principles of<br />

nonlinear dynamics. A supply chain is an example of a<br />

logistics system. A typical supply chain exhibits stable<br />

behavior with damped oscillations in response to external<br />

disturbances. Unstable phenomena however can arise,<br />

due to feedback structure, inherent adjustment delays [6],<br />

nonlinear decision-making [7] <strong>and</strong> interactions that go in a<br />

supply chain. One of the causes of unstable phenomena is<br />

that the information feedback in the system is slow relative<br />

to rate of changes that occur in the system. Nonlinearity is<br />

inherent in a supply chain. The first mode of unstable<br />

behavior to arise in nonlinear systems is usually the simple<br />

one-cycle self-sustained oscillations. If the instability drives<br />

the system further into the nonlinear regime, more<br />

complicated temporal behavior may be generated. The<br />

route to chaos through subsequent period-doubling<br />

bifurcations, as certain parameters of the system are<br />

varied, is generic to large class of systems in physics,<br />

chemistry, biology, economics <strong>and</strong> other fields.<br />

Functioning in chaotic regime deprives us the ability for<br />

long-term predictions about the behavior of the system,<br />

while short-term predictions may be possible sometimes.<br />

As a result, control <strong>and</strong> stabilization of such a system<br />

becomes almost impossible. Here we investigate such<br />

dynamical behaviors that can arise in models that<br />

represent some of the components in a supply chain.<br />

3.1 Preemptive Queuing Model with Delays<br />

The Queuing system [8] considered here has two queues<br />

(A <strong>and</strong> B) <strong>and</strong> a single server with following characteristics:<br />

•Once served, the class A customer returns as a class B<br />

customer after a constant interval of time<br />

•Class B has non-preemptive priority over class A, i.e., the<br />

class A queue does not get served until the class B queue<br />

is emptied.<br />

•Schedules are organized every T units of time, i.e., if the<br />

low priority queue is emptied within time T, the server<br />

remains idle for the remaining time interval.<br />

•<strong>Final</strong>ly, the higher priority class B has a lower service rate<br />

than the low priority class A<br />

Suppose the system is sampled at the end of every<br />

schedule cycle, <strong>and</strong> the following quantities are observed<br />

at the beginning of the kth interval: Ak<br />

the queue length of<br />

low priority queue, Bk<br />

the queue length of high priority<br />

queue, C<br />

k the outflow from low priority queue in the kth<br />

interval <strong>and</strong> Dk<br />

the outflow from high priority queue in the<br />

kth interval. In the model, λ<br />

k denotes the arrival rate, µ<br />

a<br />

is the service rate for the lower priority queue, µ<br />

b is the<br />

service rate for the higher priority queue <strong>and</strong> l the<br />

feedback interval in units of the schedule cycle.<br />

The following four equations then completely describe the<br />

evolution of the system:<br />

Ak<br />

+ 1<br />

= Ak<br />

+ λk<br />

− Ck<br />

(3)<br />

C<br />

B<br />

k<br />

min( A<br />

k<br />

Dk<br />

+ λk<br />

, µ<br />

a<br />

(1 − ))<br />

µ<br />

= (4)<br />

k +1<br />

= Bk<br />

+ Ck<br />

−l<br />

− Dk<br />

(5)<br />

D min( B + C , µ )<br />

k<br />

=<br />

k k −l<br />

b<br />

(6)<br />

Equations (3) <strong>and</strong> (5) are merely conservation rules, while<br />

equations (4) <strong>and</strong> (6) model the constraints on the outflows<br />

<strong>and</strong> the interaction between the queues. This model while<br />

conceptually simple, exhibits surprisingly complex<br />

behaviors. The dynamical behavior reported in [8] is<br />

summarized in the following.<br />

Figure 1: Non-preemptive Queuing Model<br />

Dynamical Behavior: The analytic approach to solve for<br />

the flow model under constant arrivals (i.e.<br />

b<br />

λ =<br />

k<br />

λ<br />

for all<br />

k) shows several classes of solutions. The system is found<br />

to batch its workload even for such perfectly smooth arrival<br />

patterns. Following are the characteristics of behavior of<br />

the system:<br />

•Above a threshold arrival rate ( λ ≥ µ b<br />

/ 2 ), a<br />

momentary overload can send the system into a number of<br />

stable modes of oscillations.<br />

•Each mode of oscillations is characterized by distinct<br />

average queuing delays.<br />

•Extreme sensitivity to parameters, <strong>and</strong> the existence of<br />

chaos, implies that the system at a given time may be in<br />

any one of a number of distinct steady-state modes.<br />

The batching of the workload can cause significant<br />

queuing delays even at moderate occupancies. Also such<br />

oscillatory behavior significantly lowers the real-time<br />

capacity of the system.<br />

4 APPLICATION OF QUEUING MODEL TO<br />

LOGISTICS SCENARIO<br />

The assumptions in the model proposed by [8] are generic<br />

in the sense that priorities are widely observed in large<br />

systems due to economic <strong>and</strong> administrative compulsions.<br />

Sometime they can also arise from the physical facts when<br />

two different stages of processing have certain temporal<br />

constraint. Priorities may also arise due to the nonhomogeneity<br />

of the system where “knowledge” level of one<br />

agent is different from the other.<br />

Varying service time again follows from physical<br />

constraints on the task. For example in a simple logistics<br />

scenario tasks like unpacking, shipping, logging <strong>and</strong><br />

dispatching may take different times. These times scales<br />

can vary widely depending on the nature <strong>and</strong> physical<br />

characteristics of the tasks.<br />

The considerations regarding the generality of<br />

assumptions <strong>and</strong> the clear one-to-one correspondence


etween the physical logistics tasks <strong>and</strong> the model<br />

parameters described in [8] made us apply the queuing<br />

model to a simple, yet, realistic logistics scenario.<br />

4.1 Example Logistics Scenario<br />

The example scenario consists of two stages modeled by<br />

the non-preemptive queuing formalism. We take a simple<br />

battle front scenario (this can be any context of supply of<br />

materials, not necessarily battle front). During the first<br />

stage, supplies are processed by the node (agent) This<br />

involves two tasks: Unpacking (Task A) <strong>and</strong> Shipping<br />

(Task B). Our assumptions are that shipping takes more<br />

resources than packing, shipping gets a non preemptive<br />

priority <strong>and</strong> resources are common to both the tasks<br />

The second stage consists of disbursement of supplies.<br />

The output of first stage feeds into the second stage (as<br />

arrival). The two associated tasks are: Maintaining an<br />

inventory (Task A) <strong>and</strong> Disbursing the supply to the troops<br />

(Task B). The assumptions at stage two are that<br />

disbursing takes more resources than maintaining<br />

inventory, disbursing has a non pre-emptive priority <strong>and</strong><br />

resources are common to both the tasks.<br />

Figure 1 shows the queuing model. This is figure is<br />

reproduced from [8]. It must be noted that that rules are<br />

very simple <strong>and</strong> generic. Priority <strong>and</strong> heterogeneity are<br />

fundamental to any logistic planning <strong>and</strong> scheduling.<br />

Tasks have to be prioritized in order to do the most<br />

important thing first. This comes naturally as we try to<br />

optimize an objective <strong>and</strong> assign the tasks their<br />

"importance.” In addition in all logistics systems, resources<br />

are limited, both in time <strong>and</strong> space. Temporal constraints<br />

considered in the example are realistic, in the sense that<br />

you cannot disburse supplies without unpacking them.<br />

Temporal dependence plays an important role in logistic<br />

planning (interdependency). This simple example also<br />

simulates the effect of arbitrary but bounded initial<br />

conditions<br />

Cougaar (Cognitive Agent Architecture) is developed<br />

under <strong>DARPA</strong> Advanced Logistics Program (ALP).<br />

Survivability of Cougaar is addressed in the UltraLog<br />

program of <strong>DARPA</strong>. In the above example each stage is<br />

modeled as an agent. The activities are modeled as agent<br />

processes. We do not discuss Cougaar architecture in this<br />

paper. Details can be found at the URL:<br />

http://www.couggar.org.<br />

4.2 Analysis<br />

One of the hallmarks of chaos is sensitive dependency to<br />

initial conditions (SDIC). External environment (the world<br />

in which the logistics scenario resides) changes <strong>and</strong> hence<br />

changing the initial conditions <strong>and</strong> the parameters. The<br />

following affects the initial conditions <strong>and</strong> parameters of<br />

the agents ( thereby affecting the initial conditions of the<br />

queuing model): change in arrival rate of supplies (inputs<br />

to the agents), change in resources (assets) available in<br />

each agent, <strong>and</strong> delay in processing of Tasks.<br />

The internal states of the two agents are characterized by:<br />

supplies waiting to be shipped (X 1 ), supplies waiting to be<br />

unpacked (X 2 ), supplies actually shipped, supplies waiting<br />

to be inventoried (X 3 ), supplies waiting to be disbursed (X 4 )<br />

to the troops <strong>and</strong> supplies actually shipped. We have<br />

considered these variables <strong>and</strong> observed their behavior.<br />

Characterization of these behaviors leads to some<br />

interesting inferences.<br />

We simulated the queuing models in each agent with the<br />

following model parameters. There are 162 personnel in<br />

each of the agents, who can be allocated to either task.<br />

We assume that it takes 1 unit of time <strong>and</strong> one person to<br />

do task A <strong>and</strong> one unit of time with 2 people to do task B.<br />

This defines the capacity/arrival rate as 54 items/unit time.<br />

Hence arrival rate can be 0-54 per unit time. We assume<br />

that the initial conditions are given by: X10=131, X20=201,<br />

X30=151 <strong>and</strong> X40=29.<br />

S tate X 1<strong>and</strong>X2-><br />

M agnitude(dB)<br />

P o w er S pectru m<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

-10<br />

-20<br />

-30<br />

Evolution of system states<br />

50 55 60 65 70 75 80 85 90 95 100<br />

Time-><br />

(a): Time evolution of system state<br />

Power Spectrum of state X1<br />

-40<br />

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1<br />

Frequency<br />

50<br />

(b) Power spectrum<br />

State space trajectories for center 2(blue) <strong>and</strong> for center 1(red)<br />

150<br />

100<br />

2 ,x4-> x<br />

0<br />

0 50 100 150<br />

x1,x3-><br />

(c) State-space plot<br />

d) Multi-stability:Bifurcation diagram for the system<br />

Figure 2: Plots for arrival rate =53<br />

We have used Matlab for computations. We have<br />

experimented with several arrival rates <strong>and</strong> delays. We<br />

observe the state-space structure (time evolution) of the<br />

following: arrival rates at all the queues, time series of<br />

various parameters <strong>and</strong> the Power-spectrum.<br />

At arrival rates of 40 the system has a period of 1, at<br />

arrival rate of 50 a period 2, at 52 a period of 4 <strong>and</strong> at 53<br />

the system shows a seemingly r<strong>and</strong>om behavior. This


shows relatively irregular behavior with several different<br />

peaks in the power spectrum. The bifurcation diagram<br />

shows that at the arrival rate of 53 the system is chaotic.<br />

We show illustrative plots in figure 2. Time evolution (2a)<br />

clearly shows the existence of many periods showing the<br />

possible existence of chaotic behavior.<br />

4.3 Discussion<br />

We could successfully show with certain initial conditions<br />

the existence of chaos in the simple yet realistic logistic<br />

system. The underlying queuing model at arrival rates of<br />

53 leads to the chaotic behavior of the number of jobs<br />

waiting to be processed. The bifurcation diagram points to<br />

the fact that X j ’s (for j=1,2,3,4) exhibit aperioidic behavior.<br />

The physical implication is that the resources needed vary<br />

from time to time <strong>and</strong> the logistics system will exhibit<br />

nervousness, which is an undesirable property. We have<br />

also observed a cascading effect (when one agent enters<br />

the chaotic behavior, the connected agent also tends to<br />

exhibit chaos). This leads to the problem of planning of<br />

later stages facing much more uncertainty compared to<br />

the first stage even for simple fixed deterministic arrivals.<br />

We have also observed increased average delay. There is<br />

an increase in delay by 25% if the system starts batching<br />

the load From our analysis we can conclude that if the two<br />

agents start load batching then inventory requirement may<br />

go to 200% as evident from the plots.<br />

It is necessary to make sure in this case to keep the arrival<br />

rates to less than 53, there by enforcing control policies<br />

which will keep the system stable or quasi-stable. If the<br />

system ends up being chaotic then we could perform<br />

further analysis to study the characteristics <strong>and</strong> use them<br />

to control the behavior in the short term. We also compute<br />

: Average mutual information, Global dimension, Local<br />

dimension, Correlation dimension <strong>and</strong> Largest Lyapunov<br />

Exponent.These computed values also indicate the<br />

existence of Chaos in this logistics system.<br />

5 SUMMARY<br />

Chaotic behavior in deterministic dynamical systems is an<br />

intrinsically non-linear phenomenon. We could successfully<br />

show that a simple example logistics system is chaotic. A<br />

characteristic feature of a chaotic systems is an extreme<br />

sensitivity to changes in initial conditions while the<br />

dynamics, at least for the so-called dissipative systems, is<br />

still constrained to a finite region of the state space called<br />

an attractor. In such instances, Fourier analysis <strong>and</strong><br />

ARMA models may not be useful to study the time traces<br />

of supply chain systems. The need to extract interesting<br />

physical information about the dynamics of observed<br />

systems when they are operating in a chaotic regime has<br />

led to development of nonlinear time series analysis<br />

techniques. Systematically, the study of potentially, chaotic<br />

systems may be divided into three areas: identification of<br />

chaotic behavior, modeling <strong>and</strong> prediction <strong>and</strong> control. The<br />

first area shows how chaotic systems may be separated<br />

form stochastic ones <strong>and</strong>, at the same time, provides<br />

estimates of the degrees of freedom <strong>and</strong> the complexity of<br />

the underlying chaotic system. Based on such results,<br />

identification of a state space representation allowing for<br />

subsequent predictions may be carried out. The last stage,<br />

if desirable, involves control of a chaotic system. In this<br />

short paper we have concentrated on the first area i.e.<br />

identification of chaotic behavior. In general if we consider<br />

this step in spatio-temporal regime, the following tasks are<br />

needed to be accomplished [9]:<br />

1.Signal Separation (Finding the signal): Separation of<br />

broadb<strong>and</strong> signal form broadb<strong>and</strong> “noise” using<br />

deterministic nature of signal.<br />

2.Phase Space reconstruction (Finding the space): Time<br />

lagged variables are used to form coordinates for a phase<br />

space in the embedding dimension. The embedding<br />

dimension can be determined using false nearest<br />

neighbors test <strong>and</strong> time lag using mutual information.<br />

3.Classification of the signal: Determination of invariants of<br />

system such as Lyapunov exponents <strong>and</strong> various fractal<br />

dimensions.<br />

4.Making models <strong>and</strong> Prediction: Determination of the<br />

parameters of the assumed model, which are consistent<br />

with the invariant classifiers (like Lyapunov exponents <strong>and</strong><br />

dimensions).<br />

In this paper the non-preemptive queuing model is used<br />

for detailed application to a part of a supply chain, two<br />

agents interacting in a military logistics scenario. The<br />

queuing model forms the processing component of the<br />

logistics agents implemented in the Cougaar architecture.<br />

One of the manifestations of complexity is through the<br />

onset of chaos. Our analysis shows the cascading effect of<br />

chaos. This points to the conjecture that the supply chain<br />

may exhibit chaotic behavior. The underlying motivation of<br />

our study is to build control models. Our next step in this<br />

research is to build adaptive predictive <strong>and</strong> control models<br />

for larger supply chain networks from the insights we have<br />

derived from the current analysis.<br />

6 ACKNOWLEDGMENTS<br />

The authors acknowledge <strong>DARPA</strong> for its support (Grant#:<br />

MDA 972-01-1-0563) under the UltraLog program for this<br />

research. The help of Seokcheon Lee, Yunho Hong <strong>and</strong><br />

Hariprasad, T. is greatly appreciated.<br />

7 REFERENCES<br />

[1] Isham, V., 1993, Statistical Aspects of chaos: A<br />

review, Networks <strong>and</strong> Chaos - Statistical <strong>and</strong><br />

Probabilistic Aspects, Barndorff-Nielsen et al.<br />

(editors).<br />

[2] Wiggins, S., 1990, Introduction to Applied Nonlinear<br />

Dynamical Systems <strong>and</strong> Chaos, Springer-Verlag, New<br />

York, Inc.<br />

[3] Bukkapatnam, S.T.S., Kumara, S., <strong>and</strong> Lakhtakia,<br />

A., 2000, Fractal estimation of flank wear in turning<br />

,ASME Journal of Dynamic Systems, Measurements<br />

<strong>and</strong> Control, 122:89-94.<br />

[4] Reiter, S. R., Freitag, M. <strong>and</strong> Schmieder, A., 2002,<br />

Modeling <strong>and</strong> Control of Production Systems Based<br />

on Nonlinear Dynamics Theory, Annals of the CIRP,<br />

51/1 :375-378.<br />

[5] Wiendahl, H.P. <strong>and</strong> Scheffczyk, H., 1999, Simulation<br />

Based Analysis of Complex Production Systems with<br />

methods of Nonlinear Dynamics, Annals of the CIRP,<br />

48/1 :357-360.<br />

[6] Rasmussen, R. D. <strong>and</strong> Moseklide, E., 1988,<br />

Bifurcations <strong>and</strong> Chaos in Generic Management<br />

Model, European Journal of Operations Research<br />

Science, 35:80-88.<br />

[7] Feichtinger, G., Hommes, C. H. <strong>and</strong> Herold W., 1994,<br />

Chaos in a simple Deterministic Queuing System,<br />

ZOR- Mathematical Methods of Operations Research,<br />

40:109-119.<br />

[8] Erramilli, A. <strong>and</strong> Forys, L. J., 1991. Oscillations <strong>and</strong><br />

Chaos in a Flow Model of a Switching System. IEEE<br />

Journal on Selected Areas in Communications, Vol.9,<br />

No 2:171-178.<br />

[9] Abarbanel, H.D.I., 1996, Analysis of Observed<br />

Chaotic Data, Springer-Verlag, New York Inc.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!