DARPA ULTRALOG Final Report - Industrial and Manufacturing ...
DARPA ULTRALOG Final Report - Industrial and Manufacturing ...
DARPA ULTRALOG Final Report - Industrial and Manufacturing ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Ultra*Log<br />
PSU/IAI <strong>Final</strong> <strong>Report</strong> for Ultra*Log<br />
Document Revision Number: 2.0<br />
Date: 09/01/2005<br />
Prepared for:<br />
Defense Advanced Research Projects Agency<br />
Information Systems Office<br />
3701 North Fairfax Drive<br />
Arlington, VA 22203-1714<br />
Prepared by:<br />
The Pennsylvania State University<br />
Intelligent Automation, Inc.<br />
Contact Persons:<br />
Soundar Kumara (PSU)<br />
skumara@psu.edu (814) 863-2359<br />
Vikram Manikonda (IAI)<br />
vikram@i-a-i.com (301) 294-5045<br />
Wilbur Peng (IAI)<br />
wpeng@i-a-i.com (301) 294-50455<br />
i
Document History<br />
Revision<br />
Date<br />
Revised By Comments Date Reviewed<br />
with Team<br />
08/24/05 Soundar Kumara Initial creation<br />
* Document will be approved as part of CCB process.<br />
Approved *<br />
Software Release History<br />
Date<br />
Comments<br />
LEGEND (example; legend/copyright statement optional)<br />
Use, duplication, or disclosure by the Government is as set forth in the Rights in technical data<br />
noncommercial items clause DFAR 252.227-7013 <strong>and</strong> Rights in noncommercial computer software <strong>and</strong><br />
noncommercial computer software documentation clause DFAR 252.227-7014.<br />
© Copyright 2001 <br />
ii
Contents<br />
Contents .............................................................................................................. iii<br />
Executive Summary..............................................................................................v<br />
1 Introduction .....................................................................................................7<br />
2 Design <strong>and</strong> survivability of distributed multi-agent systems ............................7<br />
2.1 Designing a Network Infrastructure for Survivability of Multi-Agent<br />
Systems .................................................................................................7<br />
2.2 Survivability of Multi-agent based Supply Networks: A Topological<br />
perspective.............................................................................................7<br />
2.3 Survivability of a distributed multi-agent application – A performance<br />
control perspective .................................................................................8<br />
2.4 Survivability through Implementation Alternatives in Large-scale<br />
Information Networks with Finite Load ...................................................8<br />
3 Monitoring, situation identification <strong>and</strong> pattern extraction ...............................8<br />
3.1 Situation Identification Using Dynamic Parameters in Complex Agent-<br />
Based Planning Systems .......................................................................8<br />
3.2 Estimating global stress environment through local behavior in a<br />
multiagent-based planning system.........................................................8<br />
3.3 Using Predictors to Improve the Robustness of Multi-Agent Systems:<br />
Design <strong>and</strong> Implementation in Cougaar .................................................8<br />
3.4 Survivability of Complex System – Support Vector Machine Based<br />
Approach................................................................................................8<br />
4 Control ............................................................................................................8<br />
4.1 A Framework For Performance Control of Distributed Autonomous<br />
Agents....................................................................................................8<br />
4.2 An Autonomous Performance Control Framework for Distributed Multi-<br />
Agent Systems: A Queueing Theory Based Approach...........................9<br />
4.3 Adaptive control for large-scale information networks through alternative<br />
algorithms to support survivability ..........................................................9<br />
4.4 Self-organizing resource allocation for minimizing completion time in<br />
large-scale distributed information networks ..........................................9<br />
4.5 Efficient method of quantifying minimal completion time for componentbased<br />
service networks: Network topology <strong>and</strong> resource allocation ......9<br />
4.6 Market-based model predictive control for large-scale information<br />
networks: Completion time <strong>and</strong> value of solution ...................................9<br />
4.7 Coordinating control decisions of software agents for adaptation to<br />
dynamic environments. ..........................................................................9<br />
iii
5 CPE society modeling <strong>and</strong> performance analysis...........................................9<br />
5.1 Underst<strong>and</strong>ing agent societies using distributed monitoring <strong>and</strong> profiling<br />
...............................................................................................................9<br />
5.2 Reliable MAS Performance Prediction Using Queueing Models............9<br />
6 Characterization <strong>and</strong> analysis of supply chains from complex systems<br />
perspective .........................................................................................................10<br />
6.1 Supply Chain Network: A Complex Adaptive Systems Perspective .....10<br />
6.2 Decision Making in Logistics: A Chaos Theory Based Approach.........10<br />
iv
Executive Summary<br />
Ultra*Log is a Defense Advanced Research Projects Agency (<strong>DARPA</strong>) sponsored<br />
research project focused on creating a distributed agent-based architecture that is<br />
inherently survivable <strong>and</strong> capable of operating effectively in very chaotic environments.<br />
The project is pursuing the development of technologies to enhance the security,<br />
robustness, <strong>and</strong> scalability of large-scale, distributed agent-based systems operating in<br />
chaotic wartime environments. Ultra*Log's goal is to operate with up to 45% information<br />
infrastructure loss in a very chaotic environment with not more than 20% capabilities<br />
degradation <strong>and</strong> not more than 30% performance degradation for a period representing<br />
180 days of sustained military operations in a major regional contingency.<br />
In order to achieve the above goals, we are concentrating on the complexity studies for<br />
analysis, estimation <strong>and</strong> control. The efforts are geared towards realizing a robust theory<br />
for analyzing <strong>and</strong> controlling the complexity in distributed multi-agent systems. This<br />
would in turn help to define the theoretical <strong>and</strong> application grounds for adaptivity in<br />
distributed systems. The application area is military logistics where the studies<br />
concentrate on sensing to logistics in a network centric warfare environment. The<br />
research under Ultra*Log is expected to lay the foundation for the next generation<br />
logistics.<br />
In this document we discuss significant accomplishments of PSU/IAI as a part of<br />
Ultra*Log project. We present the results in the form of papers we published/submitted to<br />
different refereed conferences <strong>and</strong> journals. During the course of the Ultra*Log project,<br />
we have proposed three different research <strong>and</strong> development areas, namely:<br />
1. Research in design <strong>and</strong> survivability of distributed multi-agent systems<br />
2. Research in Monitoring, Situation Identification <strong>and</strong> Pattern Extraction<br />
3. Research in characterization, analysis <strong>and</strong> control of complex adaptive systems.<br />
We discuss the details of several results <strong>and</strong> findings related to the above research areas<br />
in this document.<br />
The total period of the project is four <strong>and</strong> half years from the start of the project. The<br />
team members include The Pennsylvania State University (PSU), <strong>and</strong> Intelligent<br />
Automation Incorporated, Rockville, MD., who is a sub-contractor to PSU. The regular<br />
team members from PSU include graduate students (Y. Hong, S. Lee, H. P.<br />
Thadakamalla, A. Surana, V. Narayanan, H. Gupta, N. Gnanasamb<strong>and</strong>am, K. Tang, X.<br />
Ding, E. Pinto <strong>and</strong> U. N. Raghavan). These students worked under the direct supervision<br />
of Professor Kumara. Most of them are Ph.D., students. In addition, Professors G.<br />
Natarajan <strong>and</strong> C.R. Rao * participated in the project. From IAI, V. Manikonda, W. Peng<br />
<strong>and</strong> H. Gupta were the participants.<br />
* (Winner of National medal of Science for Mathematics from President Bush in 2002)<br />
v
1 Introduction<br />
The main focus of the proposed research is breakthrough technology development based on<br />
theory of chaos, knowledge mining, queueing theory <strong>and</strong> market based control theoretic<br />
principles combined with chaos to improve scalability, robustness <strong>and</strong> survivability of Cougaar<br />
Architecture. In specific, our focus was on adaptive logistics. Such a development will introduce<br />
new-operation capabilities in Cougaar in terms of:<br />
• Dynamic fault isolation <strong>and</strong> recovery,<br />
• Dynamic adaptation to environment, <strong>and</strong><br />
• Variable fidelity to adaptive processes.<br />
We have published/submitted many papers in different refereed conferences <strong>and</strong> journals that<br />
address these principles. We have classified these papers into five different sections as given<br />
below:<br />
• Design <strong>and</strong> survivability of distributed multi-agent systems<br />
• Monitoring, situation identification <strong>and</strong> pattern extraction<br />
• Control<br />
• CPE society modeling <strong>and</strong> performance analysis<br />
• Characterization <strong>and</strong> analysis of supply chains from complex systems perspective<br />
Please note that the papers are hyper-linked.<br />
2 Design <strong>and</strong> survivability of distributed multi-agent systems<br />
It is extremely important to design multi-agent systems architecture that is survivable even<br />
in war-time or critical situations. Survivability can be improved both from functionality <strong>and</strong><br />
topological perspective. We published the following papers, where we discuss different ways of<br />
improving survivability for distributed multi-agent systems.<br />
2.1 Surana, A., Gautam, N., Kumara, S. R. T., <strong>and</strong> Greaves, M., “Designing a<br />
Network Infrastructure for Survivability of Multi-Agent Systems”, IASTED<br />
Conference on Parallel <strong>and</strong> Distributed Computing <strong>and</strong> Networks, 2005.<br />
2.2 Thadakamalla, H. P., Raghavan, U. N., Kumara, S. R. T. <strong>and</strong> Albert, R.,<br />
“Survivability of Multi-agent based Supply Networks: A Topological<br />
perspective,” IEEE Intelligent Systems, Vol. 19, No. 5, 2004.<br />
7
2.3 Gnanasamb<strong>and</strong>am, N., Lee, S., Kumara, S. R. T., Gautam, N., Peng, W.,<br />
Manikonda, V., Brinn, M. <strong>and</strong> Greaves, M., “Survivability of a distributed multiagent<br />
application – A performance control perspective”, IEEE Symposium on<br />
Multi-agent Security <strong>and</strong> Survivability (MAS&S 2005), Philadelphia, 2005.<br />
2.4 Lee, S., <strong>and</strong> Kumara, S. R. T., “Survivability through Implementation<br />
Alternatives in Large-scale Information Networks with Finite Load,” Proceedings<br />
of Open Cougaar Conference, July 2004.<br />
3 Monitoring, situation identification <strong>and</strong> pattern extraction<br />
An essential task for control is sensing. We have developed different tools to monitor <strong>and</strong><br />
sense distributed systems. With the help of these tools, we devised many situation identification<br />
<strong>and</strong> pattern extraction algorithms based on theory of chaos, knowledge mining <strong>and</strong> Kalman filter<br />
principles. The following are the papers published in this field of research.<br />
3.1 Lee, S., Gautam, N., Kumara, S. R. T., Surana, A., Gupta, H., Hong, Y.,<br />
Narayanan, V., <strong>and</strong> Thadakamalla, H. P., “Situation Identification Using Dynamic<br />
Parameters in Complex Agent-Based Planning Systems,” Intelligent Engineering<br />
Systems Through Artificial Neural Networks, v 12, 2002.<br />
3.2 Lee, S., <strong>and</strong> Kumara, S. R. T., “Estimating global stress environment through<br />
local behavior in a multiagent-based planning system,” IEEE Conference on<br />
Automation Science <strong>and</strong> Engineering (CASE 05), Edmonton, Canada, August<br />
2005.<br />
3.3 Gupta, H., Hong, Y., Thadakamalla, H. P., Manikonda, V., Kumara, S. R. T. <strong>and</strong><br />
Peng, W., “Using Predictors to Improve the Robustness of Multi-Agent Systems:<br />
Design <strong>and</strong> Implementation in Cougaar”, Proceedings of Open Cougaar<br />
Conference, July 2004.<br />
3.4 Hong, Y., Gautam, N., Kumara, S. R. T., Surana, A., Gupta, H., Lee, S.,<br />
Narayanan, V., <strong>and</strong> Thadakamalla, H. P., “Survivability of Complex System –<br />
Support Vector Machine Based Approach,” Intelligent Engineering Systems<br />
Through Artificial Neural Networks, v 12, 2002.<br />
4 Control<br />
The heart of adaptivity is control. In our work we therefore, build different control<br />
frameworks <strong>and</strong> methods for distributed systems. The following are the papers published in this<br />
research area.<br />
4.1 Gnanasamb<strong>and</strong>am, N., Lee, S., Kumara, S. R. T. <strong>and</strong> Gautam, N., “A Framework<br />
For Performance Control of Distributed Autonomous Agents,” <strong>Industrial</strong><br />
Engineering Research Conference (IERC), Atlanta, August 2005.
4.2 Gnanasamb<strong>and</strong>am, N., Lee, S., Gautam, N., Kumara, S. R. T., Peng, W.,<br />
Manikonda, V., Brinn, M. <strong>and</strong> Greaves, M., “An Autonomous Performance<br />
Control Framework for Distributed Multi-Agent Systems: A Queueing Theory<br />
Based Approach,” Autonomous Agents <strong>and</strong> Multi-Agent Systems (AAMAS),<br />
Utrecht, Netherl<strong>and</strong>s, July 2005.<br />
4.3 Lee, S. <strong>and</strong> Kumara, S. R. T. “Adaptive control for large-scale information<br />
networks through alternative algorithms to support survivability”, submitted to<br />
IEEE Transactions on Automatic Control.<br />
4.4 Lee, S., Kumara, S. R. T. <strong>and</strong> Gautam, N., “Self-organizing resource allocation<br />
for minimizing completion time in large-scale distributed information networks”,<br />
submitted to IEEE Transactions on Systems, Man, <strong>and</strong> Cybernetics.<br />
4.5 Lee, S., Kumara, S. R. T. <strong>and</strong> Gautam, N., “Efficient method of quantifying<br />
minimal completion time for component-based service networks: Network<br />
topology <strong>and</strong> resource allocation”, submitted to IEEE Transactions on<br />
Computers.<br />
4.6 Lee, S., Kumara, S. R. T. <strong>and</strong> Gautam, N., “Market-based model predictive<br />
control for large-scale information networks: Completion time <strong>and</strong> value of<br />
solution”, submitted to IEEE Transactions on Parallel <strong>and</strong> Distributed Systems.<br />
4.7 Hong, Y. <strong>and</strong> Kumara, S. R. T., “Coordinating control decisions of software<br />
agents for adaptation to dynamic environments” 37 th CIRP International Seminar<br />
on <strong>Manufacturing</strong> Systems (ISMS-2004), Budapest, Hungary, May 2004.<br />
5 CPE society modeling <strong>and</strong> performance analysis<br />
We have built a demo society, “CPEDemo”, for identifying the key aspects within a<br />
continuous planning <strong>and</strong> execution scenario. This will help us to identify <strong>and</strong> demonstrate key<br />
concepts in the argument for <strong>and</strong> concept of “design for survivability”. The following are the<br />
papers published related to the CPE society.<br />
5.1 Peng, W., Manikonda, V. <strong>and</strong> Kumara S. R. T., “Underst<strong>and</strong>ing agent societies<br />
using distributed monitoring <strong>and</strong> profiling” Proceedings of Open Cougaar<br />
Conference, July 2004.<br />
5.2 Gnanasamb<strong>and</strong>am, N., Lee, S., Gautam, N., Kumara, S. R. T., Peng, W.,<br />
Manikonda, V., Brinn, M. <strong>and</strong> Greaves, M., “Reliable MAS Performance<br />
Prediction Using Queueing Models,” IEEE Symposium on Multi-agent Security<br />
<strong>and</strong> Survivability (MAS&S 2004), Philadelphia, PA, 2004
6 Characterization <strong>and</strong> analysis of supply chains from complex<br />
systems perspective<br />
With the advent of information technology, supply chains have acquired complexity that is<br />
almost equivalent to biological systems. In the following papers, we argue why supply chains<br />
should be treated as complex systems <strong>and</strong> propose how various concepts, tools <strong>and</strong> techniques<br />
used in complex adaptive systems literature can be exploited to characterize <strong>and</strong> analyze supply<br />
chain networks.<br />
6.1 Surana, A., Kumara, S. R. T., Greaves, M. <strong>and</strong> Raghavan, U. N. “Supply Chain<br />
Network: A Complex Adaptive Systems Perspective”, International Journal of<br />
Production Research (to be published), 2005.<br />
6.2 Kumara, S. R. T., Ranjan, P., Surana, A. <strong>and</strong> Narayanan, V., “Decision Making in<br />
Logistics: A Chaos Theory Based Approach”, CIRP Annals, p.381, 2003.
DESIGNING A NETWORK INFRASTRUCTURE FOR SURVIVABILITY OF<br />
MULTI-AGENT SYSTEMS<br />
A. Surana<br />
MIT,<br />
Cambridge, MA 02139<br />
email: surana@mit.edu<br />
N. Gautam, S.R.T. Kumara<br />
Penn. State University,<br />
University Park, PA 16802<br />
email: {ngautam, skumara}@psu.edu<br />
M. Greaves<br />
<strong>DARPA</strong>, 3701 Fairfax Drive<br />
Arlington, VA 22203-1714<br />
email: mgreaves@darpa.mil<br />
ABSTRACT<br />
In this paper we consider a society of agents whose interactions<br />
are known. Our objective is to solve a strategic<br />
network infrastructure design problem to determine: (i)<br />
number of nodes (usually computers or servers) <strong>and</strong> their<br />
processing speeds, (ii) set of links between nodes <strong>and</strong> their<br />
b<strong>and</strong>widths, <strong>and</strong> (iii) assignment of agents to nodes. From<br />
a performance st<strong>and</strong>point, on one h<strong>and</strong> all the agents can<br />
reside in a single node thereby stressing the processor, on<br />
the other h<strong>and</strong> the agents can be distributed so that there<br />
is a maximum of one agent per node thereby increasing<br />
communication cost. From a robustness st<strong>and</strong>point since<br />
links <strong>and</strong> arcs can fail (possibly due to attacks) we would<br />
like to build a network that is least disruptive to the multiagent<br />
system functionality. Although we do not explicitly<br />
consider tactical issues such as moving agents to different<br />
nodes upon failure, we would like to design an infrastructure<br />
that facilitates such agent migrations. We formulate<br />
<strong>and</strong> solve a mathematical program for the network infrastructure<br />
design problem by minimizing a cost function subject<br />
to satisfying quality of service (QoS) as well as robustness<br />
requirements. We test our methodology on Cougaar<br />
multi-agent societies.<br />
KEY WORDS<br />
network design, QoS, robustness, optimization.<br />
1 Introduction<br />
As the number of applications requiring distributed multiagent<br />
systems (MAS) is continuously growing, it becomes<br />
extremely important to build a network infrastructure that<br />
can guarantee a survivable MAS architecture. By survivable<br />
we mean a system that is robust, secure as well as able<br />
to provide excellent quality of service (QoS) even when<br />
stressed. For example Brinn <strong>and</strong> Greaves [6] state that the<br />
Cougaar MAS in Ultra*Log [14], would be considered survivable<br />
if it would maintain at least x% of system capabilities<br />
<strong>and</strong> y% of system performance in the face of z%<br />
infrastructure loss <strong>and</strong> wartime loads (where x, y <strong>and</strong> z are<br />
provided by the system users).<br />
In order to build such a survivable system, there are<br />
several decisions that need to be made at different time<br />
granularities. These can be broken down as strategic (once<br />
a year or just one time), tactical (once a week to once a day<br />
depending on how often the configuration changes), <strong>and</strong><br />
operational (typically milliseconds to seconds, depending<br />
on the granularity of information exchange) decisions. The<br />
strategic decisions typically involve designing the network<br />
infrastructure (in terms of both number <strong>and</strong> capacity) for<br />
the MAS such as computers, servers, cables, etc. The tactical<br />
decisions include where to migrate the agents if a node<br />
fails or is cut off from the other nodes. Operational decisions<br />
include adaptive control methods for deciding which<br />
agent should process a task, the fidelity with which to process<br />
a task, etc.<br />
There has been a lot of research related to (a) software<br />
technology, such as, agent architecture, communication,<br />
migration, adaptation, learning, etc., <strong>and</strong> (b) networking,<br />
such as, QoS provisioning, fault-tolerance, dependability,<br />
robustness, etc. However there is very little research that<br />
combines the two <strong>and</strong> studies them from a systems engineering<br />
viewpoint. In this research we address that shortcoming.<br />
We focus on the strategic problem stated in the<br />
previous paragraph of designing a network infrastructure<br />
in terms of hardware to support a given society of agents<br />
<strong>and</strong> their interactions. This is with the underst<strong>and</strong>ing that<br />
in order to solve the tactical <strong>and</strong> operational problems effectively,<br />
the strategic problem must favor a network design<br />
that would ease tactical <strong>and</strong> operational decisions.<br />
We now present some of the related research with the<br />
underst<strong>and</strong>ing that due to space restrictions it is difficult to<br />
cite all relevant articles in the literature. Andreoli et al [2]<br />
consider a distributed software network infrastructure for<br />
agents performing search tasks (such as search engines).<br />
Optimization issues for MAS at software level such as<br />
load balancing using non-linear programming techniques is<br />
studied in Aiello et al [1] for a given hardware topology. In<br />
Hofmann et al [9], a mobile intelligent agent system is built<br />
under conditions of low b<strong>and</strong>width to show that it could<br />
improve the efficiency of military tactical operations <strong>and</strong><br />
the mobile agents would outperform static agents. Multiagent<br />
hybrid systems that combine computational hardware<br />
<strong>and</strong> a large scale software residing on it, with an application<br />
to air-traffic management is studied in Tomlin et al<br />
[13]. In Kephart et al [11], one of the emerging research<br />
areas, namely considering a distributed information system<br />
as analogous to biological ecosystem <strong>and</strong> social systems, is<br />
presented in order to study their survivability. Cancho <strong>and</strong><br />
Sole [7] consider a complex network <strong>and</strong> show that by op-
timizing simultaneously the link density <strong>and</strong> path distance<br />
in a graph (with a fixed set of nodes), leads to a scale-free<br />
topology which is robust to r<strong>and</strong>om attacks.<br />
The remainder of this paper is organized as follows.<br />
In Section 2 we describe the strategic problem in detail.<br />
Then in Section 3 we formulate the problem as a mathematical<br />
program. We discuss various methods to solve<br />
the mathematical program in Section 4. Then we describe<br />
numerical examples <strong>and</strong> results in Section 5. <strong>Final</strong>ly we<br />
present our concluding remarks <strong>and</strong> directions for future<br />
work in Section 6.<br />
2 Problem Description<br />
We now present details of the strategic problem of designing<br />
a network infrastructure for a MAS. Distributed information<br />
systems (DIS) can be viewed as a reconfigurable<br />
network with (i) computational infrastructure forming the<br />
backbone, <strong>and</strong> (ii) agents residing on it <strong>and</strong> moving around,<br />
consuming resources <strong>and</strong> providing services under uncertain<br />
<strong>and</strong> often hostile conditions. Each agent, has access<br />
to different information <strong>and</strong> makes its own local decisions,<br />
but must work together with other agents for the achievement<br />
of a common, system-wide goal. In this research we<br />
consider a MAS <strong>and</strong> an underlying network infrastructure<br />
that can be modeled as a DIS.<br />
One of the key inputs that go into the network design<br />
problem is the agent interaction pattern. A typical<br />
agent interaction tree is depicted in Figure 1. In the figure,<br />
the agents are nodes <strong>and</strong> if there arcs connecting two<br />
nodes, then the corresponding agents interact. The agents<br />
also specify the b<strong>and</strong>width requirement for their interaction.<br />
Besides the b<strong>and</strong>width (<strong>and</strong> interaction graph), another<br />
input to the design problem is resource requirements<br />
such as CPU <strong>and</strong> memory from the host computer or server.<br />
Although the figure suggests a hierarchical network, the<br />
model does not require that. In addition, some or all of<br />
the agents can be identical (in terms of what they can do).<br />
Figure 1. Example of an agent logical network<br />
Given the inputs mentioned above <strong>and</strong> other inputs<br />
based on survivability requirements, the output of the design<br />
problem is a physical network of nodes <strong>and</strong> arcs,<br />
Figure 2. Agent logical network residing in a physical network<br />
where nodes signify processors such as computers <strong>and</strong><br />
servers, <strong>and</strong> arcs signify links (it is not required that there<br />
be a single link between 2 nodes, however, we use the capacity<br />
of the bottleneck link <strong>and</strong> pretend as though it is a<br />
single link). The two extremes of design are if we place all<br />
agents in one node <strong>and</strong> if we place one agent in each node.<br />
In case of all agents in one node, the bottleneck would be<br />
resources, i.e. whether the CPU <strong>and</strong> memory requirements<br />
of all agents can be met. However if the agents are such<br />
that there is one agent per node, there would be a lot of<br />
time spent in communication between them. We assume<br />
that if two agents are on a single node, the available b<strong>and</strong>width<br />
for their interaction is infinite. In Figure 2, we take<br />
the logical network of agents (described in Figure 1) <strong>and</strong><br />
build 4 nodes <strong>and</strong> three arcs to house the agents in a physical<br />
infrastructure.<br />
3 Robust Design: Problem Formulation<br />
In this section we formulate a robust design problem for<br />
DIS. As we have discussed previously in Section 2, a DIS<br />
consists of two critical components: computational hardware<br />
(processors <strong>and</strong> communication links) <strong>and</strong> a MAS residing<br />
on it. With this viewpoint we have the following<br />
robust design problem: given the software agents (MAS)<br />
with their interaction pattern <strong>and</strong> computational resource<br />
requirements, we want to decide (i) How much processing<br />
power to start with i.e., how many processors <strong>and</strong> the capacity<br />
of each processor (to be selected from a given a set of<br />
processors)? (ii) How to lay out the physical network structure,<br />
i.e., how to connect the processors <strong>and</strong> what should be<br />
the capacity of each link (to be selected from a given a set<br />
of b<strong>and</strong>widths)? (iii) How to distribute agents on this network?<br />
The above decisions are to be made so that we can<br />
meet the survivability requirements <strong>and</strong> at the same time<br />
minimize the information infrastructure cost. We translate<br />
the survivability requirements into following “specifications”<br />
for the design problem: (a) Sufficient computational<br />
resources to start with <strong>and</strong> its balanced utilization;
(b) Small average path length <strong>and</strong> diameter, measuring the<br />
connectivity; (c) Resilience to complete node <strong>and</strong> link failures.<br />
Given the above specifications, it is clear that a robust<br />
design for DIS would be one with maximum possible computational<br />
resources <strong>and</strong> a fully connected backbone network.<br />
However, this would incur a very large infrastructure<br />
cost. This leads to the problem of optimally designing the<br />
backbone network <strong>and</strong> distributing agents on it such that it<br />
is fairy robust <strong>and</strong> at the same time cost effective. In order<br />
to systematically pose this trade-off as a mathematical<br />
programming model, we first give a formal description of<br />
various entities involved in the model.<br />
3.1 Agent Society, Nodes <strong>and</strong> Links<br />
The MAS or the agent society is described by the computational<br />
resource each agent consumes <strong>and</strong> their interaction<br />
pattern. Let N A be the total number of agents in the society,<br />
indexed as {1, 2, · · · , N A }. Let for an i th agent P a i<br />
denote the computational resource (CPU <strong>and</strong> Memory) it<br />
consumes <strong>and</strong> let Ba ij denote the b<strong>and</strong>width it uses, if it<br />
interacts with agent j.<br />
A node represents a computer with a given processing<br />
power (power can refer to CPU, Memory, etc.). Each agent<br />
in the given society has to be assigned to a node. As a result<br />
each node can be assigned one or more agents. For the<br />
agents which reside on the same node, the communication<br />
requirement is automatically satisfied. Let N max be the<br />
total number of nodes numbered {1, 2, · · · N max }, that is<br />
initially chosen to distribute the agents on. Let N i denote<br />
the decision variable such that,<br />
{ 1, if node i is selected from Nmax nodes<br />
N i =<br />
0, otherwise,<br />
for 1 ≤ i ≤ N max . Let P = {P 1 , P 2 , · · · , P Np } denote<br />
the set of available processing power for nodes with an associated<br />
cost set C(P ) = {C p1 , C p2 , · · · , C pNp } <strong>and</strong> P n ij<br />
a decision variable such that<br />
{<br />
1, if i<br />
P n ij =<br />
th node uses processor with a power P j<br />
0, otherwise,<br />
for 1 ≤ i ≤ N max <strong>and</strong> 1 ≤ j ≤ N p . Furthermore, let<br />
A d = [A ij ] denote a matrix of the distribution of agents on<br />
the nodes, where A ij is<br />
{ 1, if agent i resides on node j<br />
A ij =<br />
0, otherwise,<br />
for 1 ≤ i ≤ N A <strong>and</strong> 1 ≤ j ≤ N max . It is assumed<br />
that there are no multiple links <strong>and</strong> no self-loops when we<br />
connect the nodes with communication links. Let X ij be<br />
the decision variable, such that<br />
{ 1, if there is link from node i to j <strong>and</strong> i ≠ j<br />
X ij =<br />
0, otherwise,<br />
for 1 ≤ i, j ≤ N max . The matrix, X = [X ij ], is symmetric<br />
as the links connecting the nodes form the communication<br />
pathways <strong>and</strong> hence are undirected.<br />
Consider the set V = {N i |N i ≠ 0, 1 ≤ i ≤ N max }<br />
of occupied nodes <strong>and</strong> the corresponding index set I =<br />
{i|N i ≠ 0, 1 ≤ i ≤ N max }. Let E = {X ij |X ij ≠<br />
0 <strong>and</strong> i, j ∈ I, 1 ≤ i, j ≤ N max } . We shall denote<br />
by G = (V, E), the graph with V as the set of vertices <strong>and</strong><br />
E as a set of undirected edges. Let l avg (G) be the average<br />
path length <strong>and</strong> D(G) be the diameter of G. Note that if G<br />
consists of disconnected components, then l avg (G) −→ ∞<br />
<strong>and</strong> D(G) −→ ∞. Furthermore, we are only allowed to<br />
choose the capacity of links from an available set of b<strong>and</strong>widths<br />
B = {B 1 , B 2 , · · · , B Nb } with an associated cost<br />
set C(B) = {C b1 , C b2 , · · · , C bNb }. Let Br ijl be the decision<br />
variable which is 1 if link X ij uses a capacity B l <strong>and</strong><br />
0 otherwise, for, 1 ≤ i, j ≤ N max <strong>and</strong> 1 ≤ l ≤ N b .<br />
3.2 Problem Statement<br />
With the notations given in the previous section, let D =<br />
{N max , N i , P n ij , A ij , X ij , Br ijl } denote the set of decision<br />
variables (which are all binary). We can state the network<br />
design problem as follows:<br />
Objective: Let C denote the infrastructure cost, then<br />
we desire to<br />
min C =<br />
N∑<br />
max<br />
i=1<br />
N p<br />
∑<br />
C pj P n ij +<br />
j=1<br />
N∑<br />
max<br />
i=1<br />
subject to the following constraints:<br />
1. Resource Choice Constraints<br />
∑N b<br />
l=1<br />
N p<br />
N∑<br />
max<br />
j>i<br />
∑N b<br />
l=1<br />
C bl Br ijl ,<br />
(1)<br />
∑<br />
P n ij = N i , 1 ≤ i ≤ N max (2)<br />
j=1<br />
Br ijl = X ij , 1 ≤ i ≤ N max <strong>and</strong> i < j ≤ N max<br />
(3)<br />
The above constraints (2) <strong>and</strong> (3), restricts the choice<br />
of one type of processor <strong>and</strong> one type of b<strong>and</strong>width<br />
capacity for a node <strong>and</strong> a link respectively.<br />
2. Agent Distribution Constraints<br />
N∑<br />
max<br />
j=1<br />
A ij = 1, 1 ≤ i ≤ N A (4)<br />
∑N A<br />
∑<br />
A ij P a i + ∆ 1 (j) ≤ P n jl P l , 1 ≤ j ≤ N max<br />
i=1<br />
N p<br />
l=1<br />
∑N A<br />
∑N A<br />
A li A kj (Ba lk + Ba kl )<br />
l=1 k=1<br />
(5)<br />
∑N b<br />
+∆ 2 (i, j) ≤ Br ijt B t (6)<br />
t=1<br />
for 1 ≤ i ≤ N max , i < j ≤ N max ,
where ∆ 1 (j) ≥ 0 <strong>and</strong> ∆ 2 (i, j) ≥ 0 are given constants,<br />
which can vary with the node <strong>and</strong> the link, respectively.<br />
The constraints (4), force that each agent is assigned<br />
to only one node. On the other h<strong>and</strong> the constraints<br />
(5) guarantee that the agents are assigned to only those<br />
nodes which have been selected <strong>and</strong> the processing capacity<br />
chosen for that node meets the computational<br />
requirements in terms of CPU for all the agents assigned<br />
to that node. Note that this constraint also leads<br />
to a well balanced utilization of CPU by the agents, to<br />
begin with. Similarly the constraints (6), are for the<br />
meeting the communication requirements in terms of<br />
b<strong>and</strong>width of the links between nodes. Also each of<br />
the constraint (6), forces that if two agents which communicate<br />
with each other reside on separate nodes,<br />
then a direct communication link exists between those<br />
nodes. The constants ∆ 1 (j) <strong>and</strong> ∆ 2 (i, j), provide<br />
for additional or redundant CPU <strong>and</strong> b<strong>and</strong>width in<br />
the network. This redundancy takes into consideration<br />
the additional computational resources that may<br />
be required due to factors like: variability in computational<br />
resource requirements by agents, complete or<br />
partial loss resources at nodes <strong>and</strong> links <strong>and</strong> migration<br />
of agents between nodes. It should be noted that the<br />
effect of these constants can be absorbed in the processing<br />
P a i <strong>and</strong> b<strong>and</strong>width Ba ij requirements of the<br />
agents <strong>and</strong> hereafter we would assume that this has<br />
been done.<br />
3. Connectivity Constraints<br />
X ij ≤ N i 1 ≤ i ≤ N max <strong>and</strong> 1 ≤ j ≤ N max<br />
(7)<br />
l avg (G) ≤ l max (8)<br />
D(G) ≤ D max , (9)<br />
where l max is the maximum allowable average path<br />
length <strong>and</strong> D max is the maximum allowable diameter<br />
of the network considered. The constraints (7) enforce<br />
that link exists only between nodes which have been<br />
selected. On the other h<strong>and</strong>, the constraints (8) <strong>and</strong><br />
(9) are related to network performance <strong>and</strong> also guarantee<br />
that G is connected. In general, the constraints<br />
(8) <strong>and</strong> (9), cannot be expressed explicitly in terms of<br />
decision variables as equations, <strong>and</strong> have to be verified<br />
algorithmically.<br />
4 Solution Methodology<br />
The problem discussed in the previous section is similar<br />
in many respects, to the problems that often arise in the<br />
design of telecommunication networks [3], [4], [5]. For<br />
example, in [5], the authors consider the problem of “survivable<br />
network design” (SND), which seeks to design a<br />
minimum cost network with a given set of nodes <strong>and</strong> a set<br />
of possible edges between them, such that the connectivity<br />
requirement (which is specified as the minimum number<br />
of edge-disjoint paths needed between different nodes) is<br />
satisfied. The major distinction of our model from such<br />
formulations is that we consider infrastructural design <strong>and</strong><br />
the distribution of agents on this network simultaneously in<br />
the strategic design phase. Note that:<br />
• The maximum number of nodes needed satisfies,<br />
N max ≤ N A , otherwise the optimization problem has<br />
no feasible solution, as the constraints (5) cannot be<br />
satisfied. Hence, we can always take N max = N A .<br />
• Our problem, is a generalization of the “bin-packing”<br />
problem [12]. Following distinctions of our optimization<br />
problem from the “bin-packing” problem can be<br />
noted. There are two types of bins: the processors <strong>and</strong><br />
the network links <strong>and</strong> the agents are the objects to be<br />
chosen. The capacity for both type of bins are variable<br />
<strong>and</strong> can be selected from a given set, rather than being<br />
fixed. There is constraint between filling two types of<br />
bins i.e., as we fill the processors with agents, the bin<br />
which is the link connecting the processors also gets<br />
filled based on the agent distribution. Also, there are<br />
additional constraints related to the diameter <strong>and</strong> average<br />
path length (7)-(9), that should be satisfied.<br />
• Consider a special case of our optimization problem<br />
where Agents do not interact with each other i.e<br />
Ba ij = 0 (1 ≤ i, j ≤ N A ); There is only one processor<br />
with a capacity P <strong>and</strong> unit cost; There are no<br />
constraints on l avg (G) <strong>and</strong> D(G), i.e., D max −→ ∞<br />
<strong>and</strong> l max −→ ∞. Under these conditions the problem<br />
reduces to the usual bin-packing problem, as follows:<br />
subject to<br />
min C =<br />
N∑<br />
max<br />
i=1<br />
N i , (10)<br />
∑N A<br />
A ij = 1, 1 ≤ i ≤ N A , (11)<br />
j=1<br />
∑N A<br />
A ij P a i ≤ N j P, 1 ≤ j ≤ N A . (12)<br />
i=1<br />
The bin-packing problem is known to be NP-hard in the<br />
strong sense [12] . Since our problem is a generalization<br />
of the bin-packing problem it is also NP-hard in the strong<br />
sense. Given this we either need to develop heuristics or<br />
use evolutionary algorithms to obtain solutions.<br />
We have used Genetic Algorithms (GA) with the following<br />
important features:<br />
• Rather than using a binary encoding, we used a<br />
scheme of integer coding of the decision variables.<br />
• The initial pool of population is generated r<strong>and</strong>omly,<br />
with one feasible solution. The feasible solution can
e obtained as follows. Start with N A nodes, assign<br />
each agent to a separate node <strong>and</strong> choose a lowest possible<br />
processing capacity from the available set C(P )<br />
such that the processing requirements for each agent<br />
is satisfied. Connect the nodes which have agents that<br />
interact with each other <strong>and</strong> assign to that link a capacity<br />
with the lowest possible b<strong>and</strong>width from the<br />
available set C(B) such that the communication requirements<br />
are satisfied.<br />
• We have used NSGA2 [8, 10], as the GA solver. It<br />
has capability to automatically h<strong>and</strong>le constraints. It<br />
uses a mean-centric crossover <strong>and</strong> uniform bounded<br />
mutation operators for real coded strings.<br />
5 GA Application Examples <strong>and</strong> Results<br />
Figure 4. GA Result: DIS Layout for MSC, C = 130<br />
the nature of branching in the hierarchical structure<br />
<strong>and</strong> the variation in processing <strong>and</strong> b<strong>and</strong>width requirements<br />
for agents. In Figure 3, each node in the<br />
tree is labeled by an agent number A i <strong>and</strong> its processing<br />
requirement P a i , whereas each link between<br />
the agents A i <strong>and</strong> A j (if they interact), is labeled by<br />
(Ba ij + Ba ji ), their communication requirement.<br />
Figure 3. A military supply chain as an agent society<br />
3. The restriction on the maximum allowable diameter<br />
D max <strong>and</strong> the average path length l max for the DIS<br />
are listed in Table 2.<br />
5.1 Inputs<br />
Following are inputs based on notation in Section 3.2:<br />
1. The cost structure for processing power <strong>and</strong> b<strong>and</strong>width<br />
used in examples is shown in the Table 1.<br />
2. The agent societies in the examples considered were<br />
generated r<strong>and</strong>omly. The processing P a i <strong>and</strong> the<br />
b<strong>and</strong>width Ba ij requirements for the agents were<br />
sampled from uniform distribution. However, the<br />
structure of the agent societies in all the cases was restricted<br />
to be hierarchical. This is motivated by the<br />
fact that most of the organizations, like in Comm<strong>and</strong><br />
<strong>and</strong> Control or in society have hierarchical structure.<br />
Note that the optimization problem <strong>and</strong> formulation<br />
we have considered is general enough to be applied to<br />
an agent society with any underlying structure. The<br />
agent societies, differ in number of agents (Table 2),<br />
Table 1. Processing Cost <strong>and</strong> B<strong>and</strong>width Cost<br />
i 1 2 3<br />
P i 5 7 9<br />
C pi 1 2 3<br />
5.2 Output: GA Results<br />
i 1 2<br />
B i 3 6<br />
Cb i 1 2<br />
The costs C obtained by running GA have been tabulated<br />
below (Table 2), while the DIS layout is shown in Figure<br />
4, next to the corresponding agent society. In the figure,<br />
each node in the graph is labeled by the processing power<br />
P i followed by the agents which are assigned to that node,<br />
while each link is labeled by the b<strong>and</strong>width B i chosen for<br />
it. It should be noted that many agents can be assigned to a<br />
same node. For example, (Figure 4), agents A12 <strong>and</strong> A13<br />
are both assigned to a the node labeled P 9 : A12A13.
Table 2. GA Results<br />
Agent N A D max , GA Ratio<br />
Society No. l max C<br />
C<br />
N A<br />
1 4 2, 2 15 3.75<br />
2 7 ” 27 3.86<br />
3 10 5, 4 30 3.00<br />
4 12 ” 33 2.75<br />
5 15 ” 50 3.33<br />
6 19 ” 61 3.21<br />
7 24 ” 87 3.63<br />
8 33 ” 108 3.27<br />
9 40 ” 164 4.1<br />
10 50 ” 188 3.76<br />
The Table 2, shows that the optimal cost per agent<br />
ratio ( C<br />
N A<br />
) is fairly constant, with a small variation. This<br />
may be a result of the cost structure assumed <strong>and</strong> the particular<br />
instances of the agent societies considered. This observation,<br />
however, can have following implication: given<br />
a very large agent society like with N A = 5, 000 agents, we<br />
can decompose it into smaller agent societies, solve the optimization<br />
problem for each of the sub-societies <strong>and</strong> then<br />
combine them to solve the overall problem. Due to the<br />
constancy of the ratio<br />
C<br />
N A<br />
, this heuristic should lead to solutions<br />
which are fairly close to optimal.<br />
As a final example we consider one of the realistic<br />
agent societies which has been developed in the Ultra*Log<br />
Program [14], [6]. The society is shown below in Figure 3<br />
<strong>and</strong> represents a typical military supply chain (MSC). The<br />
structure of society is an exact replica of the true society,<br />
however the processing <strong>and</strong> b<strong>and</strong> width requirements for<br />
the agents have been assigned r<strong>and</strong>omly. The result obtained<br />
by GAs has been shown in the Figure 4.<br />
6 Conclusion <strong>and</strong> Future Work<br />
In this paper we have systematically studied the issue of<br />
survivability of DIS. Based on these we formulated a robust<br />
design problem for DIS. We showed that this problem<br />
is NP hard in strong sense <strong>and</strong> used GAs to obtain solutions<br />
for a number of example agent societies. We also considered<br />
a realistic agent society representing a military supply<br />
chain. We showed that our robust design problem formulation<br />
results in a fairly survivable DIS.<br />
Survivability of DIS is an emerging area <strong>and</strong> future<br />
research is possible in a number of varied directions.<br />
Refining the robust design problem we have posed <strong>and</strong><br />
developing heuristic solution methodologies for it. Further<br />
research is required to build mechanisms for survivability<br />
against other types of attacks, such as security <strong>and</strong> DOS<br />
attacks. Most of the above stated problems are nothing<br />
new for biological systems which have routinely solved<br />
them for literally millions of years. Can we draw inspiration<br />
from the structures discovered in biology to solve<br />
problems of distributed systems? We believe that the quest<br />
for “open-ended” survivability for DIS can be achieved<br />
only by exporting biological mechanisms into software<br />
systems.<br />
Acknowledgements<br />
The authors acknowledge <strong>DARPA</strong> (Grant#:<br />
MDA972-1-1-0038 under Ultra*Log Program) <strong>and</strong><br />
NSF (Grant#:ANI-0219747 under ITR program) for their<br />
generous support for this research. Special thanks to the<br />
anonymous reviewers for their comments <strong>and</strong> suggestions.<br />
References<br />
[1] W. Aiello, B. Awerbuch, B. M. Maggs, <strong>and</strong> S. Rao. Approximate<br />
load balancing on dynamic <strong>and</strong> asynchronous<br />
networks. In ACM Symposium on Theory of Computing,<br />
pages 632–641, 1993.<br />
[2] J. Andreoli, U. Borghoff, R. Pareschi, S. Bistarelli, U. Montatnari,<br />
<strong>and</strong> F. Rossi. Constraints <strong>and</strong> agents for a decentralized<br />
network infrastructure. In AAAI Workshop, Menlo<br />
Park, California, USA: AAAI Press., pages 39–44, 1997.<br />
[3] A. Balakrishnan <strong>and</strong> K. Altinkemer. Using hop-constrined<br />
model to generate alternative= communication network design.<br />
ORSA Journal on Computing, 4(2), 1992.<br />
[4] A. Balakrishnan, T. l. Magnanti, <strong>and</strong> P. Mirch<strong>and</strong>ani. A<br />
dual-based algorithm for multi-level network design. Management<br />
Science, 40(5):567–581, 1994.<br />
[5] A. Balakrishnan, T. l. Magnanti, <strong>and</strong> P. Mirch<strong>and</strong>ani.<br />
Connectivity-splitting models for survivable network design.<br />
Submitted, 2003.<br />
[6] M. Brinn <strong>and</strong> M. Greaves. Leveraging agent properties<br />
to assure survivability of distributed multi-agent<br />
systems. In https://docs.ultralog.net/dscgi/ds.py/Get/File-<br />
4088/AA03-SurvivabilityOfDMAS.pdf, 2002.<br />
[7] R. F. Cancho <strong>and</strong> R. V. Sole. Optimization in<br />
complex networks. In http://arxiv.org/PS cache/condmat/pdf/0111/0111222.pdf,<br />
2001.<br />
[8] K. Deb, A. Pratap, S. Agarwal, <strong>and</strong> T. Meyarivan. A<br />
fast <strong>and</strong> elitist multi-objective genetic algorithm-nsga-ii. In<br />
http://www.iitk.ac.in/kangal/pub.htm, 2000.<br />
[9] M. O. Hofmann, A. McGovern, <strong>and</strong> K. R. Whitebread. Mobile<br />
agents on the digital battlefield. In Proceedings of<br />
the 2nd International Conference on Autonomous Agents<br />
(Agents’98), pages 219–225, 1998.<br />
[10] KanGAL. Kanpur genetic algorithm laboratory. In<br />
http://www.iitk.ac.in/kangal/pub.htm.<br />
[11] J. O. Kephart, T. Hogg, <strong>and</strong> B. A. Huberman. Collective<br />
behavior of predictive agents. Physics D, 42:48–65, 1990.<br />
[12] A. Martello <strong>and</strong> P. Toth. Knapsack Problems Algorithms<br />
<strong>and</strong> Computer Implementations. John Wiley <strong>and</strong> Sons,<br />
1990.<br />
[13] C. Tomlin, G. Pappas, <strong>and</strong> S. Sastry. Conflict resolution for<br />
air traffic management: A case study in multi-agent hybrid<br />
systems. IEEE Trans. on Automat. Ctrl, 43(4), 1998.<br />
[14] <strong>ULTRALOG</strong>. A darpa program on logistics infromation<br />
system= survivability. In http://www.ultralog.net/.
D e p e n d a b l e A g e n t S y s t e m s<br />
Survivability of<br />
Multiagent-Based<br />
Supply Networks: A<br />
Topological Perspective<br />
Hari Prasad Thadakamalla, Usha N<strong>and</strong>ini Raghavan, Soundar Kumara, <strong>and</strong><br />
Réka Albert, Pennsylvania State University<br />
Supply chains involve complex webs of interactions among suppliers, manufacturers,<br />
distributors, third-party logistics providers, retailers, <strong>and</strong> customers.<br />
You can improve a<br />
multiagent-based<br />
supply network’s<br />
survivability by<br />
concentrating on the<br />
topology <strong>and</strong> its<br />
interplay with<br />
functionalities.<br />
Although fairly simple business processes govern these individual entities, real-time<br />
capabilities <strong>and</strong> global Internet connectivity make today’s supply chains complex.<br />
Fluctuating dem<strong>and</strong> patterns, increasing customer<br />
expectations, <strong>and</strong> competitive markets also add to<br />
their complexity.<br />
Supply networks are usually modeled as multiagent<br />
systems (MASs). 1 Because supply chain management<br />
must effectively coordinate among many<br />
different entities, a multiagent modeling framework<br />
based on explicit communication between these entities<br />
is a natural choice. 1 Furthermore, we can represent<br />
these multiagent systems as a complex network<br />
with entities as nodes <strong>and</strong> the interactions between<br />
them as edges. Here we explore the survivability (<strong>and</strong><br />
hence dependability) of these MASs from the view<br />
of these complex supply networks.<br />
Today’s supply networks aren’t dependable—or<br />
survivable—in chaotic environments. For example,<br />
Figure 1 shows how mediocre a typical supply network’s<br />
reaction to a node or edge failure is compared<br />
to a network with built-in redundancy.<br />
Survivability is a critical factor in supply network<br />
design. Specifically, supply networks in dynamic<br />
environments, such as military supply chains during<br />
wartime, must be designed more for survivability<br />
than for cost effectiveness. The more survivable a<br />
network is, the more dependable it will be.<br />
We present a methodology for building survivable<br />
large-scale supply network topologies that can<br />
extend to other large-scale MASs. Building survivable<br />
topologies alone doesn’t, however, make an<br />
MAS dependable. To create survivable—<strong>and</strong> hence<br />
dependable—multiagent systems, we must also consider<br />
the interplay between network topology <strong>and</strong><br />
node functionalities.<br />
A topological perspective<br />
To date, the survivability literature has emphasized<br />
network functionalities rather than topology. To be<br />
survivable, a supply network must adapt to a dynamic<br />
environment, withst<strong>and</strong> failures, <strong>and</strong> be flexible<br />
<strong>and</strong> highly responsive. These characteristics<br />
depend on not only node functionality but also the<br />
topology in which nodes operate.<br />
The components of survivability<br />
From a topological perspective, the following<br />
properties encompass survivability, <strong>and</strong> we denote<br />
them as survivability components.<br />
The first is robustness. A robust network can sustain<br />
the loss of some of its structure or functionalities <strong>and</strong><br />
maintain connectedness under node failures, whether<br />
the failure is r<strong>and</strong>om or is a targeted attack. We measure<br />
robustness as the size of the network’s largest<br />
24 1541-1672/04/$20.00 © 2004 IEEE IEEE INTELLIGENT SYSTEMS<br />
Published by the IEEE Computer Society
connected component, in which a path exists<br />
between any pair of nodes in that component.<br />
The second is responsiveness. A responsive<br />
network provides timely services <strong>and</strong><br />
effective navigation. Low characteristic path<br />
length (the average of the shortest path<br />
lengths from each node to every other node)<br />
leads to better responsiveness, which determines<br />
how quickly commodities or information<br />
proliferate throughout the network.<br />
The third is flexibility. This property depends<br />
on the presence of alternate paths.<br />
Good clustering properties ensure alternate<br />
paths to facilitate dynamic rerouting. The<br />
clustering coefficient, defined as the ratio<br />
between the number of edges among a node’s<br />
first neighbors <strong>and</strong> the total possible number<br />
of edges between them, characterizes the<br />
local order in a node’s neighborhood.<br />
The fourth is adaptivity. An adaptive network<br />
can rewire itself efficiently—that is,<br />
restructure or reorganize its topology on the<br />
basis of environmental shifts—to continue<br />
providing efficient performance. For example,<br />
if a supplier can’t reliably meet a customer’s<br />
dem<strong>and</strong>s, the customer should be<br />
able to choose another supplier.<br />
A typical supply chain with a tree-like or<br />
hierarchical structure lacks these four properties—the<br />
clustering coefficient is nearly<br />
zero, <strong>and</strong> the characteristic path length scales<br />
linearly with the number of nodes (or agents)<br />
N. In designing complex agent networks<br />
with built-in survivability, conventional optimization<br />
tools won’t work because of the<br />
problem’s extremely large scale. When networks<br />
were smaller, we could underst<strong>and</strong><br />
their overall behavior by concentrating on<br />
the individual components’ properties. But<br />
as networks exp<strong>and</strong>, this becomes impossible,<br />
so we shift focus to the statistical properties<br />
of the collective behavior.<br />
Using topologies<br />
Studying complex networks such as protein<br />
interaction networks, regulatory networks,<br />
social networks of acquaintances,<br />
<strong>and</strong> information networks such as the Web<br />
is illuminating the principles that make these<br />
networks extremely resilient to their respective<br />
chaotic environments. The core principles<br />
extracted from this exploration will<br />
prove valuable in building robust models for<br />
survivable complex agent networks.<br />
Complex-network theory currently offers<br />
r<strong>and</strong>om-graph, small-world, <strong>and</strong> scale-free network<br />
topologies as likely c<strong>and</strong>idates for survivable<br />
networks (see the sidebar “Complex<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
(a)<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
(b)<br />
Battalion<br />
FSB<br />
FSB<br />
Battalion<br />
Battalion<br />
FSB<br />
FSB<br />
Battalion<br />
Battalion<br />
Node<br />
failure<br />
Battalion<br />
Node<br />
failure<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Networks” for more on this topic). Evaluating<br />
these for survivability (see Figure 2), we find<br />
that no one topology consistently outperforms<br />
the others. For example, while small-world networks<br />
have better clustering properties, scalefree<br />
networks are significantly more robust to<br />
r<strong>and</strong>om attacks. So, we can’t directly use these<br />
Battalion<br />
Battalion<br />
MSB<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
MSB<br />
Battalion<br />
Battalion<br />
Battalion<br />
FSB<br />
FSB<br />
Battalion<br />
Battalion<br />
FSB<br />
FSB<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Battalion<br />
Figure 1. How redundancy affects survivability. (a) A part of the multiagent system<br />
for military logistics modeled using the UltraLog (www.ultralog.net) program. This<br />
example models each entity, such as main support battalion, forward support battalion,<br />
<strong>and</strong> battalion, as a software agent. (We’ve changed the agents’ names for security<br />
reasons.) In the current scenario, MSBs send the supplies to the FSBs, who in turn<br />
forward these to battalions. (b) A modified military supply chain with some redundancy<br />
built into it. This network performs much better in the event of node failures <strong>and</strong> hence<br />
is more dependable than the first network.<br />
topologies to build supply networks. We can,<br />
however, use their evolution principles to build<br />
supply chain networks that perform well in all<br />
respects of the survivability components.<br />
Researchers have studied complex networks<br />
in part to find ways to design evolutionary<br />
algorithms for modeling networks<br />
SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 25
Complex Networks<br />
Social scientists, among the first to study complex networks<br />
extensively, focused on acquaintance networks, where nodes<br />
represent people <strong>and</strong> edges represent the acquaintances between<br />
them. Social psychologist Stanley Milgram posited the<br />
“six degrees of separation” theory that in the US, a person’s<br />
social network has an average acquaintance path length of six. 1<br />
This turns out to be a particular instance of the small-world<br />
property found in many real-world networks, which, despite<br />
their large size, have a relatively short path between any two<br />
nodes.<br />
An early effort to model complex networks introduced r<strong>and</strong>om<br />
graphs for modeling networks with no obvious pattern or<br />
structure. 2 A r<strong>and</strong>om graph consists of N nodes, <strong>and</strong> two nodes<br />
are connected with a connection probability p. R<strong>and</strong>om graphs<br />
are statistically homogeneous because most nodes have a degree<br />
(that is, the number of edges incident on the node) close<br />
to the graph’s average degree, <strong>and</strong> significantly small <strong>and</strong> large<br />
node degrees are exponentially rare.<br />
However, studying the topologies of diverse large-scale networks<br />
found in nature reveals a more complex <strong>and</strong> unpredictable<br />
dynamic structure. Two measures quantifying network<br />
topology found to differ significantly in real networks are the<br />
degree distribution (the fraction of nodes with degree k) <strong>and</strong><br />
the clustering coefficient. Later modeling efforts focused on<br />
trying to reproduce these properties. 3,4 Duncan Watts <strong>and</strong><br />
Steven Strogatz introduced the concept of small-world networks<br />
to explain the high degree of transitivity (order) in complex<br />
networks. 5 The Watts-Strogatz model starts from a regular<br />
1D ring lattice on L nodes, where each node is joined to its<br />
first K neighbors. Then, with probability p, each edge is rewired<br />
with one end remaining the same <strong>and</strong> the other end<br />
chosen uniformly at r<strong>and</strong>om, without allowing multiple edges<br />
(more than one edge joining a pair of vertices) or loops (edges<br />
joining a node to itself). The resulting network is a regular lattice<br />
when p = 0 <strong>and</strong> a r<strong>and</strong>om graph when p = 1, because all<br />
edges are rewired. This network class displays a high clustering<br />
coefficient for most values of p, but as p → 1, it behaves like a<br />
r<strong>and</strong>om graph.<br />
Albert-László Barabási <strong>and</strong> Réka Albert later proposed an<br />
evolutionary model based on growth <strong>and</strong> preferential attachment<br />
leading to a network class, scale-free networks, with<br />
power law distribution. 6 Many real-world networks’ degree<br />
distribution follows a power law, fundamentally different<br />
from the peaked distribution observed in r<strong>and</strong>om graphs <strong>and</strong><br />
small-world networks. Barabási <strong>and</strong> Albert argued that a<br />
static r<strong>and</strong>om graph of the Watts-Strogatz model fails to capture<br />
two important features of large-scale networks: their<br />
constant growth <strong>and</strong> the inherent selectivity in edge creation.<br />
Complex networks such as the Web, collaboration networks,<br />
or even biological networks are growing continuously with<br />
the creation of new Web pages, the birth of new individuals,<br />
<strong>and</strong> gene duplication <strong>and</strong> evolution. Moreover, unlike r<strong>and</strong>om<br />
networks where each node has the same chance of<br />
acquiring a new edge, new nodes entering the scale-free network<br />
don’t connect uniformly to existing nodes but attach<br />
preferentially to higher-degree nodes. This reasoning led<br />
Barabási <strong>and</strong> Albert to define two mechanisms:<br />
• Growth: Start with a small number of nodes—say, m 0 —<strong>and</strong><br />
assume that every time a node enters the system, m edges<br />
are pointing from it, where m < m 0 .<br />
• Preferential attachment: Every time a new node enters the<br />
system, each edge of the newly connected node preferentially<br />
attaches to a node i with degree k i with the probability<br />
k<br />
Π i = i<br />
∑ j<br />
k j<br />
Research has shown that the second mechanism leads to a<br />
network with power-law degree distribution P(k) = k –γ with<br />
exponent γ = 3. Barabási <strong>and</strong> Albert dubbed these networks<br />
“scale free” because they lack a characteristic degree <strong>and</strong> have<br />
a broad tail of degree distribution. Following the proposal of<br />
the first scale-free model, researchers have introduced many<br />
more refined models, leading to a well-developed theory of<br />
evolving networks. 7<br />
Protein-to-protein interactions in metabolic <strong>and</strong> regulatory<br />
networks <strong>and</strong> other biological networks also show a striking<br />
ability to survive under extreme conditions. Most of these<br />
networks’ underlying properties resemble the three most<br />
familiar networks found in the literature (see Figure 1 in the<br />
article).<br />
Complex networks are also vulnerable to node or edge<br />
losses, which disrupt the paths between nodes or increase<br />
their length <strong>and</strong> make communication between them harder.<br />
In severe cases, an initially connected network breaks down<br />
into isolated components that can no longer communicate.<br />
Numerical <strong>and</strong> analytical studies of complex networks indicate<br />
that a network’s structure plays a major role in its response to<br />
node removal. For example, scale-free networks are more<br />
robust than r<strong>and</strong>om or small-world networks with respect to<br />
r<strong>and</strong>om node loss. 8 Large scale-free networks will tolerate the<br />
loss of many nodes yet maintain communication between<br />
those remaining. However, they’re sensitive to removal of the<br />
most-connected nodes (by a targeted attack on critical nodes,<br />
for example), breaking down into isolated pieces after losing<br />
just a small percentage of these nodes.<br />
References<br />
1. S. Milgram, “The Small World Problem,” Psychology Today, vol. 2,<br />
May 1967, pp. 60–67.<br />
2. P. Erdös <strong>and</strong> A. Renyi, “On R<strong>and</strong>om Graphs I,” Publicationes Mathematicae,<br />
vol. 6, 1959, pp. 290–297.<br />
3. S.N. Dorogovtsev <strong>and</strong> J.F.F. Mendes, “Evolution of Networks,”<br />
Advances in Physics, vol. 51, no. 4, 2002, pp. 1079–1187.<br />
4. M.E.J. Newman, “The Structure <strong>and</strong> Function of Complex Networks,”<br />
SIAM Rev., vol. 45, no. 2, 2003, pp. 167–256.<br />
5. D.J. Watts <strong>and</strong> S.H. Strogatz, “Collective Dynamics of ‘Small-World’<br />
Networks,” Nature, vol. 393, June 1998, pp. 440–442.<br />
6. A.-L. Barabási <strong>and</strong> R. Albert, “Emergence of Scaling in R<strong>and</strong>om<br />
Networks,” Science, vol. 286, Oct. 1999, pp. 509–512.<br />
7. R. Albert <strong>and</strong> A.-L. Barabási, “Statistical Mechanics of Complex Networks,”<br />
Reviews of Modern Physics, Jan. 2002, pp. 47–97.<br />
8. R. Albert, H. Jeong, <strong>and</strong> A.-L Barabási, “Error <strong>and</strong> Attack Tolerance<br />
of Complex Networks,” Nature, July 2000, pp. 378–382.<br />
26 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS
R<strong>and</strong>om<br />
Small-world<br />
Scale-free<br />
with distinct properties found in nature. A<br />
network’s evolutionary mechanism is designed<br />
such that the network’s inherent properties<br />
emerge owing to the mechanism. For<br />
example, small-world networks were designed<br />
to explain the high clustering coefficient<br />
found in many real-world networks,<br />
while the “rich get richer” phenomenon used<br />
in the Barabási-Albert model explains the<br />
scale-free distribution. 2<br />
Similarly, we seek to design supply networks<br />
with inherent survivability components<br />
(see Figure 3), obtaining these components by<br />
coining appropriate growth mechanisms. Of<br />
course, having all the aforementioned properties<br />
in a network might not be practically feasible—we’d<br />
likely have to negotiate trade-offs<br />
depending on the domain. Also, domain specificities<br />
might make it inefficient to incorporate<br />
all properties. For instance, in a supply<br />
network, we might not be able to rewire the<br />
edges as easily as we can in an information<br />
network, so we would concentrate more on<br />
obtaining other properties such as low characteristic<br />
path length, robustness to failures<br />
<strong>and</strong> attacks, <strong>and</strong> high clustering coefficients.<br />
So, the construction of these networks is<br />
domain specific.<br />
Establishing edges between network nodes<br />
is also domain specific. For instance, in a supply<br />
network, a retailer would likely prefer to<br />
have contact with other geographically convenient<br />
nodes (distributors, warehouses, <strong>and</strong><br />
other retailers). At the same time, nodes in a<br />
file-sharing network would prefer to attach to<br />
other nodes known to locate or hold many<br />
shared files (that is, nodes of high degree).<br />
Obtaining the survivability<br />
components<br />
While evolving the network on the basis<br />
of domain constraints, we need to incorporate<br />
four traits into the growth model for<br />
obtaining good survivability components.<br />
The first is low characteristic path length.<br />
During network construction, establish a few<br />
long-range connections between nodes that<br />
require many steps to reach one from<br />
another.<br />
The second is good clustering. When two<br />
nodes A <strong>and</strong> B are connected, new edges<br />
from A should prefer to attach to neighbors<br />
of B, <strong>and</strong> vice versa.<br />
The third is robustness to r<strong>and</strong>om <strong>and</strong> targeted<br />
failure. Preferential attachment—where<br />
new nodes entering the network don’t connect<br />
uniformly to existing nodes but attach preferentially<br />
to higher-degree nodes (see the side-<br />
Degree<br />
distribution<br />
Characteristic<br />
path length<br />
Clustering<br />
coefficient<br />
Robustness<br />
to failures<br />
P(k)<br />
Poisson<br />
<br />
k<br />
Scales as<br />
log(N)<br />
p (the connection<br />
probability)<br />
Similar responses<br />
to both r<strong>and</strong>om<br />
<strong>and</strong> targeted<br />
attacks<br />
Peaked<br />
1.0<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0.0<br />
2 4 6 8 10 12<br />
k<br />
Scales linearly with N<br />
for small p. And for higher<br />
p scales as log(N)<br />
High, but as p → 1<br />
behaves like<br />
a r<strong>and</strong>om graph<br />
Similar response as<br />
r<strong>and</strong>om networks.<br />
This is because it has<br />
a degree distribution<br />
similar to r<strong>and</strong>om<br />
networks.<br />
P(k)<br />
1<br />
0.1<br />
0.01<br />
0.001<br />
0.0001<br />
Power law<br />
1 10 100 1,000<br />
k<br />
Scales as<br />
log(N)/log(logN))<br />
((m–1)/2)*(log(N)/N)<br />
where m is the number of<br />
edges with which a<br />
node enters<br />
Highly resilient to r<strong>and</strong>om<br />
failures while being very<br />
sensitive to targeted<br />
attacks<br />
Figure 2. Comparing the survivability components of r<strong>and</strong>om, small-world, <strong>and</strong><br />
scale-free networks.<br />
Manufacturer<br />
Warehouse<br />
Warehouse<br />
Warehouse<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Manufacturer<br />
Manufacturer<br />
Warehouse<br />
Warehouse<br />
Warehouse<br />
Figure 3. The transition from supply chain to a survivable supply network.<br />
Failed node<br />
Failed edge<br />
Alternate path<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
Retailer<br />
SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 27
D e p e n d a b l e A g e n t S y s t e m s<br />
Preferential attachment R<strong>and</strong>om attachment Proposed attachment rules<br />
Figure 4. Snapshots of the modeled networks during their growth, where the nodes number 70. MSBs are green, FSBs are red, <strong>and</strong><br />
battalions are blue.<br />
bar for more details)—leads to scale-free networks<br />
with very few critical <strong>and</strong> many not-socritical<br />
nodes. Here we measure a node’s criticality<br />
in terms of the number of edges incident<br />
on it. So, these networks are robust to r<strong>and</strong>om<br />
failures (the probability that a critical node fails<br />
is very small) but not to targeted attacks (attacking<br />
the very few critical nodes would devastate<br />
the network). Also, it’s not practically feasible<br />
to have all nodes play an equal role in the system—that<br />
is, be equally critical. Thus, the network<br />
should have a good balance of critical,<br />
not-so-critical, <strong>and</strong> noncritical nodes.<br />
The fourth is efficient rewiring. Rewiring<br />
edges in a network might or might not be feasible,<br />
depending on the domain. But where<br />
it is feasible, it should preserve the other<br />
three traits.<br />
Although complete graphs come equipped<br />
with good survivability components, they<br />
clearly aren’t cost effective. Allowing every<br />
agent in an agent system to communicate<br />
with every other agent uses system b<strong>and</strong>width<br />
inefficiently <strong>and</strong> could completely bog<br />
down the system. So the amount of redundancy<br />
results from a trade-off between cost<br />
<strong>and</strong> survivability.<br />
An illustration<br />
Suppose we want to build a topology for a<br />
military supply chain that must be survivable<br />
in wartime. First, we broadly classify the network<br />
nodes into three types:<br />
• Battalions prefer to attach to a highly connected<br />
node so that the supplies from different<br />
parts of the network will be transported<br />
to them in fewer steps. Battalions<br />
also require quick responses, so they prefer<br />
the subsequent links to attach to nodes at<br />
convenient shorter distances (in our model<br />
we considered a fixed distance of two).<br />
• A forward support battalion prefers to<br />
attach to highly connected nodes so that<br />
its supplies proliferate faster in the network.<br />
The supply range from an FSB goes<br />
up to a particular distance (at most three<br />
in our model).<br />
• A main support battalion also prefers to<br />
attach to a highly connected node to<br />
enable its supplies to proliferate faster in<br />
the network. We assume an unrestricted<br />
supply reach from an MSB, thus facilitating<br />
some long-range connections.<br />
In a conventional logistics network, the<br />
MSBs supply commodities (such as ammunitions,<br />
food, <strong>and</strong> fuel) to the FSBs, who in<br />
turn forward them to the battalions. Our<br />
approach doesn’t restrict node functionalities<br />
as such—for example, we assume that<br />
even a battalion can supply commodities to<br />
other battalions if necessary.<br />
In (number of nodes of degree > k)<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
Model 1<br />
Model 2<br />
Model 3<br />
Characteristic path length<br />
5.6<br />
5.5<br />
5.4<br />
5.3<br />
5.2<br />
5.1<br />
5.0<br />
4.9<br />
4.8<br />
(a)<br />
0<br />
0 1 2 3 4 5<br />
In (degree k)<br />
(b)<br />
4.7<br />
6.5 7.0 7.5<br />
8.0 8.5 9.0<br />
Ln (number of nodes)<br />
Figure 5. How our proposed network performed: (a) the log-log of the degree distribution for all the three networks;<br />
(b) the characteristic path length of the proposed network against the log of the number of nodes.<br />
28 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS
Growth mechanisms<br />
Start with a small number of nodes—say,<br />
m 0 —<strong>and</strong> assume that every time a node<br />
enters the system, m edges are pointing from<br />
it, where m < m 0 . Battalions, FSBs, <strong>and</strong><br />
MSBs enter the system in a certain ratio<br />
l:m:n where l > m > n:<br />
• A battalion has one edge pointing from it<br />
<strong>and</strong> a second edge added with a probability<br />
p.<br />
• An FSB has three edges pointing from it.<br />
• An MSB has five edges pointing from it.<br />
The attachment rules applied depend on<br />
which node type enters the system:<br />
• For a battalion, the first edge attaches to a<br />
node i of degree k i with the probability<br />
ki<br />
Π i =<br />
∑ k<br />
.<br />
j<br />
j<br />
The second edge, which exists with a<br />
probability p, attaches to a r<strong>and</strong>omly chosen<br />
node at a distance of two.<br />
• For an FSB, the first edge attaches to a<br />
node i of degree k i with the probability<br />
ki<br />
Π i =<br />
∑ k<br />
.<br />
j<br />
j<br />
The subsequent edges attach to a r<strong>and</strong>omly<br />
chosen node at a distance of at most three.<br />
Table 1. Simulation results.<br />
Model 1 (r<strong>and</strong>om) Model 2 (preferential) Model 3 (proposed)<br />
Clustering coefficient 0.0038–0.0039 0.013–0.019 0.35–0.39<br />
Characteristic path length 5.26–5.36 4.09–4.25 4.69–4.79<br />
• For an MSB, each edge attaches preferentially<br />
to a node i with degree k i with the<br />
probability<br />
ki<br />
Π i =<br />
∑ k<br />
.<br />
j<br />
j<br />
Simulation <strong>and</strong> analysis<br />
Using this method, we built a network of<br />
1,000 nodes with l, m, <strong>and</strong> n being 25, 4, <strong>and</strong><br />
1 (we obtained these values from the current<br />
configuration of the military logistics network<br />
used in the UltraLog program) <strong>and</strong><br />
p = 1/2. We compared this network’s survivability<br />
with that of two other networks built<br />
using similar mechanisms except that one<br />
used purely preferential attachment rules<br />
(similar to scale-free networks) <strong>and</strong> the other<br />
used purely r<strong>and</strong>om attachment rules (similar<br />
to r<strong>and</strong>om networks) (see Figure 4). All<br />
three networks had an equal number of edges<br />
<strong>and</strong> nodes to ensure fair comparison.<br />
We refer to the networks built from r<strong>and</strong>om,<br />
preferential, <strong>and</strong> proposed attachment<br />
rules as Models 1, 2, <strong>and</strong> 3, respectively. As<br />
we noted earlier, a typical military supply<br />
chain (see Figure 1a) with a tree-like or hierarchical<br />
structure has deficient survivability<br />
components, making it vulnerable to both<br />
r<strong>and</strong>om <strong>and</strong> targeted attacks. Models 1, 2,<br />
<strong>and</strong> 3 outperform the typical supply network<br />
in all survivability components.<br />
Figure 5a shows the three models’ degree<br />
distribution. As expected, the preferentialattachment<br />
network has a heavier tail than<br />
the other two networks. We measured survivability<br />
components for all three networks.<br />
The clustering coefficient for Model 3 was<br />
the highest (see Table 1). The Model 3 attachment<br />
rules, especially those for battalions <strong>and</strong><br />
FSBs, contribute implicitly to the clustering<br />
coefficient, unlike the attachment rules in the<br />
other models.<br />
The proposed network model’s characteristic<br />
path length measured between 4.69 <strong>and</strong> 4.79<br />
despite the network’s large size (1,000 nodes).<br />
This value puts it between the preferential <strong>and</strong><br />
r<strong>and</strong>om attachment models. Also, as Figure 5b<br />
shows, the characteristic path length increases<br />
in the order of log(N) as N increases. Model 3<br />
clearly displays small-world behavior.<br />
To measure network robustness, we removed<br />
a set of nodes from the network <strong>and</strong><br />
evaluated its resilience to disruptions. We<br />
considered two attacks types: r<strong>and</strong>om <strong>and</strong> targeted.<br />
To simulate r<strong>and</strong>om attacks, we removed<br />
a set of r<strong>and</strong>omly chosen nodes; for<br />
targeted attacks, we removed a set of nodes<br />
selected strictly in order of decreasing node<br />
degree. To determine robustness, we measured<br />
how the size of each network’s largest<br />
connected component, characteristic path<br />
length, <strong>and</strong> maximum distance within the<br />
largest connected component changed as a<br />
function of the number of nodes removed. We<br />
expect that in a robust network the size of the<br />
largest connected component is a considerable<br />
fraction of N (usually O(N)), <strong>and</strong> the distances<br />
between nodes in the largest connected<br />
component don’t increase considerably.<br />
For r<strong>and</strong>om failures, Figure 6 shows that<br />
Model 3’s robustness nearly matches that of<br />
the preferential-attachment network (note that<br />
scale-free networks are highly resilient to ran-<br />
Size of the largest connected component<br />
1,000<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
(a)<br />
Model 1<br />
Model 2<br />
Model 3<br />
0 20 40 60 80<br />
Percentage of nodes removed<br />
Average length in the largest<br />
connected component<br />
10<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
Maximum distance in the largest<br />
connected component<br />
2<br />
1<br />
0<br />
0 20 40 60 80<br />
(b) Percentage of nodes removed (c)<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0 20 40 60 80<br />
Percentage of nodes removed<br />
Figure 6. Responses of the three networks to r<strong>and</strong>om attacks, plotted as (a) the size of the largest connected component,<br />
(b) characteristic path length, <strong>and</strong> (c) maximum distance in the largest connected component against the percentage of nodes<br />
removed from each network.<br />
SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 29
D e p e n d a b l e A g e n t S y s t e m s<br />
1,200<br />
18<br />
45<br />
Model 1<br />
16<br />
40<br />
1,000<br />
Model 2<br />
Model 3<br />
14<br />
35<br />
800<br />
12<br />
30<br />
10<br />
25<br />
600<br />
8<br />
20<br />
400<br />
6<br />
15<br />
4<br />
10<br />
200<br />
2<br />
5<br />
0<br />
0<br />
0<br />
0 10 20 30 40 50 60 0 20 40 60 0 10 20 30 40 50 60<br />
(a) Percentage of nodes removed (b) Percentage of nodes removed (c) Percentage of nodes removed<br />
Size of the largest connected component<br />
Average length in the largest<br />
connected component<br />
Figure 7. The three networks’ responses to targeted attacks, plotted as (a) the size of the largest connected component,<br />
(b) characteristic path length, <strong>and</strong> (c) maximum distance in the largest connected component against the percentage of nodes<br />
removed from each network.<br />
Maximum distance in the largest<br />
connected component<br />
dom failures). Also, the decrease in the largest<br />
connected component’s size is linear with<br />
respect to the number of nodes removed, which<br />
corresponds to the slowest possible decrease.<br />
So, we can safely conclude that these networks<br />
T h e A u t h o r s<br />
are robust to r<strong>and</strong>om failures—most of the<br />
nodes in the network have a degree less than<br />
four, <strong>and</strong> removing smaller-degree nodes<br />
impacts the networks much less than removing<br />
high-degree nodes (called hubs).<br />
Hari Prasad Thadakamalla is a PhD student in the Department of <strong>Industrial</strong><br />
<strong>and</strong> <strong>Manufacturing</strong> Engineering at Pennsylvania State University, University<br />
Park. His research interests include supply networks, search in complex networks,<br />
stochastic systems, <strong>and</strong> control of multiagent systems. He obtained<br />
his MS in industrial engineering from Penn State. Contact him at<br />
hpt102@psu.edu.<br />
Usha N<strong>and</strong>ini Raghavan is a PhD student in industrial <strong>and</strong> manufacturing<br />
engineering at Pennsylvania State University, University Park. Her research<br />
interests include supply chain management, graph theory, complex adaptive<br />
systems, <strong>and</strong> complex networks. She obtained her MSc in mathematics from<br />
the Indian Institute of Technology, Madras. Contact her at uxr102@psu.edu.<br />
Soundar Kumara is a Distinguished Professor of industrial <strong>and</strong> manufacturing<br />
engineering. He holds joint appointments with the Department of Computer<br />
Science <strong>and</strong> Engineering <strong>and</strong> School of Information Sciences <strong>and</strong> Technology<br />
at Pennsylvania State University. His research interests include<br />
complexity in logistics <strong>and</strong> manufacturing, software agents, neural networks,<br />
<strong>and</strong> chaos theory as applied to manufacturing process monitoring <strong>and</strong> diagnosis.<br />
He’s an elected active member of the International Institute of Production<br />
Research. Contact him at skumara@psu.edu.<br />
Réka Albert is an assistant professor of physics at Pennsylvania State University<br />
<strong>and</strong> is affiliated with the Huck Institutes of the Life Sciences. Her<br />
main research interest is modeling the organization <strong>and</strong> dynamics of complex<br />
networks. She received her PhD in physics from the University of Notre<br />
Dame. She is a member of the American Physical Society <strong>and</strong> the Society for<br />
Mathematical Biology. Contact her at ralbert@phys.psu.edu.<br />
These networks’ responses to targeted<br />
attacks are inferior compared to their resilience<br />
to r<strong>and</strong>om attacks (see Figure 7). The<br />
size of the largest component decreases much<br />
faster for the proposed network than for the<br />
other two networks, but the proposed network<br />
performs better on the other two robustness<br />
measures. That is, the distances in the connected<br />
component are considerably smaller<br />
when more than 10 percent of nodes are<br />
removed.<br />
We can improve robustness to targeted<br />
attacks by introducing constraints in the<br />
attachment rules. Here we assume that node<br />
type constrains its degree—that is, network<br />
MSBs, FSBs, <strong>and</strong> battalions can’t have more<br />
than m 1 , m 2 , <strong>and</strong> m 3 edges, respectively, incident<br />
on them. This is a reasonable assumption<br />
because in military logistics (or any orga-<br />
Sixe of the largest connected component<br />
1,000<br />
900<br />
800<br />
700<br />
Model<br />
m 1 = 4, m 2 = 10, m 3 = 25<br />
m 1 = 4, m 2 = 8, m 3 = 12<br />
m 1 = 3, m 2 = 6, m 3 = 10<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
0 10 20 30 40 50 60<br />
Percentage of nodes removed<br />
Figure 8. The proposed network’s<br />
responses to targeted attacks for<br />
different values of m 1 , m 2 , <strong>and</strong> m 3 .<br />
30 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS
Table 2. The proposed network’s characteristic path<br />
length for different m 1 , m 2 , <strong>and</strong> m 3 values.<br />
Values of m 1 , m 2 , <strong>and</strong> m 3<br />
Characteristic path length<br />
m 1 = ∞, m 2 = ∞, m 3 = ∞ 4.4<br />
m 1 = 4, m 2 = 10, m 3 = 25 6.2<br />
m 1 = 4, m 2 = 8, m 3 = 12 7.1<br />
m 1 = 3, m 2 = 6, m 3 = 10 8.0<br />
nization’s logistics management, for that matter),<br />
the suppliers might not be able to cater to<br />
more than a certain number of battalions or<br />
other suppliers. Initial experiments (see Figure<br />
8) show that a network with these constraints<br />
displayed improved robustness to targeted<br />
attacks while not deviating much from<br />
the clustering coefficient. However, as we<br />
restrict how many links a node can receive,<br />
the network’s characteristic path length<br />
increases (see Table 2). Clearly a trade-off<br />
exists between robustness to targeted attacks<br />
<strong>and</strong> the average characteristic path length.<br />
The fourth measure of survivability, network<br />
adaptivity, relates more to<br />
node functionality than to<br />
topology. Node functionality<br />
should facilitate the ability to<br />
rewire. For example, if a supplier<br />
can’t fulfill a customer’s<br />
dem<strong>and</strong>s, the customer seeks<br />
an alternate supplier—that is,<br />
the edge connected to the supplier<br />
is rewired to be incident on another supplier.<br />
Our model rewires according to its<br />
attachment rules. We conjecture that in such<br />
a case, other survivability components (clustering<br />
coefficient, characteristic path length,<br />
<strong>and</strong> robustness) will be intact. But to make a<br />
stronger argument we need more analysis in<br />
this direction.<br />
The growth mechanism we describe is<br />
more like an illustration because<br />
real-world data aren’t available, but we can<br />
always modify it to incorporate domain<br />
constraints. For example, we’ve assumed<br />
that a new node can attach preferentially to<br />
any node in the network, which might not<br />
be a realistic assumption. If specific geographical<br />
constraints are known, we can<br />
modify our mechanism to make the new<br />
node entering the system attach preferentially<br />
only within a set of nodes that satisfy<br />
the constraints.<br />
Acknowledgments<br />
We thank the anonymous reviewers for their<br />
helpful comments. We acknowledge <strong>DARPA</strong> for<br />
funding this work under grant MDA972-01-1-<br />
0038 as part of the UltraLog program.<br />
References<br />
1. J.M. Swaminathan, S.F. Smith, <strong>and</strong> N.M.<br />
Sadeh, “Modeling Supply Chain Dynamics:<br />
A Multiagent Approach,” Decision Sciences,<br />
vol. 29, no. 3, 1998, pp. 607–632.<br />
2. A.-L. Barabási <strong>and</strong> R. Albert, “Emergence of<br />
Scaling in R<strong>and</strong>om Networks,” Science, vol.<br />
286, Oct. 1999, pp. 509–512.<br />
Look to the Future<br />
IEEE Internet Computing reports<br />
emerging tools, technologies,<br />
<strong>and</strong> applications implemented through<br />
the Internet to support a worldwide<br />
computing environment.<br />
In 2004-2005, we’ll look at<br />
• Homel<strong>and</strong> Security<br />
• Internet Access to Scientific Data<br />
• Recovery-Oriented<br />
Approaches to Dependability<br />
• Information Discovery:<br />
Needles <strong>and</strong> Haystacks<br />
• Internet Media<br />
... <strong>and</strong> more!<br />
www.computer.org/internet/<br />
SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 31
Survivability of a Distributed Multi-Agent Application - A Performance Control<br />
Perspective<br />
Nathan Gnanasamb<strong>and</strong>am, Seokcheon Lee, Soundar R.T. Kumara, Natarajan Gautam<br />
Pennsylvania State University<br />
University Park, PA 16802<br />
{gsnathan, stonesky, skumara, ngautam}@psu.edu<br />
Wilbur Peng, Vikram Manikonda<br />
Intelligent Automation Inc.<br />
Rockville, MD 20855<br />
{wpeng, vikram}@i-a-i.com<br />
Marshall Brinn<br />
BBN Technologies<br />
Cambridge, MA 02138<br />
mbrinn@bbn.com<br />
Mark Greaves<br />
<strong>DARPA</strong> IXO<br />
Arlington, VA 22203<br />
mgreaves@darpa.mil<br />
Abstract<br />
Distributed Multi-Agent Systems (DMAS) such as supply<br />
chains functioning in highly dynamic environments need to<br />
achieve maximum overall utility during operation. The utility<br />
from maintaining performance is an important component<br />
of their survivability. This utility is often met by identifying<br />
trade-offs between quality of service <strong>and</strong> performance.<br />
To adaptively choose the operational settings for better utility,<br />
we propose an autonomous <strong>and</strong> scalable queueing theory<br />
based methodology to control the performance of a hierarchical<br />
network of distributed agents. By formulating<br />
the MAS as an open queueing network with multiple classes<br />
of traffic we evaluate the performance <strong>and</strong> subsequently the<br />
utility, from which we identify the control alternative for a<br />
localized, multi-tier zone. When the problem scales, another<br />
larger queueing network could be composed using<br />
zones as bu0ilding-blocks. This method advocates the systematic<br />
specification of the DMAS’s attributes to aid realtime<br />
translation of the DMAS into a queueing network. We<br />
prototype our framework in Cougaar <strong>and</strong> verify our results.<br />
1. Introduction<br />
Distributed multi-agent systems (DMAS), through adaptivity,<br />
have enormous potential to act as the “brains” behind<br />
numerous emerging applications such as computational<br />
grids, e-commerce hubs, supply chains <strong>and</strong> sensor<br />
networks [13]. The fundamental hallmark of all these applications<br />
is dynamic <strong>and</strong> stressful environmental conditions,<br />
of one type or the other, in which the MAS as a whole must<br />
survive albeit it suffers temporary or permanent damage.<br />
While the survival notion necessitates adaptivity to diverse<br />
conditions along the dimensions of performance, security<br />
<strong>and</strong> robustness, delivering the correct proportion of these<br />
quantities can be quite a challenge. From a performance<br />
st<strong>and</strong>point, a survivable system can deliver excellent Quality<br />
of Service (QoS) even when stressed. A DMAS could be<br />
considered survivable if it can maintain at least x% of system<br />
capabilities <strong>and</strong> y% of system performance in the face<br />
of z% of infrastructure loss <strong>and</strong> wartime loads (x, y, z are<br />
user-defined) [7].<br />
We address a piece of the survivability problem by building<br />
an autonomous performance control framework for the<br />
DMAS. It is desirable that the adaptation framework be<br />
generic <strong>and</strong> scalable especially when building large-scale<br />
DMAS such as UltraLog [2]. For this, one can utilize a<br />
methodology similar to Jung <strong>and</strong> Tambe [19], composing<br />
the bigger society of smaller building blocks (i.e. agent<br />
communities). Although Jung <strong>and</strong> Tambe [19] successfully<br />
employ strategies for co-operativeness <strong>and</strong> distributed<br />
POMDP to analyze performance, an increase in the number<br />
of variables in each agent can quickly render POMDP ineffective<br />
even in reasonable sized agent communities due to<br />
the state-space explosion problem. In [27], Rana <strong>and</strong> Stout<br />
identify data-flows in the agent network <strong>and</strong> model scala-
¡ ¡ ¢ £ ¤ ¥ ¦ £ § ¨ © ¥ <br />
¨ ¥ ¦ ¤ ¦ © ¥ § ¥ ¥ <br />
£ ¤ ¥ ¢ © ¥ <br />
© £ ¨ ¥ ¤ § ¢ ¥ <br />
© £ ¨ £¦ £ ¨ ¦ ¥ ¢ ¥ <br />
Figure 1. Operational Layers forming the MAS<br />
bility with Petri nets, but their focus is on identifying synchronization<br />
points, deadlocks <strong>and</strong> dependency constraints<br />
with coarse support for performance metrics relating to delays<br />
<strong>and</strong> processing times for the flows. Tesauro et al. [34]<br />
propose a real-time MAS-based approach for data centers<br />
that is self-optimizing based on application-specific utility.<br />
While [19, 27] motivate the need to estimate performance of<br />
large DMAS using a building block approach, [34] justifies<br />
the need to use domain specific utility whose basis should<br />
be the network’s service-level attributes such as delays, utilization<br />
<strong>and</strong> response times.<br />
We believe that by using queueing theory we can analyze<br />
data-flows within the agent community with greater granularity<br />
in terms of processing delays <strong>and</strong> network latencies<br />
<strong>and</strong> also capitalize on using a building block approach by<br />
restricting the model to the community. Queueing theory<br />
has been widely used in networks <strong>and</strong> operating systems<br />
[5]. However, the authors have not seen the application<br />
of queueing to MAS modeling <strong>and</strong> analysis. Since, agents<br />
lend themselves to being conveniently represented as a network<br />
of queues, we concentrate on engineering a queueing<br />
theory based adaptation (control) framework to enhance the<br />
application-level performance.<br />
Inherently, the DMAS can be visualized as a multilayered<br />
system as is depicted in Figure 1 . The top-most<br />
layer is where the application resides, usually conforming<br />
to some organization such as mesh, tree etc. The infrastructure<br />
layer not only abstracts away many of the complexities<br />
of the underlying resources (such as CPU, b<strong>and</strong>width),<br />
but more importantly provides services (such as Message<br />
Transport) <strong>and</strong> aiding agent-agent services (such as naming,<br />
directory etc.). The bottom most layer is where the<br />
actual computational resources, memory <strong>and</strong> b<strong>and</strong>width reside.<br />
Most studies in the literature do not make this distinction<br />
<strong>and</strong> as such control is not executed in a layered<br />
fashion. Some studies such as [35, 17], consider controlling<br />
attributes in the physical or infrastructural layers so<br />
that some properties (eg. robustness) could result <strong>and</strong>/or<br />
the facilities provided by these layers are taken advantage<br />
of. Often, this requires rewiring the physical layer, availability<br />
of a infrastructure level service or the ability of the<br />
application of share information with underlying layers in<br />
a timely fashion for control purposes. In this initial work,<br />
we consider control only due to application-level trade-offs<br />
such as quality of service versus performance <strong>and</strong> assume<br />
that infrastructure level services (such as load-balancing,<br />
priority scheduling) or physical level capabilities (such as<br />
rewiring) are not possible. While we intend to extend the<br />
approach to multi-layered control, it must be noted that it<br />
is not always possible for the application (or the application<br />
manager) to have access to all the underlying layers due<br />
to security reasons. In autonomic control of data centers,<br />
the application manager may have complete authority over<br />
parameters in the physical layer (servers, buffers, network),<br />
the infrastructure (middle-ware) <strong>and</strong> the applications. However,<br />
in DMAS scenarios, especially when dealing with mobile<br />
agents (as an application), trust between the layers is<br />
often partial forcing them to negotiate parameters through<br />
authorized channels. Hence, each layer must be capable of<br />
adapting with minimum cross-layer dependencies.<br />
Our contribution in this work is to combine queueing<br />
analysis <strong>and</strong> application-level control to engineer a generic<br />
framework that is capable of self-optimizing its domainspecific<br />
utility. Secondly, we provide a methodology for<br />
engineering a self-optimizing DMAS to assure applicationlevel<br />
survivability. While we see utility improvements by<br />
by adopting application-level adaptivity, we underst<strong>and</strong> that<br />
further improvement may be gained by utilizing the adaptive<br />
capabilities of the underlying layers.<br />
Before we consider the details of our framework, we<br />
classify the performance control approaches in literature in<br />
Section 2. We present the details for our Cougaar based<br />
test-bed system in Section 3. The architectural details of<br />
our framework is provided in Section 4. We provide an empirical<br />
evaluation in Section 5 <strong>and</strong> finally conclude with discussions<br />
<strong>and</strong> future work in Section 6.<br />
2. Background <strong>and</strong> Motivation<br />
2.1 Approaches in Literature<br />
Because of the diversity of literature on control frameworks<br />
<strong>and</strong> performance evaluation, we examined a representative<br />
subset primarily on the basis of control objective,<br />
(component) interdependence <strong>and</strong> autonomy, generality,<br />
composability, real-time capability (off-line/on-line<br />
control) <strong>and</strong> layering in control architecture.
In some AI based approaches such as [32, 10], behavioral<br />
or rule based controllers are employed to make the<br />
system exhibit particular behavior based upon logical reasoning<br />
or learning. While performance is not the objective,<br />
layered learning is an interesting capability that may<br />
be helpful in a large scale MAS. Learning may be from a<br />
statistical sense as well where the parameters of a transfer<br />
function are learnt from empirical data to subsequently<br />
enforce feedback control [8]. Another architectural framework<br />
called MONAD [37], utilizes a hierarchical <strong>and</strong> distributed<br />
behavior-based control module, with immense flexibility<br />
through scripting for role <strong>and</strong> resource allocation,<br />
<strong>and</strong> co-ordination. While many these approaches favor<br />
the “sense-plan-act” or “sense <strong>and</strong> respond” paradigm <strong>and</strong><br />
some partially support flexibility through scripting, some<br />
important unanswered questions are what happens when<br />
system size changes, can all axioms <strong>and</strong> behaviors be learnt<br />
a-priori <strong>and</strong> what is the performance impact of size (i.e.<br />
scalability)?<br />
Control theoretic approaches in software performance<br />
optimization are becoming important [22, 29], with software<br />
becoming increasingly more complex, multi-layered<br />
<strong>and</strong> having real-time requirements. However, because of<br />
the dynamic system boundaries, size, varying measures of<br />
performance <strong>and</strong> non-linearity in DMAS it is very complex<br />
to design a strict control theoretic control process [21].<br />
Some approaches such as [21, 34] take the heuristic path,<br />
with occasional analogs to control theory, with an emphasis<br />
on application or domain-specific utility. Kokar et al.<br />
[22] refer to this utility as benefit function <strong>and</strong> elaborate on<br />
various analogs between software systems <strong>and</strong> traditional<br />
control systems. From the perspective of autonomic control<br />
of computer systems, Bennani <strong>and</strong> Menasce [4] study<br />
the robustness of self-management techniques for servers<br />
under highly variable workloads. Although queueing theory<br />
has been used in this work, any notion of components<br />
being distributed or agent-based seems to be absent.<br />
Furthermore, exponential smoothing or regression based<br />
load-forecasting may not be sufficient to address situations<br />
caused by wartime dynamics, catastrophic failure <strong>and</strong> distributed<br />
computing. Nevertheless, in our approach we have<br />
a notion of controlling a distributed application’s utility using<br />
queueing theory.<br />
Numerous market-based control mechanisms are available<br />
in literature such as [24, 9, 12, 6]. In market-based<br />
control systems, agents emulate buyers <strong>and</strong> sellers in a<br />
market acting only with locally available information yet<br />
helping us realize global behaviour for the community of<br />
agents. While these methods are very effective <strong>and</strong> offer<br />
desirable properties such as decentralization, autonomy <strong>and</strong><br />
control hierarchy, they have been used for resource allocation<br />
[24, 9] <strong>and</strong> resource control [6]. The Challenger [9]<br />
system seeks to minimize mean flow time (job completion<br />
time - job origination time), the task is allocated to an agent<br />
providing least processing time. Load balancing is another<br />
application as applied by Ferguson et al. [12]. Resource allocation<br />
<strong>and</strong> load-balancing can be thought of as infrastructure<br />
level services, that agent frameworks such as Cougaar<br />
[1] provide <strong>and</strong> hence in our work we focus on applicationlevel<br />
performance <strong>and</strong> the associated utility to the DMAS.<br />
Using finite state machines, hybrid automata <strong>and</strong> their<br />
variants have been the foci of many research paths in agent<br />
control as in [11, 23]. The idea here is to utilize the states<br />
of the multi-agent system to represent, validate, evaluate,<br />
<strong>and</strong> choose plans that lead the system towards the goal. Often,<br />
the drawback here is that when the number of agents<br />
increase, the state-space approaches tend to become intractable.<br />
Heuristics have widely been used in controlling multiagent<br />
systems primarily in the following sense: searching<br />
<strong>and</strong> evaluating options based on domain knowledge <strong>and</strong><br />
picking a course of action (maybe a compound action composed<br />
of a schedule of individual actions) eventually. The<br />
main idea in recent heuristics based control as exemplified<br />
by [36, 26, 31] is that schedules of actions are chosen based<br />
upon requirements such as costs, feasibilities for real-time<br />
contexts, complexity, quality etc. Opportunistic planning is<br />
an interesting idea as mentioned in Soto et al. [31] refers<br />
to the best-effort planning (maximum quality) considering<br />
available resources. These meta-heuristics offer very effective,<br />
special-purpose solutions to control agent behavior,<br />
however to be more flexible, we separate the performance<br />
evaluation <strong>and</strong> the domain-specific application utility computation.<br />
Given that we have a model for performance estimation<br />
(whose parameters <strong>and</strong> state-space are known), dynamic<br />
programming (DP) <strong>and</strong> its adaptive version - reinforcement<br />
learning (RL), <strong>and</strong> model predictive control (MPC) have<br />
been used to find the control policy [3, 33, 20, 28, 25].<br />
Since the complexity of finding the optimal policy grows<br />
exponentially with the state space [3] <strong>and</strong> convergence has<br />
to be ensured in RL [33, 20], we take an MPC-like approach<br />
in our work for finding quick solutions in real-time. We discuss<br />
this further in Section 4.<br />
2.2 Related Work<br />
In large scale MAS applications, performance estimation<br />
<strong>and</strong> modeling itself can be a formidable task as illustrated<br />
by [16] in the UltraLog [2] context. UltraLog [2],<br />
built on Cougaar [1], uses for heuristic control a host of<br />
architectural features such as operating modes, conditions,<br />
<strong>and</strong> plays <strong>and</strong> play-books as described in [21]. Helsinger<br />
et al. [15] incorporate the aforementioned features into<br />
their closed-loop heuristic framework that balances the different<br />
dimensions of system survivability through targeted
© © ¤ ¢ ¨ ¤<br />
£ ¦ ¨ © §<br />
¢ £ ¤ ¥ ¦ ¡<br />
£ ¦ ¨ © §<br />
defense mechanisms, trade-offs <strong>and</strong> layered control actions.<br />
The importance of high-level, system specifications (interchangeably<br />
called TechSpecs, specification database, component<br />
database) has been emphasized in many places such<br />
as [18, 21, 14]. These specifications contain componentwise,<br />
static input/output behavior, operating requirements<br />
<strong>and</strong> control actions of agents along with domain measures<br />
of performance <strong>and</strong> computation methodologies [14].<br />
Also, queueing network based methodologies for offline<br />
<strong>and</strong> design-time performance evaluation have been applied<br />
<strong>and</strong> validated in [14, 30]. Building on these ideas, we build<br />
a real-time framework with queueing based performance<br />
prediction capabilities.<br />
¤ ¨ <br />
£ ¦ ¨ © §<br />
(a) MAS building<br />
block: Community<br />
¡ ¢ £ ¤ ¥ ¦ § § ¨ £ © ¤ <br />
¡ ¢ £ ¤ ¥ ¦ § § ¨ £ © ¤ <br />
¡ ¢ £ ¤ ¥ ¦ § § ¨ £ © ¤ <br />
(b) Agent society formed by composing<br />
communities<br />
2.3 Problem Statement<br />
Being the top-most layer as an application, the survivability<br />
of a DMAS depends on its ability to leverage its<br />
knowledge of the domain, the system’s overall utility <strong>and</strong><br />
available control-knobs. The utility of the application is the<br />
combined benefit along several conflicting (eg. completeness<br />
<strong>and</strong> timeliness [7, 2]) <strong>and</strong>/or independent (eg. confidentiality<br />
<strong>and</strong> correctness [7, 2]) dimensions, which the application<br />
tries to maximize in a best-effort sense through<br />
trade-offs. Underst<strong>and</strong>ably, in a distributed multi-agent<br />
setting, mechanisms to measure, monitor <strong>and</strong> control this<br />
multi-criteria utility function become hard <strong>and</strong> inefficient,<br />
especially under conditions of scale-up. Given that the application<br />
does not change its high-level goals, task-structure<br />
or functionality in real-time, it is beneficial to have a framework<br />
that assists in the choice of operational modes (eg.<br />
plan quality) that maximize the utility from performance.<br />
Hence, the research objective of this work is to design <strong>and</strong><br />
develop a generic, real-time, self-controlling framework for<br />
DMAS, that utilizes a queueing network model for performance<br />
evaluation <strong>and</strong> a learned utility model to select an<br />
appropriate control alternative.<br />
2.4 Solution Methodology<br />
This research concentrates on adjusting the applicationlevel<br />
parameters or operating modes (opmodes for short)<br />
within the distributed agents to make an autonomous choice<br />
of operational parameters for agents in a reasonable-sized<br />
domain (called an agent community). The choice of opmodes<br />
is based on the perceived application-level utility of<br />
the combined system (i.e. the whole community) that current<br />
environmental conditions allow. We assume that the<br />
application’s utility depends on the choice of opmodes at<br />
the agents constituting the community because the opmodes<br />
directly affect the performance. A queueing network model<br />
is utilized to predict the impact of DMAS control settings<br />
<strong>and</strong> environmental conditions on steady-state performance<br />
Figure 2. MAS Community <strong>and</strong> Society<br />
(in terms of end-to-end delays in flows), which in turn is<br />
used to estimate the application-level utility. After evaluating<br />
<strong>and</strong> ranking several alternatives from among the feasible<br />
set of operational settings on the basis of utility, the best<br />
choice is picked.<br />
3. Overview of Application (CPE) Scenario<br />
The Continuous Planning <strong>and</strong> Execution (CPE) Society<br />
is a comm<strong>and</strong> <strong>and</strong> control (C2) MAS built on Cougaar<br />
(<strong>DARPA</strong> Agent Framework [1]) that serves as the test-bed<br />
for performance control. Designed as a building block for<br />
larger scale MAS, the primary CPE prototype consists of<br />
three tiers (Brigade, Battalion, Company) as shown in Figure<br />
2a. While the discussion is mainly with respect to the<br />
structure of CPE, the system can be grown by combining<br />
many CPE communities to form large agent societies as<br />
shown in Figure 2b.<br />
CPE embodies a complete military logistics scenario<br />
with agents emulating roles such as suppliers, consumers<br />
<strong>and</strong> controllers all functioning in a dynamic <strong>and</strong> hostile (destructive)<br />
external environment. Embedded in the hierarchical<br />
structure of CPE are both comm<strong>and</strong> <strong>and</strong> control, <strong>and</strong><br />
superior-subordinate relationships. The subordinates compile<br />
sensor updates <strong>and</strong> furnish them to superiors. This<br />
enables the superiors to perform the designated function<br />
of creating plans (for maneuvering <strong>and</strong> supply) as well as<br />
control directives for downstream subordinates. Upon receipt<br />
of plans, the subordinates execute them. The supply<br />
agents replenish consumed resources periodically. This<br />
high level system definition is to be executed continuously<br />
by the application with maximum achievable performance<br />
in the presence of stresses that include temporary <strong>and</strong> catastrophic<br />
failure. Stresses associated with wartime situations<br />
cause the resource allocation (CPU, memory, b<strong>and</strong>width)<br />
<strong>and</strong> offered load (due to increased planning requirements)
¢ £ ¤ ¤ £ ¤ ¡<br />
¤ ¨© § ¢ ¤ ¡ ¢ © ¡ ¢ £ £ ¢<br />
<br />
<<br />
5<br />
1<br />
¢ £ ¤ ¤ £ ¤ ¡<br />
¦ ¦ § ¨© ¡ ¨ ¤ £ ¢<br />
¥<br />
, ><br />
, = ; : =<br />
<br />
<br />
: ; ,<br />
<br />
! " " # ! <br />
: =<br />
8 )<br />
) ; < ; ) ; :<br />
$ % & ' % (<br />
- . / . - 0 ,<br />
/ % 2 / - 3 4 1<br />
3 6 - 4 3 / - 1<br />
A / 2 - 3 4 @<br />
: <br />
? ; $<br />
) ><br />
8 3 9 / + 2 . / 9 4<br />
: > ; < ><br />
,<br />
+ *<br />
2 2 . 6 7 5<br />
3 9 / 8<br />
)<br />
+ *<br />
Figure 3. Traffic flow within CPE<br />
to fluctuate immensely.<br />
As part of the application-level adaptivity features, a set<br />
of opmodes are built into the system. Opmodes allow individual<br />
tasks (such as plans, updates, control) to be executed<br />
at different qualities or to be processed at different rates. We<br />
assume that TechSpecs for the CPE application (similar to<br />
[14]) are available to be utilized by the control framework.<br />
Although, functionally CPE <strong>and</strong> UltraLog are unique,<br />
the same flavor of activities are reflected in both. Both of<br />
them share the same Cougaar infrastructure; execute planning<br />
in dynamic, distributed settings with similar QoS requirements;<br />
<strong>and</strong> are both one application with physically<br />
distributed components interconnected by task flows (as<br />
shown in Figure 3 in the case of CPE), wherein the individual<br />
utilities of the components contribute to the global<br />
survivability.<br />
4. Architecture of the Performance Control<br />
Framework<br />
The distributed performance control framework that<br />
accomplishes application-level survivability while operating<br />
amidst infrastructure/physical layer <strong>and</strong> environmental<br />
stresses is represented in Figure 4. This representation consists<br />
of activities, modules, knowledge repositories <strong>and</strong> information<br />
flow through a distributed collection of agents.<br />
The features for adaptivity are solely at the application level<br />
without considering infrastructure or physical level adaptivity<br />
such as dynamically allocating processor share or adjusting<br />
the buffer sizes.<br />
Figure 4. Architecture Overview<br />
Architecture Overview<br />
When the application is stressed by an amount S by the<br />
underlying layers (due to under-allocation of resources)<br />
<strong>and</strong> the environment (due to increased workloads during<br />
wartime conditions), the DMAS Controller has to examine<br />
all its performance-related variables from set X <strong>and</strong> the<br />
current overall performance P in order to adapt. The variables<br />
that need to be maintained are specified in the Tech-<br />
Specs <strong>and</strong> may include delays, time-stamps, utilizations <strong>and</strong><br />
their statistics. They are collected in a distribution fashion<br />
through measurement points (MP ) which are “soft” storage<br />
containers residing inside the agents <strong>and</strong> contain information<br />
on what, when <strong>and</strong> how they should be measured.<br />
The DMAS Controller knows the set of flows F that traverse<br />
the network <strong>and</strong> the set of packet types T from the<br />
TechSpecs. With (F, T, X, C), where C is a suggestion<br />
based on prior effectiveness from the DMAS Controller, the<br />
Model Builder can select a suitable queueing model template<br />
Q. The Control Set Evaluator knows the current opmode<br />
set O as well as the set of possible opmodes, OS<br />
from TechSpecs. To evaluate the performance due to a c<strong>and</strong>idate<br />
opmode set O ′ , the Control Set Evaluator uses the<br />
Queueing Model with a scaled set of operating conditions<br />
X ′ . Once the performance P ′ is estimated by the Queueing<br />
Model it can be cached in the performance database P DB<br />
<strong>and</strong> then sent to the Utility Calculator. The Utility Calculator<br />
computes the domain utility (U ′ ) due to (O ′ , P ′ )<br />
<strong>and</strong> caches it in the utility database, UDB. Subsequently,<br />
the optimal operating mode O ∗ is identified <strong>and</strong> sent to the
DMAS Controller. The functional units of the architecture<br />
are distributed but for each community that forms part of<br />
a MAS society, O ∗ will be calculated by a single agent.<br />
We now examine the functionality <strong>and</strong> role offered by each<br />
component of the framework in greater detail.<br />
4.1 Self-Monitoring Capability<br />
Any system that wants to control itself should possess<br />
a clear specification of the scope of the variables it has to<br />
monitor. The TechSpecs is a distributed structure that supports<br />
this purpose by housing meta-data about all variables,<br />
X, that have to be monitored in different portions of the<br />
community (refer [14]). The data/statistics collected in a<br />
distributed way, is then aggregated to assist in control alternatives<br />
by the top-level controller that each community will<br />
possess.<br />
The attributes that need to be tracked are formulated<br />
in the form of measurement points (MP ). For example,<br />
one simple measurement could be specified as<br />
{what = delay, when = every packet, how =<br />
timestamp at receiving emd − timestamp at sending end }<br />
which is subsequenly stored in an MP . Each agent can<br />
look up its own TechSpecs <strong>and</strong> from time-to-time forward<br />
a measurement to its superior. The superior can analyze<br />
this information (eg. calculate statistics such as mean or<br />
variance) <strong>and</strong>/or add to this information <strong>and</strong> forward it<br />
again. We have measurement points for time-periods, timestamps,<br />
operating-modes, control <strong>and</strong> generic vector-based<br />
measurements. These measurement points can be chained<br />
for tracking information for a flow such that information is<br />
tagged-on at every point the flow traverses. For the sake of<br />
reliability, the information that is contained in these agents<br />
is replicated at several points, so that in the absence of packets<br />
reaching on time or not reaching at all, previously stored<br />
packets <strong>and</strong> their corresponding information can be utilized<br />
for control purposes.<br />
4.2 Self-Modeling Capability<br />
One of the key features of this framework is that it<br />
has the capability to choose a type of performance model<br />
for analysing the current system configuration from several<br />
queueing model templates provided. The type of model that<br />
is utilized is based on the accuracy, the computation time<br />
<strong>and</strong> the history of effectiveness of the model. For example,<br />
a simulation based queueing model may be very accurate<br />
but cannot evaluate enough alternatives in limited time, in<br />
which case an analytical model (such as BCMP [5], QNA<br />
[38]) is preferred.<br />
The inputs to the model builder are the flows that traverse<br />
the network (F ), the types of packets (T ) <strong>and</strong> the current<br />
configuration of the network. If at a given time, we know<br />
that there are n agents interconnected in a hierarchical fashion<br />
then the role of this unit is to represent that information<br />
in the required template format (Q). The current number<br />
of agents is known to the controller by tracking the measurement<br />
points. For example, if there is no response from<br />
an agent for a sufficient period of time, then for the purpose<br />
of modeling, the controller may assume the agent to<br />
be non-existent. In this way dynamic configurations can<br />
be h<strong>and</strong>led. On the other h<strong>and</strong>, TechSpecs do m<strong>and</strong>ate<br />
connections according to superior-subordinate relationships<br />
thereby maintaining the flow structure at all times. Once the<br />
modeling is complete, the MAS has to capability to analyze<br />
its current performance using the selected type of model.<br />
The MAS does have the flexibility, to choose another model<br />
template for a different iteration.<br />
4.3 Self-Evaluating Capability<br />
The evaluation capability, the first step in control, allows<br />
the MAS to examine its own performance under a given<br />
set of plausible conditions. This prediction of performance<br />
is used for the elimination of control alternatives that may<br />
lead to instabilities. Our notion of performance evaluation<br />
is similar to [34]. While Tesauro et al. [34] compute the<br />
resource level utility functions (based on the application<br />
manager’s knowledge of the system performance model)<br />
that can be combined to obtain a globally optimal allocation<br />
of resources, we predict the performance of the MAS<br />
as a function of its operating modes in real-time (within<br />
Queueing Model) <strong>and</strong> then use it to calculate its global utility<br />
(some more differences are pointed out in Section 4.4).<br />
By introducing a level of indirection, we may get some<br />
desirable properties because we separate an application’s<br />
domain-specific utility computation from performance prediction<br />
(or analysis). This theoretically enables us to predict<br />
the performance of any application whose TechSpecs are<br />
clearly defined <strong>and</strong> then compute the application-specific<br />
utility. In both cases, control alternatives are picked based<br />
on best-utility. We discuss the notion of control alternatives<br />
in Section 4.4. Also, our performance metrics (<strong>and</strong> hence<br />
utility) are based on service level attributes such as endto-end<br />
delay <strong>and</strong> latency, which is a desirable attribute of<br />
autonomic systems [34].<br />
When plan, update <strong>and</strong> control tasks (as mentioned in<br />
Section 3) flow in this heterogeneous network of agents<br />
in predefined routes (called flows), the processing <strong>and</strong> wait<br />
times of tasks at various points in the network are not alike.<br />
This is because the configuration (number of agents allocated<br />
on a node), resource availability (load due to other<br />
contending software) <strong>and</strong> environmental conditions at each<br />
agent is different. In addition, the tasks themselves can be<br />
of varying qualities or fidelities that affects the time taken<br />
to process that task. Under these conditions, performance is
Symbol<br />
Table 1. Notation<br />
Description<br />
N Total # of nodes in the community<br />
λ ij Average arrival rate of class j at node i<br />
1/µ ijk Average processing time of class j at<br />
node i at quality k<br />
M total number of classes<br />
T i Routing probability matrix for class i<br />
W ijk Steady state waiting time for class j at<br />
node i at quality k<br />
Q ij Set of qualities at which a class j task<br />
can be processed at node i<br />
estimated on the basis of the end-to-end delay involved in a<br />
“sense-plan-respond” cycle.<br />
The primary performance prediction tool that we use are<br />
called Queueing Network Models (QNM) [5]. The QNM<br />
is the representation of the agent community in the queueing<br />
domain. As the first step of performance estimation, the<br />
agent community needs to be translated into a queueing network<br />
model. Table 1 provides the notations used is this section.<br />
Inputs <strong>and</strong> outputs at a node are regarded as tasks. The<br />
rate at which tasks of class j are received at node i is captured<br />
by the arrival rate (λ ij ). Actions by agents consume<br />
time, so they get abstracted as processing rates (µ ij ). Further,<br />
each task can be processed at a quality k ∈ Q ij , that<br />
causes the processing rates to be represented as µ ijk . Statistics<br />
of processing times are maintained at each agent in PDB<br />
to arrive at a linear regression model between quality k <strong>and</strong><br />
µ ijk . Flows get associated with classes of traffic denoted<br />
by the index j. If a connection exists between two nodes,<br />
this is converted to a transition probability p ij , where i is<br />
the source <strong>and</strong> j is the target node. Typically, we consider<br />
flows originating from the environment, getting processed<br />
<strong>and</strong> exiting the network making the agent network an open<br />
queueing network [5]. Since we may typically have multiple<br />
flows through a single node, we consider multi-class<br />
queueing networks where the flows are associated with a<br />
class. Performance metrics such as delays for the “senseplan-respond”<br />
cycle is captured in terms of average waiting<br />
times, W ijk . As mentioned earlier, TechSpecs is a convenient<br />
place where information such as flows <strong>and</strong> Q ij can be<br />
embedded.<br />
The choice of QNM depends on the number of classes,<br />
arrival distribution <strong>and</strong> processing discipline as well as<br />
a suggestion C by the DMAS controller that makes this<br />
choice based upon history of prior effectiveness. Some analytical<br />
approaches to estimate performance can be found<br />
in [5, 38]. In the context of agent networks, Jackson <strong>and</strong><br />
BCMP queueing networks to estimate the performance in<br />
[14]. By extending this work we support several templates<br />
of queueing models (such as BCMP [5], Whitt’s QNA [38],<br />
Jackson [5], M/G/1, a simulation) that can be utilized for<br />
performance prediction.<br />
4.4 Self-Controlling Capability<br />
In contrast to [34], we deal with optimization of the domain<br />
utility of a single MAS that is distributed, rather than<br />
allocating resources in an optimal fashion to multiple applications<br />
that have a good idea of their utility function<br />
(through policies). As mentioned before opmodes allow<br />
for trading-off quality of service (task quality <strong>and</strong> response<br />
time) <strong>and</strong> performance. We are assuming there is a maximum<br />
ceiling R on the amount of resources, <strong>and</strong> the available<br />
resources fluctuate depending on stresses S = S e +S a ,<br />
where S e are the stresses from the environment (i.e. multiple<br />
contending applications, changes in the infrastructural<br />
or physical layers) <strong>and</strong> S a are the application stresses (i.e.<br />
increased tasks). The DMAS controller receives from<br />
the measurement points (MP ) a measurement of the actual<br />
performance P <strong>and</strong> a vector of other statistics (relating<br />
to X). Also at the top-level the overall utility (U) is<br />
U(P, S) = ∑ w n x n is known where x n is the actual utility<br />
component <strong>and</strong> w n is the associated weight specified by<br />
the user or another superior agent. We cannot change S, but<br />
we can adjust P to get better utility. Since P depends on<br />
O, which is a vector of opmodes collected from the community,<br />
we can use the QNM to find O ∗ <strong>and</strong> hence P ∗ that<br />
maximizes U(P, S) for a given S from within the set OS.<br />
In words, we find the vector of opmodes (O ∗ ) that maximizes<br />
domain utility at current S <strong>and</strong> opmodes O. This<br />
computation is performed in the Utility Calculator module<br />
using a learned utility model based on UDB.<br />
In addition to differences pointed out thus far, here are<br />
some more differences between this work <strong>and</strong> [34]:<br />
• Tesauro et al. [34] assume that the application manager<br />
in Unity has a model of system performance, which<br />
we do not assume. Although they allude to a modeler<br />
module, they do not explain the details of their performance<br />
model. We use a queueing network model<br />
that is constructed in real-time to estimate the performance<br />
for any set of opmodes O ′ by taking the current<br />
opmodes O <strong>and</strong> scaling them appropriately based on<br />
observed histories (X) to X ′ in the Control Set Evaluator.<br />
• Because of the interactions involved <strong>and</strong> complexity<br />
of performance modeling [19, 27], it may be timeconsuming<br />
to utilize statistical inferencing <strong>and</strong> learning<br />
mechanisms in real-time. This is why we use an<br />
analytical queueing network model to estimate performance<br />
quickly.
1000<br />
500<br />
0<br />
-500<br />
-1000<br />
0.2 0.4 0.6 0.8<br />
Default Policy<br />
Controlled<br />
Stress (S)<br />
Figure 5. Results Overview<br />
• Another difference is that in [34], they assume operating<br />
system support for being able to tune parameters<br />
such as buffer sizes <strong>and</strong> operating system settings<br />
which may not be true in many MAS-based situations<br />
because of mobility, security <strong>and</strong> real-time constraints.<br />
Besides, in addition to the estimation of performance,<br />
the queueing model may have the capability to eliminate<br />
instabilities from a queueing sense, which is not<br />
apparent in the other approach.<br />
• But most importantly, their work reflects a two level hierarchy<br />
where the resource manager mediates several<br />
application environments to obtain maximum utility to<br />
the data center. But our work is from the perspective<br />
of a single, self-optimizing application that is trying to<br />
be survivable by maximizing its own utility.<br />
Inspite of these differences, it is interesting to see that the<br />
self-controlling capability can be achieved, with or without<br />
explicit layering, in real-world applications.<br />
5. Empirical Evaluation on CPE Test-bed<br />
The aforementioned framework was implemented within<br />
CPE which we use as a test-bed for our experimentation.<br />
The main goal of this experimentation was to examine if<br />
application-level adaptivity led to any utility gains in the<br />
long run. The superior agents in CPE continuously plan<br />
maneuvers for their subordinates which get executed by the<br />
lower rung nodes. We subjected the entire distributed community<br />
to r<strong>and</strong>om stresses by simulating enemy clusters of<br />
varying sizes <strong>and</strong> arrival rates. These stresses translated into<br />
the need to perform the distributed “sense-plan-respond”<br />
more frequently causing increased load <strong>and</strong> traffic in the<br />
network of agents. The stresses were created by a worldagent<br />
whose main purpose was to simulate warlike dynamics<br />
within our test-bed.<br />
The CPE prototype consists of 14 agents spread across<br />
a physical layer of 6 CPUs. We utilized the prototype<br />
CPE framework to run 36 experiments at two stress levels<br />
(S = 0.25 <strong>and</strong> S = 0.75). There were three layers of<br />
hierarchy as shown in Figure 2a with a three-way branching<br />
at each level <strong>and</strong> one supply node. The community’s<br />
utility function was based on the achievement of real goals<br />
in military engagements such as terminating or damaging<br />
the enemy <strong>and</strong> reducing the penalty involved in consuming<br />
resources such as fuel or sustaining damage. To keep<br />
our queueing models simple, we assumed that the external<br />
arrival was Poisson while the service times were generally<br />
distributed. In order to cater to general arrival rates, the<br />
framework contains a QNA-based <strong>and</strong> a simulation-based<br />
model. Using this assumption a BCMP or M/G/1 queueing<br />
model could be selected by the framework for real-time<br />
performance estimation. The baseline for comparison was<br />
the do nothing policy (default) where we let the Cougaar infrastructure<br />
manage conditions of high load. Although our<br />
framework did better than any set of opmodes as shown in<br />
Figure 5 for the two stress modes, we show instantaneous<br />
<strong>and</strong> cumulative utility for two opmodes (Default A, B) in<br />
particular in Figure 6. We noticed that in the long run the<br />
framework enhanced the utility of the application as compared<br />
to the default policy.<br />
At both stress levels, the controlled scenario performed<br />
better that the default as shown in Figure 6. We did observe<br />
oscillations in the instantaneous utility <strong>and</strong> we attribute this<br />
to the impreciseness of the prediction of stresses. Stresses<br />
vary relatively fast in the order of seconds while the control<br />
granularity was of the order of minutes. Since this is a<br />
military engagement situation following no stress patterns,<br />
it is hard to cope with in the higher stress case. In contrast<br />
to MAS applications dealing with data centers where load<br />
can be attributed to time-of-day <strong>and</strong> other seasonal effects,<br />
it is not possible to get accurate load predictions for MAS<br />
applications simulating wartime loads. We think that this<br />
could be the reason why our utility falls in the latter case.<br />
In subsequent work, we intend to enhance Cougaar capability<br />
to be supportive of the application-layer by forcing it to<br />
guarantee some end-to-end delay requirements.<br />
6. Conclusions <strong>and</strong> Future Work<br />
In this paper, we were able to successfully control a realtime<br />
DMAS to achieve overall better utility in the long run,<br />
thus making the application survivable. Utility improvements<br />
were made through application-level trade-offs between<br />
quality of service <strong>and</strong> performance. We utilized a<br />
queueing network based framework for performance analysis<br />
<strong>and</strong> subsequently used a learned utility model for computing<br />
the overall benefit to the DMAS (i.e. community).<br />
While Tesauro et al. [34] employ a resource arbiter to maximize<br />
the combined utility of several application environments<br />
in a data center scenario, we focus on using queueing
-10<br />
20<br />
15<br />
10<br />
5<br />
0<br />
-5<br />
0 200 400 600 800 1000 1200<br />
time (sec.)<br />
Controlled Default A Default B<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
0 200 400 600 800 1000 1200<br />
time (sec.)<br />
Controlled Default A Default B<br />
(a) Instantaneous Utility (stress 0.25)<br />
(b) Cumulative Utility (stress 0.25)<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0 200 400 600 800 1000 1200<br />
-5<br />
-10<br />
time (sec.)<br />
Controlled Default A Default B<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
0 200 400 600 800 1000 1200<br />
time (sec.)<br />
Controlled Default A Default B<br />
(c) Instantaneous Utility (stress 0.75)<br />
(d) Cumulative Utility (stress 0.75)<br />
Figure 6. Sample Results<br />
theory to maximize the utility from performance of a single<br />
distributed application given that is has been allocated<br />
some resources. We think that the approaches are complementary,<br />
with this study providing empirical evidence to<br />
support the observation by Jennings <strong>and</strong> Wooldridge in [18]<br />
that agents can be used to optimize distributed application<br />
environments, including themselves, through flexible highlevel<br />
(i.e. application-level) interactions.<br />
Furthermore, this work has resulted in a general architectural<br />
lesson. We believe that any distributed application<br />
would have flows of traffic <strong>and</strong> would require service<br />
level attributes such as response times, utilization or delays<br />
of components to be optimized. The paradigm that we<br />
have chosen can capture such quantities <strong>and</strong> help evaluate<br />
choices that may lead to better application utility. This concept<br />
of breaking the application into flows <strong>and</strong> allowing a<br />
real-time model-based predictor to steer the system into regions<br />
of higher utility is pretty generic in nature.<br />
While we are continuing the empirical evaluation, we<br />
keep the building blocks small to ensure scalability <strong>and</strong><br />
to reduce interactions. We utilize TechSpecs to distribute<br />
knowledge <strong>and</strong> meta-data thus reemphazing the separation<br />
principle. Subsequently, we hope to broaden the layered<br />
control approach to encompass infrastructure-level control<br />
within the framework. Another avenue for improvement is<br />
to design self-protecting mechanisms so that the security<br />
aspect of the framework is reinforced.<br />
Acknowledgements<br />
This work was performed under the <strong>DARPA</strong> UltraLog<br />
Grant#: MDA 972-01-1-0038. The authors wish to acknowledge<br />
<strong>DARPA</strong> for their generous support.<br />
References<br />
[1] Cougaar open source site. http://www.cougaar.org.<br />
<strong>DARPA</strong>.<br />
[2] Ultralog program site. http://dtsn.darpa.mil/ixo/.<br />
<strong>DARPA</strong>.<br />
[3] A. G. Barto, S. J. Bradtke, <strong>and</strong> S. Singh. Learning to<br />
act using real-time dynamic programming. Artificial<br />
Intelligence, 72:81–138, 1995.<br />
[4] M. N. Bennani <strong>and</strong> D. A. Menasce. Assessing the<br />
robustness of self-managing computer systems under<br />
highly variable workloads. International Conference<br />
on Autonomic Computing, 2004.<br />
[5] G. Bolch, S. Greiner, H. de Meer, <strong>and</strong> K. S.Trivedi.<br />
Queueing Networks <strong>and</strong> Markov Chains: Modeling<br />
<strong>and</strong> Performance Evaluation with Computer Science<br />
Applications. John Wiley <strong>and</strong> Sons, Inc., 1998.<br />
[6] J. Bredin, D. Kotz, <strong>and</strong> D. Rus. Market-based resource<br />
control for mobile agents. Autonomous Agents, 1998.<br />
[7] M. Brinn <strong>and</strong> M. Greaves. Leveraging agent properties<br />
to assure survivability of distributed multi-agent sys-
tems. Proceedings of the Second Joint Conference on<br />
Autonomous Agents <strong>and</strong> Multi-Agent Systems, 2003.<br />
[8] T. Chao, F. Shan, <strong>and</strong> S. X. Yang. Modeling <strong>and</strong> design<br />
monitor using layered control architecture. Autonomous<br />
Agents <strong>and</strong> Multi-Agent Systems, 2002.<br />
[9] A. Chavaz, A. Moukas, <strong>and</strong> P. Maes. Challenger:<br />
A multi-agent systems for distributed resource allocation.<br />
Agents, 1997.<br />
[10] L. Chen, K. Bechkoum, <strong>and</strong> G. Clapworthy. A logical<br />
approach to high-level agent control. Agents, 2001.<br />
[11] A. E. Fallah-Seghrouchni, I. Degirmenciyan-Cartault,<br />
<strong>and</strong> F. Marc. Modelling, control <strong>and</strong> validation of<br />
multi-agent plans in dynamic context. Autonomous<br />
Agents <strong>and</strong> Multi-Agent Systems, 2004.<br />
[12] D. Ferguson, Y. Yemini, <strong>and</strong> C. Nikolaou. Microeconomic<br />
algorithms for load balancing in distributed<br />
computer systems. Proceedings of the International<br />
Conference on Distributed Systems, 1988.<br />
[13] I. Foster, N. R. Jennings, <strong>and</strong> C. Kesselman. Brain<br />
meets brawn: Why grid <strong>and</strong> agents need each other.<br />
Autonomous Agents <strong>and</strong> Multi-Agent Systems, 2004.<br />
[14] N. Gnanasamb<strong>and</strong>am, S. Lee, N. Gautam, S. R. T.<br />
Kumara, W. Peng, V. Manikonda, M. Brinn, <strong>and</strong><br />
M. Greaves. Reliable mas performance prediction using<br />
queueing models. IEEE Multi-agent Security <strong>and</strong><br />
Survivabilty Symposium, 2004.<br />
[15] A. Helsinger, K. Kleinmann, <strong>and</strong> M. Brinn. A framework<br />
to control emergent survivability of multi agent<br />
systems. Autonomous Agents <strong>and</strong> Multi-Agent Systems,<br />
2004.<br />
[16] A. Helsinger, R. Lazarus, W. Wright, <strong>and</strong> J. Zinky.<br />
Tools <strong>and</strong> techniques for performance measurement<br />
of large distributed multi-agent systems. Autonomous<br />
Agents <strong>and</strong> Multi-Agent Systems, 2003.<br />
[17] Y. Hong <strong>and</strong> S. R. T. Kumara. Coordinating control<br />
decisions of software agents for adaptation to dynamic<br />
environments. Working Paper, Dept. of IME, Pennsylvania<br />
State University, University Park, PA, 2004.<br />
[18] N. R. Jennings <strong>and</strong> M. Wooldridge. H<strong>and</strong>book of<br />
Agent Technology, chapter Agent-Oriented Software<br />
Engineering. AAAI/MIT Press, 2000.<br />
[19] H. Jung <strong>and</strong> M. Tambe. Performance models for large<br />
scale multi-agent systems: Using distributed pomdp<br />
building blocks. Proceedings of the Second Joint Conference<br />
on Autonomous Agents <strong>and</strong> Multi-Agent Systems,<br />
July 2003.<br />
[20] L. P. Kaelbling, M. L. Littman, <strong>and</strong> A. Moore. Reinforcement<br />
learning: A survey. Journal of Artificial<br />
Intelligence Research, 4:237–285, 1996.<br />
[21] K. Kleinmann, R. Lazarus, <strong>and</strong> R. Tomlinson. An infrastructure<br />
for adaptive control of multi-agent systems.<br />
IEEE Conference on Knowledge-Intensive<br />
Multi-Agent Systems, 2003.<br />
[22] M. M. Kokar, K. Baclawski, <strong>and</strong> Y. A. Eracar. Control<br />
theory-based foundations of self-controlling software.<br />
IEEE Intelligent Systems, pages 37–45, May/June<br />
1999.<br />
[23] K. C. Lee, W. H. Mansfield, <strong>and</strong> A. P. Sheth. A framework<br />
for controlling cooperative agents. IEEE Computer,<br />
1993.<br />
[24] T. W. Malone, R. Fikes, K.R.Grant, <strong>and</strong> M.T.Howard.<br />
Enterprise: A Market-like Task Scheduler for Distributed<br />
Computing Environments. Elsevier, Holl<strong>and</strong>,<br />
1988.<br />
[25] M. Morari <strong>and</strong> J. H. Lee. Model predictive control:<br />
past, present <strong>and</strong> future. Computers <strong>and</strong> Chemical Engineering,<br />
23(4):667–682, 1999.<br />
[26] A. Raja, V. Lesser, <strong>and</strong> T. Wagner. Toward robust<br />
agent control in open environments. Agents, 2000.<br />
[27] O. F. Rana <strong>and</strong> K. Stout. What is scalabilty in multiagent<br />
systems? Proceedings of the Fourth International<br />
Conference on Autonomous Agents, 2000.<br />
[28] J. B. Rawlings. Tutorial overview of model predictive<br />
control. IEEE Control Systems, 20(3):38–52, 2000.<br />
[29] R. Sanz <strong>and</strong> K.-E. Arzen. Trends in software <strong>and</strong> control.<br />
IEEE Control Systems Magazine, June 2003.<br />
[30] F. Sheikh, J. Rolia, P. Garg, S. Frolund, <strong>and</strong> A. Shepard.<br />
Performance evaluation of a large scale distributed<br />
application design. World Congress on Systems<br />
Simulation, 1997.<br />
[31] I. Soto, M. Garijo, C. A. Iglesias, <strong>and</strong> M. Ramos.<br />
An agent architecture to fulfill real-time requirement.<br />
Agents, 2000.<br />
[32] P. Stone <strong>and</strong> M. Veloso. Using decision tree confidence<br />
factors for multi-agent control. Autonomous<br />
Agents, 1998.<br />
[33] R. S. Sutton, A. G. Barto, <strong>and</strong> R. J. Williams. Reinforcement<br />
learning is direct adaptive optimal control.<br />
IEEE Control Systems, 12(2):19–22, 1992.<br />
[34] G. Tesauro, D. M. Chess, W. E. Walsh, R. Das,<br />
I. Whalley, J. O. Kephart, <strong>and</strong> S. R. White. A multiagent<br />
systems approach to autonomic computing. Autonomous<br />
Agents <strong>and</strong> Multi-Agent Systems, 2004.<br />
[35] H. P. Thadakamalla, U. N. Raghavan, S. R. T. Kumara,<br />
<strong>and</strong> R. Albert. Survivability of multi-agent supply<br />
networks: A topological perspective. IEEE Intelligent<br />
Systems: Dependable Agent Systems, 19(5):24–<br />
31, September/October 2004.<br />
[36] R. Vincent, B. Horling, V. Lesser, <strong>and</strong> T. Wagner. Implementing<br />
soft real-time agent control. Agents, 2001.<br />
[37] T. Vu, J. Go, G. Kaminka, M. Velosa, <strong>and</strong> B. Browning.<br />
Monad: A flexible architecture for multi-agent<br />
control. Autonomous Agents <strong>and</strong> Multi-Agent Systems,<br />
2003.<br />
[38] W. Whitt. The queueing network analyzer. The Bell<br />
System Technical Journal, 62(9):2779–2815, 1983.
Proceedings of the 1st Open Cougaar Conference 1<br />
Survivability through Implementation Alternatives<br />
in Large-scale Information Networks with Finite Load<br />
Seokcheon Lee <strong>and</strong> Soundar Kumara<br />
Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />
The Pennsylvania State University<br />
University Park, PA 16802<br />
{stonesky, skumara}@psu.edu<br />
Abstract<br />
We study a large-scale information network, which is<br />
composed of distributed software components linked with<br />
each other through a task flow structure. The service<br />
provided by the network is to produce a global solution to<br />
a given problem, which is an aggregate solution of<br />
partial solutions from processing tasks. Quality of Service<br />
of this network is determined by the value of the global<br />
solution <strong>and</strong> time for generating the global solution.<br />
Survivability of the network is the capability to provide<br />
high Quality of Service by utilizing implementation<br />
alternatives as control actions, in the presence of<br />
accidental failures <strong>and</strong> malicious attacks. In this paper<br />
we develop an adaptive control mechanism to support<br />
survivability. We stress two desirable properties in<br />
designing the mechanism: scalability <strong>and</strong> predictability.<br />
To address adaptivity we model the stress environment<br />
indirectly by quantifying resource availability of the<br />
system. We build a mathematical programming model<br />
with the resource availability incorporated, which<br />
predicts Quality of Service as a function of control<br />
actions. By periodically solving the programming model<br />
<strong>and</strong> taking optimal control actions with recent resource<br />
availability, the system can be adaptive to the changing<br />
stress environment predictably. But, as the programming<br />
model becomes large-scale <strong>and</strong> complex, we agentify the<br />
components of the network from a control point of view<br />
so that the system can solve the large-scale programming<br />
model in a decentralized mode. We provide an auctionbased<br />
market as a decentralized coordination<br />
mechanism.<br />
1. Introduction<br />
Critical infrastructures become increasingly dependent<br />
on networked systems in many domains for automation or<br />
organizational integration. Though such infrastructure can<br />
improve the efficiency <strong>and</strong> effectiveness, these systems<br />
can be easily exposed to various adverse events such as<br />
accidental failures <strong>and</strong> malicious attacks [1]. Two metrics,<br />
namely survivability <strong>and</strong> scalability, can be used to<br />
determine the efficiency <strong>and</strong> effectiveness of these<br />
systems. Survivability is defined as “the capability of a<br />
system to fulfill its mission, in a timely manner, in the<br />
presence of attacks, failure, or accidents” [2]. One<br />
promising way to achieve survivability is through<br />
adaptivity: changing the system behavior to achieve the<br />
system goal in response to the changing environment [3].<br />
One important consideration of an adaptation is<br />
predictability. Unpredictable adaptation can sometimes<br />
result in worse performance than without adaptation [4].<br />
Scalability is defined as: “the ability of a solution to some<br />
problem to work when the size of the problem increases”<br />
(From Dictionary of Computing at<br />
http://wombat.doc.ic.ac.uk). As the size of networked<br />
systems grows scalability becomes a critical issue when<br />
developing practical software systems [5].<br />
As software systems grow larger <strong>and</strong> more complex,<br />
component technology became one of the topmost topics<br />
in the computing community [6][7]. A component is a<br />
reusable program element, with which developers can<br />
build the systems needed by simply wiring all the<br />
components together. To support flexible usage of the<br />
components in various forms, the components must be<br />
independent, self-contained, <strong>and</strong> highly specialized. In<br />
component-based software systems, components interact<br />
with each other through a task flow structure with each<br />
component specialized for specific tasks.<br />
We study a large-scale information network, which is<br />
composed of distributed software components linked with<br />
each other through a task flow structure. A problem given<br />
to the network is decomposed into a set of tasks for some<br />
of software components <strong>and</strong> those tasks are propagated<br />
through the task flow structure. The service provided by<br />
the network is to produce a global solution to the given<br />
problem, which is an aggregate solution of partial<br />
solutions from processing tasks. Each component can<br />
process a task using one of available implementation<br />
alternatives, which trade off processing time <strong>and</strong> value of<br />
partial solution. Quality of Service (QoS) of this network<br />
is determined by the value of the global solution <strong>and</strong> time<br />
for generating the global solution. Survivability of the<br />
network is the capability to provide high QoS in the<br />
presence of accidental failures <strong>and</strong> malicious attacks. A
Proceedings of the 1st Open Cougaar Conference 2<br />
promising approach to deal with the large-scale systems is<br />
multiagent systems (MAS), we agentify the components<br />
in purely control point of view. In MAS, agents address<br />
the scalability issue by computing solutions locally <strong>and</strong><br />
then using this information in a social way. In this paper<br />
we develop a multiagent-based adaptive control<br />
mechanism with scalability <strong>and</strong> predictability to support<br />
survivability of large-scale networks.<br />
Specifically, in Section 2, we discuss problem domain<br />
<strong>and</strong> in Section 3 formally define the problem in detail.<br />
We review previous control approaches in Section 4. We<br />
design an adaptive control mechanism in Section 5 <strong>and</strong><br />
show empirical results in Section 6. <strong>Final</strong>ly, we discuss<br />
implications <strong>and</strong> possible extensions of our work in<br />
Section 7.<br />
2. Problem domain<br />
The networks we study in this paper represent<br />
distributed <strong>and</strong> component-based architectures. As an<br />
instance, Cougaar (Cognitive Agent Architecture:<br />
http://www.cougaar.org) developed by <strong>DARPA</strong> (Defense<br />
Advanced Research Project Agency), follows such an<br />
architecture for building large-scale multiagent systems.<br />
Recently, there have been efforts to combine the<br />
technologies of agents <strong>and</strong> components to improve the<br />
way of building large-scale software systems [8][9][10].<br />
While component technology focuses on reusability,<br />
agent technology focuses on processing complex tasks as<br />
a community. Cougaar is in line with this trend. In<br />
Cougaar a software system is comprises of agents <strong>and</strong> an<br />
agent of components (called plugins). The task flow<br />
structure in those systems is that of components as a<br />
combination of intra-agent <strong>and</strong> inter-agent task flows. As<br />
the agents in Cougaar can be distributed both from<br />
geographical <strong>and</strong> information content sense, the networks<br />
implemented in Cougaar have distributed <strong>and</strong> componentbased<br />
architecture.<br />
UltraLog (http://www.ultralog.net) networks are<br />
military supply chain planning systems implemented in<br />
Cougaar. Agents in those networks represent<br />
organizations in military supply chains. The objective of<br />
an UltraLog network is to provide appropriate logistics<br />
plan to a military operational plan. The system produces a<br />
logistics plan by decomposing the operational plan into<br />
logistics tasks <strong>and</strong> processing them through a task flow<br />
structure. The system makes initial planning for a given<br />
operation <strong>and</strong> continuous replanning in the execution<br />
mode to cope with logistics plan deviations or operational<br />
plan changes. As the scale of operation increases there<br />
can be thous<strong>and</strong>s of agents working together to generate a<br />
logistics plan.<br />
Initial planning or replanning generates a logistics plan<br />
as a global solution, which is an aggregate of individual<br />
schedules built by plugins through their task flow<br />
structure. Each plugin can implement one of its available<br />
implementation alternatives which trade off processing<br />
time <strong>and</strong> quality of the schedule. Quality of service is<br />
determined by two metrics, quality of logistics plan <strong>and</strong><br />
plan completion time. These two metrics directly affect<br />
the performance of the operation.<br />
Planning <strong>and</strong> replanning of UltraLog networks are the<br />
instances of the current research problem. An UltraLog<br />
network cannot work in isolation from outside world<br />
because they utilize external databases <strong>and</strong> users should<br />
be able to access the system. This inevitable connection to<br />
the outside makes the system exposed to malicious<br />
attacks in addition to accidental failure. Now, the<br />
question is how can we make this system survivable to<br />
generate high quality logistics plans in a timely manner in<br />
the presence of accidental failures <strong>and</strong> malicious attacks?<br />
3. Problem specification<br />
In this Section we formally define the problem by<br />
detailing the network model. We concentrate on<br />
computational CPU resources assuming that the system is<br />
computation-bounded.<br />
3.1. Network model<br />
We define four elements of the network to clarify its<br />
mechanics: network configuration, implementation<br />
alternatives, quality of service, <strong>and</strong> stress environment.<br />
Network configuration<br />
A network is composed of a set of agents A with each<br />
agent located in its own machine. Task flow structure of<br />
the network, which defines precedence relationship<br />
between agents, is an acyclic directed graph with each<br />
link assigned a positive real number. A link number l ij<br />
(i≠j) indicates the number of tasks generated for successor<br />
agent j when agent i processes a task in its queue. Once<br />
accumulated tasks for a successor agent becomes over<br />
one, the corresponding integer number of tasks are sent to<br />
the successor agent. By using real numbers we can<br />
represent wide range of task flow structure including noninteger<br />
aggregation <strong>and</strong> expansion.<br />
A problem given to a network is decomposed in terms<br />
of root tasks for some agents. And, those tasks are<br />
propagated through task flow structure.<br />
Implementation alternatives<br />
An agent can have multiple implementation<br />
alternatives to process a task. Different alternatives trade<br />
off CPU time <strong>and</strong> solution value with more CPU time
Proceedings of the 1st Open Cougaar Conference 3<br />
resulting in higher solution value. As we can find optimal<br />
mixed alternatives, an agent has a monotonically<br />
increasing convex function, say value function, with CPU<br />
time as a function of value. We call the value in the<br />
function as value mode that the agent can select as its<br />
decision variable. A value function is defined with three<br />
components as:<br />
〈 f i ( vi<br />
), vi(min) , vi(max)<br />
This function says that an agent i’s expected CPU time<br />
to process a task is f i (v i ) with a value mode v i <strong>and</strong> v i(min) ≤<br />
v i ≤ v i(max) .<br />
Quality of service<br />
A problem given to the network is decomposed to root<br />
tasks for some agents <strong>and</strong> those tasks are propagated<br />
through task flow structure. The service provided by the<br />
network is to produce a global solution to the given<br />
problem, which is an aggregate solution of the partial<br />
solutions from processing tasks. QoS of the network is<br />
determined by the value of global solution <strong>and</strong> the cost of<br />
completion time for generating global solution. The value<br />
of global solution is the summation of partial solution<br />
values. And, the cost of completion time is determined by<br />
a cost function CCT(T), which is a monotonically<br />
increasing function with completion time T. Consider that<br />
v i d denotes the value mode used to process d th task <strong>and</strong> e i<br />
the number of tasks processed to completion by agent i.<br />
Then, QoS can be calculated as:<br />
Stress environment<br />
QoS<br />
ei<br />
= ∑∑<br />
i∈ A d = 1<br />
d<br />
i<br />
〉<br />
v − CCT(<br />
T)<br />
Survivability stresses, such as accidental failures <strong>and</strong><br />
malicious attacks, affect the system by consuming<br />
resources directly or indirectly through activating defense<br />
mechanisms as remedies against them. For example,<br />
“denial of service” attack consumes resources directly<br />
while relevant defense mechanism also consumes<br />
resource in terms of resistance, recognition, <strong>and</strong> recovery<br />
[1]. We consider both of survivability stresses <strong>and</strong><br />
remedies as stress environment from the viewpoint of the<br />
agents in the network.<br />
The stress environment space is a high-dimensional<br />
<strong>and</strong> also evolving one [11][12]. But, as we concentrate on<br />
computational CPU resources a stress environment can be<br />
regarded as a set of threads residing in the machines of<br />
the network <strong>and</strong> sharing resources with the agents. The<br />
threads, say stressors, can have some priorities or weights<br />
for resource allocation under admission or can be stealing<br />
resources without admission.<br />
3.2. Problem definition<br />
In this paper we develop an adaptive control<br />
mechanism with scalability <strong>and</strong> predictability to support<br />
the survivability of large-scale networks. The system<br />
needs to adapt to the changing stress environment to<br />
provide high QoS utilizing implementation alternatives<br />
(v) as:<br />
arg max<br />
v<br />
QoS<br />
We discuss several characteristics of the problem that<br />
will be helpful in underst<strong>and</strong>ing the problem <strong>and</strong><br />
developing appropriate control mechanism:<br />
• Large-scale network: The network can be large-scale<br />
as the number of agents <strong>and</strong> nodes increase with the<br />
scale of the given problem to the network.<br />
• Finite time horizon: The time horizon for a network<br />
to generate a global solution is finite.<br />
• Indecomposable QoS: QoS is not decomposable to<br />
individual elements’ performance because one of the<br />
two conflicting QoS elements is the completion time<br />
that is common throughout the network.<br />
• Complex dynamics: Agents interact with each other<br />
through task flow <strong>and</strong> with stressors through sharing<br />
resources. As those interactions are in parallel to<br />
control actions the dynamics of the system is<br />
intrinsically complex especially in large-scale<br />
networks.<br />
• Non-availability of statistics: Statistics such as arrival<br />
rates or service rates are not fixed or given. But, they<br />
are changing as the system evolves. In addition, the<br />
stress environment changes.<br />
4. Control approaches in dynamic systems<br />
In general in dynamic systems, centralized <strong>and</strong><br />
decentralized control approaches are used.<br />
4.1. Centralized approaches<br />
There are three centralized control approaches,<br />
dynamic programming (DP), reinforcement learning<br />
(RL), <strong>and</strong> model predictive control (MPC). Dynamic<br />
programming (DP) solves optimality equation to produce<br />
reactive strategies in terms of optimal closed-loop control<br />
policy, which is a rule specifying optimal action as a<br />
function of state <strong>and</strong> time [13]. It assumes that the<br />
structure of dynamic model is fixed <strong>and</strong> the model<br />
parameters are known in advance. DP gives absolutely<br />
optimal policy but the complexity in solving optimality<br />
equation grows exponentially with the dimension of the<br />
state space. RL is an adaptive version of DP to develop a
Proceedings of the 1st Open Cougaar Conference 4<br />
policy in real-time when the model parameters are<br />
unknown [14][15]. This method takes longer time to<br />
converge than DP at the cost of exploration in addition to<br />
exploitation.<br />
In MPC, for each current state, an optimal open-loop<br />
control policy is designed for finite-time horizon by<br />
solving a static mathematical programming model based<br />
on an explicit process model [13][16][17][18][19]. The<br />
design process is repeated for the next observed state<br />
feedback forming a closed-loop policy reactive to each<br />
current system state. Though MPC does not give<br />
absolutely optimal policy in stochastic environment, it is<br />
easy to adapt to new contexts by explicitly h<strong>and</strong>ling<br />
objective function or constraints. But, it requires efforts to<br />
develop process models <strong>and</strong> has scalability problem.<br />
4.2. Decentralized approaches<br />
There are three decentralized control approaches,<br />
market-based approaches, insect-behavioral approaches,<br />
<strong>and</strong>, learning-based approaches. Market-based control<br />
works through the interaction of local agents in the same<br />
way as economic markets [20]. Agents trade with one<br />
another using a relatively simple mechanism, yet<br />
desirable global objectives can often be realized. These<br />
approaches are implemented in distributed processor<br />
allocation problems. Insect-behavioral approaches are<br />
inspired by effective <strong>and</strong> adaptive behavior of social<br />
insect colonies such as ants, bees, wasps, <strong>and</strong> termites<br />
[21]. An important <strong>and</strong> interesting behavior of ant<br />
colonies is their foraging behavior, in particular how ants<br />
can find the shortest paths between food sources <strong>and</strong> their<br />
nest. Algorithms based on the foraging behavior are<br />
implemented in routing problems in communication<br />
networks <strong>and</strong> shop floor. Similar to ant algorithms, wasp<br />
algorithms are proposed inspired by wasps’ task<br />
allocation behavior. Algorithms based on the task<br />
allocation behavior are implemented in routing problems<br />
in shop floor. Reinforcement learning can be used without<br />
prior knowledge of the system model. By making agents<br />
to learn through their experience the method can be used<br />
in decentralized mode. These approaches are<br />
implemented in routing problems in communication<br />
networks [22].<br />
5. Control mechanism<br />
DP <strong>and</strong> RL have inefficiencies in terms of scalability<br />
<strong>and</strong> agility which are important considerations in our<br />
problem. In addition, the dynamic model in our problem<br />
is not fixed <strong>and</strong> partially known due to unpredictable<br />
stress environment. Decentralized approaches are scalable<br />
<strong>and</strong> robust, but they lack agility <strong>and</strong> optimality. We<br />
choose MPC-style approach considering its benefits with<br />
respect to complexity, optimality, <strong>and</strong> agility. However,<br />
we need to overcome scalability problem.<br />
5.1. Overall control procedure<br />
As we discussed we develop an adaptive control<br />
mechanism to provide high QoS to the changing stress<br />
environment while ensuring scalability <strong>and</strong> predictability.<br />
To address adaptivity we model the stress environment<br />
indirectly by quantifying resource availability of the<br />
system through sensors. We build a mathematical<br />
programming model with the resource availability<br />
incorporated, which predicts QoS as a function of control<br />
actions. By periodically solving the programming model<br />
<strong>and</strong> taking optimal control actions with recent resource<br />
availability, the system can be adaptive to the changing<br />
stress environment predictably. But, as the programming<br />
model can be large-scale, we provide a decentralized<br />
coordination mechanism to solve the large-scale<br />
programming model in a decentralized mode.<br />
5.2. Sensors<br />
We facilitate two different types of sensors, Load<br />
sensor <strong>and</strong> Resource sensor, which are located in each<br />
agent <strong>and</strong> measure statistics which form the coefficients<br />
in the mathematical programming model.<br />
Load sensor<br />
A load sensor measures future load L i of agent i,<br />
which is the number of tasks to be processed in the future.<br />
Initially, each agent identifies their future loads by<br />
combining its own root tasks <strong>and</strong> incoming tasks from its<br />
predecessor agents in the future. After identifying initial<br />
future loads, agents update them by counting down as<br />
they process tasks.<br />
Resource sensor<br />
A resource sensor measures resource availability,<br />
which is defined as the available fraction of resource<br />
when an agent requests the resource. In a given time<br />
window we define two measurements to calculate this<br />
statistic, request time <strong>and</strong> execution time. Request time is<br />
the time duration that an agent requests resource, which is<br />
the duration for which queue length (including one in<br />
service) is more than zero. Execution time is the time<br />
duration for which an agent actually utilizes resource. An<br />
agent i’s resource availability in between two subsequent<br />
control points (k-1, k) is calculated as:<br />
RA<br />
( k−1,<br />
k)<br />
i<br />
execution time in ( k −1,<br />
k)<br />
=<br />
.<br />
request time in ( k −1,<br />
k)
Proceedings of the 1st Open Cougaar Conference 5<br />
5.3. Mathematical programming model<br />
Agents can estimate their resource availability in the<br />
future using observed resource availability in the past. An<br />
agent i estimates its resource availability in the future RA i<br />
f<br />
using observed resource availability in the last control<br />
period. Service time to process a task can be directly<br />
predicted as a function of value mode by incorporating<br />
the estimation as:<br />
i<br />
i<br />
f<br />
i<br />
f ( v ) / RA .<br />
Based on this we build a mathematical programming<br />
model. Consider completion time as T <strong>and</strong> current time as<br />
t. An agent’s optimal mode is a pure mode common to all<br />
the tasks because of the convexity of value function.<br />
When agents use pure modes such that their total service<br />
times are less than or equal to T-t, each agent can<br />
complete its tasks approximately by T because in worst<br />
case tasks will arrive at a constant rate, L i /(T-t). In other<br />
words, the completion time is dominantly determined by<br />
bottleneck agents with maximal total service times for<br />
their future loads, that is:<br />
f<br />
i<br />
T − t ≈ Max [ L * f ( v ) / RA ] .<br />
i∈A<br />
i<br />
So, given completion time T each agent can select a<br />
maximal mode so that total service time is less than or<br />
equal to T. That is, it is optimal for each agent to select a<br />
mode maximizing:<br />
subject to<br />
i<br />
L i * v i<br />
f<br />
i<br />
L * f ( v ) / RA ≤ T − t .<br />
i<br />
i<br />
i<br />
Through the optimality condition we can formulate the<br />
control problem through a mathematical programming<br />
model that maximizes QoS by trading off the value of<br />
solution <strong>and</strong> the cost of completion time as:<br />
Select v i ’s <strong>and</strong> T satisfying:<br />
Max<br />
s.<br />
t.<br />
∑<br />
i∈A<br />
L * f ( v ) / RA<br />
v<br />
i<br />
L * v − CCT ( T )<br />
i<br />
i<br />
i(min)<br />
i<br />
i<br />
≤ v ≤ v<br />
i<br />
f<br />
i<br />
i(max)<br />
≤ T − t<br />
i<br />
for all i ∈ A<br />
for all i ∈ A<br />
security. As we discussed earlier our effort is to support<br />
survivability. If information is revealed to others directly<br />
it is not survivable in the viewpoint of information<br />
security. So, decentralization will also help survivability<br />
with respect to information security.<br />
One branch of distributed control approaches is that of<br />
decentralizing structured mathematical programming<br />
models. In this branch there are two popular methods,<br />
decomposition methods <strong>and</strong> auction/bidding algorithms.<br />
We decentralize the mathematical programming model<br />
through a non-iterative auction mechanism, so called<br />
multiple-unit auction with variable supply [23]. In this<br />
auction a seller may be able <strong>and</strong> willing to adjust the<br />
supply as a function of bidding. In the programming<br />
model we have built, all the agents are coupled with each<br />
other. But, it has a typical structure, where the objective<br />
function <strong>and</strong> constraints are separable if one variable T is<br />
fixed. This characteristic makes it possible to convert the<br />
model into an auction. The completion time T is an<br />
unbounded resource <strong>and</strong> the supply can be adjusted as a<br />
function of bidding.<br />
In the designed auction for the programming model,<br />
agents bid for T <strong>and</strong> the seller decides T* based on the<br />
bids by maximizing its utility considering the cost. But,<br />
the seller supplies so that minimum requirements of the<br />
agents are fulfilled. After the seller broadcasts T*, agents<br />
select their optimal value modes by maximizing their<br />
utility.<br />
<br />
Agents’ bids<br />
〈 b<br />
i( T ), Ti<br />
(min)〉<br />
b ( T ) = L * f<br />
i<br />
i<br />
= L * v<br />
i<br />
−1<br />
i<br />
f<br />
( T − t)*<br />
RAi<br />
(<br />
)<br />
L<br />
i(max)<br />
f<br />
Ti (min)<br />
= Li<br />
* fi(<br />
vi<br />
(min))/<br />
RAi<br />
+ t<br />
<br />
Max<br />
s.<br />
t.<br />
<br />
Seller’s decision problem<br />
∑<br />
i<br />
i∈A<br />
b ( T)<br />
− CCT(<br />
T)<br />
T ≥ Max(<br />
T<br />
i∈A<br />
i(min)<br />
Agents’ decision<br />
)<br />
i<br />
if T ≤ L * f ( v<br />
else<br />
i<br />
i<br />
i(max)<br />
)/ RA<br />
f<br />
i<br />
+ t<br />
5.4. Decentralization<br />
The next question is how to decentralize the<br />
mathematical programming model for scalability <strong>and</strong><br />
robustness. In addition to these properties<br />
decentralization will give a byproduct, information<br />
v<br />
f<br />
* 1<br />
i<br />
= −<br />
i<br />
= v<br />
( T<br />
(<br />
i(max)<br />
*<br />
f<br />
− t) * RAi<br />
)<br />
L<br />
i<br />
if T<br />
else<br />
*<br />
≤ L * f ( v<br />
i<br />
i<br />
i(max)<br />
) / RA<br />
f<br />
i<br />
+ t
Proceedings of the 1st Open Cougaar Conference 6<br />
The auction mechanism described as a decentralized<br />
coordination incorporates a centralized seller. As a<br />
centralized auction can still exhibit problems in terms of<br />
scalability <strong>and</strong> robustness we introduce a hierarchical<br />
auction mechanism. Suppose that T a * is optimal of agent<br />
group a <strong>and</strong> T b * is optimal of agent group b. If a ⊂ b, then<br />
we can say that:<br />
* *<br />
a T b<br />
T ≤ .<br />
Through this property we can convert the auction<br />
mechanism into a hierarchical one, in which there are<br />
multiple auction markets that are structured<br />
hierarchically. Each auction solves its decision problem<br />
based on the bids from the agents or subordinate auctions<br />
<strong>and</strong> makes a bid to its superior auction with T larger than<br />
<strong>and</strong> equal to its optimal completion time. This<br />
hierarchical structure makes improvements with respect<br />
to scalability <strong>and</strong> robustness compared to central auction<br />
mechanism. Scalability improves because bids <strong>and</strong><br />
decisions are distributed to multiple auctions in the<br />
hierarchical framework. And, robustness improves<br />
because there is no single point of failure.<br />
6. Empirical result<br />
We ran several experiments to validate the proposed<br />
control approach through discrete-event simulation.<br />
6.1. Experimental design<br />
The network is composed of fifteen agents with a<br />
convergent structure as in figure 1, in which each link is<br />
assigned 1 (l ij ). In this network each agent in the lowest<br />
position has 200 root tasks. Each of the agents has the<br />
same linear value function <strong>and</strong> the cost of completion<br />
time is linear as described in the figure.<br />
A 2<br />
A 4 A 5<br />
A 1<br />
<br />
CCT(T) = 4T<br />
Figure 1. Experimental network configuration<br />
To observe adaptive behavior we assign weight w i to<br />
an agent <strong>and</strong> w′ i to a stressor residing in a same machine<br />
with the agent for proportional resource share between<br />
them. A stressor, which has infinite work (continuously<br />
A 3<br />
A 8 A 9 A 10 A 11<br />
A 6 A 7<br />
A 12 A 13 A 14<br />
A 15<br />
200 200 200 200 200 200 200 200<br />
requiring resources), can impose different levels of stress<br />
on the agent directly by changing w′ i . When it is zero there<br />
is no stress, <strong>and</strong> as it increases the stress level increases.<br />
We implement our stress environment simply by using<br />
Weighted Round-Robin scheduling, in which each thread<br />
gets a number of quanta in proportion to its weight.<br />
We set up four different experimental conditions as in<br />
table 1. In stressed conditions we stress agent A 4 in the<br />
middle of run. And, the distribution of CPU time in value<br />
function can be deterministic or exponential. While using<br />
stochastic value function we ran 5 experiments.<br />
Table 1. Experimental conditions<br />
Condition Stress Value function<br />
Con1 W/o Stress Deterministic<br />
Con2 W/o Stress Exponential<br />
Con3 W/Stress Deterministic<br />
Con4 W/Stress Exponential<br />
* Control period: 100<br />
* w i : 0.1, w′ 4 : 1 in (500, 1000)<br />
We use three different control modes for each<br />
experimental condition. Table 2 shows the control modes<br />
we used for experimentation. AC represents the adaptive<br />
control mechanism we have developed.<br />
Table 2. Control modes for experimentation<br />
Control mode<br />
Description<br />
FL<br />
Fixed lowest value mode<br />
FH<br />
Fixed highest value mode<br />
AC<br />
Adaptive control<br />
6.2. Results<br />
Experimental results are summarized in table 3. The<br />
proposed adaptive control showed significant advantages<br />
compared to non-adaptive cases in all different<br />
conditions.<br />
Table 3. Experimental results<br />
FL FH AC<br />
T V QoS T V QoS T V QoS<br />
Con1 1656 13558 6934 6313 30643 5391 1663 22898 16245<br />
Con2 1652 13547 6942 6302 30643 5435 1723 22982 16089<br />
Con3 1656 13558 6934 6313 30643 5391 1966 23401 15539<br />
Con4 1652 13547 6942 6371 30643 5159 2024 23495 15401<br />
* T: Completion time<br />
* V: Value of solution<br />
The adaptive behaviors under proposed control<br />
mechanism are shown in figures 2 <strong>and</strong> 3 for deterministic<br />
case, <strong>and</strong> figures 4 <strong>and</strong> 5 for stochastic case. These<br />
represent time series of decision variables at each control<br />
point. Under stress the system changes its behavior
Proceedings of the 1st Open Cougaar Conference 7<br />
adaptively to the new environment. And, once the stress<br />
is removed the system adapts again.<br />
4000<br />
3500<br />
mode<br />
6.0<br />
5.0<br />
4.0<br />
3.0<br />
A 8 A 2<br />
A 1<br />
A 4<br />
optimal T<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
2.0<br />
0<br />
0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />
time<br />
mode<br />
optimal T<br />
1.0<br />
0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />
time<br />
Figure 2. Adaptive value mode under Con3<br />
4000<br />
3500<br />
3000<br />
2500<br />
2000<br />
1500<br />
1000<br />
500<br />
0<br />
0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />
time<br />
Figure 3. Adaptive optimal T under Con3<br />
A 8<br />
6.0<br />
A 2<br />
5.0<br />
4.0<br />
A 1<br />
3.0<br />
A 4<br />
2.0<br />
1.0<br />
0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />
time<br />
Figure 4. Adaptive value mode under Con4<br />
Figure 5. Adaptive optimal T under Con4<br />
7. Summary <strong>and</strong> conclusions<br />
A typical information network emerges as a result of<br />
automation or organizational integration. These networks<br />
are large-scale with distributed <strong>and</strong> component-based<br />
architecture. As such networks can be easily exposed to<br />
various adverse events such as accidental failures <strong>and</strong><br />
malicious attacks, there is a need to study survivability of<br />
the networks.<br />
In this paper we studied the emerging networks to<br />
support survivability by utilizing implementation<br />
alternatives. By adopting MPC-style approach<br />
considering its benefits with respect to complexity,<br />
optimality, <strong>and</strong> agility, we developed an adaptive control<br />
mechanism with scalability <strong>and</strong> predictability. To address<br />
adaptivity we modeled the stress environment indirectly<br />
by quantifying resource availability of the system. We<br />
built a mathematical programming model with the<br />
resource availability incorporated, which predicts QoS as<br />
a function of control actions. By periodically solving the<br />
programming model <strong>and</strong> taking optimal control actions<br />
with recent resource availability, the system could be<br />
adaptive to the changing stress environment predictably.<br />
But, as the programming model can be large-scale <strong>and</strong><br />
complex, we agentified the components of the network<br />
from control point of view so that the system can solve<br />
the large-scale programming model in a decentralized<br />
mode. We provided a hierarchical auction mechanism as<br />
a coordination mechanism. We showed the effectiveness<br />
of our approach regarding to QoS <strong>and</strong> adaptivity in<br />
different experimental conditions.<br />
Our approach can be extended for the network<br />
configurations where there are multiple agents in a<br />
machine sharing resources together. In this case we have<br />
a good opportunity to improve the system performance by<br />
appropriately allocating resources to the agents.<br />
To implement the proposed control mechanism in<br />
information networks such as UltraLog network, we need<br />
to devise several things which are discussed in
Proceedings of the 1st Open Cougaar Conference 8<br />
developing the mechanism. Each component should have<br />
value function <strong>and</strong> sensors. And, to coordinate the<br />
components through hierarchical auction market, sellers<br />
need to be built with appropriate optimization algorithms.<br />
To provide necessary information to auction market<br />
components <strong>and</strong> sellers should be able to make bids. As<br />
the system makes periodic decisions a seller at the top of<br />
hierarchy may send market opening messages to market<br />
participants periodically.<br />
Acknowledgements<br />
Support for this research was provided by <strong>DARPA</strong><br />
(Grant#: MDA972-01-1-0038) under the UltraLog<br />
program. We thank Dr. Mark Greaves (<strong>DARPA</strong>),<br />
Marshall Brinn, Beth DePass, <strong>and</strong> Aaron Helsinger (all<br />
from BBN) for their suggestions in this work.<br />
References<br />
[1] S. Jha, J. M. Wing, “Survivability analysis of networked<br />
systems”, 23rd international conference on Software<br />
engineering, pp. 307-317, 2001<br />
[2] R. Ellison, D. Fisher, H. Lipson, T. Longstaff, <strong>and</strong> N. Mead,<br />
“Survivable network systems: An emerging discipline”,<br />
Technical <strong>Report</strong> CMU/SEI-97-153, Software Engineering<br />
Institute, Carnegie Mellon University, 1997<br />
[3] J. E. Eggleston, S. Jamin, T. P. Kelly, J. K. MacKie-Mason,<br />
W. E. Walsh, <strong>and</strong> M. P. Wellman, “Survivability through<br />
Market-Based Adaptivity: The MARX Project”, <strong>DARPA</strong><br />
Information Survivability Conference <strong>and</strong> Exposition, 2000<br />
[4] S. Bowers, L. Delcambre, D. Maier, C. Cowan, P. Wagle, D.<br />
McNamee, A. L. Meur, <strong>and</strong> H. Hinton, “Applying Adaptation<br />
Spaces to Support Quality of Service <strong>and</strong> Survivability”,<br />
<strong>DARPA</strong> Information Survivability Conference <strong>and</strong> Exposition,<br />
2000<br />
[5] O. F. Rana <strong>and</strong> K. Stout, “What is scalability in multi-agent<br />
systems?”, Fourth International Conference on Autonomous<br />
Agents, 2000<br />
[6] B. Meyer, “On to components”, IEEE Computer, vol. 32, no.<br />
1, pp. 139-140, 1999<br />
[7] P. Clements, “From Subroutine to Subsystems: Component-<br />
Based Software Development”, In Alan W. Brown, editor,<br />
Component Based Software Engineering, IEEE Computer<br />
Society Press, pp. 3-6, 1996<br />
[8] F. M. T. Brazier, C. M. Jonker, <strong>and</strong> J. Treur, “Principles of<br />
Component-Based Design of Intelligent Agents”, Data <strong>and</strong><br />
Knowledge Engineering, vol. 41, no. 1, pp. 1-28, 2002<br />
[9] H. J. Goradia <strong>and</strong> J. M. Vidal, “Building blocks for agent<br />
design”, Fourth International Workshop on Agent-Oriented<br />
Software Engineering, pp. 17-30, 2003<br />
[10] R. Krutisch, P. Meier, <strong>and</strong> M. Wirsing, “The<br />
AgentComponent approach, combining agents <strong>and</strong><br />
components”, Net.objectDays, 2003<br />
[11] A. P. Moore, R. J. Ellison, <strong>and</strong> R. C. Linger, “Attack<br />
Modeling for Information Security <strong>and</strong> Survivability”,<br />
Technical Note CMU/SEI-2001-TN-001, Software Engineering<br />
Institute, Carnegie Mellon University, 2001<br />
[12] F. Moberg, “Security Analysis of an Information System<br />
Using an Attack Tree-based Methodology”, Master’s Thesis,<br />
Automation Engineering Program, Chalmers University of<br />
Technology, 2000<br />
[13] G. Barto, S. J. Bradtke, <strong>and</strong> S. P. Singh, “Learning to act<br />
using real-time dynamic programming”, Artificial Intelligence,<br />
vol. 72, pp. 81-138, 1995<br />
[14] R. S. Sutton, A. G. Barto, <strong>and</strong> R. J. Williams,<br />
“Reinforcement learning is direct adaptive optimal control”,<br />
IEEE Control Systems, vol. 12, no. 2, pp. 19-22, 1992<br />
[15] L. P. Kaelbling, M. L. Littman, <strong>and</strong> A. W. Moore,<br />
“Reinforcement learning: A survey”, Journal of Artificial<br />
Intelligence Research, vol. 4, pp. 237-285, 1996<br />
[16] J. B. Rawlings, “Tutorial overview of model predictive<br />
control”, IEEE Control Systems, vol. 20, no. 3, pp. 38-52, 2000<br />
[17] M. Morari <strong>and</strong> J. H. Lee, “Model predictive control: past,<br />
present <strong>and</strong> future”, Computers <strong>and</strong> Chemical Engineering, vol.<br />
23, no. 4, pp. 667-682, 1999<br />
[18] M. Nikolaou, “Model predictive Controllers: A Critical<br />
Synthesis of Theory <strong>and</strong> <strong>Industrial</strong> Needs”, Advances in<br />
Chemical Engineering Series, Academic Press, 2001<br />
[19] S. J. Qin <strong>and</strong> T. A. Badgwell, “A survey of industrial model<br />
predictive technology”, Control Engineering Practice, vol. 11,<br />
pp. 733-764, 2003<br />
[20] S. Clearwater, Market-Based Control: A Paradigm for<br />
Distributed Resource Allocation, World Scientific Publishing,<br />
1996<br />
[21] E. Bonabeau, M. Dorigo, <strong>and</strong> G. Theraulaz, Swarm<br />
Intelligence: From Natural to Artificial Systems, Oxford<br />
University Press, 1999<br />
[22] S. Kumar, “Confidence based dual reinforcement Q-<br />
routing: An on-line adaptive network routing algorithm”,<br />
Technical <strong>Report</strong> AI98-267, Department of Computer Sciences,<br />
The University of Texas at Austin, 1998<br />
[23] Y. Lengwiler, “The multiple unit auction with variable<br />
supply”, Economic Theory, vol. 14, pp. 373-392, 1999
Proceedings of the 1st Open Cougaar Conference 9
1<br />
SITUATION IDENTIFICATION USING DYNAMIC PARAMETERS IN<br />
COMPLEX AGENT-BASED PLANNING SYSTEMS<br />
SEOKCHEON LEE, N. GAUTAM, S. KUMARA, Y. HONG, H. GUPTA,<br />
A. SURANA, V. NARAYANAN, H. THADAKAMALLA, M. BRINN, M.<br />
GREAVES<br />
Department of <strong>Industrial</strong> Engineering<br />
The Pennsylvania State University<br />
University Park, PA 16802<br />
ABSTRACT<br />
Survivability of multi-agent systems is a critical problem. Real-life systems are<br />
constantly subject to environmental stresses. These include scalability,<br />
robustness <strong>and</strong> security stresses. It is important that a multi-agent system adapts<br />
itself to varying stresses <strong>and</strong> still operates within acceptable performance<br />
regions. Such an adaptivity comprises of identifying the state of the agents,<br />
relating them to stress situations, <strong>and</strong> then invoking control rules (policies). In<br />
this paper, we study a supply chain planning implemented in COUGAAR<br />
(Cognitive Agent Architecture) developed by <strong>DARPA</strong> (Defense Advanced<br />
Research Project Agency), <strong>and</strong> develop a methodology to identify behavior<br />
parameters, <strong>and</strong> relate those parameters to stress situations. Experimentally we<br />
verify the proposed method.<br />
1. INTRODUCTION<br />
Survivability of multi-agent systems is a critical problem. Real-life systems are<br />
inherently distributed <strong>and</strong> are constantly subject to environmental <strong>and</strong> internal stresses.<br />
These include scalability, robustness <strong>and</strong> security stresses. It is important that a multiagent<br />
system adapts itself to varying stresses <strong>and</strong> still operates within an acceptable<br />
performance region. Such an adaptivity comprises of identifying the state of the agents,<br />
relating them to stress situation, <strong>and</strong> then invoking control rules (policies). One of the<br />
fundamental problems is agent state (behavior) identification.<br />
In this paper, we study a supply chain planning society called Small Supply Chain<br />
(SSC) implemented in COUGAAR (Cognitive Agent Architecture) developed by<br />
<strong>DARPA</strong> (Defense Advanced Research Project Agency), <strong>and</strong> develop a methodology for<br />
behavior parameter identification, <strong>and</strong> relating it to stress situations. The two important<br />
steps in our methodology are: 1. Identify the most discriminable behavior parameter set<br />
for situation identification, 2. Apply it to situation identification. To identify the most<br />
discriminable behavior parameter set we collect the time series data from one of the<br />
agents in SSC (TAO) <strong>and</strong> compute 38 statistical <strong>and</strong> deterministic parameters to represent<br />
the collected time series. In essence, these 38 parameters are the features of agent state. In<br />
our earlier work (Ranjan et al., 2002) we prove that SSC shows chaotic behavior from an<br />
inventory fluctuation point of view <strong>and</strong> computed chaos indicators (which we call as<br />
deterministic parameters without loss of generality). Though we compute 38 different<br />
parameters, next question we address is whether all these are really useful <strong>and</strong> necessary<br />
for identifying several stress situations. So, we develop a discriminability index <strong>and</strong><br />
identify the most discriminable behavior parameter set based on this index as a
2<br />
representative parameter set for identifying several stress situations. Using those<br />
parameters we develop a nearest neighbor classification based method to identify stress<br />
situations.<br />
2. SSC (SMALL SUPPLY CHAIN) SOCIETY<br />
SSC is a COUGAAR society for supply chain planning composed of 26 agents.<br />
Each agent generates logistics plan depending on its relative position in the supply chain.<br />
TAO is an important agent of the SSC <strong>and</strong> we have selected it to test our schema. Figure<br />
1 shows the detailed view. In TAO GenerateProjection Tasks are exp<strong>and</strong>ed to Supply<br />
Tasks, which are for internal consumption. Each Supply Task is exp<strong>and</strong>ed to Withdrawal<br />
Task, which is allocated to inventory asset. Supply Tasks are also transferred from other<br />
agents. They are exp<strong>and</strong>ed to Withdrawal Tasks, which are allocated to inventory asset.<br />
MaintainInventory Tasks, which are for the maintenance of inventory assets in TAO, are<br />
exp<strong>and</strong>ed to Supply Tasks. Each Supply Task is allocated to other agents.<br />
MaintainInventory<br />
ProjectSupply<br />
Supply<br />
TAO<br />
GenerateProjection<br />
ProjectSupply<br />
Supply<br />
ProjectWithdrawl<br />
Withdrawl<br />
Inventory Asset<br />
Figure 1. TAO in SSC<br />
3. STRESSES AND BEHAVIOR<br />
For the sake of analysis we have parameterized the stress situations <strong>and</strong> system<br />
behavior.<br />
3.1 Stress<br />
Stress refers to survivability stress <strong>and</strong> includes scalability, security, <strong>and</strong> robustness<br />
stresses. Scalability is defined as the ability of a solution to a problem to work when the<br />
size of the problem increases. And, survivability (regarding security <strong>and</strong> robustness) is<br />
defined as as the capability of a system to fulfill its mission, in a timely manner, in the<br />
presence of attacks, failures, or accidents (Ellison et al., 1997). There can be diverse<br />
stress situations, but in this paper we consider stress situations formed by two scalability<br />
stress types given below:<br />
• Problem Complexity: Problem complexity is determined by the complexity of<br />
the planning task. This includes many aspects <strong>and</strong> we have chosen one of the stress types,<br />
called OpTempo of each agent. OpTempo defines operation tempo.<br />
• Query Frequency: Each agent provides query service for its planning<br />
information to human operators. We have chosen query frequency (# of query request per<br />
second) to each agent as one of stress types.<br />
Although SSC society is composed of 26 agents there are only 8 agents that are<br />
directly affected by OpTempo. We define stress levels: Low/Medium/High. So, the size<br />
of our stress situation space becomes 3 34 .
3<br />
3.2 Behavior<br />
In SSC society an agent’s behavior can be described by its Task groups’ behaviors.<br />
Behaviors can be represented by time series. We define four different time series (Task<br />
arrival, Time to solution sorted by generation sequence, Time to solution sorted by<br />
completion sequence, <strong>and</strong> Queue length). A time series may be characterized using<br />
deterministic <strong>and</strong> statistical parameters as shown in Table 1.<br />
Deterministic characterization makes it possible to h<strong>and</strong>le non-stationary, nonperiodic,<br />
irregular time series, including chaotic deterministic time series. In this study<br />
we use five different deterministic behavior parameters. In a deterministic dynamical<br />
system since the dynamics of a system are unknown, we cannot reconstruct the original<br />
attractor that gave rise to the observed time series. Instead, we seek the embedding space<br />
where we can reconstruct an attractor from the scalar data that preserves the invariant<br />
characteristics of the original unknown attractor using delay coordinates proposed by<br />
Packard et al. (1980) <strong>and</strong> justified by Taken (1981). Average mutual information has<br />
been suggested to choose time delay coordinates by Fraser <strong>and</strong> Swinney (1986). And,<br />
Schuster (1989) proposed nearest neighbor algorithm to base the choice of the embedding<br />
dimension. Local dimension has been used to define the number of dynamical variables<br />
that are active in the embedding dimension (1998). The most popular measure of an<br />
attractor’s dimension is the correlation dimension, first defined by Grassberger <strong>and</strong><br />
Procaccia (1983). And, a method to measure the largest Lyapunov exponent, sensitivity<br />
to initial condition as a measure of chaotic dynamics, is proposed by Wolf et al. (1985).<br />
We have systematically studied the use of the methods from the literature <strong>and</strong> computed<br />
38 different behavioral parameters to characterize the four time series we have<br />
considered. These 38 parameters are shown in Table 1.<br />
Statistical<br />
Parameters<br />
Deterministic<br />
Parameters<br />
Table 1. Behavioral parameters<br />
Time Series<br />
Task Arrival<br />
Time to Solution Time to Solution<br />
(Generation) (Completion)<br />
# of events<br />
# of events<br />
Average<br />
Average<br />
Minimum<br />
Minimum<br />
Maximum<br />
Maximum<br />
Radius<br />
Radius<br />
Variance<br />
Variance<br />
ami<br />
e_dim<br />
l_dim<br />
c_dim<br />
l_exp<br />
ami<br />
e_dim<br />
l_dim<br />
c_dim<br />
l_exp<br />
ami<br />
e_dim<br />
l_dim<br />
c_dim<br />
l_exp<br />
ami: average mutual information, e_dim: embedding dimension, l_dim: local dimension,<br />
c_dim: correlation dimension, l_exp: lyapunov exponent<br />
Queue Length<br />
# of events<br />
Average<br />
Minimum<br />
Maximum<br />
Radius<br />
Variance<br />
ami<br />
e_dim<br />
l_dim<br />
c_dim<br />
l_exp<br />
4. EXPERIMENTATION AND RESULTS<br />
We ran several simulations of SSC to identify the most discriminable behavior<br />
parameter set.
4<br />
4.1 Experimental configuration<br />
SSC<br />
TAO<br />
Behavior<br />
Stressor<br />
Stress<br />
Situation<br />
Database<br />
Parameter<br />
Generation<br />
Parameter<br />
Table<br />
Online Experimentation<br />
Figure 2. Experimental configuration<br />
Offline Analysis<br />
In this experimentation we store event data from TAO <strong>and</strong> the parameters of stress<br />
situation from stressor into an online database, <strong>and</strong> then from the database we construct<br />
the parameter table with stress parameters <strong>and</strong> behavior parameters as in the Fig. 2. The<br />
experimental matrix is shown in Table 2.<br />
Table 2. Experimental matrix<br />
TestID OpTempo Query Repetition<br />
PRE001 Low to all agents Low to all agents 10<br />
PRE002 High to all agents Low to all agents 10<br />
PRE003 Medium to all agents Low to all agents 10<br />
PRE004 Medium to all agents High to all agents 10<br />
4.2 Results<br />
Reduction of stress space<br />
Figure 3. shows an example of ‘# of events’ parameter in each experiment repeated<br />
10 times in four different stress conditions. We identified the stresses that have no<br />
significant effects on the society’s behavior by comparing the behavior parameters under<br />
different conditions. The result shows:<br />
• No significant difference between Low <strong>and</strong> Medium of OpTempo stress<br />
• No significant effect of query frequency stress<br />
# of Events from Task Arrival<br />
1120<br />
1110<br />
1100<br />
# of events<br />
1090<br />
1080<br />
1070<br />
1060<br />
1050<br />
PRE001 PRE002 PRE003 PRE004<br />
1040<br />
1<br />
3<br />
5<br />
7<br />
9<br />
11<br />
13<br />
15<br />
17<br />
19<br />
21<br />
23<br />
25<br />
27<br />
29<br />
31<br />
33<br />
35<br />
37<br />
39<br />
Experiment<br />
Figure 3. Comparison of a behavior parameter in different stress conditions<br />
This leads to the reduction in the stress space to 2 8 (OpTempo Low/High for 8<br />
agents) from 3 34 .<br />
Discriminability of behavior parameters
5<br />
All the behavior parameters may not be equally good in helping the classification of<br />
stress situations. Therefore, there is a need for a measure of discriminating power of each<br />
of the behavior parameters. We call this as discriminability index (DI). DI can be<br />
represented as the ratio between sensitivity to the stress situations <strong>and</strong> r<strong>and</strong>om variation<br />
defined as:<br />
Discriminability Index (DI) = [∑(µ-µi) 2 /n] / [∑(si 2 )/n] = ∑ (µ-µi) 2 / ∑ (si 2 ) (1)<br />
µ : Average of parameter values<br />
µi : Average of parameter values from ith condition<br />
si : St<strong>and</strong>ard deviation of parameter values from ith condition<br />
n : Number of conditions<br />
We ranked those 38 behavior parameters using the DI. Top 5 are as shown in Table<br />
3. As shown in the table ‘# of events’ from task arrival time series was the most<br />
discriminable behavior parameter. Because this parameter is sensitive to the different<br />
stress situations <strong>and</strong> has small variation in the same stress situations the DI is relatively<br />
larger than those of other parameters.<br />
Table 3. Discriminability index (DI) of behavior parameters<br />
Rank DI Time Series Behavior Parameter<br />
1 2477 Task arrival # of events<br />
2 6 Time to solution Variance<br />
3 5 Time to solution Radius<br />
4 4 Time to solution Average<br />
5 4 Time to solution Maximum<br />
5. SITUATION IDENTIFICATION<br />
Results from preliminary experimentation showed that ‘# of events’ from task<br />
arrival time series (# of tasks) is the most discriminable behavior parameter in our stress<br />
space. So, assuming that the input to an agent affects the output depending on that agent’s<br />
stress situation we can identify OpTempo of an agent by using four features of ‘# of<br />
tasks’ as shown in Fig. 4.<br />
ProjectSupply<br />
Supply<br />
Agent<br />
(OpTempo)<br />
# of ProjectSupply from Outside / # of Supply from Outside<br />
# of ProjectSupply to Outside / # of Supply to Outside<br />
Figure 4. Features for situation identification<br />
ProjectSupply<br />
Supply<br />
OpTempo<br />
We performed an initial design of experiments <strong>and</strong> constructed a database of the<br />
behavior parameters from 100 experiments. Each agent’s OpTempo is r<strong>and</strong>omly chosen<br />
<strong>and</strong> the parameters are computed <strong>and</strong> stored in the database. Given a new experimental<br />
data we select the nearest neighbor from the base database by using the Euclidean<br />
distance between feature vectors. The stress level of the nearest neighbor is used for<br />
stress estimation. We estimated the stress level for 100 new experimental data using this
6<br />
approach. The results of estimation are shown in Table 4. Half of agents identified the<br />
stress successfully although the other half didn’t.<br />
Table 4. Stress estimation result<br />
Stress Correct estimation Stress Correct estimation<br />
OpTempo of agent 1 54% OpTempo of agent 5 100%<br />
OpTempo of agent 2 100% OpTempo of agent 6 94%<br />
OpTempo of agent 3 56% OpTempo of agent 7 53%<br />
OpTempo of agent 4 100% OpTempo of agent 8 46%<br />
6. CONCLUSIONS<br />
In this paper, we developed a methodology for extracting features from time series’<br />
of an agent-based supply chain planning society (behavior parameters) <strong>and</strong> relating it to<br />
stress situations. We identified ‘# of tasks’ as the most discriminable behavior parameter<br />
of our 38 statistical <strong>and</strong> deterministic parameters in our stress space. Using this parameter<br />
we validated the method’s ability to identify stress situation using nearest neighbor<br />
classification. Although our analysis showed deterministic parameters don’t have the<br />
ability to identify stress situations in our stress space it is possible that they can be good<br />
indicators under other stress space such as security <strong>and</strong> robustness stresses.<br />
ACKNOWLEDGEMENTS<br />
Support for this research was provided by <strong>DARPA</strong> (Grant#: MDA 972-01-1-0563) under<br />
the UltraLog program.<br />
REFERENCES<br />
Abarbanel, H. D. I., Gilpin, M. E., Rotenberg, M., 1998, Analysis of Observed Chaotic Data,<br />
Springer.<br />
Ellison, R. J., Fisher, D. A., Linger, R. C., Lipson, H. F., Longstaff, T., Mead, N. R., 1997,<br />
“Survivable Network Systems, An Emerging Discipline”, Technical <strong>Report</strong> CMU/SEI-97-153,<br />
Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA.<br />
Fraser, A. M., <strong>and</strong> Swinney, H., 1986, “Independent coordinates for strange attractors from mutual<br />
information”, Physical Review A, Vol. 33, pp. 1134 – 1140.<br />
Grassberger, P., <strong>and</strong> Procaccia, I., 1983, “Characterization of Strange Attractors”, Physical Review<br />
Letters, Vol. 50, pp. 346.<br />
Grassberger, P., <strong>and</strong> Procaccia, I., 1983, “Characterization of Strange Attractors”, Physica D, Vol. 9,<br />
pp. 189 – 208.<br />
Packard, N. H, Crutchfield, J. P., Farmer, J. D., <strong>and</strong> Shaw, R. S., 1980, “Geometry from a Time<br />
Series”, Physical Review Letters, Vol. 45, pp. 712.<br />
Ranjan, P., Kumara, S., Surana, A., Manikonda, V., Greaves, M., Peng, W., 2002, “Decision Making<br />
in Logistics: A Chaos Theory Based Analysis”, AAAI Spring Symposium, Technical <strong>Report</strong> SS-02-<br />
03, pp. 130-136.<br />
Schuster, H. G., 1989, Deterministic Chaos: An Introduction, Verlagsgesellshaft, Weinheim.<br />
Taken, F., 1981, “Detecting strange attractors in turbulence”, Dynamical Systems <strong>and</strong> Turbulence,<br />
pp. 366 - 381, Springer, Berlin.<br />
Wolf, A., Swift, J. B., Swinney, H. L., <strong>and</strong> Vastano, J., 1985, “Determining Lyapunov Exponents<br />
from a Time Series”, Physica D, Vol. 16, pp. 285 – 317.
Estimating Global Stress Environment by Observing Local<br />
Behavior in Distributed Multiagent Systems<br />
Seokcheon Lee <strong>and</strong> Soundar Kumara<br />
Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering,<br />
The Pennsylvania State University<br />
University Park, PA 16802 USA<br />
Abstract—A multiagent system can be considered survivable<br />
if it adapts itself to varying stresses without considerable<br />
performance degradation. Such an adaptivity comprises of<br />
identifying the behavior of the agents in a society, relating them<br />
to stress situations, <strong>and</strong> then invoking control rules. This<br />
problem is a hard one, especially in distributed multiagent<br />
systems wherein the agent behaviors tend to be nonlinear <strong>and</strong><br />
dynamic. In this paper, we study a supply chain planning<br />
system implemented in COUGAAR (Cognitive Agent<br />
Architecture) <strong>and</strong> develop a methodology for identifying the<br />
behavior of agents through their behavioral parameters, <strong>and</strong><br />
relating those parameters to stress situations. One important<br />
aspect of our approach is that we identify the stress situations<br />
of agents in the society by observing local behavior of one<br />
representative agent. This approach is motivated by the fact<br />
that a local time series can have the information of the<br />
dynamics of the entire system in deterministic dynamical<br />
systems. We validate our approach empirically through<br />
identifying the stress situations using k-nearest neighbor<br />
algorithm based on the behavioral parameters.<br />
S<br />
I. INTRODUCTION<br />
urvivability is defined as “the capability of a system to<br />
fulfill its mission, in a timely manner, in the presence of<br />
attacks, failures, or accidents” [1]. This definition considers<br />
security <strong>and</strong> robustness stresses as components of the stress<br />
environment. With the increasing size of networked systems<br />
scalability becomes a critical issue for a system to fulfill its<br />
mission [2]. We argue that scalability is also an important<br />
component of survivability <strong>and</strong> hence the stress<br />
environment. In this paper we consider only scalability<br />
stress in dealing with survivability.<br />
Survivability of multiagent systems is a critical problem.<br />
As infrastructures become large-scale <strong>and</strong> increasingly<br />
dependent on networked systems for automation or<br />
organizational integration, this capability becomes more <strong>and</strong><br />
more important. Real-life systems are inherently distributed<br />
This work was supported in part by <strong>DARPA</strong> under Grant MDA 972-01-<br />
1-0038.<br />
S. Lee is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong><br />
Engineering, The Pennsylvania State University, University Park, PA 16802<br />
USA (phone: 814-863-4799; fax: 814-863-4745; e-mail:<br />
stonesky@psu.edu).<br />
S. Kumara is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong><br />
Engineering, The Pennsylvania State University, University Park, PA 16802<br />
USA (e-mail: skumara@psu.edu).<br />
<strong>and</strong> are constantly subject to environmental <strong>and</strong> internal<br />
stresses. Hence, it is important that a multiagent system<br />
adapts itself to varying stresses <strong>and</strong> maintains its<br />
performance within the acceptable bounds of performance.<br />
The three important constituents of adaptivity are: agent<br />
behavior identification, mapping the agent behavior to the<br />
environment (stresses) <strong>and</strong> invoking the appropriate control<br />
rules (policies).<br />
In this paper, we study a supply chain planning system,<br />
Small Supply Chain (SSC) society implemented in<br />
COUGAAR (Cognitive Agent Architecture:<br />
http://www.cougaar.org) as an example system. We develop<br />
a methodology to identify the stress situations of the agents<br />
in the society by observing local behavior of one<br />
representative agent (called TAO). This information can<br />
subsequently be used to devise <strong>and</strong> invoke control policies.<br />
The two important steps in our methodology are: 1. Extract<br />
meaningful behavioral parameters for situation<br />
identification, 2. Apply these parameters to situation<br />
identification. We collect time series data from TAO <strong>and</strong><br />
compute 38 statistical <strong>and</strong> deterministic parameters to<br />
represent its behavior. In essence, these 38 parameters are<br />
the features of the behavior of the society as we can assume,<br />
from the theory of deterministic dynamic systems [3], [4],<br />
that the behavior of TAO has the information of the<br />
dynamics of the entire system. All the 38 different<br />
parameters are not equally important <strong>and</strong> independent. We<br />
therefore develop a discriminability index of the parameters<br />
based on which, extract meaningful behavioral parameters.<br />
Using those selected parameters we develop a k-nearest<br />
neighbor classification based method to identify stress<br />
situations of agents in the society.<br />
The organization of the paper is as follows. In section II<br />
we discuss the SSC society. In section III, we parameterize<br />
stress situations <strong>and</strong> behavior. In section IV we analyze the<br />
results from preliminary experimentation to build the<br />
methodology. In section V we implement our approach to<br />
identify the stress situations. <strong>Final</strong>ly, in section VI, we<br />
conclude our work.<br />
II. SSC (SMALL SUPPLY CHAIN) SOCIETY<br />
SSC is a COUGAAR society for military supply chain<br />
planning composed of 26 agents with 17 agents working for
actual planning. COUGAAR has distributed <strong>and</strong><br />
component-based architecture in which agents are<br />
geographically distributed <strong>and</strong> process their specific types of<br />
tasks. The objective of the SSC society is to generate a<br />
logistics plan for a given military operation. Each agent,<br />
representing an organization of military supply chain,<br />
processes tasks received from other agents or generated<br />
internally. Those tasks are allocated to assets after<br />
exp<strong>and</strong>ing or aggregating. The allocations in an agent<br />
trigger generating tasks to its supplier agents to refill the<br />
assets. When tasks from customers are allocated in a<br />
supplier agent, the results are fed back to the customer<br />
agents. Fig. 1 shows the task flow structure of the SSC<br />
society. TAO (Agent 3), which provides direct logistics<br />
support to combat units (Agents 1 <strong>and</strong> 2), is an important<br />
agent of the SSC society with respect to its relationship with<br />
other agents <strong>and</strong> amount of tasks. We have selected it as a<br />
representative agent to test our schema.<br />
1<br />
2<br />
14<br />
III. STRESS AND BEHAVIOR<br />
For analysis purposes we parameterize the stress <strong>and</strong><br />
behavioral space.<br />
A. Stress<br />
TAO<br />
3<br />
15<br />
4<br />
5 6 7<br />
There are diverse survivability stresses with respect to<br />
scalability, security, <strong>and</strong> robustness [5]–[7]. For<br />
implementation purpose, we consider three types of<br />
scalability stresses as follows:<br />
- Network topology: We consider one aspect of scalability<br />
stress as adding or removing an agent(s) to TAO in the<br />
existing topology. Theoretically we can r<strong>and</strong>omly add or<br />
remove any agent. However, we consider agent 1 <strong>and</strong> TAO<br />
together. The three stress levels we impose on TAO are<br />
through removing agent 1, having one agent 1 connected<br />
<strong>and</strong> adding one more agent of agent 1 type.<br />
- Problem Complexity: Problem complexity is determined<br />
by the complexity of the planning tasks. This includes many<br />
aspects <strong>and</strong> we have chosen OpTempo to implement this<br />
stress type, which represents the tempo of military<br />
operations. We define three stress levels of OpTempo for<br />
each of the 16 agents other than agent 1 as Low, Medium<br />
8<br />
9<br />
16 17<br />
Fig. 1. SSC society<br />
10<br />
11<br />
12<br />
13<br />
<strong>and</strong> High.<br />
- User Query: Each agent provides query service for its<br />
planning information to human operators. We have chosen<br />
query frequency to implement this stress type, which is the<br />
number of query requests per second. We define three stress<br />
levels of query frequency for each of the 16 agents other<br />
than agent 1as Low, Medium <strong>and</strong> High.<br />
The size of the stress space is very large. Combining these<br />
three types of stresses the size of stress space becomes 3 33<br />
(3*3 16 *3 16 ).<br />
B. Behavior<br />
In SSC society an agent’s behavior can be abstracted by<br />
observing the agent’s task processing. We define four<br />
different time series related to the agent’s task processing as<br />
follows:<br />
- Task arrival: Task inter-arrival times from other agents<br />
as well as TAO itself<br />
- Time to solution sorted by generation sequence: Time<br />
durations taken to complete a task from its generation,<br />
sorted by generation sequence<br />
- Time to solution sorted by completion sequence: Time<br />
durations taken to complete a task from its generation,<br />
sorted by completion sequence<br />
- Queue length: Number of tasks that are waiting for<br />
processing<br />
A time series can be characterized using deterministic <strong>and</strong><br />
statistical parameters. We have systematically studied the<br />
use of the methods from the literature <strong>and</strong> computed 38<br />
different behavioral parameters to characterize the four time<br />
series we have considered. These 38 parameters, composed<br />
of 18 statistical <strong>and</strong> 20 deterministic parameters (from<br />
dynamical systems theory), are shown in Table I. These<br />
represent the features of agent’s behavior.<br />
Task<br />
Arrival<br />
# of events<br />
Average<br />
Minimum<br />
Maximum<br />
Radius<br />
Variance<br />
AMI<br />
E_Dim<br />
L_Dim<br />
C_Dim<br />
L_Exp<br />
TABLE I<br />
BEHAVIORAL PARAMETERS<br />
Time to Solution<br />
(Generation)<br />
AMI<br />
E_Dim<br />
L_Dim<br />
C_Dim<br />
L_Exp<br />
Time Series<br />
# of events<br />
Average<br />
Minimum<br />
Maximum<br />
Radius<br />
Variance<br />
Time to Solution<br />
(Completion)<br />
AMI<br />
E_Dim<br />
L_Dim<br />
C_Dim<br />
L_Exp<br />
Queue<br />
Length<br />
# of events<br />
Average<br />
Minimum<br />
Maximum<br />
Radius<br />
Variance<br />
AMI<br />
E_Dim<br />
L_Dim<br />
C_Dim<br />
L_Exp<br />
AMI: Average Mutual Information, E_Dim: Embedding Dimension,<br />
L_Dim: Local Dimension, C_Dim: Correlation Dimension, L_Exp:<br />
Lyapunov Exponent<br />
Deterministic characterization makes it possible to h<strong>and</strong>le<br />
non-stationary, non-periodic, irregular time series, including
chaotic deterministic time series. In this paper we use five<br />
different deterministic behavioral parameters. In a<br />
deterministic dynamical system since the dynamics of a<br />
system are unknown, we cannot reconstruct the original<br />
attractor that gives rise to the observed time series. Instead,<br />
we seek the embedding space where we can reconstruct an<br />
attractor from the scalar data that preserves the invariant<br />
characteristics of the original unknown attractor using delay<br />
coordinates [3], [4]. This motivates us to characterize the<br />
system dynamics of the society by observing local behavior.<br />
Average mutual information has been suggested to select<br />
time delay coordinates [8]. Nearest neighbor algorithm to<br />
base the choice of the embedding dimension is proposed in<br />
[9]. Local dimension has been used to define the number of<br />
dynamical variables that are active in the embedding<br />
dimension [10]. The most popular measure of an attractor’s<br />
dimension is the correlation dimension [11], [12]. In [13] a<br />
method to measure the largest Lyapunov exponent,<br />
sensitivity to initial condition as a measure of chaotic<br />
dynamics, is proposed. As these parameters are well<br />
documented in the references we have given, we do not<br />
undertake a detailed explanation.<br />
TABLE II<br />
EXPERIMENTAL MATRIX<br />
TestID OpTempo Query Replication<br />
PRE001 Low to all agents Low to all agents 10<br />
PRE002 High to all agents Low to all agents 10<br />
PRE003 Medium to all agents Low to all agents 10<br />
PRE004 Medium to all agents High to all agents 10<br />
For all experiments the number of agent 1 is one<br />
B. Results<br />
1) Reduction of stress space: We identified the stress<br />
situations that have no significant effects on the system<br />
dynamics by analyzing behavioral parameters. Fig. 3 shows<br />
an example of ‘# of events’ parameter from ‘Task Arrival’<br />
time series in four different stress conditions. By analyzing<br />
all 38 parameters systematically we concluded that:<br />
- There is no significant difference between Low <strong>and</strong><br />
Medium levels of OpTempo stress.<br />
- There is no significant effect of query frequency stress.<br />
This analysis leads to the reduction of the stress space to<br />
3*2 16 (the number of agent 1: 0/1/2, OpTempo for each of<br />
16 agents other than agent 1: Low/High) from 3 33 .<br />
IV. PRELIMINARY EXPERIMENTATION<br />
We ran several experiments to reduce the stress space by<br />
removing ineffective stress situations (stresses which do not<br />
change the existing behavior of a given agent). In addition<br />
we use the experiments to extract meaningful behavioral<br />
parameters from the 38 behavioral parameters we computed.<br />
In the following we undertake a detailed explanation.<br />
A. Experimental configuration<br />
# of events<br />
1120<br />
1110<br />
1100<br />
1090<br />
1080<br />
1070<br />
1060<br />
1050<br />
1040<br />
PRE001 PRE002 PRE003 PRE004<br />
SSC<br />
TAO<br />
1030<br />
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39<br />
Experiment<br />
Stressor<br />
Stress<br />
Situation<br />
Behavior<br />
Online Experimentation<br />
Database<br />
Parameter<br />
Generation<br />
Fig. 2. Experimental configuration<br />
Parameter<br />
Table<br />
Offline Analysis<br />
In this experimentation we store event data from TAO<br />
<strong>and</strong> the stress parameters from stressor (i.e., injector of<br />
stresses) into an online database, <strong>and</strong> then from the database<br />
we construct the parameter table with stress parameters <strong>and</strong><br />
behavioral parameters as shown in Fig. 2. The experimental<br />
matrix in this preliminary experimentation is shown in Table<br />
II. There are four different experimental conditions with<br />
different OpTempo <strong>and</strong> query frequency levels. We replicate<br />
each condition ten times.<br />
Fig. 3. Variance of a behavioral parameter<br />
2) Discriminability of behavioral parameters: All the 38<br />
behavioral parameters may not be equally good in helping<br />
the identification of stress situations. It is important to select<br />
good parameters, especially when we use deterministic<br />
parameters because they are computationally expensive.<br />
Therefore, there is a need for a measure of discriminating<br />
power of the parameters. We developed an index to measure<br />
the discriminating power. We call this as DI<br />
(Discriminability Index). DI is represented as the ratio<br />
between sensitivity to the stress situations <strong>and</strong> r<strong>and</strong>om<br />
variation defined as in (1).<br />
DI =<br />
∑ (<br />
∑<br />
2<br />
µ − µ i ) / n<br />
=<br />
2<br />
s / n<br />
i<br />
∑ (<br />
∑<br />
µ − µ )<br />
s<br />
2<br />
i<br />
µ : Average of parameter values<br />
µ i : Average of parameter values in i th condition<br />
s i : St<strong>and</strong>ard deviation of parameter values in i th condition<br />
n : Number of conditions<br />
i<br />
2<br />
(1)
A DI value greater than one implies that the particular<br />
parameter can help in discriminating between the situations<br />
(more discrimination power). We calculated DI values for<br />
38 behavioral parameters <strong>and</strong> selected those parameters with<br />
DI values larger than one. This resulted in 10 parameters,<br />
comprising of eight statistical <strong>and</strong> two deterministic<br />
parameters, as shown in Table III. ‘# of events’ from ‘Task<br />
Arrival’ <strong>and</strong> ‘Time to Solution’ was the most discriminating<br />
behavioral parameter. Note that ‘# of events’ is the same for<br />
both time series’ as arrived tasks are processed.<br />
TABLE III<br />
DISCRIMINABILITY INDEX OF BEHAVIORAL PARAMETERS<br />
Rank DI Time Series Parameter<br />
1 2477.5 Task Arrival/Time to Solution # of events<br />
2 5.7 Time to Solution(G) Variance<br />
3 5.1 Time to Solution(G) Radius<br />
4 4.4 Time to Solution(G) Average<br />
5 4.2 Time to Solution(G) Maximum<br />
6 2.9 Queue length # of events<br />
7 2.8 Queue length Maximum<br />
8 2.2 Queue length AMI<br />
9 1.2 Queue length Average<br />
10 1.1 Time to Solution(C) L_Expo<br />
(G): Generation, (C): Completion<br />
V. SITUATION IDENTIFICATION<br />
Results from preliminary experimentation shows that 10<br />
of the 38 behavioral parameters have better discriminating<br />
power in the stress space. By using them as features we<br />
identify the stress situations using k-nearest neighbor<br />
classification algorithm.<br />
A. k-nearest neighbor algorithm<br />
k-nearest neighbor algorithm, one of the instance-based<br />
learning methods, is conceptually straightforward. The<br />
summary reported here is based on [14]. In this algorithm<br />
learning is simply storing the training instances, in which<br />
each instance corresponds to a point in the n-dimensional<br />
feature space. Given a new query instance k nearest<br />
neighbors are retrieved from memory <strong>and</strong> used to classify<br />
the new query instance. The nearest neighbors of an instance<br />
are defined in terms of the st<strong>and</strong>ard Euclidean distance. One<br />
problem in this algorithm is the sensitivity to noise axes in<br />
high dimensional problems. One possible solution would be<br />
to normalize each feature. However, normalization does not<br />
resolve this problem because Euclidean distance can become<br />
very noisy for high dimensional problems where only a few<br />
of the features carry classification information. The solution<br />
to this problem is to modify the Euclidean metric by a set of<br />
weights that represents the information content or goodness<br />
of each feature. Therefore, given a set of weights w the<br />
distance between two normalized instances x i <strong>and</strong> x j with m<br />
features can be calculated as in (2).<br />
m<br />
∑<br />
d( x , x ) = w[<br />
k](<br />
x [ k]<br />
− x [ k])<br />
(2)<br />
i<br />
j<br />
k = 1<br />
B. Empirical results<br />
i<br />
We performed 200 experiments with the same<br />
experimental configuration as in Fig. 2 to construct the<br />
database of training instances. Each training instance is<br />
represented with the 10 behavioral parameters. In this<br />
experimentation each agent’s OpTempo (Low/High) <strong>and</strong> the<br />
number of agent 1 (0/1/2) are r<strong>and</strong>omly chosen. Given a<br />
new instance we select 20 nearest neighbors (10% of the<br />
population of training instances) from the database <strong>and</strong><br />
estimate the stress situations of the agents in the society by<br />
using those neighbors. To address the effectiveness of DI we<br />
use 12 different sets of weights to calculate the distance. In<br />
the first 10 sets only one parameter is considered with other<br />
parameters’ weights equal to zero. We weight the<br />
parameters equally in the 11th set <strong>and</strong> proportionally to DI<br />
in the 12th set.<br />
We estimated the stress situations for 100 new instances<br />
using those 12 different sets of weights. To eliminate the<br />
noise from those agents that has no significant effect on the<br />
behavioral parameters, we removed the agents from our<br />
analysis that we cannot identify more than 2/3 by using any<br />
weight set. Through this procedure only 8 agents are<br />
selected. The results of correct estimation from different<br />
weight sets are shown in Fig. 4.<br />
Correctness (%)<br />
75<br />
70<br />
65<br />
60<br />
55<br />
50<br />
1 2 3 4 5 6 7 8 9 10<br />
On the whole, performance of behavioral parameters to<br />
identify the stress situations is quite correlated with DI of<br />
the parameters. As a parameter’s DI ranked high its<br />
estimation accuracy is also high. And, when weighted<br />
equally the performance is in the middle. But, when we<br />
weight proportionally to DI the performance becomes the<br />
highest. This result demonstrates the effectiveness of DI to<br />
get the goodness of the behavioral parameters for situation<br />
identification. Fig. 5 shows the performance for each agent<br />
when we weight proportionally to DI. As agents located<br />
farther from the TAO the performance becomes degraded.<br />
j<br />
DI Rank<br />
2<br />
Weighted w ith DI<br />
Weighted equally<br />
Fig. 4. Correct estimation using different weight sets
100%<br />
1<br />
95%<br />
2<br />
14 15<br />
64%<br />
TAO 4<br />
64%<br />
62%<br />
5 6 7<br />
59%<br />
VI. CONCLUSIONS<br />
In this paper, we developed a methodology for extracting<br />
features by characterizing the time series’ <strong>and</strong> relating it to<br />
stress situations in distributed multiagent systems. One<br />
important aspect of our approach is that we identify the<br />
stress situations of the agents in the society by observing<br />
local behavior of one representative agent. This approach is<br />
motivated by the fact that a local time series can have the<br />
information of the dynamics of entire system in<br />
deterministic dynamic systems. It is important to identify the<br />
situations of other agents when agents are interdependent in<br />
networked systems.<br />
When we have a large society we will be able to predict<br />
the stress levels in some other agents in the society. This<br />
helps in invoking an appropriate control policy. For example<br />
by studying the local behavior of TAO during certain time,<br />
we may be able to estimate that agent 11’s OpTempo is high<br />
with 62% accuracy. This may need us to reduce the amount<br />
of tasks to the agent as high OpTempo requires more<br />
computational resource.<br />
To extract meaningful behavioral parameters we collected<br />
the time series data from a representative agent <strong>and</strong><br />
computed 38 statistical <strong>and</strong> deterministic parameters to<br />
represent its behavior. Discriminability Index defined by us<br />
in this paper as a measure of the discriminating power of the<br />
parameters seems to be a promising direction for agent<br />
behavior estimation. Using those selected parameters we<br />
validated our approach through identifying the stress<br />
situations using k-nearest neighbor algorithm with the index<br />
values as weights. Although our analysis showed that<br />
deterministic parameters don’t have significant ability to<br />
identify stress situations in our stress space, it is possible<br />
that they can be good indicators under other stress space<br />
such as security <strong>and</strong> robustness stresses.<br />
8<br />
9<br />
16 17<br />
52%<br />
10<br />
62%<br />
Fig. 5. Correct estimation with proportional weights to DI<br />
11<br />
12<br />
13<br />
[2] O. F. Rana <strong>and</strong> K. Stout, “What is scalability in multi-agent systems?”<br />
in Proc. 4th Int. Conf. Autonomous Agents, 2000, pp. 56–63.<br />
[3] N. H. Packard, J. P. Crutchfield, J. D. Farmer, <strong>and</strong> R. S. Shaw,<br />
“Geometry from a time series,” Physical Review Letters, vol. 45, pp.<br />
712–716, 1980.<br />
[4] F. Taken, “Detecting strange attractors in turbulence,” in Dynamical<br />
Systems <strong>and</strong> Turbulence, D. R<strong>and</strong> <strong>and</strong> L.-S. Young, Eds. Springer:<br />
Berlin, 1981, pp. 366–381.<br />
[5] A. P. Moore, R. J. Ellison, <strong>and</strong> R. C. Linger, “Attack modeling for<br />
information security <strong>and</strong> survivability,” Software Engineering<br />
Institute, Carnegie Mellon University, Pittsburg, PA, Tech. Note<br />
CMU/SEI-2001-TN-001, 2001.<br />
[6] F. Moberg, “Security analysis of an information system using an<br />
attack tree-based methodology,” M.S. thesis, Automation Engineering<br />
Program, Chalmers University of Technology, Sweden, 2000.<br />
[7] S. Jha <strong>and</strong> J. M. Wing, “Survivability analysis of networked systems,”<br />
in Proc. 23rd Int. Conf. Software engineering, 2001, pp. 307–317.<br />
[8] A. M. Fraser <strong>and</strong> H. Swinney, “Independent coordinates for strange<br />
attractors from mutual information,” Physical Review A, vol. 33, pp.<br />
1134–1140, 1986.<br />
[9] H. G. Schuster, Deterministic Chaos: An Introduction,<br />
Verlagsgesellshaft: Weinheim, 1989.<br />
[10] H. D. I. Abarbanel, M. E. Gilpin, <strong>and</strong> M. Rotenberg, Analysis of<br />
Observed Chaotic Data, Springer: New York, 1998.<br />
[11] P. Grassberger <strong>and</strong> I. Procaccia, “Characterization of strange<br />
attractors,” Physical Review Letters, vol. 50, pp. 346–349, 1983.<br />
[12] P. Grassberger <strong>and</strong> I. Procaccia, “Characterization of strange<br />
attractors,” Physica D, vol. 9, pp. 189–208, 1983.<br />
[13] A. Wolf, J. B. Swift, H. L. Swinney, <strong>and</strong> J. Vastano, “Determining<br />
Lyapunov exponents from a time series,” Physica D, vol. 16, pp. 285–<br />
317, 1985.<br />
[14] T. M. Mitchell, Machine Learning, MaGraw-Hill, pp. 230–236, 1997.<br />
REFERENCES<br />
[1] R. Ellison, D. Fisher, H. Lipson, T. Longstaff, <strong>and</strong> N. Mead,<br />
“Survivable network systems: An emerging discipline,” Software<br />
Engineering Institute, Carnegie Mellon University, Pittsburg, PA,<br />
Tech. Rep. CMU/SEI-97-153, 1997.
Using Predictors to Improve the Robustness of Multi-Agent Systems: Design<br />
<strong>and</strong> Implementation in Cougaar<br />
† Himanshu Gupta, ‡ Yunho Hong, ‡ Hari Prasad Thadakamalla, † Vikram Manikonda, ‡ Soundar Kumara <strong>and</strong> † Wilbur<br />
Peng<br />
† Intelligent Automation Incorporated<br />
7519 St<strong>and</strong>ish Place, Suite 200, Rockville, MD – 20855<br />
{hgupta, vikram, wpeng}@i-a-i.com<br />
‡<br />
<strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />
310 Leonhard Building, The Pennsylvania State University, University Park, PA 16802<br />
{yyh101, hpt102, skumara}@psu.edu<br />
Abstract<br />
In this paper we discuss the use of predictors as a means<br />
to improve the robustness of a multi-agent system in the<br />
event of information attacks that might result in a<br />
communication loss between agents. We focus on an<br />
adaptive logistics application developed under<br />
<strong>DARPA</strong>’s Ultralog program using the Cougaar Agent<br />
infrastructure. The objective of the predictors is to<br />
estimate key “state variables” such as dem<strong>and</strong>,<br />
inventory etc. in the absence of communication, <strong>and</strong><br />
allow logistics planning <strong>and</strong> execution to continue when<br />
communication resources are limited or lost. Prediction<br />
schemes based on a model-based linear state estimation<br />
<strong>and</strong> moving averages are discussed. A generalized<br />
software implementation of the predictors as plugins<br />
within a Cougaar agent, <strong>and</strong> approaches to reconcile<br />
any errors between the “estimated” <strong>and</strong> “actual” states<br />
when communication is restored is also discussed.<br />
Experimental results based on the implementation of the<br />
predictors in a logistics society subject to simulated<br />
communication losses <strong>and</strong> variable changes to the<br />
operational plan are presented.<br />
1 Introduction<br />
Agent-based technology provides a natural solution<br />
for inherently complex, distributed <strong>and</strong> decentralized<br />
systems, where a desired solution emerges as set of<br />
autonomous, interacting entities, execute/optimize their<br />
individual/group behavior in a dynamically changing<br />
environment. Adaptive logistics is one such example. In<br />
this setting, agents represent logistics entities such as the<br />
Units of Action (UA), Forward Support Battalions<br />
(FSB), Brigades, <strong>and</strong> Companies etc. These agents,<br />
distributed across various physical <strong>and</strong> logical<br />
boundaries, collaborate to perform logistics sustainment<br />
operations such as forecasting logistics consumption<br />
trends, identifying potential shortfalls, planning,<br />
executing, monitoring <strong>and</strong> re-planning logistics<br />
operations in a dynamically changing environment.<br />
When deployed in a battlefield environment, the agent<br />
infrastructure is subject to several stresses such as<br />
wartime loads (e.g. CPU stressors due to variable loads),<br />
information attacks (e.g. denial of service,<br />
communication loss, reduced b<strong>and</strong>width) <strong>and</strong> kinetic<br />
attacks (e.g. loss of hardware resources). For successful<br />
deployment, the overall agent infrastructure needs to be<br />
robust <strong>and</strong> resilient to these stresses/attacks.<br />
In this paper we discuss the use of predictors as a<br />
means to achieve robust behavior in the event of<br />
information attacks that might result in a communication<br />
loss between agents. We use an adaptive logistics<br />
application developed under <strong>DARPA</strong>s Ultralog program<br />
using the Cougaar agent infrastructure [9] as a testbed to<br />
motivate, implement <strong>and</strong> test the predictor designs. Two<br />
prediction schemes are discussed. The first is based on a<br />
linear state estimation approach that models agent
interactions as a dynamical system <strong>and</strong> the second is<br />
based on moving averages.<br />
The paper is organized as follows. In Section 2, we<br />
discuss the modeling approach adopted to build the<br />
predictors. In Section 3, software implementation of the<br />
predictors as plugins within a Cougaar agent is<br />
discussed. In Section 4, we discuss experimental results<br />
based on the implementation of the predictors on a<br />
logistics system built on Cougaar. Approaches adopted<br />
to tune the predictors based on historical data is also<br />
discussed. Section 5 discusses conclusions <strong>and</strong> possible<br />
future areas of research <strong>and</strong> development.<br />
2 Predictor Design <strong>and</strong> Algorithms<br />
In this section we discuss the predictor algorithms<br />
that were implemented in Cougaar. Before we discuss<br />
the technical details of the algorithms we present a brief<br />
description of the logistics application domain, as some<br />
aspects of the design <strong>and</strong> implementation are specific to<br />
the application.<br />
2.1 Logistics Scenario<br />
The Cougaar multi-agent society considered in this<br />
effort is the Full society that was developed as a part of<br />
<strong>DARPA</strong>’s Ultralog Program (See [1] for more details)<br />
The Full is a military supply chain logistics society that<br />
consists of many different supply classes. Each agent in<br />
the society represents a military unit performing a<br />
certain logistics operation in the supply chain. For e.g.<br />
the TRANSCOM agent represents the transportation<br />
comm<strong>and</strong> authority for the US military. It issues<br />
directives to its subordinate units regarding the<br />
transportation to be provided to a particular agent for a<br />
particular type of shipment. Figure 1 shows the<br />
organizational structure of the prototype Full society.<br />
Figure 1. Full society hierarchical structure<br />
There are five main supply chain threads in the<br />
prototype military logistics society. They are (i)<br />
Ammunition Supply Chain (ii) Petroleum, Oil <strong>and</strong><br />
Lubricants Supply Chain (BulkPOL <strong>and</strong> PackagedPOL)<br />
(iii) Subsistence Supply Chain (Food, Water) (iv) Repair<br />
Parts Supply Chain; <strong>and</strong> (v) Transportation Supply<br />
Chain<br />
Within each supply chain there exists a customersupplier<br />
relationship between various agents. A<br />
customer makes requests for various items (POL,<br />
ammunition etc) to its supplier <strong>and</strong> the supplier in turn<br />
attempts to meet these dem<strong>and</strong>s based on its current<br />
inventory, or forwards the requests up the supply chain<br />
hierarchy. Thus, depending on its positioning in the<br />
hierarchy, a supplier can also be the customer for<br />
another agent.<br />
Figure 2 shows a part of the supply chain. Here, FSB<br />
is a supplier. ARBN <strong>and</strong> INFBN are customers of FSB<br />
(Note: FSB is also a customer of MSB.) These agents<br />
send dem<strong>and</strong> requests to the FSB, which are managed<br />
by its Inventory Manager. Based on the operation plan,<br />
Optempo <strong>and</strong> current inventory each customer (agent)<br />
requests items from its supplier.<br />
Figure 2. Predictor Implementation in Agent<br />
Network
2.2 Role of Predictors<br />
As mentioned earlier, when deployed in a battlefield<br />
this agent society is subject to several stresses related to<br />
varying wartime loads, kinetic <strong>and</strong> information warfare.<br />
These stresses may result in node failures, denial of<br />
service <strong>and</strong> other network related faults that will result<br />
in the lack of communications between agents. In this<br />
decentralized application the inability of customer/<br />
suppliers to make/meet requests can significantly impact<br />
the performance of the various operational units.<br />
In this setting, predictors can play an important role<br />
in maintaining the supply chain connectivity, while<br />
network related faults are restored. The predictors can<br />
provide the ability to approximate the “expected<br />
behavior” by continuing to make appropriate<br />
dem<strong>and</strong>/supply projections.<br />
We focus on two classes of predictors (i) a customer<br />
predictor that resides at the supplier agent <strong>and</strong> estimates<br />
the customer’s dem<strong>and</strong> when communications are lost<br />
(ii) a supplier predictor inserted at the customer agent<br />
that predicts the allocation results for the tasks generated<br />
by the supplier. As shown in Figure 2, the customer<br />
predictor residing on FSB forecasts each customer’s<br />
(ARBN <strong>and</strong> INFBN) dem<strong>and</strong> when communications are<br />
lost. In a similar fashion the supplier predictor residing<br />
on INFBN agent predicts the supplier’s behavior. These<br />
agents use the predicted values <strong>and</strong> continue execution<br />
of their functionality.<br />
Depending on the accuracy of the predictors,<br />
typically predicted states are not identical to the actual<br />
states. Thus when communications are restored, <strong>and</strong><br />
actual dem<strong>and</strong>s/supply values are available, any errors<br />
between estimated <strong>and</strong> actual values will need to be<br />
resolved. This process, termed as “reconciliation”,<br />
requires any surplus tasks to be rescinded <strong>and</strong> new tasks<br />
added for any shortfalls. The predictors in turn will need<br />
to update their models based on available data.<br />
2.3 Predictor Design<br />
2.3.1 Customer Predictor<br />
The customer predictor is implemented in the form of<br />
two plugins. One plugin is used during the planning<br />
mode where it collects the data about the customersupplier<br />
relationship <strong>and</strong> items involved. It also collects<br />
the Optempo of these items. Another plugin is used<br />
during the execution mode. This Plugin monitors the<br />
dem<strong>and</strong> from the customer <strong>and</strong> predicts the dem<strong>and</strong><br />
when there is a communication loss. Figure 3 shows the<br />
framework of the customer predictor during the<br />
execution mode.<br />
Figure 3. Customer Predictor<br />
2.3.2 Supplier Predictor<br />
This predictor is also built in the form of two plugins.<br />
One plugin resides at the supplier <strong>and</strong> other resides at<br />
the customer. The Plugin at the supplier periodically<br />
sends the snapshots of the inventory levels for each item<br />
in all the supply classes to the Plugin at the customer.<br />
The plugin at the customer uses this information for<br />
predicting the allocation results of the dem<strong>and</strong> task. The<br />
design of the supplier predictor is represented<br />
diagrammatically in Figure 4.<br />
Figure 4. Supplier Predictor<br />
2.4 Predictor Algorithms<br />
Based on the nature of the supply chain dynamics<br />
(uncertainty in dem<strong>and</strong>, model complexity), duration of<br />
communication loss, computational requirements<br />
different approaches for the predictors were investigated.<br />
These ranged from dynamical systems, to classification<br />
theory to traditional forecasting [3,4,6,7,8]. While some<br />
of our research [5] <strong>and</strong> prototypes indicates that we may<br />
get better prediction results by using a non parametric<br />
method such as support vector machine, radial basis<br />
function neural network etc., from an<br />
implementation/computational perspective this becomes<br />
impractical. The primary reason is that as the society<br />
size scales it becomes increasingly difficult to generate<br />
historical patterns for each agent <strong>and</strong> classify its
ehavior states under varying system configurations <strong>and</strong><br />
environment. Given the practical nature of the logistics<br />
application we adopted a more generic computationally<br />
inexpensive approach based on moving averages <strong>and</strong><br />
linear-model based estimation schemes. In this section<br />
the design <strong>and</strong> implementation of the two schemes are<br />
discussed.<br />
2.4.1 Model-based State Estimator<br />
This approach is motivated in part by traditional<br />
approaches to estimation such as a Linear Kalman filter<br />
[2], where the system state is estimated by propagating<br />
the state using a model of the system <strong>and</strong> updating the<br />
state using data-based in actual state measurements.<br />
The implementation adopted here is a fairly simplistic<br />
one but has shown to perform significantly well in a<br />
large number of settings.<br />
Recall, that a Cougaar-based logistics society<br />
operates in two modes – a planning mode <strong>and</strong> an<br />
execution mode. In the planning mode, based on the<br />
anticipated operational plan <strong>and</strong> Optempo a logistics<br />
plan is generated from projected dem<strong>and</strong>s from each of<br />
the agents. In the execution mode this plan is executed,<br />
<strong>and</strong> modifications <strong>and</strong> re-planning are done as needed<br />
based on actual dem<strong>and</strong>s.<br />
The approach used in the model-based state estimator<br />
is to use “plan” time information as the ‘best” estimate<br />
for the state (dem<strong>and</strong>) in the event that actual state<br />
information is not available.<br />
In this approach, plan-time data is used to build a<br />
linear estimator model for the system of the form<br />
x(k+1|k) = x(k) + u(k) (1)<br />
Where x(k) is the dem<strong>and</strong> at time k, <strong>and</strong> u(k) is the<br />
request for the “change in dem<strong>and</strong>” at time k. Since<br />
during the planning mode x(k) is available at each time<br />
step, in this approach u(k) is explicitly computed as<br />
u(k) = x(k+1) – x(k) (2)<br />
for each time step <strong>and</strong> saved. Thus, based on data<br />
available during plan generation a model with state <strong>and</strong><br />
inputs for each time step is built. During the execution<br />
mode, as time evolves the estimator projects the dem<strong>and</strong><br />
for the next time step using (1) <strong>and</strong> corrects its estimate<br />
based on actual dem<strong>and</strong> as follows<br />
xˆ(<br />
k + 1| k + 1) = x(<br />
k + 1| k)<br />
+<br />
(3)<br />
K(<br />
x(<br />
k + 1| k)<br />
− xm(<br />
k + 1| k + 1))<br />
xˆ<br />
Here denoted the estimated state, K is the filter<br />
gain <strong>and</strong> xm is the actual (measured) dem<strong>and</strong>. K is<br />
computed offline based on historical data <strong>and</strong> execution<br />
dem<strong>and</strong> error covariances. In the event of a<br />
communication loss at time j the dem<strong>and</strong> (x cl ) for the<br />
next time step is projected as<br />
x cl ( j + 1) = xˆ(<br />
j | j)<br />
+ u(<br />
j)<br />
(4)<br />
Thus using the above model, the customer/suppliers<br />
continue to execute their functionality using the<br />
estimated states, until communication is restored. At this<br />
point any differences between the estimated dem<strong>and</strong>s<br />
<strong>and</strong> the actual dem<strong>and</strong>s during the communication loss<br />
are reconciled <strong>and</strong> supplies/dem<strong>and</strong> are adjusted<br />
accordingly.<br />
2.4.2 Moving Average Model<br />
In the moving approach the forecasted dem<strong>and</strong> for<br />
the day t, denoted by F t , for the time window i, is given<br />
by<br />
t<br />
∑ − 1<br />
−<br />
Dk<br />
k = t i<br />
Ft<br />
=<br />
i<br />
where, D k is the dem<strong>and</strong> for the k th day. For example,<br />
let the time window be 4. Then forecasted dem<strong>and</strong> for<br />
the day 10 is given by<br />
F<br />
10<br />
=<br />
9<br />
∑<br />
k = 6<br />
4<br />
D<br />
k<br />
( D<br />
=<br />
6<br />
+ D7<br />
+ D8<br />
4<br />
+ D )<br />
We evaluate the effectiveness of the forecasting<br />
method with the following two error criteria which are<br />
chosen depending on the requirements of the system.<br />
The corresponding optimal time window i is calculated<br />
using these error criteria.<br />
Error 1: Difference between average of dem<strong>and</strong>s <strong>and</strong><br />
average of forecasted values<br />
t<br />
∑ Dk<br />
− ∑ Fk<br />
k<br />
k<br />
Et<br />
=<br />
t<br />
t<br />
Where the time window t = 1,2, 3, … , D t denotes the<br />
dem<strong>and</strong> at day t, F t denotes the forecasted value for<br />
day t <strong>and</strong> E t denotes the error at day t.<br />
Error Criteria 2: Difference between daily dem<strong>and</strong> <strong>and</strong><br />
daily forecast value<br />
E t ′ = |D t – F t |<br />
9
3 Implementing Predictors as Cougaar<br />
Plugins<br />
A generalized predictor framework was adopted to<br />
implement predictors in Cougaar to ensure component<br />
reusability with minimal code replication. In the adopted<br />
approach each algorithm does not need a separate<br />
implementation but extends the predictor<br />
implementation interface to make use of the data<br />
collection <strong>and</strong> other functionalities. The predictors were<br />
implemented as plugins providing a lightweight<br />
implementation capability for different agents without<br />
any risk of corrupting or jamming other services<br />
provided by the architecture. The predictors are coupled<br />
with the application domain, i.e., in our case, the<br />
logistics application, <strong>and</strong> hence are not part of the<br />
Cougaar core release but available with the logistics<br />
functionality package.<br />
3.1 The Generalized Predictor Framework<br />
The implementation of predictor framework has the<br />
following features <strong>and</strong> services:<br />
• It has a set of two plugins, one for the planning<br />
mode <strong>and</strong> other for the execution mode. Each<br />
plugin differs in the type of tasks it subscribes to<br />
<strong>and</strong> the task processing logic.<br />
• The plugins automatically identify the customers<br />
for a given agent in which the plugin is inserted.<br />
Thus the predictor does not need to know in<br />
advance what agents represent its customers.<br />
• The plugins automatically identify the supply<br />
classes (for e.g. Ammunition, Subsistence etc.) <strong>and</strong><br />
their respective items for each of its customers that<br />
the supplier predictor serves.<br />
• The plan time plugin subscribes to dem<strong>and</strong><br />
projection tasks <strong>and</strong> generates a dem<strong>and</strong>/day model<br />
for each unique customer-supplyclass-item<br />
relationship. It publishes the data structure or the<br />
model to the blackboard.<br />
• The execution time plugin subscribes to actual<br />
dem<strong>and</strong> tasks <strong>and</strong> generates dem<strong>and</strong>/day quantity<br />
values for each unique customer-supplyclass-item<br />
relationship.<br />
• The execution time plugin subscribes to the model<br />
generated by the plan time plugin to update the<br />
model with actual dem<strong>and</strong> values (this features is<br />
used by the linear state estimator approach). The<br />
plugin does not subscribe to the model when other<br />
approaches such as the Moving Average is used.<br />
• Different algorithm implementations can be hooked<br />
into the execution time plugin to do prediction <strong>and</strong><br />
updates on the model.<br />
• A predictor servlet implementation that can turn the<br />
predictor ON, OFF or SLEEP manually. This<br />
feature is used during testing the event of low CPU<br />
availability, memory usage etc. SLEEP mode refers<br />
to a passive data collection mode with no<br />
communication loss whereas OFF mode refers to<br />
completely shutting down the predictor.<br />
• The execution time plugin can access the<br />
communication loss object to automatically change<br />
the predictor mode to ON/SLEEP.<br />
• The framework is rehydration compatible. This<br />
enables the agent to store the data model <strong>and</strong><br />
current state values to keep functioning normally<br />
when rehydrated.<br />
The above framework is robust <strong>and</strong> generic enough<br />
to be plugged into different agents <strong>and</strong> offers a plug <strong>and</strong><br />
play mechanism to hook up different algorithms for<br />
prediction. It should be noted that in the current<br />
implementation the type of algorithm (model-based,<br />
average-based etc) cannot be chosen at run time <strong>and</strong> is<br />
implemented as a rule in the society definition. Future<br />
work involves algorithm selection as a run-time<br />
capability.<br />
3.2 Reconciliation Mechanism<br />
Once the communications are back up, a<br />
reconciliation mechanism has been developed that<br />
resolves the differences between the actual <strong>and</strong><br />
predicted values to avoid any overages or shortages.<br />
Furthermore, as during run-time actual dem<strong>and</strong>s are<br />
often available for some period into the future, the<br />
mechanism only uses estimated dem<strong>and</strong>s for those days<br />
(after communication loss) where actual dem<strong>and</strong> was<br />
not available. The impact of reconciling the predictions<br />
with the actual dem<strong>and</strong> after communication is restored<br />
is significant since it eliminates/reduces the cascading<br />
effect/bullwhip effect up the supply chain. Also, it<br />
reduces the re-planning of tasks <strong>and</strong> resources which<br />
might have resulted due to shortages or overages.<br />
4 Experimental Results<br />
4.1 Customer Predictor using Model-based<br />
State Estimation<br />
For our implementation <strong>and</strong> analysis, we consider<br />
two agents FSB as a customer <strong>and</strong> MSB as the supplier.<br />
MSB provides FSB support for different supply classes<br />
viz., Ammunition, Fuel, Subsistence <strong>and</strong> Consumable<br />
that in turn have various types <strong>and</strong> items. There are in<br />
total more than 100 items that are supplied by MSB to<br />
FSB. .
Extensive testing <strong>and</strong> validation of the algorithms<br />
was performed on a number of societies with varying<br />
number of agents. The results obtained with the linear<br />
estimator approach seem very encouraging across<br />
different supply classes. Due to space limitations we<br />
would show results for few of these items 1 . Figure 5<br />
shows actual dem<strong>and</strong> <strong>and</strong> predicted dem<strong>and</strong> values for<br />
an Ammunition item. We can see that the predicted<br />
values are very close to the actual values <strong>and</strong> match with<br />
the reorder periods of the actual dem<strong>and</strong>. The<br />
communications were cut for 12 days (day 43 -55)<br />
during which we can still see the predictions, but no<br />
actual dem<strong>and</strong> tasks. In Figure 6 (for Subsistence item)<br />
observe that the predicted dem<strong>and</strong> catches up with the<br />
actual dem<strong>and</strong>. The initial error is due to the initial<br />
model inaccuracies that are reduced as the model is<br />
updated with observed data.<br />
Figure 6. Actual vs. Predicted values for<br />
Subsistence item<br />
Figure 5. Actual vs. Predicted values for Ammo<br />
item<br />
Figure 7 <strong>and</strong> Figure 8 show the planned dem<strong>and</strong><br />
(model), actual dem<strong>and</strong>, predicted dem<strong>and</strong> without<br />
communication loss <strong>and</strong> predicted dem<strong>and</strong> with<br />
communication loss for BulkPOL item <strong>and</strong> Subsistence<br />
item respectively. As actual dem<strong>and</strong> values roll in, the<br />
measurement equations reduce the error causing the<br />
model to closely mimic the actual dem<strong>and</strong> pattern. With<br />
communication loss, the predictor uses the last updated<br />
value to forecast dem<strong>and</strong>.<br />
Figure 7. With & W/o Comm. Loss Prediction<br />
for BulkPOL item<br />
1 Some of the experimental results in this paper are shown for Tiny<br />
<strong>and</strong> Small societies, which are smaller versions of Full society. Note<br />
that since the agents are generic, the dem<strong>and</strong> patterns are similar in all<br />
the societies.<br />
Figure 8. With & W/o Comm. Loss Prediction<br />
for Subsistence item
4.2 Moving Average Model Results<br />
Table 1 shows some typical data collected for the<br />
moving average model based predictor. Each column<br />
shows the dem<strong>and</strong> sent to the supplier (MSB) from the<br />
customer (FSB). On each execution day the customer<br />
sends the dem<strong>and</strong> for about 20 days ahead. Suppose<br />
there is a communication loss on day 51, the predictor<br />
then forecasts the dem<strong>and</strong> on day 52.<br />
The graphs below (Figure 9, Figure 10 <strong>and</strong> Figure<br />
11) show some of the results for the moving average<br />
based predictor. These results are a few representative<br />
examples of several runs performed across a number of<br />
Cougaar societies for various supply classes. The results<br />
show that the forecasted dem<strong>and</strong> is quite close to the<br />
original dem<strong>and</strong> thus validating the methodology of the<br />
predictor.<br />
Table 1. Dem<strong>and</strong> Requests from FSB to<br />
MSB<br />
Figure 9. Comparison of the forecasted<br />
dem<strong>and</strong> with the actual dem<strong>and</strong> for BulkPOL in<br />
Small society<br />
Figure 10. Comparison of the forecasted<br />
dem<strong>and</strong> with the actual dem<strong>and</strong> for BulkPOL in<br />
Tiny society<br />
Figure 11. Comparison of the forecasted<br />
dem<strong>and</strong> with the actual dem<strong>and</strong> for<br />
Ammunition in Tiny society
5 Conclusions <strong>and</strong> Future Research<br />
The generic predictor framework provides core<br />
functionality to Cougaar, making Cougaar a more<br />
survivable agent infrastructure. The predictor plugins<br />
can be invoked by any agent participating in the<br />
logistics supply process. Different algorithms can be<br />
hooked into the framework to use the data <strong>and</strong> other<br />
predictor services, hence eliminating the need to write<br />
predictor plugins from scratch. Initial studies show that<br />
the estimator models work well for items across<br />
different supply classes with the prediction almost<br />
mimicking the actual dem<strong>and</strong> values. However, due to<br />
the low variability <strong>and</strong> uncertainty of the observed<br />
dem<strong>and</strong>, the performance of predictors has not been<br />
extensively tested. Testing with variable dem<strong>and</strong> is<br />
currently in progress. Furthermore as a certain class of<br />
predictors seems to perform better for a particular class<br />
of data, hybrid approaches to intelligently selecting the<br />
predictor algorithms based on data-type <strong>and</strong> dem<strong>and</strong> are<br />
being investigated. One such approach is to use a<br />
SMART predictor (Figure 12) as explained below. Here<br />
a smart predictor would monitor the dem<strong>and</strong> coming<br />
from the customers <strong>and</strong> choose which method should be<br />
used during the communication loss.<br />
We observe that each method (Model based state<br />
estimator <strong>and</strong> Moving-average) gives good forecasts for<br />
certain types of data. Thus by building a SMART<br />
predictor which chooses the type of predictor to be used<br />
depending on the situation would result in better<br />
forecasts.<br />
Brinn <strong>and</strong> Beth DePass for their support, comments <strong>and</strong><br />
insightful discussions. We would also like to thank Lora<br />
Goldston for her support in the development of the<br />
reconciliation code <strong>and</strong> in the testing <strong>and</strong> integration of<br />
the predictor algorithms.<br />
7 References<br />
1. Ultra*Log Adaptive Logistics Defense Team Plan,<br />
Revised version 2.0, 2003.<br />
2. Welch .G, Bishop., G. An introduction to Kalman filter.,<br />
Department of Computer Science, University of North<br />
Carolina at Chapel Hill, Chapel Hill, TR 95-041, March<br />
11 2002.<br />
3. John Moody <strong>and</strong> Christian J. Darken, Fast learning in<br />
networks of locally-tuned processing units, Neural<br />
Computation 1, 281-294, 1989.<br />
4. G. Rätsch, T. Onoda, <strong>and</strong> K.-R. Müller. Soft margins for<br />
AdaBoost. Machine Learning, 42(3):287-320, March<br />
2001.<br />
5. Y.Hong, N.Gautam, S.R.T.Kumara, A.Surana, H.Gupta,<br />
S.Lee, V.Narayanan, H.Thadakamalla, M. Greaves, M.<br />
Brinn, Survivability of Complex System – Support Vector<br />
Machine Based Approach Conf., Artificial Neural<br />
Networks in Engineering (ANNIE) 2002.<br />
6. Osuna, E. E., Freund R. <strong>and</strong> Girosi, F., 1997, Support<br />
Vector Machines: Training <strong>and</strong> Applications, Technical<br />
<strong>Report</strong> AIM-1602, MIT A.I. Lab.<br />
7. Vapnik, V. N., 1998, Statistical Learning Theory, John<br />
wiley & sons, Inc, New York.<br />
8. Burges, C. J. C., 1998, A Tutorial on Support Vector<br />
Machines for Pattern Recognition, Knowledge Discovery<br />
<strong>and</strong> Data Mining, Vol. 2, No. 2, pp. 121-167.<br />
9. Cougaar Website (www.cougaar.org)<br />
Figure 12. Description of SMART Predictor<br />
6 Acknowledgements<br />
This research was performed under the <strong>DARPA</strong><br />
Ultralog effort <strong>and</strong> was supported by <strong>DARPA</strong> grant<br />
MDA972-1-1-0038 <strong>and</strong> Contract 2087-IAI-ARPA-0038.<br />
We would like to thank Dr. Mark Greaves, Marshall
1<br />
SURVIVABILITY OF COMPLEX SYSTEM – SUPPORT VECTOR<br />
MACHINE BASED APPROACH<br />
Y, HONG, N. GAUTAM, S. R. T. KUMARA, A. SURANA, H. GUPTA,<br />
S. LEE, V. NARAYANAN, H. THADAKAMALLA<br />
The Dept. of <strong>Industrial</strong> Engineering, The Pennsylvania State University,<br />
University Park, PA 16802<br />
M.BRINN<br />
BBN Technologies, Cambridge, MA<br />
M. GREAVES<br />
<strong>DARPA</strong> IXO/JLTO, Arlington, VA 22203<br />
ABSTRACT<br />
Logistic systems which are inherently distributed, in general can be classified<br />
as complex systems. Survivability of these systems under varying environment<br />
conditions is of paramount importance. Different environmental conditions in<br />
which the logistic system resides are translated into several stresses. These in<br />
turn will manifest as internal stresses. Logistic systems can be modeled as a<br />
collection of software agents. Each agent’s behavior is a result of the stresses<br />
imposed. Predicting the agents’ collective behavior is of paramount importance<br />
to ensure survivability. Analytical modeling of such systems becomes very<br />
difficult, albeit impossible. In this paper, we study a supply chain in which a<br />
real life scenario is used. We implement the supply chain in Cougaar<br />
(Cognitive Agent Architecture developed by <strong>DARPA</strong>) <strong>and</strong> develop a predictor,<br />
based on Support Vector Machine. We report our methodology <strong>and</strong> results with<br />
real-life experiments <strong>and</strong> stress scenarios.<br />
INTRODUCTION<br />
Logistic systems can be classified as complex systems (Choi et al., 2001,<br />
Baranger, http://necsi.org/projects/baranger/cce.pdf). Logistic systems have<br />
many components such as suppliers <strong>and</strong> distributors at several stages. These<br />
components are distributed geographically but interdependent. At each<br />
component some form of nonlinear decision making process goes on. Typically<br />
the system would respond in a stable manner to external disturbances. But due to<br />
information delay, inherent feedback structure <strong>and</strong> nonlinear components<br />
unstable phenomena can arise which may ultimately manifest as chaotic<br />
behavior. Efficient resource allocation <strong>and</strong> collective oscillations (of say<br />
inventory levels) are some examples of emergent behavior shown by supply<br />
chains. They have structure at many scales, each component itself represents a<br />
simple supply chain. The components compete due to resource limitation but<br />
collobarate/cooperate to maximize their gains which is another characteristic<br />
feature of a complex system.<br />
The survivability of logistic systems under varying environmental<br />
conditions is of paramount importance. Survivability is going to be itself an<br />
emergent property of a logistic system <strong>and</strong> it represents the ability of the system
2<br />
to function critically even under adverse conditions. We refer to these adverse<br />
conditions as stresses. In order to improve the survivability, agents should detect<br />
stresses <strong>and</strong> take appropriate actions so that they can adapt to stress conditions.<br />
Due to lack of analytical tools for predicting emergent behavior of a<br />
complex system from its component behavior, simulation is primary tool of<br />
designing <strong>and</strong> optimizing them. In this paper, we would like to show how an<br />
agent learns to detect stresses as the first step towards improving survivability.<br />
We implemented the supply chain in COUGAAR (Cognitive Agent Architecture<br />
developed by <strong>DARPA</strong>) as a simulation model. Through an extensive design of<br />
experiments we subjected the supply chain to various stress conditions <strong>and</strong> made<br />
the agents learn to predict them using Support Vector Machines.<br />
THE SMALL SUPPLY CHAIN (SSC)<br />
We built a multi-agent system for a small supply chain using Cougaar<br />
version 8.6.0 (http://www.cougaar.org). Cougaar is an open source multi-agent<br />
architecture <strong>and</strong> is appropriate for modelling large-scale distributed supply chain<br />
management applications. We call our supply chain system ‘the Small Supply<br />
Chain (SSC)’.<br />
Each agent in SSC represents an individual organization such as a retailer<br />
<strong>and</strong> a supplier in the supply chain. Figure 1 represents dem<strong>and</strong> flows in this<br />
small supply chain.<br />
Supplier 1<br />
Factory 1<br />
Warehouse 1<br />
Supplier 2<br />
Warehouse 2<br />
Retailer 1<br />
Retailer 2<br />
Distribution Center 1 Whole Saler 1<br />
Retailer 3<br />
Distribution Center 2<br />
Factory 2<br />
Supplier 3<br />
Figure 1. Dem<strong>and</strong> flows in the Small Supply Chain (SSC)<br />
STRESS TYPES AND LEVELS<br />
After some preliminary experiments <strong>and</strong> observations we used the<br />
following stress conditions to show our approach.<br />
Stress 1. Changing OPTEMPO. The SSC works according to a Logistics<br />
Plan. The plan for each agent is prespecified. Every activity of each agent has an<br />
OPTEMPO value which represents the level of the activity. Changing<br />
OPTEMPOs can result in a different plan. OPTEMPO can have one of the three<br />
values, ‘low’, ‘medium’ <strong>and</strong> ‘high’.<br />
Stress 2. Adding <strong>and</strong> Dropping agents. Dropping agents can represent<br />
situations such as communication loss due to physical accidents or cyber attacks.<br />
When a retailer agent is dropped, its supporting agent will not receive tasks from<br />
the dropped agent <strong>and</strong> its retailer agents will not receive responses from the
3<br />
dropped agent. These changes will affect planning significantly. By adding new<br />
retailer agents, we can evaluate how sensitive the SSC is to scalability. The<br />
addition of new retailer agent increases a load to the other supplier agents.<br />
PREDICTORS<br />
In Cougaar, every agent has its own blackboard. During logistics planning,<br />
the intermediate planning results are continuously accumulated on that<br />
blackboard. Therefore, by observing the blackboard we can recognize the<br />
planning state. Our idea is to detect stresses by observing the blackboard. Each<br />
agent should have the ability to detect the stresses coming from outside so that it<br />
can make a decision autonomously to h<strong>and</strong>le the stresses.<br />
In this work, for each agent we build a separate supervised learning model.<br />
Many types of task classes are instantiated on the blackboard. The collection of<br />
the number of tasks of each type represents the state of the agent. A task is a<br />
java class of Cougaar which represents a logistic requirement or activity. Tasks<br />
are generated successively along the supply chain starting from the tasks of the<br />
retailers. The learning model takes the state of the agent as input feature <strong>and</strong><br />
predicts the corresponding stress type <strong>and</strong> level. The pattern recognition model -<br />
predictor - is built using the Support Vector Machine.<br />
In order to prepare training <strong>and</strong> test data, the blackboard of each agent is<br />
monitored <strong>and</strong> data is stored into a database during experiments by a monotoring<br />
facility which consists of a specialized Plugin <strong>and</strong> a separate server machine.<br />
The Plugin is a java class provided by Cougaar. The pattern recognition model is<br />
trained by the data from the database off-line.<br />
SUPPORT VECTOR MACHINES (SVM)<br />
A Support Vector Machine is a pattern recognition method. It has been<br />
popular since the mid-90s because of its theoretical clearness <strong>and</strong> good<br />
performance. Many pattern recognition applications have been reported since<br />
this theory was developed by Vapnik (Műller, et al., 2001), which also<br />
exemplify its superiority over similar such techniques. Moghaddam <strong>and</strong> Yang<br />
(2002) applied SVM to the appearance-based gender classification <strong>and</strong> showed<br />
that it is superior to other classifiers-nearest neighbor, radial basis function<br />
classifier. Liang <strong>and</strong> Lin (2001) showed SVM has better performance than<br />
conventional neural network in detection of delayed gastric emptying. For an<br />
exhaustive review, we recommend the reader to (Burges, 1998), (Chapelle et al.,<br />
1999) <strong>and</strong> (Műller et al., 2001).<br />
The main idea of SVM is to separate the classes with a surface that<br />
maximizes the margin between them. This is an approximate implementation of<br />
the Structural Risk Minimization induction principle (Osuna, et al., 1997). To<br />
construct a classifier for a given data set, a SVM solves a quadratic<br />
programming problem with each variable corresponding to a data point. When<br />
the size of the data set is large, it requires special techniques such as<br />
decompositions to h<strong>and</strong>le the large number of variables. Basically, the SVM is a<br />
linear classifier. Thus in order to h<strong>and</strong>le a dataset which is not separable by a<br />
linear function, inner-product kernel functions are used. The role of the innerproduct<br />
kernel functions is to convert an inner product of low dimensional data
4<br />
points into a corresponding inner product in high dimensional space without<br />
actual mapping. The principle of the mapping is based on Mercer’s theorem<br />
(Vapnik 1998). By doing so, the SVM overcomes non linear-separable cases.<br />
The selection of kernel functions depends on the problem. We should<br />
choose an appropriate function by performing experiments. Other control<br />
parameters for the SVM are the extra cost for errors (represented as C), the loss<br />
sensitivity constant (ε insensitivity ) <strong>and</strong> the maximum number of iterations. The extra<br />
cost for errors, C, is a cost assigned to training errors in non linear-separable<br />
cases. A larger C corresponds to assigning a higher penalty to errors (Burges<br />
1998). The loss sensitivity constant (ε insensitivity ) represents the allowable error<br />
range for the prediction values.<br />
SVMs are basically developed as binary classifiers. Currently a lot of<br />
research is being done in the area of multi-class SVM. We use BSVM 2.0 which<br />
is the multi-class SVM program suggested by Hsu <strong>and</strong> Lin (2002).<br />
EXPERIMENT CONDITIONS<br />
We simulated the SSC under various stress conditions. 12 Stress conditions<br />
are made through the combination of the following factors; The number of<br />
retailer 1(zero, one, two), OPTEMPO of retailer 2 (LOW, HIGH), OPTEMPO<br />
of retailer 3 (LOW, HIGH).<br />
219 data sets were used for training <strong>and</strong> 94 data sets were used for testing<br />
the prediction power from a total of 313 data sets. The conditions <strong>and</strong> number of<br />
experiments are shown in the table 1.<br />
Condition Retailer 1 Retailer 2 Retailer 3 Training Test Total<br />
1 Zero LOW LOW 25 11 36<br />
2 Zero HIGH LOW 19 8 27<br />
3 Zero LOW HIGH 18 8 26<br />
4 Zero HIGH HIGH 17 7 24<br />
5 One LOW LOW 29 13 42<br />
6 One HIGH LOW 17 7 24<br />
7 One LOW HIGH 18 7 25<br />
8 One HIGH HIGH 18 8 26<br />
9 Two LOW LOW 21 9 30<br />
10 Two HIGH LOW 12 5 17<br />
11 Two LOW HIGH 13 6 19<br />
12 Two HIGH HIGH 12 5 17<br />
Gr<strong>and</strong> Total - - - 219 94 313<br />
Table 1. The stress condition <strong>and</strong> number of experiments<br />
TRAINING<br />
Through preliminary studies apart from the experiments tabulated above we<br />
found that all the stress conditions do not affect all the agents. Thus, we<br />
prepared different classification definitions for the training set depending on the<br />
agent (See the table 2).<br />
As the classification definitions are different for different agents, the input<br />
features are also different. The tasks used as input features in each agent are<br />
shown in table 3.
5<br />
The option ‘multi-class bound-constrained support vector classification’ in<br />
BSVM 2.0 is selected. For others, we use default options of BSVM 2.0 such as<br />
the radial basis function with gamma = 1/(the number of input features).<br />
Agent name<br />
Retailer 2,<br />
Warehouse 2<br />
Retailer 1<br />
Factory 2,<br />
Supplier 2<br />
Warehouse 1,<br />
Factory 1<br />
Retailer 3,<br />
Distribution Center 2<br />
Distribution Center 1,<br />
Supplier 1<br />
Wholesaler 1<br />
Class definition<br />
Class 1: condition 1,3,5,7,9,11 (Retailer 2 LOW)<br />
Class 2: condition 2,4,6,8,10,12 (Retailer 2 HIGH)<br />
Class 1: condition 1,2,3,4 (zero Retailer 1)<br />
Class 2: condition 5,6,7,8 (one Retailer 1)<br />
Class 3: condition 9,10,11,12 (two Retailer 1)<br />
Class 1: condition 1,3 (Retailer 2 LOW at zero Retailer 1)<br />
Class 2: condition 2,4 (Retailer 2 HIGH at zero Retailer 1)<br />
Class 3: condition 5,6,7,8,9,10,11,12 (All other cases)<br />
12 Classes, Regard each condition as one class<br />
Class 1: condition 1,2,5,6,9,10 (LOW Retailer 3)<br />
Class 2: condition 3,4,7,8,11,12 (HIGH Retailer 3)<br />
Class 1: 1,2 (LOW Retailer 3 at zero Retailer 1)<br />
Class 2: 3,4 (HIGH Retailer 3 at zero Retailer 1)<br />
Class 3: 5,6 (LOW Retailer 3 at one Retailer 1)<br />
Class 4: 7,8 (HIGH Retailer 3 at one Retailer 1)<br />
Class 5: 9,10 (LOW Retailer 3 at two Retailer 1)<br />
Class 6: 11,12 (HIGH Retailer 3 at two Retailer 1)<br />
Class 1: condition 9,10,11,12 (Two Retailer 1)<br />
Class 2: condition 1,2,3,4,5,6,7,8 (other conditions)<br />
Table 2. The class definition by agents<br />
Agent Features Agent Features<br />
Retailer 2 PS, PW Warehouse 1 PS, PW, OS<br />
Distribution Center 1 W, OPS, OS Factory 1 TP, W, OPS, OS<br />
Retailer 3 PS, PW Supplier 1 TR, TP, OTP<br />
Distribution Center 2 PS, PW Retailer 1 PS, PW<br />
Factory 2 TP Warehouse 2 TP, W<br />
Supplier 2 OTP Wholesaler 1 S<br />
* PS = ProjectSupply, PW = ProjectWithdraw, W =Withdraw, TP = Transport,<br />
TR = Transit, S = Supply, OPS = ProjectSupply coming from outside,<br />
OS = Supply coming from outside, OTP = Transport coming from outside<br />
Table 3. The input features by agents<br />
Agent Success rate Agent Success rate<br />
Retailer 2 100% Warehouse 1 100%<br />
Retailer 1 100% Distribution Center 1 100%<br />
Retailer 3 100% Factory 1 22.34%<br />
Distribution Center 2 100% Supplier 1 40.43%<br />
Factory 2 100% Warehouse 2 84.04%<br />
Supplier 2 100% Wholesaler 1 86.17%<br />
Table 4. The success rate to classify the stress condition at each agent<br />
RESULTS<br />
The table 4 contains the test results. Overall performance is good. In<br />
addition, we can see the agent near the retailers in the supply chain can detect
6<br />
stresses well. The Warehouse 1 agent can detect exactly all the stress types even<br />
though it is far from retailers (see Fig 1.).<br />
CONCLUSIONS<br />
We have shown an effective application of pattern recognition model for<br />
detecting stresses by observing the internal state of each agent. Each agent has a<br />
SVM since the influence of the same stress on different agents can be different.<br />
Some agents near the retailers can detect stresses very well. However, it is hard<br />
to detect the influence of the stress on agents which are far from retailers.<br />
Overall performance of predictor of each agent is good. Constructing the<br />
capability for stress detection is the first step towards improving the<br />
survivability of a multi-agent system. This result is important because we can<br />
pursue further research on how we can dampen the effect of stresses based on<br />
the result of this study. Based on current detected state each agent can change<br />
their behaviors - ordering or planning - to adapt to stress conditions without<br />
serious performance degradation of overall supply chain. In addition, our<br />
approach is generally useful because it is very hard to model a complex system<br />
analytically.<br />
ACKNOWLEDGEMENTS<br />
Support for this research was provided by <strong>DARPA</strong> (Grant#: MDA 972-01-<br />
1-0563) under the UltraLog program.<br />
REFERENCES<br />
Baranger, Michel, “Chaos, Complexity, <strong>and</strong> Entropy – A physics talk for non-physicists,” MIT-<br />
CTP-3112, http://necsi.org/projects/baranger/cce.pdf.<br />
Burges, C. J. C., 1998, “A Tutorial on Support Vector Machines for Pattern Recognition,”<br />
Knowledge Discovery <strong>and</strong> Data Mining, Vol. 2, No. 2, pp. 121-167.<br />
Chapelle, O., Haffner, P. <strong>and</strong> Vapnik, V. N., 1999, “Support Vector Machines for Histogram-Based<br />
Image Classification,” IEEE Transactions on Neural Networks, Vol. 10, No. 5, pp. 1055-1064.<br />
Choi, T., Dooley, K. <strong>and</strong> Rungtusanatham, M, 2000, " Supply Networks <strong>and</strong> Complex Adaptive<br />
Systems: control versus emergence,” Journal of Operations Management, Vol. 19, pp 351-366.<br />
Dooley, K., 2002, “Simulation Research Methods,” Companion to Organizations, Joel Baum (ed.),<br />
London: Blackwell, pp. 829-848.<br />
Hsu, Chih-Wei <strong>and</strong> Lin, Chih-Jen, 2002, “A Comparison of Methods for Multiclass Support Vector<br />
Machines,” IEEE Transactions on Neural Networks, Vol. 13, No. 2, pp. 415-425.<br />
Liang, H. <strong>and</strong> Lin, Z., 2001, “Detection of Delayed Gastric Emptying from Electrogastrograms with<br />
Support Vector Machine,” IEEE Transactions on Biomedical Engineering, Vol. 13, No. 2, pp.<br />
415-425.<br />
Moghaddam, B. <strong>and</strong> Yang, M., 2002, “Learning Gender with Support Faces,” IEEE Transactions on<br />
Pattern Analysis <strong>and</strong> Machine Intelligence, Vol. 24, No. 5, pp. 707-711.<br />
Műller, K., Mika, S., Rätch, G., Tsuda, K. <strong>and</strong> Schőlkopf B., 2001, “A Introduction to Kernel-Based<br />
Learning Algorithms,” IEEE Transactions on Neural Networks, Vol. 12, No. 2, pp. 181-202.<br />
Osuna, E. E., Freund R. <strong>and</strong> Girosi, F., 1997, Support Vector Machines: Training <strong>and</strong> Applications,<br />
Technical <strong>Report</strong> AIM-1602, MIT A.I. Lab.<br />
Vapnik, V. N., 1998, Statistical Learning Theory, John wiley & sons, Inc, New York.
A Framework for Performance Control of<br />
Distributed Autonomous Agents<br />
Nathan Gnanasamb<strong>and</strong>am, Seokcheon Lee, Soundar R.T. Kumara <strong>and</strong> Natarajan Gautam<br />
310 Leonhard Building, The Pennsylvania State University, University Park, PA, 16802, USA<br />
Abstract<br />
We propose an autonomous <strong>and</strong> scalable queueing theory-based methodology to control the performance of a hierarchical network<br />
of distributed agents. Multi-agent systems (MAS) such as supply chains functioning in highly dynamic environments<br />
need to achieve maximum overall utility during operation. Hence, the objective of the control framework is to identify the<br />
trade-offs between quality <strong>and</strong> performance <strong>and</strong> adaptively choose the operational settings to posture the MAS for better utility.<br />
By formulating the MAS as an open queueing network with multiple classes of traffic we evaluate the performance <strong>and</strong><br />
subsequently the utility, from which we identify the control alternative for a localized, multi-tier zone.<br />
Keywords: Queueing Network, Multi-Agent Systems, Performance control.<br />
1 Introduction<br />
With the growing view of agent-oriented software sytems [1] <strong>and</strong> the increased deployment of distributed multi-agent systems<br />
(DMAS) for numerous emerging applications such as computational grids, e-commerce hubs, supply chains <strong>and</strong> sensor networks,<br />
we are faced with large-scale distributed agents whose performance needs to be estimated <strong>and</strong> controlled. Often times,<br />
these DMAS operate in dynamic <strong>and</strong> stressful environmental conditions, of one type or the other, in which the MAS as whole<br />
must survive. While survival notion necessitates adaptivity to diverse conditions along the dimensions of performance, security<br />
<strong>and</strong> robustness, delivering the correct proportion of these quantities can be quite a challenge. In this paper, we address a piece<br />
of this problem by building an autonomous performance control framework for MAS.<br />
While building large multi-agent societies (such as UltraLog [2]), it is desirable that the associated adaptation framework<br />
be generic <strong>and</strong> scalable. One way to do this is to utilize a methodology similar to Jung <strong>and</strong> Tambe [3], where the bigger<br />
society is composed of smaller building blocks, in this case, corresponding to communities of agents. Although, strategies for<br />
co-operativeness <strong>and</strong> distributed POMDP to have been utilized to analyze performance in [3], an increase in the number of<br />
variables in each agent can quickly render POMDP ineffective even in reasonable sized agent communities due to the statespace<br />
explosion problem. In [4], Rana <strong>and</strong> Stout identify data-flows in the agent network <strong>and</strong> model scalability with Petri<br />
nets, but their focus is on identifying synchronization points, deadlocks <strong>and</strong> dependency constraints with coarse support for<br />
performance metrics relating to delays <strong>and</strong> processing times for the flows. In a recent architecture for autonomic computing,<br />
Tesauro et al. [5] build a real-time MAS-based framework that is self-optimizing based on application-specific utility. While<br />
[3, 4] motivate the need to estimate performance of large DMAS using a building block approach, [5] justifies the need to use<br />
domain specific utility whose basis should be the network’s service-level attributes such as delays, utilization <strong>and</strong> response<br />
times.<br />
We believe that by using queueing theory we can analyze data-flows within the agent community with greater granularity in<br />
terms of processing delays <strong>and</strong> network latencies <strong>and</strong> also capitalize on using a building block approach by restricting the model<br />
to the agent community. Queueing theory has been widely used in networks <strong>and</strong> operating systems [6]. However, the authors<br />
have not seen the application of queueing to MAS modeling <strong>and</strong> analysis. Since, agents lend themselves to being conveniently<br />
represented as a network of queues, we concentrate on engineering a queueing theory based adaptation (control) framework to<br />
enhance the application-level performance.<br />
Inherently, the DMAS can be visualized as a multi-layered system as is depicted in Figure 1a. The top-most layer is where<br />
the application resides, usually conforming to some organization such as mesh, tree etc. The infrastructure layer not only<br />
abstracts away many of the complexities of the underlying resources (such as CPU, b<strong>and</strong>width), but more importantly provides<br />
services (such as Message Transport) <strong>and</strong> aiding agent-agent services (such as naming, directory etc.). The bottom most layer<br />
is where the actual computational resources, memory <strong>and</strong> b<strong>and</strong>width reside. Most studies in the literature do not make this<br />
distinction <strong>and</strong> as such control is not executed in a layered fashion. Some studies such as [7, 8], consider controlling attributes<br />
in the physical or infrastructural layers so that some properties (eg. survivability) could result <strong>and</strong>/or the facilities provided by<br />
these layers are taken advantage of. Often, this requires rewiring the physical layer, availability of a infrastructure level service<br />
or the ability of the application to share information with underlying layers in a timely fashion for control purposes. In this<br />
work, we consider control only due to application-level trade-offs such as quality of service versus performance <strong>and</strong> assume that<br />
infrastructure level services (such as load-balancing, priority scheduling) or physical level capabilities (such as rewiring) are not<br />
possible. This does not exclude the possibility that in future we can combine all approaches to achieve a multi-layered control.
© £ ¨ ¥ ¤ § ¢ ¥ <br />
© £ ¨ £¦ £ ¨ ¦ ¥ ¢ ¥ <br />
¢ £ ¤ ¤ £ ¤ ¡<br />
¤ ¨© § ¢ ¤ ¡ ¢ © ¡ ¢ £ £ ¢<br />
<br />
<<br />
)<br />
+ *<br />
5<br />
1<br />
<br />
! " " # ! <br />
3 6 - 4 3 / - 1<br />
A / 2 - 3 4 @<br />
2 2 . 6 7 5<br />
3 9 / 8<br />
¢ £ ¤ ¤ £ ¤ ¡<br />
¦ ¦ § ¨© ¡ ¨ ¤ £ ¢<br />
¥<br />
, ><br />
, = ; : =<br />
- . / . - 0 ,<br />
/ % 2 / - 3 4 1<br />
,<br />
+ *<br />
Our contribution in this work is to combine queueing analysis <strong>and</strong> application-level control to engineer a generic framework<br />
that is capable of self-optimizing its domain-specific utility.<br />
¡ ¡ ¢ £ ¤ ¥ ¦ £ § ¨ © ¥ <br />
<br />
<br />
: ; ,<br />
: =<br />
8 )<br />
) ; < ; ) ; :<br />
¨ ¥ ¦ ¤ ¦ © ¥ § ¥ ¥ <br />
$ % & ' % (<br />
: <br />
? ; $<br />
8 3 9 / + 2 . / 9 4<br />
) ><br />
£ ¤ ¥ ¢ © ¥ <br />
: > ; < ><br />
(a) Operational Layers<br />
(b) Framework Architecture<br />
Figure 1: MAS framework<br />
1.1 Problem Statement<br />
Typically, the top-most layer in the computing infrastructure (here the DMAS-based application) possesses maximum transparency<br />
to system’s overall utility, control-knobs <strong>and</strong> domain knowledge. The utility of the application is the combined benefit<br />
along several conflicting (eg. completeness <strong>and</strong> timeliness [9, 2]) <strong>and</strong>/or independent (eg. confidentiality <strong>and</strong> correctness [9, 2])<br />
dimensions, which the application tries to maximize in a best-effort sense through trade-offs. Underst<strong>and</strong>ably, in a distributed<br />
multi-agent setting, mechanisms to measure, monitor <strong>and</strong> control this multi-criteria utility function become hard <strong>and</strong> inefficient,<br />
especially under conditions of scale-up. Given that the application does not change its high-level goals, task-structure or<br />
functionality in real-time, it is beneficial to have a framework that assists in the choice of operational modes (or opmodes) in a<br />
distributed way. Hence, the research objective of this work is to design <strong>and</strong> develop a generic, real-time framework for DMAS,<br />
that utilizes a queueing network model for performance evaluation <strong>and</strong> a learned utility model to select an appropriate control<br />
alternative.<br />
1.2 Solution Methodology<br />
The focus of this research is to adjust the application-level parameters or opmodes within the distributed agents to make an<br />
autonomous choice of operational parameters for agents in a reasonable-sized domain (called an agent community). The choice<br />
of opmodes is based on the perceived application-level utility of the combined system (i.e. the whole community) that current<br />
environmental conditions allow. We assume that the application’s utility depends of the choice of opmodes at the agents<br />
constituting the community because the opmodes directly affect the performance. A queueing network model is utilized to<br />
predict the impact of DMAS control settings <strong>and</strong> environmental conditions on steady-state performance (in terms of end-to-end<br />
delays in tasks), which in turn is used to estimate the application-level utility. After evaluating <strong>and</strong> ranking several alternatives<br />
from among the feasible set of operational settings on the basis of utility, the best choice is picked.<br />
2 Architecture of the Performance Control Framework<br />
We implement the performance control framework for the Continuous Planning <strong>and</strong> Execution (CPE) Society which is a comm<strong>and</strong><br />
<strong>and</strong> control MAS built on Cougaar (<strong>DARPA</strong> Agent Platform [10]). While we describe the functionality of the components<br />
of the framework (Figure 1b) in this section, we highlight the autonomic capabilities that are built into the system.<br />
2.1 Overview of Application (CPE) Scenario<br />
In our set-up, the primary building block consists of three tiers in the application layer. CPE embodies a complete military<br />
logistics scenario with agents emulating roles such as suppliers, consumers <strong>and</strong> controllers all functioning in a dynamic <strong>and</strong><br />
hostile (destructive) external environment. Embedded in the hierarchical structure of CPE are both comm<strong>and</strong> <strong>and</strong> control,<br />
<strong>and</strong> superior-subordinate relationships. The subordinates compile sensor updates <strong>and</strong> furnish them to superiors. This enables<br />
the superiors to perform the designated function of creating plans (for maneuvering <strong>and</strong> supply) as well as control directives
for downstream subordinates. Upon receipt of plans, the subordinates execute them. The supply agents replenish consumed<br />
resources periodically. This high level system definition is the functionality of CPE that it seeks to perform repeatedly with<br />
maximum utility while residing in the application layer.As part of the application-level adaptivity features, a set of opmodes<br />
are built into the system. Opmodes allow individual tasks (such as plans, updates, control) to be executed at different qualities<br />
or to be processed at different rates. We assume that TechSpecs for the CPE scenario are available to be utilized by the control<br />
framework. The framework that accomplishes the aforementioned goal of CPE in a distributed fashion while performing at a<br />
maximum possible level of utility is represented in Figure 1b.<br />
2.2 Self-Monitoring Capability<br />
Any system that wants to control itself should possess a clear specification of the scope of the variables it has to monitor. The<br />
TechSpecs is a distributed structure that supports this purpose by housing all variables, X, that have to be monitored in different<br />
portions of the community (or sub-system). The data/statistics collected in a distributed way, is then aggregated to assist in<br />
control alternatives by the top-level controller that each community will possess.<br />
The attributes that need to be tracked are formulated in the form of measurement points (MP ). The measurement points are<br />
“soft” storage containers residing inside the agents <strong>and</strong> contain information on what, where <strong>and</strong> how frequently they should<br />
be measured. Each agent can look up its own TechSpecs <strong>and</strong> from time-to-time forward that to its superior. The superior can<br />
analyse this information (eg. calculate statistics such as delay, delay-jitter) <strong>and</strong>/or add to this information <strong>and</strong> forward it again.<br />
We have measurement points for time-periods, time-stamps, operating-modes, control <strong>and</strong> generic vector-based measurements.<br />
These measurement points can be chained for tracking information for a flow such that information is tagged-on at every point<br />
the flow traverses. For the sake of reliability, the information that is contained in these agents is replicated at several points, so<br />
that when packets do not reach on time or not reach at all, previously stored packets can be utilized for control purposes.<br />
2.3 Self-Modeling Capability<br />
One of the key features of this framework is that it has the capability to choose a type of model for representing itself for the<br />
purpose of performance evaluation. The system is equipped with several queueing model templates that it can utilize to analyze<br />
the system with. The type of model that is utilized at any given moment is based on accuracy, computation time <strong>and</strong> history of<br />
effectiveness. For example, a simulation based queueing model may be very accurate but cannot complete evaluating enough<br />
alternatives in limited time, in which case an analytical model (such as BCMP, QNA [11]) is preferred.<br />
The inputs to the model builder are the flows that traverse the network (F ), the types of packets (T ) <strong>and</strong> the current configuration<br />
of the network. If at a given time, we know that there are n agents interconnected in a hierarchical fashion then the role of<br />
this unit is to represent that information in the required template format (Q). The current number of agents is known to the<br />
controller by tracking the measurement points. For example, if there is no response from an agent for a sufficient period of time,<br />
then for the purpose of modeling, the controller may assume the agent to be non-existent. In this way dynamic configurations<br />
can be h<strong>and</strong>led. On the other h<strong>and</strong>, TechSpecs do m<strong>and</strong>ate connections according to superior-subordinate relationships thereby<br />
maintaining the flow structure at all times. Once the modeling is complete, the MAS has to capability to analyze its current<br />
performance using the selected type of model. The MAS does have the flexibility, to choose another model template for a<br />
different iteration.<br />
2.4 Self-Evaluating Capability<br />
The evaluation capability, the first step in control, allows the MAS to examine its own performance under a given set of plausible<br />
conditions. This prediction of performance is used for the elimination of control alternatives that may lead to instabilities. Our<br />
notion of performance evaluation is similar to [5]. While Tesauro et al. [5] compute the resource level utility functions (based<br />
on the application manager’s knowledge of system performance) that can be combined to obtain a globally optimal allocation of<br />
resources, we predict the performance of the MAS as a function of its operating modes in real-time (within Queueing Model) <strong>and</strong><br />
then use it to calculate its global utility. By introducing a level of indirection, we may get some desirable properties (explained<br />
in Section 4.2) because we separate an application’s domain-specific utility computation from performance prediction (or<br />
analysis). This theoretically enables us to predict the performance of any application whose TechSpecs are clearly defined <strong>and</strong><br />
then compute the application-specific utility. In both cases, control alternatives are picked based on best-utility. We discuss<br />
the notion of control alternatives in Section 2.5. Also, our performance metrics (<strong>and</strong> hence utility) are based on service level<br />
attributes such as end-to-end delay <strong>and</strong> latency, which is a desirable attribute of autonomic systems [5].<br />
When plan, update <strong>and</strong> control tasks (as mentioned in Section 2.1) flow in this heterogeneous network of agents in predefined<br />
routes (called flows), the processing <strong>and</strong> wait times of tasks at various points in the network are not alike. This is because<br />
the configuration (number of agents allocated on a node), resource availability (load due to other contending software) <strong>and</strong><br />
environmental conditions at each agent is different. In addition, the tasks themselves can be of varying qualities or fidelities<br />
that affects the time taken to process that task. Under these conditions, performance is estimated on the basis of the end-to-end<br />
delay involved in a “sense-plan-respond” cycle.
Table 1: Notation<br />
Symbol<br />
Description<br />
N<br />
Total # of nodes in the community<br />
λ ij<br />
Average arrival rate of class j at node i<br />
1/µ ijk Average processing time of class j at node i at quality k<br />
M<br />
Total number of classes<br />
T i<br />
Routing probability matrix for class i<br />
W ijk<br />
Steady state waiting time for class j at node i at quality k<br />
Set of qualities at which a class j task can be processed at node i<br />
Q ij<br />
The primary performance prediction tool that we use are called Queueing Network Models (QNM) [6]. The QNM is the representation<br />
of the agent community in the queueing domain. As the first step of performance estimation, the agent community<br />
needs to be translated into a queueing network model. Table 1 provides the notations used is this section. Inputs <strong>and</strong> outputs<br />
at a node are regarded as tasks. The rate at which tasks of class j are received at node i is captured by the arrival rate (λ ij ).<br />
Actions by agents consume time, so they get abstracted as processing rates (µ ij ). Further, each task can be processed at a<br />
quality k ∈ Q ij , that causes the processing rates to be represented as µ ijk . Statistics of processing times are maintained at each<br />
agent in the Performance Database (PDB) to arrive at a linear regression model between quality k <strong>and</strong> µ ijk . Flows get associated<br />
with classes of traffic denoted by the index j. If a connection exists between two nodes, this is converted to a transition<br />
probability p ij , where i is the source <strong>and</strong> j is the target node. Typically, we consider flows originating from the environment,<br />
getting processed <strong>and</strong> exiting the network making the agent network an open queueing network [6]. Since we may typically<br />
have multiple flows through a single node, we consider multi-class queueing networks where the flows are associated with a<br />
class. Performance metrics such as delays for the “sense-plan-respond” cycle is captured in terms of average waiting times,<br />
W ijk . As mentioned earlier, TechSpecs is a convenient place where information such as flows <strong>and</strong> Q ij can be embedded.<br />
The choice of QNM depends on the number of classes, arrival distribution <strong>and</strong> processing discipline as well as a suggestion<br />
C by the DMAS controller that makes this choice based upon history of prior effectiveness. Some analytical approaches to<br />
estimate performance can be found in [6, 11]. In the context of agent networks, Jackson <strong>and</strong> BCMP queueing networks have<br />
been applied to estimate the performance in [12]. By extending this work we provide several templates of queueing models<br />
(such as BCMP, Whitt’s QNA [11], Jackson, M/G/1, a simulation) that can be utilized for performance prediction.<br />
2.5 Self-Controlling Capability<br />
In contrast to [5], we deal with optimization of the domain utility of a MAS that is distributed, rather than allocating resources in<br />
an optimal fashion to multiple applications that have a good idea of their utility function (through policies). As mentioned before<br />
opmodes allow for trading-off quality of service (task quality <strong>and</strong> response time) <strong>and</strong> performance. We are assuming there is a<br />
maximum ceiling R on the amount of resources, <strong>and</strong> the available resources fluctuate depending on stresses S = S e +S a , where<br />
S e are the stresses from the environment (i.e. multiple contending applications, changes in the infrastructural or physical layers)<br />
<strong>and</strong> S a are the application stresses (i.e. increased tasks). The DMAS controller receives from MP (measurement points) a<br />
measurement of the actual performance P <strong>and</strong> a vector of other statistics (X) about task processing times. Also at the top-level<br />
the overall utility (U) is U(P, S) = ∑ w n x n is known where x n is the actual utility component <strong>and</strong> w n is the associated<br />
weight. We cannot change S, but we can adjust P to get better utility. Since P depends on O, which is a vector of opmodes<br />
collected from the community, we can use the QNM to find O ∗ <strong>and</strong> hence P ∗ that maximizes U(P, S) for a given S. In words,<br />
we find the vector of opmodes (O ∗ ) that maximizes domain utility at current S <strong>and</strong> update O. This computation is performed in<br />
the Utility Calculator module using a utility model that is learned <strong>and</strong> stored in the Utility Database (UDB). This formulation<br />
although independently found matches the self-optimization notion in [5]. But some differences exist as follows. Tesauro et al.<br />
[5] assume that the system’s knowledge includes a performance model, which we do not assume. We use a queueing network<br />
model to estimate the performance in real-time for any set of opmodes O ′ by taking the current set of opmodes O <strong>and</strong> scaling<br />
them appropriately based on observed histories (X) to X ′ in the Control Set Evaluator. Also, we deal with a single MAS with<br />
an overall utility function for the entire distributed functionality (within the community). Because of the interactions involved<br />
<strong>and</strong> complexity of performance modeling[3, 4], it may be time-consuming to utilize inferencing <strong>and</strong> learning mechanisms in<br />
real-time. This is why we use an analytical queueing network to get the performance estimate quickly. Another difference is<br />
that in [5], they assume operating system support which may not be true in many MAS-based situations because of mobility,<br />
security <strong>and</strong> real-time constraints. Furthermore, in addition to the estimation of performance, the queueing model may have<br />
the capability to eliminate instabilities from a queueing sense, which is not apparent in the other approach. But inspite of these<br />
differences, it is interesting to see that self-controlling capability can be achieved, with or without explicit layering, in a couple<br />
of real-world applications.
1000<br />
500<br />
0<br />
0.2 0.4 0.6 0.8<br />
-500<br />
-1000<br />
Default Policy<br />
Controlled<br />
Stress (S)<br />
Figure 2: Results overview<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0 200 400 600 800 1000 1200<br />
-5<br />
-10<br />
time (sec.)<br />
Controlled Default A Default B<br />
1400<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
0 200 400 600 800 1000 1200<br />
time (sec.)<br />
Controlled Default A Default B<br />
(a) Instantaneous Utility (stress 0.25)<br />
(b) Cumulative Utility (stress 0.25)<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0 200 400 600 800 1000 1200<br />
-5<br />
-10<br />
time (sec.)<br />
Controlled Default A Default B<br />
1200<br />
1000<br />
800<br />
600<br />
400<br />
200<br />
0<br />
0 200 400 600 800 1000 1200<br />
time (sec.)<br />
Controlled Default A Default B<br />
(c) Instantaneous Utility (stress 0.75)<br />
(d) Cumulative Utility (stress 0.75)<br />
Figure 3: Sample results<br />
3 Empirical Evaluation on CPE Test-bed<br />
We utilized the prototype CPE framework to run 36 experiments at two stress levels (S = 0.25 <strong>and</strong> 0.75). The scenario<br />
consisted of 14 agents, besides a world agent that created r<strong>and</strong>om scenarios in military logistics for the agents to react to. There<br />
were three layers of hierarchy with a three-way branching at each level <strong>and</strong> one supply node. The community’s utility function<br />
was based on the achievement of real goals in military engagements such as terminating or damaging the enemy <strong>and</strong> reducing<br />
the penalty involved in consuming resources such as fuel or sustaining damage. We also assumed for the model selection<br />
process that the external arrival was Poisson <strong>and</strong> the service times were exponentially distributed. In order to cater to general<br />
arrival rates, our framework contains a QNA- <strong>and</strong> simulation-based model. Using this assumption a BCMP or M/G/1 queueing<br />
model could be utilized. We used the Cougaar based default control without additional support from our framework as the<br />
baseline (denoted as Default A <strong>and</strong> Default B) <strong>and</strong> found that controlling the agent community using our framework (denoted<br />
as controlled) was beneficial in the long run. The overview of the results is provided in Figure 2.<br />
At both stress levels, the controlled scenario performed better that the default as shown in Figure 3. We did observe oscillations<br />
in the instantaneous utility <strong>and</strong> we attribute this to the impreciseness of the prediction of stresses. Stresses vary relatively fast<br />
in the order of seconds while the control granularity was of the order of minutes. Since this is a military engagement situation<br />
following no pre-determined stress patterns, it is hard to cope with in the higher stress case. We think that this could be the<br />
reason why our utility falls in the latter case.
4 Conclusions <strong>and</strong> Future Work<br />
4.1 Conclusions<br />
In this paper, we were able to successfully control a real-time MAS to achieve overall better utility in the long run using<br />
application-level trade-offs between quality of service <strong>and</strong> performance. We utilized a queueing network based framework for<br />
performance analysis <strong>and</strong> subsequently used a learned utility model for computing the overall benefit to the MAS (i.e. community).<br />
While Tesauro et al. [5] have found a similar construction to improve utility in multiple applications, we concentrated<br />
on optimizing the utility of a single distributed application using queueing theory. We think that the approaches are complementary,<br />
with this study providing empirical evidence to support the observation in [1] that agents can be used to optimize<br />
distributed application environments, including themselves, through flexible high-level (i.e. application-level) interactions.<br />
4.2 Discussion <strong>and</strong> Future Work<br />
We believe that keeping the building-blocks small <strong>and</strong> the number of interactions (between performance <strong>and</strong> utility models)<br />
minimal may assist in making the framework more flexible <strong>and</strong> scalable. For example, if system size increases, we can consider<br />
a superior agent or human user to be at the next higher level controlling the weights in the utility function without affecting the<br />
performance model. The larger system with supervisory control would then be analyzed using another higher-level QNM or a<br />
network of networks. TechSpecs has assisted this effort to a large extent, re-emphasizing the well-founded separation principle<br />
(separating knowledge/policy <strong>and</strong> mechanism) in the computing field. While we think that the aforementioned architectural<br />
principles have been well-utilized, we hope to broaden the layered control approach to encompass the infrastructural-level<br />
control into the framework. Another avenue for improvement is to design self-protecting mechanisms within our framework so<br />
that the security aspect of the framework is reinforced.<br />
Acknowledgments<br />
The work described here was performed under the <strong>DARPA</strong> UltraLog Grant#: MDA972-1-1-0038. The authors wish to acknowledge<br />
<strong>DARPA</strong> for their generous support.<br />
References<br />
[1] Jennings, N. R. <strong>and</strong> Wooldridge, M., 2000, “Agent-Oriented Software Engineering”, H<strong>and</strong>book of Agent Technology,<br />
AAAI/MIT Press.<br />
[2] UltraLog Program Site. www.ultralog.net. <strong>DARPA</strong>.<br />
[3] Jung, H., <strong>and</strong> Tambe, M., 2003, “Performance Models for Large Scale Multi-Agent Systems”, Proceedings of the Seocnd<br />
Joint Conference on Autonomous Agents <strong>and</strong> Multi-Agnet Systems.<br />
[4] Rana, O. F., <strong>and</strong> Stout, O., “What is Scalability in Mult-Agent Systems”, 2000, Proceedings of the Fourth International<br />
Conference on Autonomous Agents.<br />
[5] Tesauro, G., Chess, D. M., Walsh, W. E., Das, R., Whalley, I., Kephart, J. O., <strong>and</strong> White, S. R., 2004, “A Multi-Agent<br />
Systems Approach to Autonomic Computing”, Autonomous Agents <strong>and</strong> Multi-Agent Systems.<br />
[6] Bolch, G., de Meer, H., Greiner, S., <strong>and</strong> Trivedi, K. S., 1998, Queueing Networks <strong>and</strong> Markov Chains: Modeling <strong>and</strong><br />
Performance Evaluation with Computer Science Applications. John Wiley <strong>and</strong> Sons.<br />
[7] Thadakamalla, H. P., Raghavan, U. N., Kumara, S. R. T., <strong>and</strong> Albert, R., 2004, “Survivability of Multi-Agent Systems - A<br />
Topological Perspective”, IEEE Intelligent Systems: Dependable Agent Systems, vol. 19, no. 5, pp. 24-31, Sep/Oct 2004.<br />
[8] Hong, Y., <strong>and</strong> Kumara, S. R. T., 2004, “Coordinating Control Decisions of Software Agents for Adaptation to Dynamic<br />
Environments”, Working Paper, Marcus Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering, Pennsylvania State<br />
Univerity, University Park, PA.<br />
[9] Brinn, M., <strong>and</strong> Greaves, M., 2003, “Leveraging Agent Properties to Assure Survivability of Distributed Multi-Agent<br />
Systems”, in the Proceedings of the Second Joint Conference on Autonomous Agents <strong>and</strong> Multi-Agent Systems.<br />
[10] Cougaar Open Source Site. www.cougaar.org. <strong>DARPA</strong>.<br />
[11] Whitt, W., 1983, “The Queueing Network Analyzer”, The Bell System Technical Journal, vol. 62, no. 9, pp. 2779-2815.<br />
[12] Gnanasamb<strong>and</strong>am, N., Lee, S., Gautam, N., Kumara, S. R. T., Peng, W., Manikonda, V., Brinn, M., <strong>and</strong> Greaves, M.,<br />
2004, “Reliable MAS Performance Prediction Using Queueing Models”, IEEE Multi-Agent Security <strong>and</strong> Survivability<br />
Symposium.
An Autonomous Performance Control Framework for<br />
Distributed Multi-Agent Systems: A Queueing Theory<br />
Based Approach<br />
Nathan<br />
Gnanasamb<strong>and</strong>am<br />
gsnathan@psu.edu<br />
Seokcheon Lee<br />
stonesky@psu.edu<br />
Pennsylvania State University<br />
310 Leonhard Building<br />
University Park, PA 16802<br />
Soundar R.T. Kumara<br />
skumara@psu.edu<br />
ABSTRACT<br />
Distributed Multi-Agent Systems (DMAS) such as supply chains<br />
functioning in highly dynamic environments need to achieve maximum<br />
overall utility during operation. The utility from maintaining<br />
performance is an important component of their survivability.<br />
This utility is often met by identifying trade-offs between quality<br />
of service <strong>and</strong> performance. To adaptively choose the operational<br />
settings for better utility, we propose an autonomous <strong>and</strong> scalable<br />
queueing theory based methodology to control the performance of<br />
a hierarchical network of distributed agents.<br />
Categories <strong>and</strong> Subject Descriptors<br />
C.4 [Performance of Systems]: Design studies, modeling techniques,<br />
performance attributes<br />
General Terms<br />
Performance<br />
Keywords<br />
Multi-Agent Systems, Survivability, Queueing Models<br />
1. INTRODUCTION<br />
With the emerging popularity of distributed multi-agent systems<br />
as application platforms, it is necessary that they survive dynamic<br />
<strong>and</strong> stressful environmental conditions, even partial permanent damage.<br />
While the survival notion necessitates adaptivity to diverse<br />
conditions along the dimensions of performance, security <strong>and</strong> robustness,<br />
delivering the correct proportion of these quantities can<br />
be quite a challenge. From a performance st<strong>and</strong>point, a survivable<br />
system can deliver excellent Quality of Service (QoS) even when<br />
stressed. A DMAS could be considered survivable if it can maintain<br />
at least x% of system capabilities <strong>and</strong> y% of system perfor-<br />
Permission to make digital or hard copies of all or part of this work for<br />
personal or classroom use is granted without fee provided that copies are<br />
not made or distributed for profit or commercial advantage <strong>and</strong> that copies<br />
bear this notice <strong>and</strong> the full citation on the first page. To copy otherwise, to<br />
republish, to post on servers or to redistribute to lists, requires prior specific<br />
permission <strong>and</strong>/or a fee.<br />
AAMAS’05, July 25-29, 2005, Utrecht, Netherl<strong>and</strong>s.<br />
Copyright 2005 ACM 1-59593-094-9/05/0007 ...$5.00.<br />
mance in the face of z% of infrastructure loss <strong>and</strong> wartime loads<br />
(x, y, z are user-defined) [1].<br />
We address a piece of the survivability problem by building an<br />
autonomous performance control framework for the DMAS drawing<br />
on the idea of composing the bigger society of smaller building<br />
blocks (i.e. agent communities) [3]. Identifying data-flowsinthe<br />
agent network (similar to [4]) <strong>and</strong> utilizing the network’s servicelevel<br />
attributes such as delays, utilization <strong>and</strong> response times as a<br />
basis for its utility (like in [5]) we build a self-optimizing framework<br />
for DMAS. We believe that by using queueing theory we can<br />
analyze data-flows within the agent community as a network of<br />
queues with greater granularity in terms of processing delays <strong>and</strong><br />
network latencies <strong>and</strong> also capitalize on using a building block approach<br />
by restricting the model to the community. We contribute<br />
by engineering a queueing theory based adaptation (control) framework<br />
to enhance the performance of the application layer, which<br />
inherently can be visualized as residing over the infrastructure (logical<br />
layer or middle-ware) <strong>and</strong> the physical layer (resources such as<br />
CPU, b<strong>and</strong>width).<br />
2. FRAMEWORK ARCHITECTURE<br />
Building on the ideas of high-level system specifications (or Tech-<br />
Specs) <strong>and</strong> utilizing queueing network models (QNMs) for performance<br />
estimation as in [2] we build a real-time framework for<br />
application-level survivability. This framework is represented in<br />
Figure 1 <strong>and</strong> consists of activities, modules, knowledge repositories<br />
<strong>and</strong> information flow through a distributed collection of agents.<br />
2.1 Architecture Overview<br />
When the DMAS is stressed by an amount S by the underlying<br />
layers (due to under-allocation of resources) <strong>and</strong> the environment<br />
(due to increased workloads during wartime conditions), the<br />
DMAS Controller has to examine all its performance-related variables<br />
from set X <strong>and</strong> the current overall performance P in order<br />
to adapt. The variables that need to be maintained are specified in<br />
the TechSpecs <strong>and</strong> may include delays, time-stamps, utilization <strong>and</strong><br />
their statistics. They are collected in a distributed fashion through<br />
the measurement points MP which are “soft” storage containers<br />
residing inside the agents <strong>and</strong> contain information on what, where<br />
<strong>and</strong> how frequently they should be measured. The DMAS Controller<br />
knows the set of flows F that traverse the network <strong>and</strong> the set<br />
of packet types T from the TechSpecs. With {F, T, X, C}, where<br />
C is a suggestion from the DMAS Controller, the Model Builder<br />
can select a suitable queueing model template Q. The Control Set
O, U<br />
Stresses<br />
Physical/Infrastructure Layer<br />
X<br />
F,T<br />
Se<br />
TechSpecs<br />
P<br />
DB<br />
MP<br />
OS<br />
Model Builder<br />
Q<br />
P,X<br />
C<br />
DMAS<br />
Controller<br />
S,P,O<br />
Control Set<br />
Evaluator<br />
O',X'<br />
Queueing<br />
Model<br />
P'<br />
Stresses<br />
Application / User<br />
U*,O*<br />
U'<br />
Sa<br />
Utility<br />
Calculator<br />
Figure 1: Architecture Overview<br />
Evaluator knows the current operating mode (opmode) set O as<br />
well as the set of possible opmodes, OS from TechSpecs. To evaluate<br />
the performance due to a c<strong>and</strong>idate opmode set O ′ , the Control<br />
Set Evaluator uses the Queueing Model with a scaled set of operating<br />
conditions X ′ . Once the performance P ′<br />
is estimated by<br />
the Queueing Model it can be cached in the performance database<br />
PDB <strong>and</strong> then sent to the Utility Calculator. The Utility Calculator<br />
computes the domain utility due to (O ′ ,P ′ ) <strong>and</strong> caches it in<br />
the utility database, UDB. Subsequently, the optimal opmode set<br />
O ∗ is identified <strong>and</strong> sent to the DMAS Controller. The functional<br />
units of the architecture are distributed but for each community that<br />
forms part of a MAS society, O ∗ will be calculated by a single<br />
agent. We now examine the capabilities of the framework.<br />
2.1.1 Self-Monitoring Capability<br />
TechSpecs acts as a distributed structure that contains meta-data<br />
about all variables, X, that have to be monitored in different portions<br />
of the community. The data/statistics collected in a distributed<br />
way, is then aggregated to assist in control alternatives by the toplevel<br />
controller that each community possesses. Each agent can<br />
look up its own TechSpecs <strong>and</strong> from time-to-time forward a measurement<br />
to its superior. The superior can analyse this information<br />
(eg. calculate statistics such as delay, delay-jitter) <strong>and</strong>/or add to this<br />
information <strong>and</strong> forward it again.<br />
2.1.2 Self-Modeling Capability<br />
One of the key features of this framework is that it has the capability<br />
to choose a type of model for representing itself for the<br />
purpose of performance evaluation. The system is equipped with<br />
several queueing model templates that it can utilize to analyze the<br />
system configuration with. The inputs to the Model Builder are the<br />
flows that traverse the network (F ), the types of packets (T ) <strong>and</strong><br />
the current configuration of the network. Given we know that there<br />
are n agents interconnected in a hierarchical fashion, this unit represents<br />
the information in the required template format (Q) which<br />
is subsequently used to analyze the current performance.<br />
O*<br />
U<br />
DB<br />
2.1.3 Self-Evaluating Capability<br />
The evaluation capability allows the MAS to examine its own<br />
performance under a given set of plausible conditions. This prediction<br />
of performance is used for the elimination of control alternatives<br />
that may lead to instabilities. Given that a variety of tasks<br />
traverse the heterogeneous network of agents in predefined routes<br />
(called flows), the processing <strong>and</strong> wait times of tasks at various<br />
points in the network are not alike because of dissimilar configurations,<br />
resource availabilities <strong>and</strong>/or environmental stresses. Under<br />
these conditions, performance is evaluated in terms of end-to-end<br />
delays for the “sense-plan-respond” cycles.<br />
2.1.4 Self-Controlling Capability<br />
Since tasks can be processed at various pre-defined qualities, opmodes<br />
allow for trading-off quality of service (task quality) for performance<br />
(end-to-end response time). The available resources fluctuate<br />
depending on stresses S = S e +S a, where S e are the stresses<br />
from the environment (i.e. multiple contending applications) <strong>and</strong><br />
S a are the application stresses (i.e. increased tasks). Using current<br />
measured performance P <strong>and</strong> the measured stress S the DMAS<br />
Controller relates the overall utility (U) asU(P, S) = P w nx n<br />
where x n is the actual utility component <strong>and</strong> w n is the associated<br />
weight specified by the user. To adjust P to get the best achievable<br />
utility under S, the following is done. Since P depends on<br />
O, which is a vector of opmodes collected from the community, we<br />
can use the QNM to find O ∗ <strong>and</strong> hence P ∗ that maximizes U(P, S)<br />
for a given S from within the set OS. In words, we find the vector<br />
of opmodes (O ∗ ) that maximizes domain utility at current S. The<br />
utility computation is performed in the Utility Calculator module<br />
using a learned utility model based on UDB.<br />
3. CONCLUSIONS<br />
We combined queueing analysis <strong>and</strong> application-level control to<br />
engineer a generic framework that is capable of self-optimizing<br />
its domain-specific utility to assure application-level survivability.<br />
While application-level adaptivity yields improvement in utility further<br />
gains are possible by leveraging underlying layers.<br />
4. ADDITIONAL AUTHORS<br />
Additional Authors: Natarajan Gautam (Pennsylvania State University,<br />
email: ngautam@psu.edu), Wilbur Peng <strong>and</strong> Vikram<br />
Manikonda (IAI Inc., email: wpeng,vikram@i-a-i.com), Marshall<br />
Brinn (BBN Technologies, email: mbrinn@bbn.com) <strong>and</strong><br />
Mark Greaves (<strong>DARPA</strong> IXO, email: mgreaves@darpa.mil).<br />
5. REFERENCES<br />
[1] M. Brinn <strong>and</strong> M. Greaves. Leveraging agent properties to<br />
assure survivability of distributed multi-agent systems.<br />
Proceedings of the Second Joint Conference on Autonomous<br />
Agents <strong>and</strong> Multi-Agent Systems, 2003.<br />
[2] N. Gnanasamb<strong>and</strong>am, S. Lee, N. Gautam, S. R. T. Kumara,<br />
W. Peng, V. Manikonda, M. Brinn, <strong>and</strong> M. Greaves. Reliable<br />
mas performance prediction using queueing models. IEEE<br />
Multi-agent Security <strong>and</strong> Survivabilty Symposium, 2004.<br />
[3] H. Jung <strong>and</strong> M. Tambe. Performance models for large scale<br />
multi-agent systems: Using distributed pomdp building<br />
blocks. Proceedings of the Second Joint Conference on<br />
Autonomous Agents <strong>and</strong> Multi-Agent Systems, July 2003.<br />
[4] O. F. Rana <strong>and</strong> K. Stout. What is scalabilty in multi-agent<br />
systems? Proceedings of the Fourth International Conference<br />
on Autonomous Agents, 2000.<br />
[5] G. Tesauro, D. M. Chess, W. E. Walsh, R. Das, I. Whalley,<br />
J. O. Kephart, <strong>and</strong> S. R. White. A multi-agent systems<br />
approach to autonomic computing. Autonomous Agents <strong>and</strong><br />
Multi-Agent Systems, 2004.
Manuscript for IEEE Transactions on Automatic Control 1<br />
ADAPTIVE CONTROL FOR LARGE-SCALE INFORMATION NETWORKS THROUGH<br />
ALTERNATIVE ALGORITHMS TO SUPPORT SURVIVABILITY *<br />
Seokcheon Lee † <strong>and</strong> Soundar Kumara ‡<br />
†‡ Department of <strong>Industrial</strong> & <strong>Manufacturing</strong> Engineering, The Pennsylvania State University,<br />
University Park, PA 16802<br />
† Phone: 814-863-4799; Fax: 814-863-4745; E-mail: stonesky@psu.edu<br />
‡ Corresponding author. Phone: 814-863-2359; Fax: 814-863-4745; E-mail: skumara@psu.edu<br />
ABSTRACT<br />
As modern networks can be easily exposed to various adverse events such as malicious<br />
attacks <strong>and</strong> accidental failures, there is a need to study their survivability. We study a large-scale<br />
information network composed of distributed software components linked together through a<br />
task flow structure. The service provided by the network is to produce a global solution to a<br />
given problem, which is an aggregate solution of partial solutions of individual tasks. Quality of<br />
service of the network is determined by the value of global solution <strong>and</strong> the time taken for<br />
generating global solution. In this paper we design an adaptive control mechanism along the<br />
lines of model predictive control to support the survivability of such networks by utilizing<br />
alternative algorithms. To address adaptivity we model stress environment by quantifying<br />
resource availability through sensors. We build a mathematical programming model with the<br />
resource availability incorporated, which predicts quality of service as a function of alternative<br />
algorithms. The programming model is decentralized through an auction market without any<br />
degradation of the solution optimality. By periodically opening the auction market, the system<br />
can achieve desirable performance adaptive to changing stress environments while assuring<br />
scalability property. We verify the designed control mechanism empirically.<br />
Key Words: Adaptive control, survivability, alternative algorithms, scalability<br />
* This work is supported in part by <strong>DARPA</strong> under Grant MDA 972-01-1-0038.
Manuscript for IEEE Transactions on Automatic Control 2<br />
1. Introduction<br />
Critical infrastructures become increasingly dependent on networked systems in many<br />
domains for automation or organizational integration. Though such infrastructure can improve<br />
the efficiency <strong>and</strong> effectiveness, these systems can be easily exposed to various adverse events<br />
such as malicious attacks <strong>and</strong> accidental failures [1]. Two metrics, namely survivability <strong>and</strong><br />
scalability, can be used to determine the efficiency <strong>and</strong> effectiveness of these systems.<br />
Survivability is defined as “the capability of a system to fulfill its mission, in a timely manner, in<br />
the presence of attacks, failure, or accidents” [2]. One promising way to achieve survivability is<br />
through adaptivity: changing the system behavior to achieve the system goal in response to the<br />
changing environment [3]. But, unpredictable adaptation can sometimes result in worse<br />
performance than without adaptation [4]. Scalability is defined as: “the ability of a solution to<br />
some problem to work when the size of the problem increases” (From Dictionary of Computing<br />
at http://wombat.doc.ic.ac.uk). As the size of networked systems grows, scalability becomes a<br />
critical issue when developing practical software systems [5].<br />
As software systems grow larger <strong>and</strong> more complex, component technology has become one<br />
of the important research topics in the computing community [6][7]. A component is a reusable<br />
program element, with which developers can build systems needed by simply defining their<br />
specific roles <strong>and</strong> wiring them together. In networks with component-based architecture, each<br />
component is highly specialized for specific tasks. Another emerging technology is adaptive<br />
software [8][9]. Adaptive software has alternative algorithms for the same numerical problem<br />
<strong>and</strong> a switching function for selecting the best algorithm in response to environmental changes.<br />
As modern operating environments are highly dynamic, adaptive software becomes an important<br />
tool to achieve portable high performance.
Manuscript for IEEE Transactions on Automatic Control 3<br />
We study a large-scale information network, which is composed of distributed software<br />
components linked together through a task flow structure. A problem given to the network is<br />
decomposed in terms of root tasks for some components <strong>and</strong> those tasks are propagated through<br />
a task flow structure to other components. As a problem can be decomposed with respect to<br />
space, time, or both, a component can have multiple root tasks that can be considered<br />
independent <strong>and</strong> identical in their nature. The service provided by the network is to produce a<br />
global solution to a given problem, which is an aggregate of partial solutions of individual tasks.<br />
Each component can have alternative algorithms to process a task which trade off processing<br />
time <strong>and</strong> value of partial solution. Quality of Service (QoS) of the network is determined by the<br />
value of global solution <strong>and</strong> time for generating global solution (i.e., completion time).<br />
Survivability of the network is the capability to provide high QoS in the presence of adverse<br />
events such as malicious attacks <strong>and</strong> accidental failures. In this paper we design an adaptive<br />
control mechanism to support the survivability of such networks by utilizing alternative<br />
algorithms.<br />
The organization of this paper is as follows. In Section 2 we discuss problem domain <strong>and</strong> in<br />
Section 3 formally define the problem in detail. We design control mechanism in Sections 4<br />
through 7 <strong>and</strong> show empirical results in Section 8. <strong>Final</strong>ly, we discuss implications <strong>and</strong> possible<br />
extensions of our work in Section 9.<br />
2. Problem domain<br />
The networks we study represent distributed <strong>and</strong> component-based architectures for providing<br />
a solution to a given problem. A problem is decomposed in terms of root tasks <strong>and</strong> solved by<br />
distributed components through a task flow structure. As a problem can be decomposed with
Manuscript for IEEE Transactions on Automatic Control 4<br />
respect to space, time, or both, a component can have multiple root tasks that can be considered<br />
independent <strong>and</strong> identical in their nature. When the size of a problem becomes large, the size of<br />
the network as well as the number of tasks for each component can be large. One can imagine<br />
wide range of scientific <strong>and</strong> engineering problems that can be solved with such architectures.<br />
Cougaar (Cognitive Agent Architecture: http://www.cougaar.org) developed by <strong>DARPA</strong><br />
(Defense Advanced Research Project Agency), is such an architecture for building large-scale<br />
multi-agent systems. Recently, there have been efforts to combine the technologies of agents <strong>and</strong><br />
components to improve building large-scale software systems [10]-[12]. While component<br />
technology focuses on reusability, agent technology focuses on processing complex tasks as a<br />
community. Cougaar is in line with this trend. In Cougaar a software system comprises of agents<br />
<strong>and</strong> an agent of components (called plugins). The task flow structure in those systems is that of<br />
components as a combination of intra-agent <strong>and</strong> inter-agent task flows. As the agents in Cougaar<br />
can be distributed both from geographical <strong>and</strong> information content sense, the networks<br />
implemented in Cougaar have distributed <strong>and</strong> component-based architecture.<br />
UltraLog (http://www.ultralog.net) networks are military supply chain planning systems<br />
implemented in Cougaar [13]-[17]. Each agent in these networks represents an organization of<br />
military supply chain <strong>and</strong> has a set of components specialized for each functionality (allocation,<br />
expansion, inventory management, etc) <strong>and</strong> class (ammunition, water, fuel, etc). The objective of<br />
an UltraLog network is to provide an appropriate logistics plan for a given military operational<br />
plan. A logistics plan is a global solution which is an aggregate of individual schedules built by<br />
components. An operational plan is decomposed into logistics requirements of each thread for<br />
each agent, <strong>and</strong> a requirement is further decomposed into root tasks (one task per day) for a<br />
designated component. As a result, a component can have hundreds of root tasks depending on
Manuscript for IEEE Transactions on Automatic Control 5<br />
the horizon of an operation <strong>and</strong> thous<strong>and</strong>s of tasks to process as the root tasks are propagated. As<br />
the scale of operation increases there can be thous<strong>and</strong>s of agents (tens of thous<strong>and</strong>s of<br />
components) working together to generate a logistics plan. The system makes initial planning<br />
<strong>and</strong> continuous replanning to cope with logistics plan deviations or operational plan changes.<br />
Initial planning <strong>and</strong> replanning are the instances of the current research problem.<br />
QoS of these networks is determined by the quality of logistics plan (value of solution) <strong>and</strong><br />
(plan) completion time. These two metrics directly affect the performance of an operation. As the<br />
networks are working in a military environment, they are especially vulnerable to malicious<br />
attacks <strong>and</strong> accidental failures. Now, the question is how can we make this system survivable to<br />
generate high quality logistics plans in a timely manner in the presence of such adverse events?<br />
3. Problem specification<br />
In this section we formally define the problem by detailing network configuration, control<br />
action, <strong>and</strong> stress environment. We focus on computational CPU resources assuming that the<br />
system is computation-bounded.<br />
3.1 Network configuration<br />
A network is composed of a set of components A <strong>and</strong> each component resides in its own<br />
machine 1 . Task flow structure of the network, which defines precedence relationship between<br />
components, is an arbitrary directed acyclic graph. A problem given to the network is<br />
decomposed in terms of root tasks for some components <strong>and</strong> those tasks are propagated through<br />
the task flow structure. Each component processes one of the tasks in its queue (which has root<br />
1 For simplicity we consider the cases where there is one component in a machine. Though the designed control<br />
mechanism is also applicable to resource sharing environments, we may need to consider resource allocation in<br />
addition as will be discussed in Section 9.
Manuscript for IEEE Transactions on Automatic Control 6<br />
tasks as well as tasks from predecessor components) <strong>and</strong> then sends it to successor components.<br />
We denote the number of root tasks of component i as rt i . Fig. 1 shows an example network<br />
composed of four components. Each of A 1 <strong>and</strong> A 2 has 100 root tasks. A 3 <strong>and</strong> A 4 have no root<br />
tasks but they have 200 <strong>and</strong> 100 tasks respectively from the corresponding predecessors.<br />
<br />
<br />
A 1<br />
A 3<br />
100<br />
0<br />
<br />
A 2 A 4<br />
100<br />
0<br />
Fig. 1. An example network<br />
3.2 Control action<br />
A component can use one of alternative algorithms to process a task. Different alternatives<br />
trade off CPU time <strong>and</strong> value of solution with more CPU time resulting in higher solution value.<br />
As one can find optimal mixed alternatives, a component has a monotonically increasing convex<br />
function, say value function, with CPU time as a function of value. We call the value in the<br />
function as value mode that a component can select as its decision variable. A value function is<br />
defined with three elements as<br />
〈 f i ( vi<br />
), vi(min),<br />
vi(max)〉<br />
as shown in Fig. 1. This function indicates<br />
that component i’s expected CPU time 2 to process a task is f i (v i ) with a value mode v i <strong>and</strong> v i(min) ≤<br />
v i ≤ v i(max) . We assume that components cannot change the mode for a task in process.<br />
3.3 Stress environment<br />
Survivability stresses such as malicious attacks <strong>and</strong> accidental failures, affect the system by<br />
directly consuming resources or indirectly invoking defense mechanisms as remedies. For<br />
2 The distribution of CPU time can be arbitrary though we use only expected CPU time.
Manuscript for IEEE Transactions on Automatic Control 7<br />
example, “Denial of Service” attack consumes resources directly while relevant defense<br />
mechanism also consumes resources in terms of resistance, recognition, <strong>and</strong> recovery [1]. We<br />
consider both survivability stresses <strong>and</strong> remedies as stress environment from the viewpoint of<br />
components. The space of stress environment is high-dimensional <strong>and</strong> also evolving [18][19].<br />
But, as we concentrate on CPU resources, a stress environment can be considered as a set of<br />
threads residing in the machines of the network <strong>and</strong> sharing resources with components. The<br />
threads, say stressors, may have admission to access resources or be stealing resources without<br />
admission.<br />
3.4 Problem definition<br />
The service provided by the network is to produce a global solution to a given problem, which<br />
is an aggregate of partial solutions of individual tasks. QoS of the network is determined by the<br />
value of global solution <strong>and</strong> the cost of completion time. The value of global solution is the<br />
summation of partial solution values, <strong>and</strong> the cost of completion time is determined by a cost<br />
function CCT(T) which is a monotonically increasing function with completion time T. We<br />
assume that the solution values <strong>and</strong> cost are represented in a common unit 3 . Consider v d i as the<br />
value mode used to process d th task by component i <strong>and</strong> e i the number of tasks processed by<br />
component i to the completion. Then, the control objective is to maximize QoS by utilizing<br />
alternative algorithms (v) as in (2). As stated earlier, we design an adaptive control mechanism to<br />
achieve the objective for supporting the survivability of large-scale information networks.<br />
arg max<br />
v<br />
e<br />
i<br />
∑∑<br />
i∈ A d = 1<br />
v<br />
d<br />
i<br />
− CCT(T )<br />
(2)<br />
3 Relative importance can be considered by scaling the functions <strong>and</strong> it results in the same function structures.
Manuscript for IEEE Transactions on Automatic Control 8<br />
4. Overall control procedure<br />
There are two representative optimal control approaches in dynamic systems: Dynamic<br />
Programming (DP) <strong>and</strong> Model Predictive Control (MPC). Though DP gives optimal closed-loop<br />
policy it has inefficiencies in dealing with large-scale systems especially when systems are<br />
working in finite time horizon [20]-[22]. In MPC, for each current state, an optimal open-loop<br />
control policy is designed for finite-time horizon by solving a static mathematical programming<br />
model [23]-[26]. The design process is repeated for the next observed state feedback forming a<br />
closed-loop policy reactive to each current system state. Though MPC does not give absolutely<br />
optimal policy in stochastic environments, the periodic design process alleviates the impacts of<br />
stochasticity <strong>and</strong> it is easy to adapt to new contexts by explicitly h<strong>and</strong>ling objective function or<br />
constraints.<br />
Considering the characteristic of the current problem, we choose MPC framework. Our<br />
networks are large-scale working in finite time horizon <strong>and</strong> need to adapt to unpredictable stress<br />
environment. Therefore, under MPC framework, we develop an adaptive control mechanism as<br />
depicted in Fig. 2. First, to address adaptivity we model stress environment by quantifying<br />
resource availability through sensors. Second, we build a mathematical programming model with<br />
the resource availability incorporated, which predicts QoS as a function of alternative algorithms.<br />
Third, we provide an auction market as a decentralized coordination mechanism for solving the<br />
programming model. By periodically opening the auction market, the system can achieve<br />
desirable performance adaptive to changing stress environment while assuring scalability<br />
property. We define sensors <strong>and</strong> build a mathematical programming model in Section 5, <strong>and</strong><br />
refine it based on stability analysis in Section 6. The refined programming model is decentralized<br />
in Section 7.
Manuscript for IEEE Transactions on Automatic Control 9<br />
Stress<br />
Environment Modeling<br />
Sensor Design<br />
Sensor<br />
Component<br />
Sensor<br />
Component<br />
Sensor<br />
Component<br />
Mathematical<br />
Programming<br />
Periodic Auctioning<br />
Auction<br />
Decentralized<br />
Coordination<br />
Fig. 2. Overall control procedure<br />
5. Mathematical programming model<br />
In this section we define sensors <strong>and</strong> build a mathematical programming model under MPC<br />
framework.<br />
5.1 Sensors<br />
Each component monitors its operating environment through a sensor. The sensor measures<br />
resource availability MRA i (t), which is defined as available fraction of a resource when a<br />
component i requests that resource in the last control period at control point t. There are two<br />
quantities to extract this measurement, which are request time <strong>and</strong> execution time. Request time<br />
is the duration for which the component requests resource or equivalently queue length<br />
(including a task in service) is more than zero. Execution time is the duration for which the<br />
component utilizes the resource. If control period is SW, resource sensor calculates MRA i (t) as:
Manuscript for IEEE Transactions on Automatic Control 10<br />
( t MRA i<br />
execution time in ( t − SW ,t )<br />
) = . (3)<br />
request time in ( t − SW ,t )<br />
5.2 Mathematical programming model<br />
A component can estimate its resource availability in the future by using observed resource<br />
availability in the past. So, by incorporating the estimation, the component can induce service<br />
time per task as a function of value mode as:<br />
f ( v ) / MRA ( t ) . (4)<br />
i<br />
i<br />
Now, consider current time as t <strong>and</strong> estimate the completion time T by assuming that each<br />
component uses a mode common to all the tasks (i.e. pure strategy). We will discuss the<br />
optimality of the pure strategy later in this subsection.<br />
In a task flow structure where each component processes only one task after its predecessors<br />
complete their tasks, the completion time will be the length of the longest path (i.e., critical path)<br />
as widely studied in project management literature. However, as the number of tasks increases, a<br />
bottleneck component, which has maximal total service time, becomes dominating the<br />
completion time. As each component in our networks can have large number of tasks to process<br />
rather than just one, the completion time T can be estimated as:<br />
i<br />
T − t ≈ Max [ R ( t ) + L ( t ) f ( v )] / MRA ( t ) , (5)<br />
i∈A<br />
i<br />
in which R i (t) denotes remaining CPU time for a task in process <strong>and</strong> L i (t) the number of<br />
remaining tasks excluding a task in process. After identifying initial number of tasks L i (0) as in<br />
(6) where i denotes the immediate predecessors of component i, each component updates it by<br />
counting down as they process tasks.<br />
i<br />
∑<br />
a∈i<br />
i<br />
i<br />
L i (0 ) = rti<br />
+ La(0<br />
)<br />
(6)<br />
i
Manuscript for IEEE Transactions on Automatic Control 11<br />
So, given completion time T it is optimal for each component i to select a mode by the<br />
following:<br />
Max L ( t )<br />
i v i<br />
(7)<br />
subject to<br />
[ i i i i<br />
i<br />
R ( t ) + L ( t ) f ( v )] / MRA ( t ) ≤ T − t . (8)<br />
Consequently, the programming model can be formulated in a straightforward way as in (9),<br />
named naïve decision model. The model maximizes QoS by trading off the value of solution <strong>and</strong><br />
the cost of completion time.<br />
<br />
Naïve decision model<br />
Max<br />
s.t.<br />
∑<br />
i∈A<br />
[ R ( t ) + L ( t ) f ( v )] / MRA( t ) ≤ T − t<br />
v<br />
L ( t )v<br />
i<br />
i<br />
i(min)<br />
≤ v<br />
i<br />
i<br />
− CCT(T )<br />
i<br />
≤ v<br />
i<br />
i(max)<br />
i<br />
for all i ∈ A<br />
for all i ∈ A<br />
(9)<br />
The naïve decision model maximizes QoS as if all the tasks of each component are available<br />
in its queue at current time t. That is, a network under the naïve decision model can achieve a<br />
performance close to the optimal performance of an ideal network with maximal task availability<br />
when L i (t)s are large. As any mixed strategy (i.e. using different modes for processing tasks)<br />
cannot perform better in the ideal network due to the convexity of value functions, it is optimal<br />
for each component to use a pure strategy. We will refine the model so that it can be applicable<br />
even though L i (t)s are small in the next section.
Manuscript for IEEE Transactions on Automatic Control 12<br />
6. Model refinement<br />
In this section we analyze system behavior under the naïve decision model <strong>and</strong> refine it to<br />
eliminate undesirable behavioral properties.<br />
6.1 Analysis of system behavior<br />
To analyze the system behavior under the naïve decision model, we made experimentation<br />
using discrete-event simulation. There are five components in the system linked serially as in<br />
Fig. 3. Component A 1 in the lowest position is assigned 100 root tasks. Components have a<br />
common deterministic value function <strong>and</strong> the cost of completion time is a linear increasing<br />
function as indicated in the figure. There is no stress in the system <strong>and</strong> components measure<br />
MRA i (t) equal to 1 all the time. The system makes decision every 100 time units (i.e., SW=100)<br />
by solving the naïve decision model.<br />
100<br />
A 1<br />
A 2 A 3 A 4<br />
A 5<br />
<br />
CCT(T) = 10T<br />
Fig. 3. An example network for stability analysis<br />
Fig. 4 <strong>and</strong> 5 show the resultant behavior of the system, in which the decisions T * <strong>and</strong> v * i are<br />
divergent. The divergent behavior indicates that there is inefficiency in the naïve decision model<br />
<strong>and</strong> system performance can be improved if we eliminate this inefficiency. The divergent<br />
behavior is due to the inaccurate prediction of the naïve decision model. In the example network<br />
the system (or A 5 ) can complete at T * when A 4 completes before T * . As each component is trying<br />
to complete at T * without considering its position in the task flow structure, the components
Manuscript for IEEE Transactions on Automatic Control 13<br />
cannot receive tasks in time from their predecessors. This inaccuracy leads to changing the<br />
decisions in the subsequent decision points resulting in the divergent behavior.<br />
1150<br />
1120<br />
Optimal T<br />
1090<br />
1060<br />
1030<br />
1000<br />
0 200 400 600 800 1000 1200<br />
Time<br />
Fig. 4. Behavior of T * under naïve decision model<br />
40.0<br />
A4<br />
38.0<br />
Mode<br />
36.0<br />
34.0<br />
A5<br />
A3<br />
32.0<br />
A2<br />
30.0<br />
A1<br />
0 200 400 600 800 1000 1200<br />
Time<br />
Fig. 5. Behavior of v i * under naïve decision model<br />
6.2 Model refinement<br />
To stabilize the system behavior we need to reinforce the naïve decision model by taking into<br />
account the components’ positions in the task flow structure. For this purpose, we define Depth
Manuscript for IEEE Transactions on Automatic Control 14<br />
D i (t) as a quantitative representation of the component’s position. D i (t) is the required time gap<br />
between the system’s <strong>and</strong> the component’s completion times at time t. Each component needs to<br />
complete at less than or equal to T-D i (t) to keep the completion time T. Components without<br />
successors have depth equal to 0 but components with successors have positive depths. A<br />
component a can keep its depth if its predecessors’ depth is D a (t) plus its total service time for<br />
the last arriving tasks in the worst case. So, a component i’s depth to keep the depths of its all<br />
successors is the maximal of the required depths from its successors represented as:<br />
D i( t ) = max[ Da( t ) + f a( va<br />
) / MRAa<br />
( t )] , (10)<br />
a∈i<br />
∑<br />
b∈a<br />
in which i denotes successors of component i <strong>and</strong> a predecessors of component a.<br />
Though it is possible to refine the naïve decision model by incorporating the depths as<br />
variables, the model complexity increases because each component’s constraint will be<br />
intertwined with the decision variables of all connected components. So, we simply estimate<br />
components’ depths through the decisions used at the last control point. At each control point<br />
each successor informs its predecessors of required depths <strong>and</strong> each predecessor chooses the<br />
maximal one as its depth. As a result, we can consider the depth as constant rather than variable<br />
so that the refined model has no increase in complexity. We call the refined model in (11) as<br />
stable decision model. If we don’t consider the depth, i.e., D i (t)=0, the model becomes equivalent<br />
to the naïve decision model. Also, the stable decision model becomes an exact CPM/PERT<br />
formulation as a special case found in project management literature when one consider D i (t) as<br />
variable.
Manuscript for IEEE Transactions on Automatic Control 15<br />
<br />
Stable decision model<br />
Max<br />
s.t.<br />
∑<br />
i∈A<br />
[ R ( t ) + L ( t ) f ( v )] / MRA( t ) ≤ T − t − D ( t )<br />
v<br />
L ( t )v<br />
i<br />
i<br />
i(min)<br />
≤ v<br />
i<br />
i<br />
− CCT(T )<br />
i<br />
≤ v<br />
i<br />
i(max)<br />
i<br />
i<br />
for all<br />
for all<br />
i ∈ A<br />
i ∈ A<br />
(11)<br />
6.3 System behavior under stable decision model<br />
To observe system behavior under the stable decision model, we experimented with the<br />
example network described in Fig. 3. Fig. 6 <strong>and</strong> 7 show the resultant behavior of the system, in<br />
which the decisions T * <strong>and</strong> v * i are stable. The stability indicates that the inefficiency of the naïve<br />
decision model is removed as a result of improved prediction accuracy.<br />
1150<br />
1120<br />
Optimal T<br />
1090<br />
1060<br />
1030<br />
1000<br />
0 200 400 600 800 1000 1200<br />
Time<br />
Fig. 6. Behavior of T * under stable decision model
Manuscript for IEEE Transactions on Automatic Control 16<br />
31.0<br />
30.8<br />
30.6<br />
Mode<br />
30.4<br />
30.2<br />
A1<br />
A2<br />
A3<br />
A4<br />
30.0<br />
A5<br />
0 200 400 600 800 1000 1200<br />
Time<br />
Fig. 7. Behavior of v i * under stable decision model<br />
The effects of the stability on performance are shown in Table 1. QoS is improved<br />
significantly when using the stable decision model. Improved prediction accuracy made the<br />
system behaving stable <strong>and</strong> consequently performing better.<br />
Table 1. The effects of stability on performance<br />
Decision model<br />
Naïve<br />
Stable<br />
T V QoS T V QoS<br />
1171 15289 3583 1082 15104 4282<br />
T: Completion time, V: Value of solution<br />
7. Decentralization<br />
The next question is how to decentralize the programming model. Centralized control<br />
mechanisms scale badly, due to the rapid increase of computational <strong>and</strong> communicational<br />
overheads with system size. Single point failure will often lead to failure of the complete system<br />
leading to a non-robust network. Decentralization can address these issues by distributing the<br />
computations <strong>and</strong> communications to multiple entities. In addition to these properties
Manuscript for IEEE Transactions on Automatic Control 17<br />
decentralization will give a byproduct, information security. As we discussed earlier our effort is<br />
to support survivability. If information is revealed to others directly information security will be<br />
in question. In this section we decentralize the programming model through an auction market.<br />
7.1 Two-tier auction market<br />
There are two popular methods of decentralizing structured programming models:<br />
decomposition methods <strong>and</strong> auction/bidding algorithms. Considering the compatible structure of<br />
the programming model, we decentralize it through a non-iterative auction mechanism, so called<br />
multiple-unit auction with variable supply [27]. In this auction a seller may be able <strong>and</strong> willing<br />
to adjust the supply as a function of bidding. In the programming model we have built, all<br />
components are coupled with each other. However, the objective function <strong>and</strong> constraints are<br />
separable if one variable T is fixed. This characteristic makes it possible to solve the model<br />
through an auctioning process for T. The completion time T is an unbounded resource <strong>and</strong> the<br />
supply can be adjusted as a function of bidding. To design the auction market we define a seller<br />
which determines T * based on the bids from the components. We call this auction market as<br />
Two-tier Auctioning Model.<br />
We define T i as available resource of component i which is required minimally to the amount<br />
of T i(min) as in (12) <strong>and</strong> maximally T i(max) as in (13).<br />
T<br />
T<br />
= [ R ( t ) L ( t ) f ( v )] / MRA ( t )<br />
(12)<br />
i (min) i +<br />
i<br />
i<br />
i(min)<br />
= [ R ( t ) L ( t ) f ( v )] / MRA ( t )<br />
(13)<br />
i (max) i +<br />
i<br />
Each component bids to the seller with maximal value as a function of T as in (14). The seller<br />
decides T * based on the bids by considering CCT(T) as in (15). After the seller broadcasts T * ,<br />
each component selects its optimal value mode in the limits T * as in (16). Though this auctioning<br />
i<br />
i(max)<br />
i<br />
i
Manuscript for IEEE Transactions on Automatic Control 18<br />
process gives an equivalent solution to the centralized programming model, it gives more<br />
benefits as communications <strong>and</strong> computations are distributed to multiple market participants.<br />
<br />
Two-tier auctioning model<br />
Component’s bid<br />
b (T ) = −∞<br />
i<br />
= L ( t )v<br />
i<br />
= L ( t ) f<br />
i<br />
i(max)<br />
−1<br />
i<br />
Seller’s decision<br />
Max ∑bi ( T ) − CCT ( T )<br />
i∈A<br />
Component’s decision<br />
(T − t − D i( t ))MRA i( t ) − R i( t )<br />
(<br />
)<br />
L ( t )<br />
i<br />
if T − t − D ( t ) < T<br />
if T − t − D ( t ) > T<br />
else<br />
i<br />
i<br />
i(min)<br />
i(max)<br />
(14)<br />
(15)<br />
v<br />
*<br />
i<br />
= v<br />
i(max)<br />
*<br />
i<br />
if<br />
T<br />
*<br />
− t − D<br />
i<br />
> T<br />
i(max)<br />
− 1 (T − t − D i( t ))MRA i( t ) − R i( t )<br />
(16)<br />
= fi<br />
(<br />
) else<br />
L ( t )<br />
7.2 Multi-tier auction market<br />
Though the designed auctioning process is decentralized it incorporates a centralized seller<br />
which needs to coordinate all the components. As the centralized auction can still exhibit<br />
problems in terms of scalability <strong>and</strong> robustness we introduce a multi-tier auction market.<br />
Suppose there are two component groups a <strong>and</strong> b with a ⊂ b, <strong>and</strong> denote S a as a set of optimal<br />
completion time solutions of group a <strong>and</strong> S b of group b. Then, the maximal of S b is greater than<br />
or equal to the maximal of S a as in (17).<br />
max S<br />
a<br />
≤ max S if a ⊂ b<br />
(17)<br />
b<br />
Proof. Suppose it is not true, that is, T a =max S a > T b =max S b . Then, for group b,
Manuscript for IEEE Transactions on Automatic Control 19<br />
∑<br />
i∈b<br />
≡<br />
≡ [<br />
b (T<br />
i<br />
∑<br />
i∈a<br />
∑<br />
i∈a<br />
i<br />
i<br />
b<br />
b (T<br />
) − CCT(T<br />
b<br />
b (T<br />
) +<br />
b<br />
∑<br />
i∉a<br />
And, for group a,<br />
∑<br />
i∈a<br />
b (T<br />
i<br />
b<br />
) ><br />
b<br />
) − CCT(T<br />
i<br />
b<br />
b (T<br />
) − CCT(T<br />
b<br />
b<br />
) ≤<br />
∑<br />
i∈b<br />
b (T<br />
) − CCT(T<br />
i<br />
)] − [<br />
∑<br />
i∈a<br />
∑<br />
i∈a<br />
b (T<br />
So, the inequality in (20) should hold.<br />
∑<br />
i∉a<br />
a<br />
∑<br />
i∉a<br />
b<br />
i<br />
a<br />
) − CCT(T<br />
b<br />
i<br />
) ><br />
b (T<br />
a<br />
a<br />
∑<br />
i∈a<br />
i<br />
a<br />
)<br />
b (T<br />
a<br />
) +<br />
) − CCT(T<br />
) − CCT(T<br />
a<br />
a<br />
∑<br />
i∉a<br />
)] ><br />
b (T<br />
i<br />
∑<br />
i∉a<br />
a<br />
i<br />
) − CCT(T<br />
b (T<br />
a<br />
) −<br />
∑<br />
i∉a<br />
a<br />
b (T<br />
i<br />
)<br />
b<br />
. (18)<br />
)<br />
) . (19)<br />
b i (T ) < bi<br />
(T )<br />
(20)<br />
But, this inequality is not possible because b i (T) is an increasing function with T.<br />
<br />
Through this property the two-tier auctioning model can be transformed into a multi-tier<br />
model, in which there are multiple brokers arbitrating components <strong>and</strong> the seller. A broker bids<br />
to its superior broker or seller for T≥T s(m) as in (21), in which T s(m) denotes the maximal of<br />
optimal completion time solutions of a group s(m) <strong>and</strong> s(m) subordinate components <strong>and</strong> brokers<br />
of broker m. In this way, the search space becomes reduced as the bidding process goes to the<br />
superior. In this multi-tier auctioning model communications <strong>and</strong> computations are more<br />
distributed through the brokers overcoming the problems of the two-tier model.<br />
<br />
Multi-tier auctioning model<br />
Broker’s bid<br />
b<br />
m<br />
(T ) = −∞<br />
=<br />
∑<br />
a<br />
a∈s(<br />
m )<br />
b (T )<br />
if T < max{arg max<br />
else<br />
T<br />
∑<br />
a<br />
a∈s(<br />
m )<br />
b (T ) − CCT(T )}<br />
(21)
Manuscript for IEEE Transactions on Automatic Control 20<br />
8. Empirical results<br />
We ran several experiments using discrete-event simulation to validate the designed control<br />
mechanism. Though we use a small network in the experimentation for validation purpose, the<br />
decentralized model, especially, can h<strong>and</strong>le much larger networks.<br />
8.1 Experimental design<br />
The experimental network is composed of fifteen components with a tree structure as shown<br />
in Fig. 8. Each component in the lowest position has 200 root tasks. Also, all the components<br />
have a common linear value function <strong>and</strong> the cost of completion time is linear increasing<br />
function as indicated in the figure.<br />
A 1<br />
A 2<br />
<br />
CCT(T) = 4T<br />
A 3<br />
A 15<br />
A 4 A 5<br />
A 8 A 9 A 10 A 11<br />
A 6 A 7<br />
A 12 A 13 A 14<br />
200 200 200 200 200 200 200 200<br />
Fig. 8. Experimental network configuration<br />
We set up four different experimental conditions as shown in Table 2. There can be stressors<br />
which share resources with components. We assign weight w i to a component i <strong>and</strong> w i ′ to a<br />
stressor sharing resource with component i. A stressor, which has infinite work (continuously<br />
requiring resource), can impose different levels of stress on the component directly by changing<br />
w i ′. When it is zero there is no stress, <strong>and</strong> as it increases the stress level increases. We implement<br />
the stress environment by using a weighted round-robin scheduling, in which CPU time received
Manuscript for IEEE Transactions on Automatic Control 21<br />
by each thread in a round is equal to its assigned weight. Also, the distribution of CPU time can<br />
be deterministic or stochastic. While using stochastic value function we repeat 5 experiments.<br />
Table 2. Experimental conditions<br />
Condition Stress f i (v i )<br />
Con1 Unstressed Deterministic<br />
Con2 Unstressed Exponential<br />
Con3 Stressed Deterministic<br />
Con4 Stressed Exponential<br />
w i =0.1for all i∈A, w A4 ′ =1 in 500≤t≤1000 for Con3 <strong>and</strong> Con4<br />
Initial value mode: (2, 5, 5, 3 for A 4 to A 15 )<br />
We use four different control policies for each experimental condition as shown in Table 3.<br />
FL <strong>and</strong> FH use fixed value modes over time. AC-X policies represent the adaptive control<br />
mechanism we have designed. In AC-N the system is controlled under naïve decision model<br />
while in AC-S under stable decision model. When using adaptive policies the system makes<br />
decision every 100 time units (i.e., SW=100).<br />
Table 3. Control policies used for experimentation<br />
Control policy<br />
FL<br />
FH<br />
AC-N<br />
AC-S<br />
Description<br />
Fixed with lowest value mode<br />
Fixed with highest value mode<br />
Adaptive control under naïve decision model<br />
Adaptive control under stable decision model<br />
8.2 Results<br />
Numerical results from the experimentation are summarized in Table 4. The adaptive control<br />
policies show significant advantages compared to non-adaptive ones in all different conditions.<br />
But, the benefit of AC-S is not clear in the numerical results. Though AC-S outperforms AC-N<br />
in deterministic environments, AC-N outperforms AC-S in stochastic environments. This means<br />
that AC-S cannot guarantee better performance especially in stochastic environments. But, we
Manuscript for IEEE Transactions on Automatic Control 22<br />
can say that AC-S is a robust policy to keep the system from behaving divergently <strong>and</strong> degrading<br />
performance significantly as have shown in the previous stability analysis.<br />
Table 4. Experimental results<br />
Control Policy<br />
FL FH AC-N AC-S<br />
Condition T V QoS T V QoS T V QoS T V QoS<br />
Con1 1656 13558 6934 6313 30643 5391 1663 22898 16245 1656 22884 16259<br />
Con2 1652 13547 6942 6302 30643 5435 1723 22982 16089 1728 22959 16046<br />
Con3 1656 13558 6934 6313 30643 5391 1966 23401 15539 1965 23403 15542<br />
Con4 1652 13547 6942 6371 30643 5159 2024 23495 15401 2007 23406 15376<br />
T: Completion time, V: Value of solution<br />
Fig. 9 shows the behavior of T * under adaptive control policies in unstressed environments. In<br />
the deterministic environment the system behaves stable under AC-S while diverging under AC-<br />
N. But, it is not valid in the stochastic environment because the system seems more stable under<br />
AC-N. This might explain partially why AC-S does not perform better in stochastic<br />
environments.<br />
1720<br />
1710<br />
Optimal T<br />
1700<br />
1690<br />
1680<br />
1670<br />
1660<br />
1650<br />
AC-S (Con2)<br />
AC-N (Con2)<br />
AC-S (Con1)<br />
AC-N (Con1)<br />
1640<br />
0 200 400 600 800 1000 1200 1400 1600 1800<br />
Time<br />
Fig. 9. Behavior of T * in unstressed environments
Manuscript for IEEE Transactions on Automatic Control 23<br />
The system controlled by adaptive control policies is naturally adaptive to changing<br />
environments as components monitor their environments <strong>and</strong> incorporate them into the decision<br />
process. As shown in Fig. 10 <strong>and</strong> 11 for deterministic case <strong>and</strong> Fig. 12 <strong>and</strong> 13 for stochastic case,<br />
when environment changes the system adapts to the new environment.<br />
4000<br />
3000<br />
Optimal T<br />
2000<br />
1000<br />
0<br />
0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />
Time<br />
Fig. 10. Adaptive behavior of T * in deterministic environment (Con3)<br />
6.0<br />
A 8<br />
A 4<br />
5.0<br />
A 2<br />
Mode<br />
4.0<br />
3.0<br />
A 1<br />
2.0<br />
1.0<br />
0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />
Time<br />
Fig. 11. Adaptive behavior of v i * in deterministic environment (Con3)
Manuscript for IEEE Transactions on Automatic Control 24<br />
4000<br />
3000<br />
Optimal T<br />
2000<br />
1000<br />
0<br />
0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />
Time<br />
Fig. 12. Adaptive behavior of T * in stochastic environment (Con4)<br />
6.0<br />
A 8<br />
A 4<br />
5.0<br />
A 2<br />
4.0<br />
A 1<br />
Mode<br />
3.0<br />
2.0<br />
1.0<br />
0 200 400 600 800 1000 1200 1400 1600 1800 2000<br />
Time<br />
Fig. 13. Adaptive behavior of v i * in stochastic environment (Con4)<br />
9. Conclusions<br />
A typical information network emerges as a result of automation or organizational integration,<br />
which is large-scale with distributed <strong>and</strong> component-based architecture. In this paper we<br />
developed an adaptive control mechanism to support the survivability of such networks by
Manuscript for IEEE Transactions on Automatic Control 25<br />
utilizing alternative algorithms. We designed an auction market which coordinates the<br />
components of a network. Each component bids based on its measured resource availability <strong>and</strong><br />
optimal decisions are made through a multi-tier auctioning process. By periodically opening the<br />
auction market, the system can achieve desirable performance adaptive to changing stress<br />
environment while assuring scalability property.<br />
Our work can be extended by considering more general network configurations. There can be<br />
multiple components in a machine sharing resources together. In such resource sharing<br />
environments, we have an opportunity to improve system performance by appropriately<br />
allocating resources. Though the designed control mechanism is applicable to the resource<br />
sharing environments, it would be desirable to explore an appropriate control mechanism by<br />
incorporating the resource allocation in addition.<br />
References<br />
[1] S. Jha <strong>and</strong> J. M. Wing, “Survivability analysis of networked systems,” in Proc. 23rd Int.<br />
Conf. Software engineering, 2001, pp. 307-317.<br />
[2] R. Ellison, D. Fisher, H. Lipson, T. Longstaff, <strong>and</strong> N. Mead, “Survivable network systems:<br />
An emerging discipline,” Software Engineering Institute, Carnegie Mellon University,<br />
Pittsburg, PA, Tech. Rep. CMU/SEI-97-153, 1997.<br />
[3] J. E. Eggleston, S. Jamin, T. P. Kelly, J. K. MacKie-Mason, W. E. Walsh, <strong>and</strong> M. P.<br />
Wellman, “Survivability through market-based adaptivity: The MARX project,” in Proc.<br />
<strong>DARPA</strong> Information Survivability Conference <strong>and</strong> Exposition, 2000, pp. 145-156.<br />
[4] S. Bowers, L. Delcambre, D. Maier, C. Cowan, P. Wagle, D. McNamee, A. L. Meur, <strong>and</strong> H.<br />
Hinton, “Applying adaptation spaces to support quality of service <strong>and</strong> survivability,” in
Manuscript for IEEE Transactions on Automatic Control 26<br />
Proc. <strong>DARPA</strong> Information Survivability Conference <strong>and</strong> Exposition, 2000, pp. 271-283.<br />
[5] O. F. Rana <strong>and</strong> K. Stout, “What is scalability in multi-agent systems?,” in Proc. 4th Int.<br />
Conf. Autonomous Agents, 2000, pp. 56-63.<br />
[6] B. Meyer, “On to components,” IEEE Computer, vol. 32, no. 1, pp. 139-140, 1999.<br />
[7] P. Clements, “From subroutine to subsystems: Component-based software development,” in<br />
Component Based Software Engineering, A. W. Brown, Ed. IEEE Computer Society Press,<br />
pp. 3-6, 1996.<br />
[8] M. O. McCracken, A. Snavely, <strong>and</strong> A. Malony, “Performance modeling for dynamic<br />
algorithm selection,” in Proc. Int. Conf. Computational Science, 2003, pp. 749-758.<br />
[9] P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Medvidovic, A.<br />
Quilici, D. S. Rosenblum, <strong>and</strong> A. L. Wolf, “An architecture-based approach to self-adaptive<br />
software,” IEEE Intelligent Systems, vol. 14, no. 3, pp. 54-62, 1999.<br />
[10] F. M. T. Brazier, C. M. Jonker, <strong>and</strong> J. Treur, “Principles of component-based design of<br />
intelligent agents,” Data <strong>and</strong> Knowledge Engineering, vol. 41, no. 1, pp. 1-28, 2002.<br />
[11] H. J. Goradia <strong>and</strong> J. M. Vidal, “Building blocks for agent design,” in Proc. 4th Int.<br />
Workshop on Agent-Oriented Software Engineering, 2003, pp. 17-30.<br />
[12] R. Krutisch, P. Meier, <strong>and</strong> M. Wirsing, “The AgentComponent approach, combining agents<br />
<strong>and</strong> components,” in Proc. 1st German Conf. Multiagent System Technologies, 2003, pp. 1-<br />
12.<br />
[13] D. Moore, W. Wright, <strong>and</strong> R. Kilmer, “Control surfaces for Cougaar,” in Proc. First Open<br />
Cougaar Conference, 2004, pp. 37-44.<br />
[14] W. Peng, V. Manikonda, <strong>and</strong> S. Kumara, “Underst<strong>and</strong>ing agent societies using distributed<br />
monitoring <strong>and</strong> profiling,” in Proc. First Open Cougaar Conference, 2004, pp. 53-60.
Manuscript for IEEE Transactions on Automatic Control 27<br />
[15] H. Gupta, Y. Hong, H. P. Thadakamalla, V. Manikonda, S. Kumara, <strong>and</strong> W. Peng, “Using<br />
predictors to improve the robustness of multi-agent systems: Design <strong>and</strong> implementation in<br />
Cougaar,” in Proc. First Open Cougaar Conference, 2004, pp. 81-88.<br />
[16] D. Moore, A. Helsinger, <strong>and</strong> D. Wells, “Deconfliction in ultra-large MAS: Issues <strong>and</strong> a<br />
potential architecture,” in Proc. First Open Cougaar Conference, 2004, pp. 125-133.<br />
[17] R. D. Snyder <strong>and</strong> D. C. Mackenzie, “Cougaar agent communities,” in Proc. First Open<br />
Cougaar Conference, 2004, pp. 143-147.<br />
[18] A. P. Moore, R. J. Ellison, <strong>and</strong> R. C. Linger, “Attack modeling for information security <strong>and</strong><br />
survivability,” Software Engineering Institute, Carnegie Mellon University, Pittsburg, PA,<br />
Tech. Note CMU/SEI-2001-TN-001, 2001.<br />
[19] F. Moberg, “Security analysis of an information system using an attack tree-based<br />
methodology,” M.S. thesis, Automation Engineering Program, Chalmers University of<br />
Technology, Sweden, 2000.<br />
[20] G. Barto, S. J. Bradtke, <strong>and</strong> S. P. Singh, “Learning to act using real-time dynamic<br />
programming,” Artificial Intelligence, vol. 72, no. 1-2, pp. 81-138, 1995.<br />
[21] R. S. Sutton, A. G. Barto, <strong>and</strong> R. J. Williams, “Reinforcement learning is direct adaptive<br />
optimal control,” IEEE Control Systems, vol. 12, no. 2, pp. 19-22, 1992.<br />
[22] L. P. Kaelbling, M. L. Littman, <strong>and</strong> A. W. Moore, “Reinforcement learning: A survey,” J.<br />
Artificial Intelligence Research, vol. 4, pp. 237-285, 1996.<br />
[23] J. B. Rawlings, “Tutorial overview of model predictive control,” IEEE Control Systems, vol.<br />
20, no. 3, pp. 38-52, 2000.<br />
[24] M. Morari <strong>and</strong> J. H. Lee, “Model predictive control: Past, present <strong>and</strong> future,” Computers<br />
<strong>and</strong> Chemical Engineering, vol. 23, no. 4, pp. 667-682, 1999.
Manuscript for IEEE Transactions on Automatic Control 28<br />
[25] M. Nikolaou, “Model predictive controllers: A critical synthesis of theory <strong>and</strong> industrial<br />
needs,” in Advances in Chemical Engineering Series, Academic Press, 2001.<br />
[26] S. J. Qin <strong>and</strong> T. A. Badgwell, “A survey of industrial model predictive technology,” Control<br />
Engineering Practice, vol. 11, pp. 733-764, 2003.<br />
[27] Y. Lengwiler, “The multiple unit auction with variable supply,” Economic Theory, vol. 14,<br />
no. 2, pp. 373-392, 1999.
1<br />
Self-Organizing Resource Allocation for<br />
Minimizing Completion Time in Large-Scale<br />
Distributed Information Networks<br />
Seokcheon Lee, Soundar Kumara, <strong>and</strong> Natarajan Gautam<br />
Abstract—As information networks grow larger in size due to<br />
automation or organizational integration, it is important to<br />
provide simple decision-making mechanisms for each entity or<br />
groups of entities that will lead to desirable global performance.<br />
In this paper, we study a large-scale information network<br />
consisting of distributed software components linked together<br />
through a task flow structure <strong>and</strong> design a resource control<br />
mechanism for minimizing completion time. We define load index<br />
which represents component’s workload. When resources are<br />
allocated locally proportional to the load index, the network can<br />
maximize the utilization of distributed resources <strong>and</strong> achieve<br />
optimal performance in the limit of large number of tasks.<br />
Coordinated resource allocation throughout the network emerges<br />
as a result of using the load index as global information. To clarify<br />
the obscurity of “large number of tasks” we provide a quantitative<br />
criterion for the adequacy of the proportional resource allocation<br />
for a given network. By periodically allocating resources under<br />
the framework of model predictive control, a closed-loop policy<br />
reactive to each current system state is formed. The designed<br />
resource control mechanism has several emergent properties that<br />
can be found in many self-organizing systems such as social or<br />
biological systems. Though it is localized requiring almost no<br />
computation, it realizes desirable global performance adaptive to<br />
changing environments.<br />
Index Terms—Distributed information networks, emergence,<br />
resource allocation, scalability.<br />
C<br />
I. INTRODUCTION<br />
ritical infrastructures are increasingly becoming dependent<br />
on networked systems in many domains due to automation<br />
or organizational integration. The growth in complexity <strong>and</strong><br />
size of software systems is leading to the increasing importance<br />
of distributed <strong>and</strong> component-based architectures. Distributed<br />
computing aims at using computing power of machines<br />
Manuscript received June 24, 2005. This work was supported in part by<br />
<strong>DARPA</strong> under Grant MDA 972-01-1-0038.<br />
S. Lee is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering,<br />
The Pennsylvania State University, University Park, PA 16802 USA (phone:<br />
814-863-4799; fax: 814-863-4745; e-mail: stonesky@psu.edu).<br />
S. Kumara is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong><br />
Engineering, The Pennsylvania State University, University Park, PA 16802<br />
USA (e-mail: skumara@psu.edu).<br />
N. Gautam is with the Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong><br />
Engineering, The Pennsylvania State University, University Park, PA 16802<br />
USA (e-mail: ngautam@psu.edu).<br />
connected by a network. When a task requires intensive<br />
computation, it becomes natural choice to achieve high<br />
performance. A component is a reusable program element.<br />
Component technology utilizes the components so that<br />
developers can build systems needed by simply defining their<br />
specific roles <strong>and</strong> wiring them together [1][2]. In networks with<br />
component-based architecture, each component is highly<br />
specialized for specific tasks.<br />
We study a large-scale information network (with respect to<br />
the number of components as well as machines) comprising of<br />
distributed software components linked together through a task<br />
flow structure. A problem given to the network is decomposed<br />
in terms of root tasks for some components <strong>and</strong> those tasks are<br />
propagated through a task flow structure to other components.<br />
As a problem can be decomposed with respect to space, time, or<br />
both, a component can have multiple root tasks that can be<br />
considered independent <strong>and</strong> identical in their nature. The<br />
service provided by the network is to produce a global solution<br />
to the given problem, which is an aggregation of the partial<br />
solutions of individual tasks. Quality of Service (QoS) of the<br />
network is determined by the time for generating the global<br />
solution, i.e. completion time. For a given topology,<br />
components are sharing resources <strong>and</strong> the network can control<br />
its behavior through resource allocation. In specific, we address<br />
allocating resources of each machine to the components<br />
residing on that machine. In this paper we develop a resource<br />
control mechanism of such networks for minimizing<br />
completion time.<br />
Many self-organizing systems such as social <strong>and</strong> biological<br />
systems exhibit emergent properties. Though entities act with a<br />
simple mechanism without central authority, these systems are<br />
adaptive <strong>and</strong> desirable global performance can often be<br />
realized. The control mechanism designed in this paper has<br />
such properties so that it can be applicable to large-scale<br />
networks working in a dynamic environment. Scalability,<br />
defined as “the ability of a solution to some problem to work<br />
when the size of the problem increases” (From Dictionary of<br />
Computing at http://wombat.doc.ic.ac.uk), becomes a critical<br />
issue when developing practical software systems as the size of<br />
networks grows [3]. We also provide a criterion by which one<br />
can evaluate if the emergent properties hold for a given<br />
network.<br />
The organization of this paper is as follows. In Section II we
2<br />
discuss problem domain <strong>and</strong> in Section III formally define the<br />
problem in detail. After designing resource control mechanism<br />
in Sections IV <strong>and</strong> V, we show empirical results in Section VI.<br />
<strong>Final</strong>ly, we conclude our work in Section VII.<br />
replanning to cope with logistics plan deviations or operational<br />
plan changes. Initial planning <strong>and</strong> replanning are the instances<br />
of the current research problem. Plan completion time of such<br />
networks directly affects the performance of military operation.<br />
II. PROBLEM DOMAIN<br />
The networks we study represent distributed <strong>and</strong><br />
component-based architectures for providing a solution for a<br />
given problem. A problem is decomposed in terms of root tasks<br />
<strong>and</strong> solved by distributed components through a task flow<br />
structure. As a problem can be decomposed with respect to<br />
space, time, or both, a component can have multiple root tasks<br />
that can be considered independent <strong>and</strong> identical in their nature.<br />
When the size of a problem becomes large, the size of the<br />
network as well as the number of tasks for each component can<br />
be large. One can imagine wide range of scientific <strong>and</strong><br />
engineering problems that can be solved with such<br />
architectures.<br />
Cougaar (Cognitive Agent Architecture:<br />
http://www.cougaar.org) developed by <strong>DARPA</strong> (Defense<br />
Advanced Research Project Agency), is such an architecture<br />
for building large-scale multi-agent systems. Recently, there<br />
have been efforts to combine the technologies of agents <strong>and</strong><br />
components to improve building large-scale software systems<br />
[4]-[6]. While component technology focuses on reusability,<br />
agent technology focuses on processing complex tasks as a<br />
community. Cougaar is in line with this trend. In Cougaar a<br />
software system comprises of agents <strong>and</strong> an agent of<br />
components (called plugins). The task flow structure in those<br />
systems is that of components as a combination of intra-agent<br />
<strong>and</strong> inter-agent task flows. As the agents in Cougaar can be<br />
distributed both from geographical <strong>and</strong> information content<br />
sense, the networks implemented in Cougaar have distributed<br />
<strong>and</strong> component-based architecture.<br />
UltraLog (http://www.ultralog.net) networks are military<br />
supply chain planning systems implemented in Cougaar<br />
[7]-[11]. Each agent in these networks represents an<br />
organization of military supply chain <strong>and</strong> has a set of<br />
components specialized for each functionality (allocation,<br />
expansion, inventory management, etc) <strong>and</strong> class (ammunition,<br />
water, fuel, etc). The objective of an UltraLog network is to<br />
provide an appropriate logistics plan for a given military<br />
operational plan. A logistics plan is a global solution which is<br />
an aggregate of individual schedules built by components. An<br />
operational plan is decomposed into logistics requirements of<br />
each thread for each agent, <strong>and</strong> a requirement is further<br />
decomposed into root tasks (one task per day) for a designated<br />
component. As a result, a component can have hundreds of root<br />
tasks depending on the horizon of an operation <strong>and</strong> thous<strong>and</strong>s<br />
of tasks to process as the root tasks are propagated. As the scale<br />
of operation increases there can be thous<strong>and</strong>s of agents (tens of<br />
thous<strong>and</strong>s of components) in hundreds of machines working<br />
together to generate a logistics plan.<br />
An UltraLog network makes initial planning <strong>and</strong> continuous<br />
III. PROBLEM SPECIFICATION<br />
In this section we formally define the problem in a general<br />
form by detailing the network model <strong>and</strong> resource allocation.<br />
We concentrate on computational CPU resources assuming that<br />
the system is computation-bounded.<br />
A. Network Model<br />
A network is composed of a set of components A <strong>and</strong> a set of<br />
nodes (i.e., machines) N. K n denotes a set of components that<br />
reside in node n sharing the node’s CPU resource. Task flow<br />
structure of the network, which defines precedence relationship<br />
between components, is an arbitrary directed acyclic graph. A<br />
problem given to the network is decomposed in terms of root<br />
tasks for some components <strong>and</strong> those tasks are propagated<br />
through the task flow structure. Each component processes one<br />
of the tasks in its queue (which has root tasks as well as tasks<br />
from predecessor components) <strong>and</strong> then sends it to successor<br />
components. We denote the number of root tasks <strong>and</strong> expected<br />
CPU time 1 per task of component i as respectively. Fig.<br />
1 shows an example network in which there are four<br />
components residing in three nodes. Components A 1 <strong>and</strong> A 2<br />
resides in N 1 <strong>and</strong> each of them has 100 root tasks. A 3 in N 2 <strong>and</strong><br />
A 4 in N 3 have no root tasks, but each of them has 100 tasks from<br />
the corresponding predecessors, namely A 1 <strong>and</strong> A 2<br />
respectively.<br />
<br />
A 1<br />
<br />
A 3<br />
N 2<br />
<br />
<br />
A 2 A 4<br />
N 3<br />
Fig. 1. An example network. The network is composed of four components in<br />
three nodes <strong>and</strong> the performance can depend on the resource allocation of node<br />
N 1 .<br />
B. Resource Allocation<br />
When there are multiple components in a node, the network<br />
needs to control its behavior through resource allocation. In the<br />
example network, node N 1 has two components <strong>and</strong> the system<br />
performance can depend on its resource allocation to these two<br />
components. There are several CPU scheduling algorithms for<br />
allocating a CPU resource amongst multiple threads. Among<br />
the scheduling algorithms, proportional CPU share (PS)<br />
scheduling is known for its simplicity, flexibility, <strong>and</strong> fairness<br />
[12]. In PS scheduling threads are assigned weights <strong>and</strong><br />
resource shares are determined proportional to the weights<br />
1 The distribution of CPU time can be arbitrary though we use only expected<br />
CPU time.
3<br />
[13]. Excess CPU time from some threads is allocated fairly to<br />
other threads. There are many PS scheduling algorithms such as<br />
Weighted Round-Robin scheduling, Lottery scheduling, <strong>and</strong><br />
Stride scheduling [14]-[16].<br />
We adopt PS scheduling as resource allocation scheme<br />
because of its generality in addition to the benefits mentioned<br />
above. We define resource allocation variable set w = {w i (t):<br />
i∈A, t≥0} in which w i (t) is a non-negative weight of component<br />
i at time t. If total managed weight of a node n is ω n , the<br />
boundary condition for assigning weights over time can be<br />
described as:<br />
∑<br />
i∈K<br />
C. Problem Definition<br />
n<br />
w ( t ) = ω where w ( t ) ≥ 0 . (1)<br />
i<br />
n<br />
The service provided by a network is to produce a global<br />
solution to a given problem, which is an aggregate solution of<br />
partial solutions of individual tasks. QoS is determined by<br />
completion time taken to generate the global solution. In this<br />
paper we develop a resource control mechanism to minimize<br />
the completion time T though resource allocation (w) as in (2).<br />
arg min<br />
w<br />
T<br />
i<br />
(2)<br />
has a set of serial operations <strong>and</strong> each operation should be<br />
processed on a specific machine. A job shop scheduling<br />
problem is sequencing the operations in each machine by<br />
satisfying a set of job precedence constraints such that the<br />
completion time is minimized. Our problem can be exactly<br />
transformed into such a job shop scheduling problem. However,<br />
scheduling problems are in general intractable. Though the job<br />
shop scheduling problem is polynomially solvable when there<br />
are two machines <strong>and</strong> each job has two operations, it becomes<br />
NP-hard on the number of jobs even if the number of machines<br />
or operations is more than two [24][25]. Considering that the<br />
task flow structure of our networks is arbitrary, our scheduling<br />
problem is NP-hard on the number of components in general<br />
<strong>and</strong> the increase of the number of tasks imposes additional<br />
complexity. Moreover, there can be large number of nodes in<br />
our networks.<br />
Though it is possible to use some available heuristic<br />
algorithms from the job shop scheduling problem, our<br />
scheduling problem has a particular characteristic, i.e., the<br />
number of tasks for each component can be large. Though the<br />
increase of the number of tasks adds more complexity, it can<br />
also give us great opportunity to develop an efficient heuristic<br />
solution. So, we analyze the impacts of the largeness on the<br />
optimal scheduling in the course of developing a resource<br />
control mechanism.<br />
IV. OVERALL SOLUTION METHODOLOGY<br />
There are two representative optimal control approaches in<br />
dynamic systems: Dynamic Programming (DP) <strong>and</strong> Model<br />
Predictive Control (MPC). Though DP gives optimal<br />
closed-loop policy it has inefficiencies in dealing with<br />
large-scale systems especially when systems are working in<br />
finite time horizon [17]-[19]. In MPC, for each current state, an<br />
optimal open-loop control policy is designed for finite-time<br />
horizon by solving a static mathematical programming model<br />
[20]-[23]. The design process is repeated for the next observed<br />
state feedback forming a closed-loop policy reactive to each<br />
current system state. Though MPC does not give absolute<br />
optimal policy in stochastic environments, the periodic design<br />
process alleviates the impacts of stochasticity. Considering the<br />
characteristic of our problem, we choose MPC framework. Our<br />
networks are large-scale <strong>and</strong> work in finite time horizon. So,<br />
we need to build a mathematical programming model.<br />
The mathematical programming model is essentially a<br />
scheduling problem formulation. There are a variety of<br />
formulations <strong>and</strong> algorithms available for diverse scheduling<br />
problems in the context of multiprocessor, manufacturing, <strong>and</strong><br />
project management. In general, a scheduling problem is to<br />
allocate limited resources to a set of tasks to optimize a specific<br />
objective. One widely studied objective is completion time<br />
(also called makespan in the manufacturing literature) as in the<br />
problem we have considered. Though it is not easy to find a<br />
problem exactly same as ours, it is possible to convert our<br />
problem into one of the scheduling problems. For example, in a<br />
job shop, there are a set of jobs <strong>and</strong> a set of machines. Each job<br />
V. RESOURCE CONTROL MECHANISM<br />
In this section we develop a resource control mechanism<br />
under MPC framework. After exemplifying the effects of<br />
resource allocation, we develop a resource control mechanism<br />
by characterizing an optimal open-loop resource allocation<br />
policy in the limit of large number of tasks, <strong>and</strong> providing a<br />
quantitative criterion for the largeness. For theoretical analysis,<br />
we assume a hypothetical weighted round-robin server for CPU<br />
scheduling though it is not strictly required in practice as will<br />
be discussed. The hypothetical server has idealized fairness as<br />
the CPU time received by each thread in a round is infinitesimal<br />
<strong>and</strong> proportional to the weight of the thread.<br />
A. Effects of Resource Allocation<br />
The completion time T is the time taken to generate the<br />
global solution, i.e., to process all the tasks of a network. We<br />
denote T n as the completion time taken to process all the tasks<br />
of node n <strong>and</strong> T i of component i. Then, the relationships as in<br />
(3) hold.<br />
T<br />
= Max T = Max T T = Max T . (3)<br />
n∈N<br />
n<br />
i∈A<br />
i , n<br />
i<br />
i∈K<br />
n<br />
A component’s instantaneous resource availability RA i (t) is<br />
the available fraction of a resource when the component<br />
requests the resource at time t. Service time S i (t) is the time<br />
taken to process a task at time t <strong>and</strong> has a relationship with<br />
RA i (t) as:
4<br />
t Si<br />
t<br />
∫ + ( )<br />
i )<br />
t<br />
When RA i (t) remains constant S i (t) becomes:<br />
RA ( τ dτ<br />
= P . (4)<br />
i<br />
Pi<br />
Si ( t)<br />
= . (5)<br />
RA ( t)<br />
Now, consider the example network in Fig. 1. In the network<br />
only N 1 has the chance to allocate its resource as it has two<br />
residing components. T N1 is invariant to resource allocation <strong>and</strong><br />
equal to 300 (=100*1+100*2). But, T A1 <strong>and</strong> T A2 can vary<br />
depending on the resource allocation of N 1 . When the resource<br />
is allocated equally to the components, both RA A1 (t) <strong>and</strong> RA A2 (t)<br />
are equal to 0.5 initially. As A 1 completes at t=200<br />
(=100*1/0.5), A 2 starts utilizing the resource fully from then,<br />
i.e. RA A2 (t)=1 for t≥200. So, A 2 completes 50 tasks at t=200<br />
(=50*2/0.5) <strong>and</strong> remaining 50 tasks at t=300 (=200+50*2/1).<br />
A 3 completes at t=202 (=200+1*2/1) because task inter-arrival<br />
time from A 1 is equal to its service time. As A 4 ’s service time is<br />
less than task inter-arrival time (=4) for t≤200, A 4 completes 49<br />
tasks at t=200 with one task in queue arriving at t=200. From<br />
t=200 task inter-arrival time from A 2 becomes reduced to 2<br />
which is less than A 4 ’s service time. So, tasks become<br />
accumulated till t=300 <strong>and</strong> A 4 completes at t=353<br />
(=200+51*3/1). In this way we trace exact system behavior<br />
under three resource allocation strategies as shown in Fig. 2.<br />
RA<br />
1<br />
4/5<br />
2/3<br />
1/2<br />
1/3<br />
1/5<br />
1:1<br />
1:2<br />
1:4<br />
0 50 100 150 200 250<br />
w A1 : w A2<br />
1 : 1 1 : 2 1 : 4<br />
T A1 200 300 300<br />
T A2 300 300 250<br />
T A3 202 302 352<br />
T A4 353 303 302.5<br />
T 353 303 352<br />
(a) Completion time<br />
300 Time<br />
The network cannot complete at less than t=300 because<br />
each of N 1 <strong>and</strong> N 3 requires 300 CPU time. When the resource is<br />
allocated with 1:2 ratio, the completion time T is minimal close<br />
to 300. The ratio is proportional to each component’s total<br />
required CPU time, i.e., 1:2 ≡ 100*1:100*2. One interesting<br />
question is whether the proportional allocation can give the best<br />
performance even if the successors have different parameters.<br />
i<br />
RA<br />
1<br />
1/2<br />
1/3<br />
1/5<br />
0 50 100 150 200 250 300 Time<br />
(b) Resource availability of A 1 (c) Resource availability of A 2<br />
Fig. 2. Effects of resource allocation. Depending on the resource allocation of<br />
node N 1 , each of components A 1 <strong>and</strong> A 2 follows different resource availability<br />
profile as in (b) <strong>and</strong> (c). Consequently, the difference results in different<br />
completion times as in (a).<br />
4/5<br />
2/3<br />
1:1<br />
1:2<br />
1:4<br />
The answer is yes. If a component A 1 is allocated more resource<br />
than the proportional allocation, T A3 is dominated by the<br />
maximal of T A1 <strong>and</strong> A 3 ’s total CPU time. But, the first quantity<br />
is less than T N1 <strong>and</strong> the second quantity is an invariant. So,<br />
allocating more resource than the proportional allocation<br />
cannot help reducing the completion time of the network.<br />
However, if a component is allocated less resource than the<br />
proportional allocation, its successor’s task inter-arrival time is<br />
stepwise decreasing. As a result, the successor underutilizes<br />
resource <strong>and</strong> can complete later than under the proportional<br />
allocation. Therefore, the proportional allocation leads the<br />
network to efficiently utilize distributed resources <strong>and</strong><br />
consequently helps minimizing the completion time of the<br />
network, though it is localized independent of the successors’<br />
parameters.<br />
B. Optimal Open-loop Policy<br />
To generalize the arguments for arbitrary network<br />
configurations, we define Load Index LI i which represents<br />
component i’s total CPU time required to process its tasks. As a<br />
component needs to process its own root tasks as well as<br />
incoming tasks from its predecessors, its number of tasks L i is<br />
identified as in (6) where i denotes the immediate predecessors<br />
of component i. Then, LI i is represented as in (7).<br />
∑<br />
L = rt + L<br />
(6)<br />
i<br />
i<br />
i<br />
a∈i<br />
LI = L P<br />
To provide theoretical foundation of optimal resource<br />
allocation policy, we convert a network into a network with<br />
tasks having infinitesimal processing times. Each root task is<br />
divided into r infinitesimal tasks <strong>and</strong> each P i is replaced with<br />
P i /r. Then, the load index of each component is the same as the<br />
original network but tasks are infinitesimal. We denote the<br />
completion time of the network with infinitesimal tasks as T´.<br />
Also, we define a term called task availability as an indicator of<br />
relative preference for task arrival patterns. An arrival pattern<br />
gives higher task availability than another if cumulative<br />
number of arrived tasks is larger or equal over time. A<br />
component prefers a task arrival pattern with higher task<br />
availability as it can utilize more resource. Consider a network<br />
<strong>and</strong> reconfigure it such that all components have their tasks in<br />
their queues at t=0. Each component has maximal task<br />
availability in the reconfigured network <strong>and</strong> the completion time<br />
of the reconfigured network forms the lower bound T LB of a<br />
network’s completion time T given by:<br />
T<br />
LB<br />
= Max<br />
n∈N<br />
i<br />
i<br />
∑<br />
i∈<br />
K n<br />
a<br />
LI<br />
i<br />
(7)<br />
. (8)<br />
Theorem 1. T´ equals to T LB when each node allocates its<br />
resource proportional to its residing components’ load<br />
indices as:
5<br />
LI i<br />
w i( t ) = wi<br />
= ω n( i ) for all t ≥ 0 , (9)<br />
LI<br />
∑<br />
p∈K<br />
n(<br />
i )<br />
p<br />
T<br />
LB<br />
s<br />
ωn<br />
+ ωn<br />
= Max<br />
n∈N<br />
ω<br />
n<br />
s<br />
∑<br />
i∈<br />
K n<br />
LI<br />
i<br />
(12)<br />
where n(i) denotes a node in which component i resides.<br />
Theorem 2. T s´ equals to T s LB under proportional allocation.<br />
Proof. RA i (t) is more than or equal to assigned weight proportion<br />
as:<br />
w i ( t )<br />
RA ( t ) ≥ for t ≥ 0 . (10)<br />
ω<br />
i<br />
n( i )<br />
Proof. RA i (t) becomes:<br />
RA i( t ) ≥<br />
ω<br />
w ( t )<br />
n( i )<br />
i<br />
+ ω<br />
s<br />
n( i )<br />
for t ≥ 0 . (13)<br />
Suppose a component i receives its tasks at a constant interval<br />
of T LB /L i . Then, under proportional allocation, S i (t) is less<br />
than or equal to T LB /L i over time as shown in (11).<br />
P<br />
i<br />
=<br />
∫<br />
wi<br />
=<br />
ω<br />
n( i )<br />
LB<br />
T<br />
⇒<br />
L<br />
i<br />
t+<br />
Si( t )<br />
t<br />
S ( t ) =<br />
i<br />
RA ( τ )dτ<br />
≥<br />
≥ S ( t )<br />
i<br />
i<br />
LI<br />
∑<br />
i<br />
LI<br />
p∈K<br />
n( i )<br />
p<br />
∫<br />
for t ≥ 0<br />
t+<br />
Si(<br />
t<br />
t )<br />
LI<br />
Si( t ) ≥<br />
T<br />
w i( t )<br />
dτ<br />
ω<br />
n( i )<br />
i<br />
LB<br />
S ( t )<br />
i<br />
(11)<br />
So, any component can complete by T LB <strong>and</strong> generate tasks at<br />
a constant interval of T LB /L i from t=T LB /L i (first task<br />
generation time) under proportional allocation when it<br />
receives tasks at a constant interval of T LB /L i from t=0 (first<br />
task arrival time). As tasks are infinitesimal <strong>and</strong> root tasks<br />
increase task availability, each component can receive<br />
infinitesimal tasks at a constant interval in 0≤t≤T LB or more<br />
preferably, <strong>and</strong> complete at less than or equal to T LB . So, the<br />
network completes at T LB under proportional allocation. <br />
From Theorem 1 we can conjecture that a network can<br />
achieve a performance close to T LB under proportional<br />
allocation in the limit of large number of tasks. We propose the<br />
proportional allocation as an optimal resource allocation<br />
policy. Though the proportional allocation is localized, the<br />
network can maximize the utilization of distributed resources<br />
<strong>and</strong> achieve desirable performance. Coordinated resource<br />
allocation throughout the network emerges as a result of using<br />
the load index as global information. If nodes do not follow the<br />
proportional allocation policy, some components can receive<br />
their tasks less preferably resulting in underutilization <strong>and</strong><br />
consequently increased completion time as have shown in the<br />
previous subsection.<br />
Another important property of the proportional allocation<br />
policy is that it is itself adaptive. Suppose there are some<br />
stressors sharing resources with the components. We denote<br />
ω s n as the amount of shared resource by a stressor in node n.<br />
Then, the lower bound performance T LB s under stress is given<br />
by (12). We denote the completion time under stress as T s´.<br />
Then, (11) results in (14) under proportional allocation.<br />
LB<br />
Ts<br />
L<br />
i<br />
≥ S ( t ) for t ≥ 0<br />
(14)<br />
i<br />
Therefore, the network completes at T LB s under proportional<br />
allocation. <br />
Theorem 2 depicts that the proportional allocation policy is<br />
optimal independent of the stress environments. Though we do<br />
not consider them explicitly, the policy gives lower bound<br />
performance adaptively. This characteristic is especially<br />
important when the system is vulnerable to unpredictable stress<br />
environments. Modern networked systems can be easily<br />
exposed to various adverse events such as accidental failures<br />
<strong>and</strong> malicious attacks, <strong>and</strong> the space of stress environment is<br />
high-dimensional <strong>and</strong> also evolving [26]-[28].<br />
C. Adequacy criterion<br />
The arguments we have made hold in the limit of large<br />
number of tasks. As the term “large” is obscure we need to give<br />
it a concrete definition. We define it with an adequacy criterion,<br />
by which one can evaluate if the desirable properties of the<br />
proportional allocation hold for a given network. For this<br />
purpose we characterize upper bound performance of a<br />
network under proportional allocation.<br />
Theorem 3. Under proportional allocation a network’s upper<br />
bound T UB of completion time T is given by:<br />
T<br />
UB<br />
= T<br />
LB<br />
+ Max Max<br />
e∈E<br />
j∈Se<br />
∑<br />
i∈j<br />
[ P<br />
∑<br />
LI<br />
i<br />
p∈K<br />
n(<br />
i)<br />
p<br />
/ LI<br />
i<br />
] , (15)<br />
where E denotes a set of components which have no successor<br />
<strong>and</strong> S e a set of task paths to component e. A task path to<br />
component e is a set of components in a path from a<br />
component with no predecessor to component e <strong>and</strong> does not<br />
include component e.<br />
Proof. From (11) we can induce the lowest upper bound S i UB of<br />
S i (t) as:
6<br />
S<br />
UB<br />
i<br />
∑<br />
= P LI / LI . (16)<br />
i<br />
p∈K<br />
n(<br />
i)<br />
So, a component i can complete by T LB <strong>and</strong> generate tasks at<br />
a constant interval of T LB /L i from t=S i UB when it receives<br />
tasks at a constant interval of T LB /L i from t=0. Now, consider<br />
component i’s successor s which has only one predecessor.<br />
As the successor receives tasks at a constant interval of T LB /L s<br />
from t=S i UB or more preferably, it can complete by S i UB +T LB .<br />
So, a component e∈E (with no successor) can receive tasks at<br />
a constant interval of T LB /L e from maximal task traveling time<br />
to the component of:<br />
Max<br />
j∈S<br />
e<br />
∑<br />
i∈<br />
j<br />
S<br />
p<br />
UB<br />
i<br />
i<br />
(17)<br />
(note that a path j does not include component e) or more<br />
preferably so that its completion time T e is bounded as:<br />
T<br />
e<br />
≤ T<br />
LB<br />
+ Max<br />
j∈S<br />
e<br />
∑<br />
i∈<br />
j<br />
S<br />
UB<br />
i<br />
. (18)<br />
And, the upper bound of T is the maximal of the bounds.<br />
Though we formulated the upper bound performance<br />
without considering stress environments, one can easily modify<br />
it so that the upper bound performance can reflect the stress<br />
environments (if each ω n s is identifiable or assumable). The<br />
adequacy criterion is defined as the ratio between T LB <strong>and</strong> T UB<br />
as in (19). When the criterion is close to one, a network can<br />
achieve the lower bound performance using the proportion<br />
allocation policy. Typically, the criterion converges to one as<br />
each L i increases. However, as the criterion approaches zero,<br />
the policy become more <strong>and</strong> more inadequate. The example<br />
network in Fig. 1 is quite adequate because the network’s<br />
adequacy is 0.99 (300/303).<br />
LB<br />
<br />
T<br />
Adequacy = (19)<br />
UB<br />
T<br />
So far, we assumed a hypothetical weighted round-robin<br />
server which is difficult to realize in practice. But, our<br />
arguments do not seem to be invalid because they are based on<br />
worst-case analysis <strong>and</strong> quantum size is relatively infinitesimal<br />
compared to working horizon in reality.<br />
D. Resource control mechanism<br />
Once a network has an appropriate adequacy over a certain<br />
level (depending on the nature of the network), the proportional<br />
allocation is deployed periodically under MPC framework.<br />
Consider current time as t. To update load index as the system<br />
moves on, we slightly modify it to represent total CPU time for<br />
the remaining tasks as:<br />
LI ( t ) = R ( t ) + L ( t ) P , (20)<br />
i<br />
i<br />
in which R i (t) denotes remaining CPU time for a task in process<br />
<strong>and</strong> L i (t) the number of remaining tasks excluding a task in<br />
process. After identifying initial number of tasks L i (0)=L i , each<br />
component updates it by counting down as they process tasks.<br />
Periodically, a resource manager of each node collects current<br />
LI i (t)s from residing components <strong>and</strong> allocates resource<br />
proportional to the indices as in (21). As the resource allocation<br />
policy is purely localized there is no need for synchronization<br />
between nodes. The designed resource control mechanism is<br />
scalable as each node can make decisions independent of<br />
others while requiring almost no computation.<br />
w<br />
i<br />
i ( t)<br />
ωn(<br />
i)<br />
∑ LI p ( t)<br />
p∈K<br />
n(<br />
i)<br />
i<br />
i<br />
LI ( t)<br />
= (21)<br />
VI. EMPIRICAL RESULTS<br />
We ran several experiments using discrete-event simulation<br />
to validate the designed resource control mechanism.<br />
A. Experimental design<br />
The experimental network is composed of eight components<br />
in four nodes as in Fig. 3. Two components are sharing a<br />
resource in N 3 <strong>and</strong> four components in N 4 . Also, ω n is 1 for all<br />
n∈N <strong>and</strong> CPU is allocated using a weighted round-robin<br />
scheduling in which CPU time received by each component in a<br />
round is equal to its assigned weight.<br />
N 1 N 2<br />
A 1<br />
N 3<br />
A 3 A 4<br />
N 4<br />
A 5 A 6 A 7 A 8<br />
Fig. 3. Experimental network configuration. The network is composed of<br />
eight components in four nodes <strong>and</strong> the performance can depend on the<br />
resource allocation of nodes N 3 <strong>and</strong> N 4 .<br />
We set up ten different experimental conditions as shown in<br />
Table I. We vary the number of root tasks rt i <strong>and</strong> CPU time per<br />
task P i , <strong>and</strong> the distribution of P i can be deterministic or<br />
exponentially distributed. While using stochastic distribution<br />
we repeat 5 experiments.<br />
We use three different resource control policies for each<br />
experimental condition. Table II shows these control policies.<br />
In round-robin allocation policy (RR) the components in each<br />
node are assigned equal weights over time. PA-O <strong>and</strong> PA-C use<br />
the proportional allocation policy in open-loop <strong>and</strong> closed-loop<br />
A 2
7<br />
respectively. In PA-O resources are allocated only at t=0 <strong>and</strong><br />
kept over time while in PA-C periodically (every 100 time<br />
units). PA-C is the resource control mechanism we have<br />
designed.<br />
B. Results<br />
TABLE I<br />
EXPERIMENTAL CONDITIONS<br />
Condition Distribution of P i rt i P i<br />
Con1-1 Deterministic [000 000 000 000 [04 12 04 08<br />
Con1-2 Exponential 200 200 200 200] 02 02 02 02]<br />
Con2-1 Deterministic [100 100 100 100 [04 12 04 10<br />
Con2-2 Exponential 200 200 200 200] 02 02 06 06]<br />
Con3-1 Deterministic [100 100 100 100 [04 12 04 10<br />
Con3-2 Exponential 200 200 200 100] 02 02 20 10]<br />
Con4-1 Deterministic [100 100 100 100 [04 12 04 08<br />
Con4-2 Exponential 200 200 200 200] 02 02 10 02]<br />
Con5-1 Deterministic [100 100 200 200 [04 10 04 08<br />
Con5-2 Exponential 200 200 200 200] 02 02 02 02]<br />
TABLE II<br />
CONTROL POLICIES FOR EXPERIMENTATION<br />
Control policy<br />
Description<br />
RR<br />
Round-Robin allocation<br />
PA-O Proportional allocation - Open loop<br />
PA-C Proportional allocation - Closed loop<br />
Numerical results from the experimentation are shown in<br />
Table III. Lower <strong>and</strong> upper bounds are calculated for each<br />
experimental condition. The network adequacy of each<br />
condition is close to one <strong>and</strong> the proportional allocation policy<br />
can be used effectively for all the conditions.<br />
Proportional allocation policies (PA-O <strong>and</strong> PA-C) shows<br />
significant advantages compared to round-robin allocation in<br />
all the different conditions. The completion time T under<br />
proportional allocation is bounded to T UB <strong>and</strong> close to T LB in all<br />
deterministic conditions (note that the performance of PA-O<br />
<strong>and</strong> PA-C is the same in deterministic environments),<br />
supporting the effectiveness of the resource allocation policy.<br />
Though T UB does not work accurately in stochastic<br />
environments, the performance improves close to T LB when the<br />
proportional allocation is implemented in closed-loop. The<br />
periodic design process alleviates the impacts of stochasticity.<br />
So, we can conclude that the designed control mechanism can<br />
be effectively used even in stochastic environments for the<br />
networks with high adequacy.<br />
The performance differences can be reasoned from resource<br />
utilization as discussed earlier. A node with maximal total CPU<br />
time needs to utilize its resource almost fully to achieve a<br />
performance close to T LB . For example, N 2 is such a node in<br />
Con1-1 (deterministic) <strong>and</strong> Con1-2 (stochastic). Resource<br />
utilization profiles of N 2 are shown in Fig. 4 for Con1-1 <strong>and</strong><br />
Fig. 5 for Con1-2, in which a data point corresponds to the<br />
amount of utilized resource during a control period (100 time<br />
units). In deterministic environment (Con1-1), N 2 utilizes its<br />
resource almost fully under both proportional allocation<br />
Utilization (%)<br />
100<br />
80<br />
60<br />
40<br />
20<br />
PA-O<br />
RR<br />
PA-C<br />
0<br />
0 1000 2000 3000 4000 5000 6000<br />
Fig. 4. Resource utilization of N 2 in Con1-1. In a deterministic environment,<br />
N 2 utilizes its resource almost fully under both proportional allocation policies<br />
(PA-O, PA-C) while underutilizing in initial stage under round-robin<br />
allocation policy (RR).<br />
Utilization (%)<br />
100<br />
80<br />
60<br />
40<br />
20<br />
PA-O<br />
RR<br />
Time<br />
0<br />
0 1000 2000 3000 4000 5000 6000<br />
Fig. 5. Resource utilization of N 2 in Con1-2. In a stochastic environment, N 2<br />
utilizes its resource more under proportional allocation policies (PA-O, PA-C)<br />
compared to round-robin allocation policy (RR), <strong>and</strong> resource utilization<br />
under closed-loop policy (PA-C) is larger than under open-loop policy<br />
(PA-O).<br />
Time<br />
PA-C<br />
TABLE III<br />
EXPERIMENTAL RESULTS<br />
Control policy<br />
RR PA-O PA-C<br />
T LB T UB Adequacy T T LB /T T T LB /T T T LB /T<br />
Con1-1 4800 4820 0.996 5619 0.854 4820 0.996 4820 0.996<br />
Con1-2 4800 4820 0.996 5618 0.854 5021 0.956 4939 0.972<br />
Con2-1 7200 7230 0.996 7612 0.946 7200 1.000 7200 1.000<br />
Con2-2 7200 7230 0.996 7679 0.938 7323 0.983 7252 0.993<br />
Con3-1 6000 6073 0.988 6412 0.936 6012 0.998 6012 0.998<br />
Con3-2 6000 6073 0.988 6408 0.936 6193 0.969 6013 0.998<br />
Con4-1 7200 7228 0.996 7200 1.000 7200 1.000 7200 1.000<br />
Con4-2 7200 7228 0.996 7231 0.996 7109 1.013 7169 1.004<br />
Con5-1 7200 7220 0.997 7810 0.922 7210 0.999 7210 0.999<br />
Con5-2 7200 7220 0.997 7979 0.902 7351 0.979 7319 0.984
8<br />
policies while underutilizing in initial stage under round-robin<br />
allocation. In stochastic environment (Con1-2), resource<br />
utilization profiles under proportional allocation policies<br />
become different. Though both policies give more utilization<br />
compared to round-robin allocation, resource utilization under<br />
closed-loop policy is larger than under open-loop policy. Such<br />
differences of resource utilization result in the performance<br />
differences in Table III. The designed control mechanism helps<br />
maximizing the utilization of distributed resources so as to<br />
achieve desirable performance.<br />
VII. CONCLUSIONS<br />
A typical information network emerges as a result of<br />
automation or organizational integration, which is large-scale<br />
with distributed <strong>and</strong> component-based architecture. In this<br />
paper we designed a resource control mechanism of such<br />
networks for minimizing completion time. The designed<br />
resource control mechanism has several desirable properties.<br />
First, it is localized as each node can make decisions<br />
independent of others. Second, it requires almost no<br />
computation. Third, nevertheless the network can achieve a<br />
desirable performance. Fourth, it is itself adaptive to the stress<br />
environments without explicit considerations. Such emergent<br />
properties can be found in many self-organizing systems such<br />
as social or biological systems. Though entities act with a<br />
simple mechanism without central authority, desirable global<br />
performance can often be realized. When a large-scale network<br />
is working in a dynamic environment under the designed<br />
control mechanism, it is really a self-organizing system.<br />
REFERENCES<br />
[1] B. Meyer, “On to components”, IEEE Computer, vol. 32, no. 1, pp.<br />
139-140, 1999.<br />
[2] P. Clements, “From subroutine to subsystems: Component-based<br />
software development,” in Component Based Software Engineering, A.<br />
W. Brown, Ed. IEEE Computer Society Press, 1996, pp. 3-6.<br />
[3] O. F. Rana <strong>and</strong> K. Stout, “What is scalability in multi-agent systems?,” in<br />
Proc. 4th Int. Conf. Autonomous Agents, 2000, pp. 56-63.<br />
[4] F. M. T. Brazier, C. M. Jonker, <strong>and</strong> J. Treur, “Principles of<br />
component-based design of intelligent agents,” Data <strong>and</strong> Knowledge<br />
Engineering, vol. 41, no. 1, pp. 1-28, 2002.<br />
[5] H. J. Goradia <strong>and</strong> J. M. Vidal, “Building blocks for agent design,” in Proc.<br />
4th Int. Workshop on Agent-Oriented Software Engineering, 2003, pp.<br />
17-30.<br />
[6] R. Krutisch, P. Meier, <strong>and</strong> M. Wirsing, “The AgentComponent approach,<br />
combining agents <strong>and</strong> components,” in Proc. 1st German Conf.<br />
Multiagent Sys. Technologies, 2003, pp. 1-12.<br />
[7] D. Moore, W. Wright, <strong>and</strong> R. Kilmer, “Control surfaces for Cougaar,” in<br />
Proc. First Open Cougaar Conference, 2004, pp. 37-44.<br />
[8] W. Peng, V. Manikonda, <strong>and</strong> S. Kumara, “Underst<strong>and</strong>ing agent societies<br />
using distributed monitoring <strong>and</strong> profiling,” in Proc. First Open Cougaar<br />
Conference, 2004, pp. 53-60.<br />
[9] H. Gupta, Y. Hong, H. P. Thadakamalla, V. Manikonda, S. Kumara, <strong>and</strong><br />
W. Peng, “Using predictors to improve the robustness of multi-agent<br />
systems: Design <strong>and</strong> implementation in Cougaar,” in Proc. First Open<br />
Cougaar Conference, 2004, pp. 81-88.<br />
[10] D. Moore, A. Helsinger, <strong>and</strong> D. Wells, “Deconfliction in ultra-large MAS:<br />
Issues <strong>and</strong> a potential architecture,” in Proc. First Open Cougaar<br />
Conference, 2004, pp. 125-133.<br />
[11] R. D. Snyder <strong>and</strong> D. C. Mackenzie, “Cougaar agent communities,” in<br />
Proc. First Open Cougaar Conference, 2004, pp. 143-147.<br />
[12] J. Regehr, “Some guidelines for proportional share CPU scheduling in<br />
general-purpose operating systems,” Presented as a work in progress at<br />
22nd IEEE Real-Time Systems Symposium, London, UK, Dec. 3-6, 2001.<br />
[13] I. Stoica, H. Abdel-Wahab, J. Gehrke, K. Jeffay, S. K. Baruah, <strong>and</strong> C. G.<br />
Plexton, “A proportional share resource allocation algorithm for<br />
real-time, time-shared systems,” in Proc. 17th IEEE Real-Time Systems<br />
Symposium, 1996, pp. 288-299.<br />
[14] C. A. Waldspurger <strong>and</strong> W. E. Weihl, “Lottery scheduling: Flexible<br />
proportional-share resource management,” in Proc. First Symposium on<br />
Operating System Design <strong>and</strong> Implementation, 1994, pp. 1-11.<br />
[15] C. Waldspurger <strong>and</strong> W. Weihl, “Stride scheduling: Deterministic<br />
proportional-share resource management,” Lab. for Computer Science,<br />
Massachusetts Institute of Technology, Cambridge, MA, Tech. Rep.<br />
MIT/LCS/TM-528, 1995.<br />
[16] C. Waldspurger, “Lottery <strong>and</strong> stride scheduling: Flexible proportional<br />
share resource management,” Ph.D. dissertation, Lab. for Computer<br />
Science, Massachusetts Institute of Technology, Cambridge, MA, 1995.<br />
[17] G. Barto, S. J. Bradtke, <strong>and</strong> S. P. Singh, “Learning to act using real-time<br />
dynamic programming,” Artificial Intelligence, vol. 72, pp. 81-138, 1995.<br />
[18] R. S. Sutton, A. G. Barto, <strong>and</strong> R. J. Williams, “Reinforcement learning is<br />
direct adaptive optimal control,” IEEE Control Systems, vol. 12, no. 2, pp.<br />
19-22, 1992.<br />
[19] L. P. Kaelbling, M. L. Littman, <strong>and</strong> A. W. Moore, “Reinforcement<br />
learning: A survey,” Journal of Artificial Intelligence Research, vol. 4,<br />
pp. 237-285, 1996.<br />
[20] J. B. Rawlings, “Tutorial overview of model predictive control,” IEEE<br />
Control Systems, vol. 20, no. 3, pp. 38-52, 2000.<br />
[21] M. Morari <strong>and</strong> J. H. Lee, “Model predictive control: Past, present <strong>and</strong><br />
future,” Computers <strong>and</strong> Chemical Engineering, vol. 23, no. 4, pp.<br />
667-682, 1999.<br />
[22] M. Nikolaou, “Model predictive controllers: A critical synthesis of theory<br />
<strong>and</strong> industrial needs,” Advances in Chemical Engineering Series,<br />
Academic Press, 2001.<br />
[23] S. J. Qin <strong>and</strong> T. A. Badgwell, “A survey of industrial model predictive<br />
technology,” Control Engineering Practice, vol. 11, pp. 733-764, 2003.<br />
[24] T. Gonzalez <strong>and</strong> S. Sahni, “Flowshop <strong>and</strong> jobshop schedules: Complexity<br />
<strong>and</strong> approximation,” Operations Research, vol. 26, pp. 36-52, 1978.<br />
[25] J. Lenstra, A. R. Kan, <strong>and</strong> P. Brucker, “Complexity of machine scheduling<br />
problems,” Annals of Discrete Mathematics, vol. 1, pp. 343-362, 1977.<br />
[26] S. Jha <strong>and</strong> J. M. Wing, “Survivability analysis of networked systems,” in<br />
Proc. 23rd Int. Conf. Software engineering, 2001, pp. 307-317.<br />
[27] A. P. Moore, R. J. Ellison, <strong>and</strong> R. C. Linger, “Attack modeling for<br />
information security <strong>and</strong> survivability,” Software Engineering Institute,<br />
Carnegie Mellon University, Pittsburg, PA, Tech. Note<br />
CMU/SEI-2001-TN-001, 2001.<br />
[28] F. Moberg, “Security analysis of an information system using an attack<br />
tree-based methodology,” M.S. thesis, Automation Engineering Program,<br />
Chalmers University of Technology, Sweden, 2000.
Efficient Method of Quantifying Minimal Completion Time for Component-<br />
Based Service Networks: Network Topology <strong>and</strong> Resource Allocation *<br />
Seokcheon Lee, Soundar Kumara, <strong>and</strong> Natarajan Gautam<br />
Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />
The Pennsylvania State University<br />
University Park, PA 16802<br />
{stonesky, skumara, ngautam}@psu.edu<br />
ABSTRACT<br />
In a grid service environment, it is important to be able to agilely quantify the quality of<br />
service achievable by each alternative composition of resources <strong>and</strong> services. This capability<br />
is an essential driver to not only efficiently utilizing the resources <strong>and</strong> services, but also<br />
promoting the virtual economy. In this paper, we develop such a method of quantifying the<br />
minimal completion time for component-based service networks whose task flow structure is<br />
a combination of intra-service <strong>and</strong> inter-service task flows. The performance of the network<br />
is a function of network topology <strong>and</strong> resource allocation. Network topology assigns<br />
components to available machines <strong>and</strong> resource allocation allocates the resources of each<br />
machine to the residing components. Though similar problems can be found in the<br />
multiprocessor scheduling literature, our problem is different especially because a component<br />
in our networks can have multiple tasks to process, i.e. a component can process tasks in<br />
parallel with its successor or predecessor components. The designed method incorporates the<br />
fact that the components in a network can be considered independent under a certain resource<br />
allocation policy when the number of tasks of each component is large.<br />
Index Terms: Multiprocessor systems, sequencing <strong>and</strong> scheduling, network topology,<br />
modeling <strong>and</strong> prediction, optimization<br />
1. Introduction<br />
Individual systems are becoming interoperable in virtue of several enabling technologies. The<br />
Grid technology provides inexpensive access to large computational resources across<br />
institutional boundaries [1]. Services can be composed over the Internet via Web Service<br />
technology creating enormous opportunities for automation of business processes [2]. OGSA<br />
(Open Grid Services Architecture: http://www.globus.org/ogsa/) defines a grid system<br />
* This work was supported, in part, by <strong>DARPA</strong> (Grant#: MDA972-01-1-0038) under the UltraLog program.<br />
1
architecture based on both the Grid <strong>and</strong> Web Service technologies. The Grid Service enables the<br />
integration of resources <strong>and</strong> services across distributed, heterogeneous, dynamic virtual<br />
organizations [3]. Cost <strong>and</strong> quality considerations may force large number of customers to look<br />
for resources <strong>and</strong> services via such an architecture to deal with their own computing problems.<br />
Ubiquitous computing technology embeds computers in various objects <strong>and</strong> places for sensing<br />
<strong>and</strong> controlling environments [4]. As this technology is becoming realized <strong>and</strong> gives rise to<br />
complex computing problems, the use of such an architecture might be inevitable.<br />
In a grid service environment, a problem is processed by composing multiple resources <strong>and</strong><br />
services. As there can be several alternative compositions of resources <strong>and</strong> services for a given<br />
problem, virtual markets will play a critical role in coordinating huge amount of economic<br />
entities such as customers, service providers, <strong>and</strong> resource providers. There are various market<br />
mechanisms such as OCEAN [5], Compute Power Market [6], <strong>and</strong> Nimord/G [7], proposed for<br />
the large-scale virtual economy. However, one essential enabler of such markets is the ability to<br />
agilely quantify the quality of service (QoS) achievable by each alternative. Without such a<br />
capability, the alternatives cannot be valuated in a timely manner <strong>and</strong> the virtual economy will<br />
fail to efficiently utilize the resources <strong>and</strong> services.<br />
There can be various ways of defining QoS depending on the nature of the problems. We<br />
consider a class of problems whose QoS is determined by completion time for generating a<br />
solution. The completion time (also called makespan) is one of the most widely studied<br />
objectives for diverse scheduling problems in the context of multiprocessor, manufacturing, <strong>and</strong><br />
project management. Regarding to problem solving structure we adopt component-based<br />
architecture as a general framework. A component is a reusable program element. Component<br />
technology utilizes the components so that developers can build systems needed by simply<br />
2
defining their specific roles <strong>and</strong> wiring them together [8][9]. In service networks with<br />
component-based architecture, each component is highly specialized for specific tasks <strong>and</strong> task<br />
flow structure between components is a combination of intra-service <strong>and</strong> inter-service task flows.<br />
A problem given to such a network is decomposed in terms of root tasks for some components<br />
<strong>and</strong> those tasks are propagated through a task flow structure to other components. As a problem<br />
can be decomposed with respect to space, time, or both, a component can have multiple root<br />
tasks that can be considered independent <strong>and</strong> identical in their nature. One can imagine wide<br />
range of scientific <strong>and</strong> engineering problems that can be solved by such a network.<br />
In this paper, we develop an efficient method of quantifying the minimal completion time for<br />
the component-based service networks. For a given set of resources <strong>and</strong> services, the<br />
performance can vary depending on the way of utilizing distributed heterogeneous resources.<br />
Network topology assigns components to available machines with a set of constraints. The<br />
components of a web service may not be separable to different machines <strong>and</strong> a web service may<br />
be allowed to specific machines. Though mobile code provides a great flexibility for creating<br />
distributed systems there are technical challenges such as security to fulfill its promise [10]-[13].<br />
Given a network topology, there can be multiple components in a machine sharing the machine’s<br />
resources together. So, resource allocation can play an important role in controlling the<br />
performance of a network. These two control facilities determine the performance of a network<br />
<strong>and</strong> the minimal completion time represents achievable QoS by a set of resources <strong>and</strong> services.<br />
Similar problems can be found in the multiprocessor scheduling literature 1 . There is a set of<br />
components with a task flow structure between them <strong>and</strong> each component without predecessors<br />
has one root task. Each component processes exactly one task only after all of its predecessors<br />
complete their tasks. A multiprocessor scheduling is composed of an assignment of components<br />
1 We adapt the terms used in the multiprocessor scheduling to our context throughout this paper.<br />
3
to machines (network topology) <strong>and</strong> a sequence of components for each machine (resource<br />
allocation). However, our problem is different especially because a component in our networks<br />
can have multiple tasks to process, i.e., a component can process tasks in parallel with its<br />
successors or predecessors. The easiest multiprocessor scheduling problem is when components<br />
are independent, i.e., there is no task flow between components. However, this problem is known<br />
as NP-complete [14][15]. Considering that the task flow structure of our networks is arbitrary<br />
<strong>and</strong> each component can have multiple tasks to process, our scheduling problem is even harder.<br />
In this context, the method designed in this paper is a heuristic which is applicable to the<br />
cases where the number of tasks to be processed by each component is large. Though the<br />
increase of the number of tasks adds more complexity, it can give us great opportunity to<br />
develop an efficient heuristic. Also, our method addresses resource reservation. When different<br />
applications share resources together, their performance can be guaranteed through the resource<br />
reservation. The method quantifies the minimal completion time by incorporating the resource<br />
reservations of other applications <strong>and</strong> also enables to make the resource reservations for the<br />
service network under consideration.<br />
The organization of this paper is as follows. In section 2 we formally define the problem in<br />
detail. After designing the method in Sections 3, we show empirical results in Section 4. <strong>Final</strong>ly,<br />
we discuss implications <strong>and</strong> possible extensions of our work in Section 5.<br />
2. Problem statement<br />
In this section we formally define the problem by detailing component-based service network,<br />
network topology, <strong>and</strong> resource allocation. We focus on computational CPU resources assuming<br />
that the system is computation-bounded.<br />
4
2.1 Component-based service network<br />
A network is composed of a set I = {i: i∈I} of components <strong>and</strong> a task flow structure between<br />
them. Task flow structure of the network, which defines precedence relationship between<br />
components, is an arbitrary directed acyclic graph. A problem given to a network is decomposed<br />
in terms of root tasks for some components <strong>and</strong> those tasks are propagated through a task flow<br />
structure. Each component processes one of the tasks in its queue (which has root tasks as well as<br />
tasks from predecessor components) <strong>and</strong> then sends it to successor components. We denote the<br />
number of root tasks of component i as rt i . There is a set K = {k: k∈K} of available machines <strong>and</strong><br />
P i (k) represents CPU time per task of component i at machine k reflecting computation speed<br />
difference between machines.<br />
Fig. 1 shows an example network composed of four components in three machines. In the<br />
figure denotes rt i <strong>and</strong> P i (k) at the residing machine respectively. Components I 1 <strong>and</strong> I 2 are<br />
residing in machine K 1 <strong>and</strong> each of them has 100 root tasks. I 3 in K 2 <strong>and</strong> I 4 in K 3 have no root<br />
tasks but they have 200 <strong>and</strong> 100 tasks from the corresponding predecessors.<br />
<br />
I 1<br />
<br />
I 3<br />
K 2<br />
K 3<br />
<br />
I 2 I 4<br />
K 1<br />
Fig. 1. An example network composed of four components in three machines. denotes the<br />
number of root tasks <strong>and</strong> CPU time per task at the residing machine.<br />
2.2 Network topology<br />
Considering that the components of a web service may not be separable to different machines,<br />
we define a set J = {j: j∈J} of clusters <strong>and</strong> denote the components of a cluster j as M j . Each<br />
5
component is a member of one of the clusters <strong>and</strong> the components in a cluster should be assigned<br />
to the same machine. Each cluster can be assigned to a set of machines <strong>and</strong> we denote the<br />
assignable machine set of cluster j as N j . We define topology variable set X = {x jk : j∈J, k∈K} in<br />
which x jk is 1 if cluster j is assigned to machine k <strong>and</strong> 0 otherwise. The constraints of topology<br />
variables are as in (1).<br />
<br />
Network topology constraints<br />
∑<br />
k∈N<br />
∑<br />
k∉N<br />
x<br />
jk<br />
j<br />
j<br />
x<br />
x<br />
jk<br />
jk<br />
= 1<br />
= 0<br />
∈{0,1}<br />
for all<br />
for all<br />
for all<br />
j ∈ J<br />
j ∈ J<br />
j ∈ J<br />
<strong>and</strong><br />
k ∈ K<br />
(1)<br />
2.3 Resource allocation<br />
When there are multiple components in a machine, a network can control its behavior through<br />
resource allocation. In the example network, machine K 1 has two components <strong>and</strong> the system<br />
performance depends on its resource allocation to these two components. There are several CPU<br />
scheduling algorithms for allocating a CPU resource amongst multiple threads. Among the<br />
scheduling algorithms, proportional CPU share (PS) scheduling is known for its simplicity,<br />
flexibility, <strong>and</strong> fairness [16]. In PS scheduling threads are assigned weights <strong>and</strong> resource shares<br />
are determined proportional to the weights [17]. Excess CPU time from some threads is allocated<br />
fairly to other threads. There are many PS scheduling algorithms such as Weighted Round-Robin<br />
scheduling, Lottery scheduling, <strong>and</strong> Stride scheduling [18]-[20].<br />
We adopt PS scheduling as resource allocation scheme because of its generality in addition to<br />
the benefits mentioned above. We define resource allocation variable set w = {w i (t): i∈I, t≥0} in<br />
which w i (t) is a non-negative weight of component i at time t. We denote the components<br />
6
assigned to machine k as S I[k] <strong>and</strong> the clusters assigned to machine k as S J[k] . If ω k a of total<br />
managed weight ω k is available to assign in machine k (i.e. ω k -ω k a<br />
is reserved by other<br />
applications), the constraints of resource allocation variables for a given topology are as in (2).<br />
<br />
Resource allocation constraints<br />
∑<br />
i∈S<br />
I<br />
[ k ]<br />
a<br />
w i( t ) ≤ ω k for all k ∈ K<br />
(2)<br />
2.4 Problem definition<br />
As the completion time T is a function of network topology (X) <strong>and</strong> resource allocation (w),<br />
the objective is to quantify the minimal completion time T *<br />
represented in (3) with the<br />
constraints of (1) <strong>and</strong> (2).<br />
T<br />
*<br />
= Min T<br />
X ,w<br />
. (3)<br />
3. Minimal completion time<br />
As stated earlier, we design a method of quantifying the minimal completion time by limiting<br />
to the cases where the number of tasks to be processed by each component is large. In this<br />
section, we investigate the impacts of the largeness on the optimal resource allocation for a given<br />
topology. Then, we formulate the problem by incorporating network topology <strong>and</strong> provide a<br />
heuristic algorithm for solving the problem formulation.<br />
3.1 Optimal resource allocation<br />
For a given topology, we define Load Index LI i which represents component i’s total CPU<br />
time required to process its tasks. As a component needs to process its own root tasks as well as<br />
incoming tasks from its predecessors, its number of tasks L i is identified as in (4), where i<br />
7
denotes the immediate predecessors of component i. Then, by denoting CPU time per task in the<br />
given topology as P i , LI i is represented as in (5).<br />
∑<br />
L i = rti<br />
+ La<br />
(4)<br />
i<br />
a∈i<br />
LI = L P . (5)<br />
To provide theoretical foundation of optimal resource allocation, we convert a network into a<br />
network with infinitesimal tasks. Each root task is divided into r infinitesimal tasks <strong>and</strong> each P i is<br />
replaced with P i /r. Then, the load index of each component is the same as the original network<br />
but tasks are infinitesimal. We denote the completion time of the network with infinitesimal<br />
tasks as T´. Also, we define a term called task availability as an indicator of relative preference<br />
for task arrival patterns. A component’s task availability for an arrival pattern is higher than for<br />
another if cumulative number of arrived tasks is larger or equal over time. A component prefers a<br />
task arrival pattern with higher task availability as it can utilize more resource. Consider a<br />
network <strong>and</strong> reconfigure it such that all components have their tasks in their queues at t=0. Each<br />
component has maximal task availability in the reconfigured network <strong>and</strong> the completion time of<br />
the reconfigured network forms the lower bound T LB of a network’s completion time T given by:<br />
i<br />
i<br />
T<br />
LB<br />
ωk<br />
= Max ∑ LI i<br />
k∈K<br />
a . (6)<br />
ω<br />
k i∈<br />
S I<br />
[ k ]<br />
Then, assuming a hypothetical weighted round-robin server 2 for CPU scheduling, T´ equals to<br />
T LB when each machine allocates resource to the residing components according to (7), where<br />
k(i) denotes a machine in which component i resides.<br />
2 The hypothetical server has idealized fairness as the CPU time received by each thread in a round is infinitesimal<br />
<strong>and</strong> proportional to the weight of the thread. This assumption is reasonable because quantum size is relatively<br />
infinitesimal compared to working horizon in reality.<br />
8
LI i<br />
wi<br />
( t ) ≥ ω<br />
k( i )<br />
for all i ∈ I <strong>and</strong> t ≥ 0<br />
LB<br />
(7)<br />
T<br />
Proof. A component’s instantaneous resource availability RA i (t), which is the available fraction<br />
of a resource when the component requests the resource at time t, is more than or equal to<br />
assigned weight proportion as:<br />
w i( t )<br />
RA i( t ) ≥ for t ≥ 0 . (8)<br />
ω<br />
k( i )<br />
Service time S i (t) is the time taken to process a task at time t <strong>and</strong> has a relationship with RA i (t)<br />
as:<br />
∫<br />
t+ Si<br />
( t)<br />
RAi<br />
( τ ) dτ<br />
= Pi<br />
. (9)<br />
t<br />
Suppose a component i receives its tasks at a constant interval of T LB /L i . Then, under the<br />
resource allocation in (7), S i (t) is less than or equal to T LB /L i over time as shown in (10).<br />
i<br />
Pi<br />
=<br />
∫<br />
RA i ( τ )dτ<br />
≥<br />
∫<br />
T<br />
⇒<br />
L<br />
LB<br />
i<br />
t+<br />
S ( t )<br />
t<br />
≥ S ( t )<br />
i<br />
t+<br />
S (<br />
t<br />
i<br />
t )<br />
wi<br />
( t ) LI<br />
dτ<br />
≥<br />
ω T<br />
k(<br />
i )<br />
i<br />
LB<br />
S ( t )<br />
i<br />
(10)<br />
So, any component can complete by T LB <strong>and</strong> generate tasks at a constant interval of T LB /L i<br />
from t=T LB /L i (first task generation time) under the resource allocation in (7) when it receives<br />
tasks at a constant interval of T LB /L i from t=0 (first task arrival time). As tasks are infinitesimal<br />
<strong>and</strong> root tasks increase task availability, each component can receive infinitesimal tasks at a<br />
constant interval in 0≤t≤T LB or more preferably, <strong>and</strong> complete at less than or equal to T LB . So,<br />
the network completes at T LB .<br />
<br />
So, a network can achieve a performance close to T LB under this resource allocation in the<br />
9
limit of large number of tasks. If machines do not follow this resource allocation, some<br />
components can receive their tasks less preferably than constant interval resulting in<br />
underutilization <strong>and</strong> consequently increased completion time. The minimal weights required to<br />
achieve T LB are constants over time as in (11) <strong>and</strong> the summation of these weights for each<br />
machine forms the required amount ω r k of resource reservation in the machine as in (12). Note<br />
that ω r k is less than or equal to ω a k satisfying the resource allocation constraints in (2).<br />
<br />
Constant resource allocation<br />
w<br />
i<br />
LI i<br />
= ω<br />
k( i )<br />
for all i ∈ I <strong>and</strong> t ≥ 0<br />
LB<br />
(11)<br />
T<br />
<br />
Resource reservation<br />
r ω<br />
ω k = k LI i for all k ∈ K<br />
LB ∑<br />
(12)<br />
T<br />
i∈<br />
S I<br />
[ k ]<br />
3.2 Optimal network topology<br />
As CPU time per task is machine-dependent we rewrite the load index as a function of<br />
machine as:<br />
LIi ( k)<br />
= Li<br />
Pi<br />
( k)<br />
. (13)<br />
Considering that the components in a cluster cannot be assigned to separate machines, we define<br />
Cluster Load Index CLI j (k) as:<br />
∑<br />
CLI j ( k)<br />
= LIi<br />
( k)<br />
. (14)<br />
Then, under the constant resource allocation, the completion time for a given topology can be<br />
estimated by:<br />
i∈<br />
M j<br />
10
ω<br />
∑<br />
k<br />
Max CLI j ( k )<br />
k∈K<br />
a<br />
. (15)<br />
ωk<br />
j∈<br />
Consequently, the minimal completion time T * can be formulated as in (16) by incorporating<br />
topology variables <strong>and</strong> constraints in (1).<br />
S J [<br />
k ]<br />
<br />
Topology problem formulation<br />
T<br />
*<br />
s.t.<br />
ω<br />
= Min Max<br />
k∈K<br />
ω<br />
∑<br />
k∈N<br />
∑<br />
k∉N<br />
x<br />
jk<br />
j<br />
j<br />
x<br />
x<br />
jk<br />
jk<br />
= 1<br />
= 0<br />
∈{0,1}<br />
k<br />
a<br />
k<br />
∑<br />
j∈J<br />
CLI<br />
j<br />
( k )x<br />
jk<br />
for all j ∈ J<br />
for all j ∈ J<br />
for all j ∈ J <strong>and</strong><br />
k ∈ K<br />
(16)<br />
The formulation has a simplistic form because it is completely separated from resource<br />
allocation variables. As a result, the formulation can be mapped into the easiest multiprocessor<br />
scheduling problem, i.e., an assignment of independent clusters to machines. As discussed, this<br />
problem is NP-complete <strong>and</strong> there are diverse heuristic algorithms available in the literature.<br />
Eleven heuristics were selected <strong>and</strong> examined with various problem configurations in [21]. They<br />
are Opportunistic Load Balancing, Minimum Execution Time, Minimum Completion Time,<br />
Min-min, Max-min, Duplex, Genetic Algorithm, Simulated Annealing, Genetic Simulated<br />
Annealing, Tabu, <strong>and</strong> A * . Though Genetic Algorithm always gave the best performance, if<br />
algorithm execution time is also considered, it was shown that the simple Min-min heuristic<br />
performs well in comparison to others. So, we recommend the Min-min heuristic as an algorithm<br />
for solving the problem formulation. By adapting to our context the Min-min heuristic is as<br />
follows.<br />
11
Min-min heuristic algorithm<br />
Step 1: Initialize a set of all unassigned clusters, U←J, <strong>and</strong> current machine-level completion<br />
times, mc(k)←0 for all k∈K.<br />
Step 2: Compute the minimal completion time after assignment for each unassigned cluster,<br />
ωk<br />
M={ min [ CLI j ( k ) + mc( k )]<br />
k∈N<br />
a<br />
: j∈U}.<br />
j ω<br />
k<br />
Step 3: Select the minimal from M, mmc←min M, <strong>and</strong> find corresponding cluster <strong>and</strong><br />
machine, c <strong>and</strong> m respectively.<br />
Step 4: Assign c to m <strong>and</strong> update mc(m), mc(m)←mc(m)+mmc.<br />
Step 5: Remove c from U.<br />
Step 6: If U=∅ then go to step 7. Otherwise go to step 2.<br />
Step 7: T * ← max mc( k ) .<br />
k∈K<br />
4. Empirical results<br />
We ran several experiments through discrete-event simulation to validate the designed<br />
method. Though we have not considered stochasticity so far, this empirical study will support the<br />
effectiveness of the method even in stochastic environments.<br />
4.1 Network description<br />
The network is composed of eight components in four clusters as in Table 1. Task flow<br />
structure between components is described in Fig. 2. There are three available machines {K 1 , K 2 ,<br />
K 3 } with ω k =ω a k =1 for all k, <strong>and</strong> each cluster is assignable to any machine.<br />
12
Table 1. Experimental network parameters<br />
Component rt i P i (k) a Cluster<br />
I 1 0 4 J 1<br />
I 2 0 12 J 2<br />
I 3 0 4 J 3<br />
I 4 0 8 J 3<br />
I 5 200 2 J 4<br />
I 6 200 2 J 4<br />
I 7 200 2 J 4<br />
I 8 200 2 J 4<br />
a<br />
for all k∈K<br />
I 1<br />
I 3 I 4<br />
I 5 I 6 I 7<br />
I 2<br />
I 8<br />
Fig. 2. Experimental task flow structure between eight components. The components are members<br />
of three clusters <strong>and</strong> each cluster is assignable to any of three machines.<br />
4.2 Performance evaluation<br />
The Min-min heuristic algorithm gives T * =4800 <strong>and</strong> the resulting topology is as in Fig. 3(b).<br />
The heuristic solution is equivalent to the exact solution of (16) in this experimental network.<br />
K 1<br />
K 2 K 1<br />
K 2<br />
I 1<br />
I 1<br />
I 3 I 4<br />
I 3 I 4<br />
K 3<br />
I 2<br />
I 8<br />
K 3<br />
I 2<br />
(b) Optimal topology<br />
I 5 I 6 I 7 I 8<br />
I 5 I 6 I 7<br />
(a) Non-optimal topology<br />
Fig. 3. Experimental network topologies. In (a), clusters J 1 <strong>and</strong> J 3 are assigned to machine K 1 , J 2 to<br />
K 2 , <strong>and</strong> J 4 to K 3 . In (b), J 4 is reassigned to K 1 <strong>and</strong> J 3 to K 3 .<br />
13
We set up eight different experimental conditions by combining three independent factors as<br />
shown in Table 2. We use two different network topologies as in Fig. 3, which are non-optimal<br />
<strong>and</strong> optimal topologies. Two resource allocation policies are used: round-robin allocation <strong>and</strong><br />
constant allocation. In round-robin allocation the components in each machine are assigned equal<br />
weights <strong>and</strong> in constant allocation according to the components’ load indices as in (11). To<br />
implement PS scheduling we use a weighted round-robin scheduling in which CPU time<br />
received by each component in a round is equal to its assigned weight. Also, the distribution of<br />
P i (k) can be deterministic or stochastic. While using stochastic distribution we repeat 5<br />
experiments.<br />
Table 2. Experimental design<br />
Condition Topology Resource allocation P i (k)<br />
Con1 Non-optimal Round-Robin Deterministic<br />
Con2 Non-optimal Round-Robin Exponential<br />
Con3 Non-optimal<br />
Constant Deterministic<br />
Con4 N on-optimal Constant Exponential<br />
Con5 Optimal Round-Robin Deterministic<br />
Con6 Optimal Round-Robin Exponential<br />
Con7 Optimal Constant Deterministic<br />
Con8 Optimal Constant Exponential<br />
Numerical results fr om the experimentation are shown in Table 3. The last two conditions<br />
(Con7 <strong>and</strong> Con8), which use the optimal network topology <strong>and</strong> constant resource allocation,<br />
gives a performance close to T * <strong>and</strong> outperforms other conditions significantly. Also, constant<br />
allocation for both non-optimal (Con3 <strong>and</strong> Con4) <strong>and</strong> optimal (Con7 <strong>and</strong> Con8) topologies, gives<br />
a performance superior to round-robin allocation <strong>and</strong> close to lower bound performance T LB in<br />
both deterministic <strong>and</strong> stochastic environments. These facts support the optimality of the<br />
constant resource allocation <strong>and</strong> consequently the validity of the method of quantifying the<br />
minimal completion time.<br />
14
Table 3. Experimental results<br />
Condition T LB T * Actual T % a<br />
Con1 6400 4800 7215 150.3<br />
Con2 6400 4800 7314 152.4<br />
Con3 6400 4800 6416 133.7<br />
Con4 6400 4800 6404 133.4<br />
Con5 4800 4800 5619 117. 1<br />
Con6 4800 4800 5645 117.6<br />
Con7 4800 4800 4820 100.4<br />
Con8 4800 4800 4899 102.1<br />
a A / T<br />
*<br />
ctual T<br />
4.3 Resource reservation<br />
The resource reservations required in the optimal topology are [ω r K1 =0.667, ω r K2 =1, ω r K3 =1]<br />
computed from (12). Our argument is that the network can achieve the optimal performance T *<br />
with these reservations even though unreserved resources are allocated to other applications. To<br />
validate this, we use eleven different reservations for machine K 1 as shown in Table 4. In each<br />
condition, ω r K1 is allocated proportional to the load indices of the residing components <strong>and</strong><br />
unreserved resources are assigned to an application which has infinite work (continuously<br />
requiring resources). The numerical results are shown in Table 4 <strong>and</strong> Fig. 4 to 5. In overall, the<br />
completion time decreases as ω r K1 increases. However, when ω r K1 is greater than 0.667, there is no<br />
significant advantage in deterministic environment. In contrast, the threshold in stochastic<br />
environment is somewhere between 0.667 <strong>and</strong> 0.7. Considering that the other applications may<br />
not require resources continuously, such a slight difference (≤ 0.033) does not seem to be<br />
significant.<br />
Table 4. The effects of resource reservation<br />
Actual T<br />
r Deterministic Exponential<br />
ω K1<br />
P i (k) P i (k)<br />
0.1 32284 34502<br />
0.2 16132 16605<br />
0.3 10746 11069<br />
15
0. 4 8057 8237<br />
0.5 6439 6635<br />
0.6 5369 5503<br />
0 .667 4829 5135<br />
0.7 4827 4941<br />
0.8 4824 4965<br />
0.9 4822 4984<br />
1.0 4820 4946<br />
35000<br />
30000<br />
25000<br />
Actual T<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0<br />
Resource reservation<br />
Fig. 4. The effects of resource reservation in deterministic environment. When the resource<br />
reservation in K 1 is greater than 0.667, there is no significant decrease of completion time.<br />
35000<br />
30000<br />
25000<br />
Actual T<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0<br />
Resource reservation<br />
Fig. 5. The effects of resource reservation in stochastic environment. When the resource reservation<br />
in K 1 is greater than somewhere between 0.667 <strong>and</strong> 0.7, there is no significant decrease of<br />
completion time.<br />
16
5. Conclusions<br />
The simple Min-min heuristic algorithm was proposed as a method of quantifying the<br />
minimal completion time for the component-based service networks. A network can achieve the<br />
performance under<br />
the constant resource allocation in the limit of large number of tasks. Also,<br />
the<br />
performance can be guaranteed with the resource reservations we have formulated. The<br />
designed method is efficient enough to satisfy the requirements for the use in a grid service<br />
environment. In spite of its simplicity, the method can quantify the quality of service effectively.<br />
The virtual markets driven by such methods will make timely transactions with desirable<br />
surpluses leading to productive virtual economy.<br />
Our work can be extended by taking into account alternative algorithms. Each component can<br />
have alternative algorithms to process a task which trade off processing time <strong>and</strong> quality of<br />
solution. While network topology <strong>and</strong> resource allocation try to efficiently utilize limited<br />
resources, alternative algorithms can change the amount of required resources. As modern<br />
operating environments are highly dynamic, alternative algorithms becomes an important tool to<br />
achieve portable high performance [22][23]. Quality of service is determined by not only<br />
completion time but also quality of solution. The question is how to quantify the optimal quality<br />
of service that can be provided by such a network.<br />
References<br />
[1] I. Foster <strong>and</strong> C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. San<br />
Francisco: Morgan Kaufmann Publishers, 1999.<br />
[2] R. Hamadi <strong>and</strong> B. Benatallah, “A Petri net-based model for web service composition,” in<br />
Proc. 14th Australasian Database Conf. Database technologies, Adelaide, Australia, 2003,<br />
17
pp. 191-200.<br />
[3] I. Foster, C. Kesselman, J. M. Nick, <strong>and</strong> S. Tuecke, “Grid services for distributed system<br />
integration,” IEEE Computer, vol. 35, no. 6, pp. 37-46, 2002.<br />
[4] M. Weiser, “The computer for the 21st century,” Scientific American, vol. 265, no. 3, pp.<br />
94-104, 1991.<br />
[5] P. Padala, C. Harrison, N. Pelfort, E. Jansen, M. P. Frank, <strong>and</strong> C. Chokkareddy, “OCEAN:<br />
The open computation exchange <strong>and</strong> arbitration network, A market approach to meta<br />
computing,” in<br />
Proc. 2nd Int. Symp. Parallel <strong>and</strong> Distributed Computing, 2003, pp. 185-<br />
192.<br />
[6]<br />
R. Buyya <strong>and</strong> S. Vazhkudai, “Compute power market: Towards a market-oriented grid,” in<br />
Proc. First IEEE/ACM Int. Symp. Cluster Computing <strong>and</strong> the Grid, 2001, pp.574-581.<br />
[7] R. Buyya, D. Abramson, <strong>and</strong> J. Giddy, “Nimrod/G: An architecture for a resource<br />
management <strong>and</strong> scheduling system in a global computational grid,” in Proc. 4th Int. Conf.<br />
High Performance Computing in Asia-Pacific Region, 2000, pp. 283-289.<br />
[8] B. Meyer, “On to components”, IEEE Computer, vol. 32, no. 1, pp. 139-140, 1999.<br />
[9]<br />
P. Clements, “From subroutine to subsystems: Component-based software development,” in<br />
Component Based Software Engineering, A. W. Brown, Ed. IEEE Computer Society Press,<br />
1996, pp. 3-6.<br />
[10] D. B. Lange, “Mobile objects <strong>and</strong> mobile agents: The future of distributed computing?,” in<br />
Proc. 12th European Conf. Object-Oriented Programming, 1998, pp. 1-12.<br />
[11] D. Schoder <strong>and</strong> T. Eymann, “The real challenges of mobile agents,” Communications of the<br />
ACM, vol. 43, no. 6, 2000, pp.111-112.<br />
[12] D. B. Lange <strong>and</strong> M. Oshima, “Seven good reasons for mobile agents,” Communications of<br />
18
the ACM, vol. 42, no. 3, 1999, pp. 88-89.<br />
[13] D. Chess, C. Harrison, <strong>and</strong> A. Kershenbaum, “Mobile agents: Are they a good idea?,” in<br />
Mobile Object Systems: Towards the Programmable Internet, Lecture Notes in Computer<br />
Science, vol. 1222, J. Vitek <strong>and</strong> C. Tschudin, Eds. Springer-Verlag, 1997, pp. 25–47.<br />
[14] O. H. Ibarra <strong>and</strong> C. E. Kim, “Heuristic algorithms for scheduling independent tasks on<br />
nonidentical processors,” Journal of the Association for Computing Machinery, vol. 24, no.<br />
2, pp. 280-289, 1977.<br />
[15] D. Fern<strong>and</strong>ez-Baca, “Allocating modules to processors in a distributed system,” IEEE<br />
Transactions on Software Engineering, vol. 15, no. 11, pp. 1427-1436, 1989.<br />
[16] J. Regehr, “Some guidelines for proportional share CPU scheduling in general-purpose<br />
operating systems,” Presented as a work in progress at 22nd IEEE Real-Time Systems<br />
Symposium, London, UK, Dec. 3-6, 2001.<br />
[17] I. Stoica, H. Abdel-Wahab, J. Gehrke, K. Jeffay, S. K. Baruah, <strong>and</strong> C. G. Plexton, “A<br />
proportional share resource allocation algorithm for real-time, time-shared systems,” in<br />
Proc. 17th IEEE Real-Time Systems Symposium, 1996, pp. 288-299.<br />
[18] C. A. Waldspurger <strong>and</strong> W. E. Weihl, “Lottery scheduling: Flexible proportional-share<br />
resource management,” in Proc. First Symposium on Operating System Design <strong>and</strong><br />
Implementation, 1994, pp. 1-11.<br />
[19] C. Waldspurger <strong>and</strong> W. Weihl, “Stride scheduling: Deterministic proportional-share<br />
resource management,” Lab. for Computer Science, Massachusetts Institute of Technology,<br />
Cambridge, MA, Tech. Rep. MIT/LCS/TM-528, 1995.<br />
[20] C. Waldspurger, “Lottery <strong>and</strong> stride scheduling: Flexible proportional share resource<br />
management,” Ph.D. dissertation, Lab. for Computer Science, Massachusetts Institute of<br />
19
Technology, Cambridge, MA, 1995.<br />
[21] T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran, A. I. Reuther, J. P.<br />
Robertson, M. D. Theys, B. Yao, D. Hensgen, <strong>and</strong> R. F. Freund, “A comparison of eleven<br />
static heuristics for mapping a class of independent tasks onto heterogeneous distributed<br />
computing systems,” Journal of Parallel <strong>and</strong> Distributed Computing, vol. 61, pp. 810-837,<br />
2001.<br />
[22] M. O. McCracken, A. Snavely, <strong>and</strong> A. Malony, “Performance modeling for dynamic<br />
algorithm selection,” in Proc. Int. Conf. Computational Science, 2003, pp. 749-758.<br />
[23] P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Medvidovic, A.<br />
Quilici, D. S. Rosenblum, <strong>and</strong> A. L. Wolf, “An architecture-based approach to self-adaptive<br />
software,” IEEE Intelligent Systems, vol. 14, no. 3, pp. 54-62, 1999.<br />
20
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1<br />
MARKET-BASED MODEL PREDICTIVE CONTROL FOR LARGE-SCALE<br />
INFORMATION NETWORKS: COMPLETION TIME AND VALUE OF SOLUTION<br />
Seokcheon Lee, Soundar Kumara, <strong>and</strong> Natarajan Gautam<br />
Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />
The Pennsylvania State University<br />
University Park, PA 16802<br />
{stonesky, skumara, ngautam}@psu.edu<br />
ABSTRACT<br />
There are several important properties of modern software systems. They tend to be largescale<br />
with distributed <strong>and</strong> component-based architectures. Also, dynamic nature of operating<br />
environments leads them to utilize alternative algorithms. However, on the other h<strong>and</strong>, these<br />
properties make it hard to provide appropriate control mechanisms due to the increased<br />
complexity. Components are sharing resources <strong>and</strong> each component can have alternative<br />
algorithms. As a result, the behavior of a software system can be controlled through resource<br />
allocation as well as algorithm selection. This novel control problem is worthy of investigation in<br />
order to double the benefits of those properties. In this paper we design a scalable control<br />
mechanism for such systems. The quality of service we are considering is a product of the value<br />
of solution <strong>and</strong> the time for generating solution for a given problem. We build a mathematical<br />
programming model that trade off these two conflicting objectives <strong>and</strong> decentralize the model<br />
through an auction market. By periodically opening the auction market for each existing system<br />
state, a closed-loop policy is formed. We verify the designed control mechanism empirically.<br />
Index Terms: Distributed applications, modeling <strong>and</strong> prediction, optimization, scalability
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2<br />
1. Introduction<br />
The growth in complexity <strong>and</strong> size of software systems due to automation or organizational<br />
integration is leading to the increasing importance of distributed <strong>and</strong> component-based<br />
architectures. Distributed computing aims at using computing power of machines connected by a<br />
network. When a task requires intensive computation, it becomes natural choice to achieve high<br />
performance. A component is a reusable program element. Component technology utilizes the<br />
components so that developers can build systems needed by simply defining their specific roles<br />
<strong>and</strong> wiring them together [1][2]. In networks with component-based architecture, each<br />
component is highly specialized for specific tasks. Another emerging technology is adaptive<br />
software [3][4]. Adaptive software has alternative algorithms for the same numerical problem<br />
<strong>and</strong> a switching function for selecting the best algorithm in response to environmental changes.<br />
As modern operating environments are highly dynamic, adaptive software becomes an important<br />
tool to achieve portable high performance.<br />
We study a large-scale information network (with respect to the number of components as<br />
well as machines) comprising of distributed software components linked together through a task<br />
flow structure. A problem given to the network is decomposed in terms of root tasks for some<br />
components <strong>and</strong> those tasks are propagated through a task flow structure to other components.<br />
As a problem can be decomposed with respect to space, time, or both, a component can have<br />
multiple root tasks that can be considered independent <strong>and</strong> identical in their nature. The service<br />
provided by the network is to produce a global solution to a given problem, which is an<br />
aggregate of partial solutions of individual tasks. Each component can have alternative<br />
algorithms to process a task which trade off processing time <strong>and</strong> value of partial solution. Quality<br />
of Service (QoS) of the network is determined by the value of global solution <strong>and</strong> the time for
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 3<br />
generating global solution (i.e., completion time). For a given topology, the network can control<br />
its behavior by utilizing two different kinds of control actions: algorithm selection <strong>and</strong> resource<br />
allocation. While resource allocation tries to efficiently utilize limited resources, algorithm<br />
selection can change the amount of required resources. The resource allocation we are addressing<br />
here, is allocating resources of each machine to the residing components for a given topology. As<br />
problems are decomposed in various ways depending on their nature <strong>and</strong> size, <strong>and</strong> their QoS<br />
functions are context-dependent, the network needs to provide adaptive solutions to given<br />
problems by utilizing such control actions.<br />
One can imagine wide range of scientific <strong>and</strong> engineering problems that can be solved by<br />
such a network. UltraLog (http://www.ultralog.net) networks, implemented in Cougaar<br />
(Cognitive Agent Architecture: http://www.cougaar.org) developed by <strong>DARPA</strong> (Defense<br />
Advanced Research Project Agency), are the instances [5]-[9]. Each agent in these networks<br />
represents an organization of military supply chain <strong>and</strong> has a set of components specialized for<br />
each functionality (allocation, expansion, inventory management, etc) <strong>and</strong> class (ammunition,<br />
water, fuel, etc). The objective of an UltraLog network is to provide an appropriate logistics plan<br />
for a given military operational plan. A logistics plan is a global solution which is an aggregate<br />
of individual schedules built by components. An operational plan is decomposed into logistics<br />
requirements of each thread for each agent, <strong>and</strong> a requirement is further decomposed into root<br />
tasks (one task per day) for a designated component. As a result, a component can have hundreds<br />
of root tasks depending on the horizon of an operation <strong>and</strong> thous<strong>and</strong>s of tasks to process as the<br />
root tasks are propagated. As the scale of operation increases there can be thous<strong>and</strong>s of agents<br />
(tens of thous<strong>and</strong>s of components) in hundreds of machines working together to generate a<br />
logistics plan. QoS of these networks is determined by the quality of logistics plan (value of
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 4<br />
solution) <strong>and</strong> (plan) completion time. These two metrics directly affect the performance of the<br />
operation.<br />
In this paper we design a control mechanism for such novel networks. We stress scalability<br />
with respect to computational complexity as well as communicational overhead, as an important<br />
consideration of the control mechanism for its practical use. The control mechanism should be<br />
able to supply appropriate control policy in a timely manner even though the size of the network<br />
is large. Such a property is important especially when completion time is an explicit<br />
consideration as in our control problem. However, the property is hard to achieve in general if<br />
one pursues exactly optimal policy. Therefore, we design a scalable control mechanism by<br />
sacrificing some amount of optimality in a systematic way as follows.<br />
First, we adopt Model Predictive Control (MPC) as our control framework. In MPC, for each<br />
current state, an optimal open-loop control policy is designed for finite-time horizon by solving a<br />
static mathematical programming model [10]-[13]. The design process is repeated for the next<br />
observed state feedback forming a closed-loop policy reactive to each current system state.<br />
Though MPC does not give absolutely optimal policy in stochastic environments, the periodic<br />
design process alleviates the impacts of stochasticity. Note that technologies such as Dynamic<br />
Programming are not efficient in terms of computational complexity as they try to give optimal<br />
closed-loop control policy. Second, under MPC framework, we build a heuristic programming<br />
model due to computational complexity. The heuristic model is solvable in polynomial time <strong>and</strong><br />
its solution converges to the solution of exact model in the limit of large number of tasks. Third,<br />
we provide a decentralized coordination mechanism for solving the programming model.<br />
Computations <strong>and</strong> communications are distributed to multiple entities through an auction market<br />
while giving a solution equivalent to the solution of the programming model.
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 5<br />
The organization of this paper is as follows. In Section 2 we formally define the problem in<br />
detail. After designing the control mechanism in Sections 3 <strong>and</strong> 4, we show empirical results in<br />
Section 5. <strong>Final</strong>ly, we discuss implications <strong>and</strong> possible extensions of our work in Section 6.<br />
2. Problem specification<br />
In this section we formally define the control problem by detailing network configuration <strong>and</strong><br />
control actions. We focus on computational CPU resources assuming that the system is<br />
computation-bounded.<br />
2.1 Network configuration<br />
A network is composed of a set of components A <strong>and</strong> a set of nodes (i.e., machines) N. K n<br />
denotes a set of components that reside in node n sharing the node’s CPU resource. Task flow<br />
structure of the network, which defines precedence relationship between components, is an<br />
arbitrary directed acyclic graph. A problem given to the network is decomposed in terms of root<br />
tasks for some components <strong>and</strong> those tasks are propagated through the task flow structure. Each<br />
component processes one of the tasks in its queue (which has root tasks as well as tasks from<br />
predecessor components) <strong>and</strong> then sends it to successor components. We denote the number of<br />
root tasks of component i as rt i . Fig. 1 shows an example network in which there are four<br />
components residing in three nodes. Components A 1 <strong>and</strong> A 2 reside in N 1 <strong>and</strong> each of them has<br />
100 root tasks. A 3 in N 2 <strong>and</strong> A 4 in N 3 have no root tasks, but they have 200 <strong>and</strong> 100 tasks<br />
respectively from the corresponding predecessors.
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 6<br />
<br />
A 1<br />
A 3<br />
N 2<br />
N 1<br />
100<br />
0<br />
<br />
A 2 A 4<br />
N 3<br />
100<br />
0<br />
Fig. 1. An example network<br />
2.2 Control actions<br />
The network can utilize two different kinds of control actions in controlling its behavior:<br />
algorithm selection <strong>and</strong> resource allocation.<br />
Algorithm selection<br />
A component can use one of alternative algorithms to process a task. Different alternatives<br />
trade off CPU time <strong>and</strong> value of solution with more CPU time resulting in higher solution value.<br />
As one can find optimal mixed alternatives, a component has a monotonically increasing<br />
piecewise-linear convex function, say value function, with CPU time as a function of value. We<br />
call the value in the function as value mode that a component can select as its decision variable.<br />
A value function is defined with three elements as f v ), v v 〉 as shown in Fig. 1.<br />
〈 i ( i i(min),<br />
i(max)<br />
This function indicates that component i’s expected CPU time 1 to process a task is f i (v i ) with a<br />
value mode v i <strong>and</strong> v i(min) ≤ v i ≤ v i(max) . We assume that components cannot change the mode for a<br />
task in process.<br />
Resource allocation<br />
When there are multiple components in a node, the network needs to control its behavior<br />
through resource allocation. In the example network, node N 1 has two components <strong>and</strong> the<br />
1 The distribution of CPU time can be arbitrary though we use only expected CPU time.
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 7<br />
system performance can depend on its resource allocation to these two components. There are<br />
several CPU scheduling algorithms for allocating a CPU resource amongst multiple threads.<br />
Among the scheduling algorithms, proportional CPU share (PS) scheduling is known for its<br />
simplicity, flexibility, <strong>and</strong> fairness [14]. In PS scheduling threads are assigned weights <strong>and</strong><br />
resource shares are determined proportional to the weights [15]. Excess CPU time from some<br />
threads is allocated fairly to other threads. There are many PS scheduling algorithms such as<br />
Weighted Round-Robin scheduling, Lottery scheduling, <strong>and</strong> Stride scheduling [16]-[18]. We<br />
adopt PS scheduling as resource allocation scheme because of its generality in addition to the<br />
benefits mentioned above. We define resource allocation variable set w = {w i (t): i∈A, t≥0} in<br />
which w i (t) is a non-negative weight of component i at time t. If total managed weight of a node<br />
n is ω n , the boundary condition for assigning weights over time can be described as:<br />
∑<br />
i∈K<br />
n<br />
wi<br />
( t)<br />
= ωn<br />
where wi<br />
( t)<br />
≥ 0 . (1)<br />
2.3 Problem definition<br />
The service provided by the network is to produce a global solution to a given problem, which<br />
is an aggregate of partial solutions of individual tasks. QoS of the network is determined by the<br />
value of global solution <strong>and</strong> the cost of completion time. The value of global solution is the<br />
summation of partial solution values, <strong>and</strong> the cost of completion time is determined by a cost<br />
function CCT(T) which is a monotonically increasing function with completion time T. We<br />
assume that the solution values <strong>and</strong> cost are represented in a common unit 2 . Consider v d i as the<br />
value mode used to process d th task by component i <strong>and</strong> e i the number of tasks processed by<br />
component i to the completion. Then, the control objective is to maximize QoS by utilizing<br />
2 Relative importance can be considered by scaling the functions <strong>and</strong> it results in the same function structures.
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 8<br />
algorithm selection (v) <strong>and</strong> resource allocation (w) as in (2). As stated earlier, we design a<br />
scalable control mechanism to achieve the objective in the framework of MPC by building a<br />
mathematical programming model <strong>and</strong> decentralizing it.<br />
arg max<br />
v,w<br />
e<br />
i<br />
∑∑<br />
i∈ A d = 1<br />
v<br />
d<br />
i<br />
− CCT(T )<br />
(2)<br />
3. Mathematical programming model<br />
The mathematical programming model is essentially a scheduling problem formulation. There<br />
are a variety of formulations <strong>and</strong> algorithms available for diverse scheduling problems in the<br />
context of multiprocessor, manufacturing, <strong>and</strong> project management. In general, a scheduling<br />
problem is allocating limited resources to a set of tasks to optimize a specific objective. One<br />
widely studied objective is completion time (also called makespan in the manufacturing<br />
literature) as the problem we have considered. Though it is not easy to find a problem exactly<br />
same as ours, it is possible to convert our problem into one of the scheduling problems. For<br />
example, in job shop, there are a set of jobs <strong>and</strong> a set of machines. Each job has a set of serial<br />
operations <strong>and</strong> each operation should be processed on a specific machine. A job shop scheduling<br />
problem is sequencing the operations in each machine by satisfying a set of job precedence<br />
constraints such that the completion time is minimized. When we assign a value mode to each<br />
task, our problem can be exactly transformed into a job shop scheduling problem. However,<br />
scheduling problems are in general intractable. Though the job shop scheduling problem is<br />
polynomially solvable when there are two machines <strong>and</strong> each job has two operations, it becomes<br />
NP-hard on the number of jobs even if the number of machines or operations is more than two<br />
[19][20]. Considering that the task flow structure of our networks is arbitrary, our scheduling
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 9<br />
problem is NP-hard on the number of components in general. The increase of the number of<br />
tasks <strong>and</strong> consideration of alternative algorithms make the problem even harder. Moreover, there<br />
can be large number of nodes in our networks.<br />
Though it may be possible to use some available heuristic algorithms from the job shop<br />
scheduling problem by taking into account alternative algorithms, our scheduling problem has a<br />
particular characteristic, i.e., the number of tasks for each component can be large. Though the<br />
increase of the number of tasks adds more complexity, it can also lead us to develop an efficient<br />
heuristic programming model. In this section, we characterize an optimal resource allocation by<br />
analyzing the impacts of the largeness <strong>and</strong> subsequently build a mathematical programming<br />
model solvable in polynomial time.<br />
3.1 Optimal resource allocation<br />
Consider current time t=0 <strong>and</strong> assume that each component uses a value mode common to all<br />
the tasks (i.e. pure strategy). We will discuss the optimality of the pure strategy later in this<br />
subsection. We define Load Index LI i which represents component i’s total CPU time required to<br />
process its tasks. As a component needs to process its own root tasks as well as incoming tasks<br />
from its predecessors, its number of tasks L i is identified as in (3), where i denotes the immediate<br />
predecessors of component i. Then, LI i is represented as in (4).<br />
∑<br />
L i = rti<br />
+ La<br />
(3)<br />
a∈i<br />
LI = L f v )<br />
(4)<br />
i<br />
To provide theoretical foundation of optimal resource allocation policy, we convert a network<br />
into a network with tasks having infinitesimal processing times. Each root task is divided into r<br />
infinitesimal tasks <strong>and</strong> each f i (v i ) is replaced with f i (v i )/r. Then, the load index of each component<br />
i<br />
i ( i
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 10<br />
is the same as the original network but tasks are infinitesimal. We denote the completion time of<br />
the network with infinitesimal tasks as T´. Also, we define a term called task availability as an<br />
indicator of relative preference for task arrival patterns. An arrival pattern gives higher task<br />
availability than another if cumulative number of arrived tasks is larger or equal over time. A<br />
component prefers a task arrival pattern with higher task availability as it can utilize more<br />
resource. Consider a network <strong>and</strong> reconfigure it such that all components have their tasks in their<br />
queues at t=0. Each component has maximal task availability in the reconfigured network <strong>and</strong> the<br />
completion time of the reconfigured network forms the lower bound T LB<br />
of a network’s<br />
completion time T given by:<br />
LB<br />
n∈N<br />
∑<br />
T = Max LI i . (5)<br />
i∈<br />
K n<br />
For theoretical analysis, we assume a hypothetical weighted round-robin server for CPU<br />
scheduling though it is not strictly required in practice as will be discussed. The hypothetical<br />
server has idealized fairness as the CPU time received by each thread in a round is infinitesimal<br />
<strong>and</strong> proportional to the weight of the thread.<br />
Theorem 1. T´ equals to T LB when each node allocates its resource proportional to its residing<br />
components’ load indices as:<br />
LI i<br />
wi<br />
( t)<br />
= wi<br />
= ω n(<br />
i)<br />
for all i ∈ A <strong>and</strong> t ≥ 0 , (6)<br />
LI<br />
∑<br />
p∈K<br />
n(<br />
i)<br />
where n(i) denotes a node in which component i resides.<br />
p<br />
Proof. A component’s instantaneous resource availability RA i (t), which is the available fraction<br />
of a resource when the component requests the resource at time t, is more than or equal to<br />
assigned weight proportion as:
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 11<br />
w i( t )<br />
RA i( t ) ≥ for t ≥ 0 . (7)<br />
ω<br />
n( i )<br />
Service time S i (t) is the time taken to process a task at time t <strong>and</strong> has a relationship with RA i (t)<br />
as:<br />
t Si(<br />
∫ +<br />
t<br />
t )<br />
RA ( τ )dτ<br />
=<br />
i<br />
f<br />
i<br />
( v<br />
i<br />
). (8)<br />
Suppose a component i receives its tasks at a constant interval of T LB /L i . Then, under<br />
proportional allocation, S i (t) is less than or equal to T LB /L i over time as shown in (9).<br />
f<br />
=<br />
i<br />
( v<br />
i<br />
p∈K<br />
) =<br />
LI<br />
∑<br />
n(<br />
i<br />
LI<br />
i )<br />
∫<br />
p<br />
t+<br />
S (<br />
t<br />
i<br />
t )<br />
RA ( τ )dτ<br />
≥<br />
LI<br />
Si( t ) ≥<br />
T<br />
i<br />
i<br />
LB<br />
S ( t )<br />
i<br />
∫<br />
t+<br />
S (<br />
t<br />
i<br />
t )<br />
T<br />
⇒<br />
L<br />
w i( t ) w<br />
dτ<br />
=<br />
ω ω<br />
LB<br />
i<br />
n(<br />
i )<br />
≥ S ( t )<br />
i<br />
n(<br />
i<br />
i )<br />
S ( t )<br />
i<br />
for t ≥ 0<br />
(9)<br />
So, any component can complete by T LB <strong>and</strong> generate tasks at a constant interval of T LB /L i<br />
from t=T LB /L i (first task generation time) under proportional allocation when it receives tasks at<br />
a constant interval of T LB /L i from t=0 (first task arrival time). As tasks are infinitesimal <strong>and</strong> root<br />
tasks increase task availability, each component can receive infinitesimal tasks at a constant<br />
interval in 0≤t≤T LB or more preferably, <strong>and</strong> complete at less than or equal to T LB . So, the<br />
network completes at T LB under proportional allocation.<br />
<br />
From Theorem 1 we can conjecture that a network can achieve a performance close to T LB<br />
under proportional allocation in the limit of large number of tasks. If nodes do not follow the<br />
proportional allocation policy, some components can receive their tasks less preferably than<br />
constant interval resulting in underutilization <strong>and</strong> consequently increased completion time. Also,<br />
it is optimal for each component to use a pure strategy. Each component’s optimal strategy in the
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 12<br />
network with maximal task availability is a pure strategy due to the convexity of value functions,<br />
<strong>and</strong> a network can achieve the optimal performance under proportional allocation. Though we<br />
assumed a hypothetical weighted round-robin server which is difficult to realize in practice, the<br />
arguments do not seem to be invalid because they are based on worst-case analysis <strong>and</strong> quantum<br />
size is relatively infinitesimal compared to working horizon in reality.<br />
3.2 Programming model<br />
As discussed, each component’s optimal strategy is a pure strategy <strong>and</strong> the completion time T<br />
is close to T LB under proportional resource allocation in the limit of large number of tasks. Now,<br />
consider current time as t. To update load index as the system moves on, we slightly modify it to<br />
represent the total CPU time for the remaining tasks as:<br />
LI t)<br />
= R ( t)<br />
+ L ( t)<br />
f ( v ) , (10)<br />
i ( i i i i<br />
in which R i (t) denotes remaining CPU time for a task in process <strong>and</strong> L i (t) the number of<br />
remaining tasks excluding a task in process. After identifying initial number of tasks L i (0)=L i ,<br />
each component updates it by counting down as they process tasks.<br />
Then, under proportional resource allocation, the completion time T can be estimated as:<br />
n∈N<br />
∑<br />
T − t ≈ Max [ Ri<br />
( t)<br />
+ Li<br />
( t)<br />
fi<br />
( vi<br />
)] . (11)<br />
i∈<br />
K n<br />
The estimation leads to building a programming model in a straightforward way. Given<br />
completion time T it is optimal for a node n to select a mode by the following:<br />
∑<br />
Max Li ( t ) v i<br />
(12)<br />
i∈K<br />
n<br />
subject to<br />
∑<br />
i∈<br />
K n<br />
[ Ri<br />
( t)<br />
+ Li<br />
( t)<br />
fi<br />
( vi<br />
)] ≤ T − t . (13)
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 13<br />
Consequently, the programming model can be formulated with two sub-models: optimization<br />
model as in (14) <strong>and</strong> resource allocation model as in (15). The optimization model maximizes<br />
QoS by trading off the value of solution <strong>and</strong> the cost of completion time, <strong>and</strong> the resource<br />
allocation model allocates resources proportional to the load indices of residing components<br />
based on the solution of (14).<br />
<br />
Programming model<br />
Max<br />
s.t.<br />
∑<br />
i∈A<br />
∑<br />
i∈K<br />
v<br />
L ( t )v<br />
n<br />
i<br />
i(min)<br />
[ R ( t ) + L ( t<br />
i<br />
≤ v<br />
i<br />
i<br />
− CCT(T )<br />
≤ v<br />
i<br />
) f<br />
i(max)<br />
i<br />
( v )] ≤ T − t<br />
i<br />
for all<br />
for all<br />
n ∈ N<br />
i ∈ A<br />
(14)<br />
w<br />
*<br />
i<br />
=<br />
*<br />
Ri<br />
( t)<br />
+ Li<br />
( t)<br />
fi<br />
( vi<br />
)<br />
ω<br />
* n(<br />
i)<br />
∑[<br />
R p ( t)<br />
+ L p ( t)<br />
f p ( v p )]<br />
(15)<br />
p∈K<br />
n(<br />
i)<br />
The optimal QoS from (14) with t=0 forms a QoS upper bound QoS UB <strong>and</strong> a network can<br />
achieve a performance close to QoS UB in the limit of large number of tasks. The programming<br />
model is efficient in terms of complexity because the two different kinds of control actions are<br />
completely separated. It is solvable in polynomial time as will be discussed in the next section.<br />
4. Decentralization<br />
The next question is how to decentralize the mathematical programming model. Centralized<br />
control mechanisms scale badly, due to the rapid increase of computational <strong>and</strong> communicational<br />
overheads with system size. Single point failure of the controller will often lead to failure of the<br />
complete system leading to non-robust network. Decentralization can address these issues by
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 14<br />
distributing the computations <strong>and</strong> communications to multiple entities in the system. There are<br />
two popular methods of decentralizing structured programming models: decomposition methods<br />
<strong>and</strong> auction/bidding algorithms. Considering the compatible structure of the programming<br />
model, we decentralize it through a non-iterative auction mechanism, so called multiple-unit<br />
auction with variable supply [21]. In this auction a seller may be able <strong>and</strong> willing to adjust the<br />
supply as a function of bidding.<br />
4.1 Auction market design<br />
In the programming model we have built, all nodes <strong>and</strong> components are coupled with each<br />
other. However, it has a typical structure, where objective function <strong>and</strong> constraints are separable<br />
to each node if one variable T is fixed. This characteristic makes it possible to solve the model<br />
through an auctioning process for T. The completion time T is an unbounded resource <strong>and</strong> the<br />
supply can be adjusted as a function of bidding.<br />
To design the auction market we define two different types of participants in addition to the<br />
components: Seller <strong>and</strong> Resource Manager. There is one seller in the system which determines<br />
T * based on the bids from resource managers. A resource manager of each node manages the<br />
resource of the node <strong>and</strong> arbitrates between its components <strong>and</strong> the seller.<br />
We define T i as available resource of component i which is required minimally to the amount<br />
of T i(min) as in (16) <strong>and</strong> maximally T i(max) as in (17).<br />
T<br />
T<br />
= [ R ( t ) L ( t ) f ( v )]<br />
(16)<br />
i (min) i +<br />
i<br />
i<br />
i(min)<br />
= [ R ( t ) L ( t ) f ( v )]<br />
(17)<br />
i (max) i +<br />
i<br />
i<br />
i(max)<br />
A component i bids to its resource manager with maximal value as a function of T i as in (18).<br />
The resource manager bids to the seller with maximal total value of its components as a function<br />
of T based on the bids from its components as in (19). The seller decides T * based on the bids
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 15<br />
from resource managers by taking into account the cost of T as in (20). After the seller<br />
broadcasts T * , each resource manager decides T * i <strong>and</strong> w * i as in (21) <strong>and</strong> (22). In (21) T * i is less<br />
than or equal to the maximally required resource T i(max) so that the resource can be allocated<br />
proportional to the components’ load indices. Each component selects optimal value mode in the<br />
limit of T * i as in (23). This auctioning process gives an equivalent solution to the programming<br />
model.<br />
<br />
Auctioning model<br />
Component’s bid<br />
b (T ) = −∞<br />
i<br />
i<br />
= L ( t )v<br />
i<br />
= L ( t ) f<br />
i<br />
i(max)<br />
−1<br />
i<br />
(Ti<br />
− Ri( t )<br />
( )<br />
L ( t )<br />
i<br />
if T<br />
if T<br />
else<br />
i<br />
i<br />
< T<br />
> T<br />
i(min)<br />
i(max)<br />
(18)<br />
Resource manager’s bid<br />
b (T ) = −∞<br />
n<br />
=<br />
∑<br />
i∈K<br />
n<br />
b (T<br />
i<br />
= Max {<br />
i∈K<br />
i(max)<br />
∑<br />
n<br />
)<br />
b (T<br />
i<br />
i<br />
) :<br />
∑<br />
i∈K<br />
n<br />
T<br />
i<br />
≤ T − t }<br />
if T <<br />
if T ><br />
else<br />
∑<br />
i∈K<br />
i∈K<br />
n<br />
∑<br />
n<br />
T<br />
T<br />
i(min)<br />
i(max)<br />
(19)<br />
Seller’s decision<br />
T<br />
*<br />
= argmax ∑bn<br />
( T ) − CCT ( T )<br />
T<br />
n∈N<br />
(20)<br />
Resource manager’s decision<br />
{T<br />
*<br />
i<br />
: i ∈ K n } = arg max { bi<br />
(Ti<br />
) : Ti<br />
≤ min(T − t, Ti(max)<br />
)}<br />
(21)<br />
{ T :i∈K<br />
}<br />
i<br />
n<br />
∑<br />
i∈K<br />
n<br />
∑<br />
i∈K<br />
n<br />
*<br />
∑<br />
i∈K<br />
n
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 16<br />
w<br />
*<br />
i<br />
=<br />
T<br />
p∈K<br />
*<br />
i<br />
∑<br />
T<br />
n(<br />
i)<br />
*<br />
p<br />
ω<br />
n(<br />
i)<br />
(22)<br />
Component’s decision<br />
v<br />
*<br />
i<br />
*<br />
1 ( Ti<br />
− Ri<br />
( t)<br />
= f<br />
− i ( )<br />
(23)<br />
L ( t)<br />
i<br />
4.2 Analysis<br />
Resource manager’s bidding function b n (T) in (19) can be composed referring to the solution<br />
algorithm of fractional knapsack problem. In the fractional knapsack problem, there are multiple<br />
items that can be broken into fractions. Given unit weight <strong>and</strong> unit value of each item, the<br />
problem is to determine the amount of each item so as to maximize total value subject to a<br />
weight capacity. The fractional knapsack problem can be easily solved by a greedy algorithm,<br />
i.e., take as much as possible of the item that is the most valuable per unit weight until the<br />
capacity is reached. Similarly, b n (T) can be composed using a greedy algorithm. As b i (T i ) in (18)<br />
is a piecewise-linear increasing concave function, take the most valuable piece per unit T i among<br />
the first available pieces until all pieces are taken. This greedy algorithm leads building the<br />
resource manager’s bidding function in O(|K n | 2 ), where |X| denotes the cardinality of set X.<br />
Similarly, resource manager’s decision problem in (21) can be solved in O(|K n | 2 ), using the<br />
greedy algorithm except that (fractional) pieces are taken until a capacity is reached. So, the<br />
complexity of all resource managers’ local problems is O(|A| 2 ) in the worst case when |N|=1.<br />
The seller’s decision problem in (20) is simply a single variable problem, which can be solved<br />
using diverse search methods depending on the structure of objective function. As each b n (T) is<br />
piecewise-linear increasing concave function, ∑b n (T) is also a piecewise-linear increasing<br />
concave function <strong>and</strong> its number of pieces is proportional to |A|. To compose ∑b n (T) from each
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 17<br />
b n (T), sort the starting T of each piece in ascending order (O(|A|log|A|)), <strong>and</strong>, for each T,<br />
summate from each b n (T) by moving on to corresponding pieces (O(|A||N|)). So, the seller can<br />
compose ∑b n (T) in O(|A| 2 ) in worst case when |N|=|A|. Once ∑b n (T) is composed, the<br />
complexity of the decision problem is proportional to the number of pieces <strong>and</strong> solvable in<br />
O(|A|). So, the seller’s decision problem is solvable in O(|A| 2 ). The complexity of other local<br />
problems such as (18), (22), <strong>and</strong> (23) is O(|A|).<br />
Therefore, if the auctioning model is solved in a centralized controller, it is solvable in<br />
O(|A| 2 ). That is, the complexity of the programming model is O(|A| 2 ). However, the auctioning<br />
model improves scalability as computations <strong>and</strong> communications are distributed to multiple<br />
market participants. Components as well as resource managers are solving their local problems<br />
in parallel rather than sequentially. This parallel processing reduces the time taken to solve the<br />
programming model. In addition, the participants communicate locally in terms of bids rather<br />
than all details to a centralized controller.<br />
5. Empirical results<br />
We ran several experiments using discrete-event simulation to validate the designed control<br />
mechanism. Though we use a small network in the experimentation for validation purpose, the<br />
decentralized model, especially, can h<strong>and</strong>le much larger networks.<br />
5.1 Experimental design<br />
The experimental network is composed of sixteen components in seven nodes as shown in<br />
Fig. 2. Each component in the lowest position has root tasks as indicated in the figure. The value<br />
function is for A 7 <strong>and</strong> A 8 , <strong>and</strong> for others. Also, ω n is 1 for all n∈N <strong>and</strong>
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 18<br />
CPU is allocated using a weighted round-robin scheduling in which CPU time received by each<br />
component in a round is equal to its assigned weight.<br />
N 1 N 2 N 3<br />
N 4<br />
A 1 A 2<br />
A 3 A 4<br />
N 5<br />
A 5 A 6<br />
N 6<br />
A 7 A 8<br />
A 9 A 10 A 11 A 12 A 13 A 14 A 15 A 16<br />
200 200 200 200 400 400 400 400<br />
N 7<br />
Fig. 2. Experimental network configuration<br />
We set up six different experimental conditions as shown in Table 1. We vary the cost of<br />
completion time <strong>and</strong> the distribution of CPU time can be deterministic or exponential. While<br />
using stochastic value function we repeat 5 experiments. QoS UB is calculated from (14) with t=0<br />
for each condition as shown in the table.<br />
Table 1. Experimental conditions<br />
Condition CCT(T) f i (v i ) QoS UB<br />
Con1-1 0.5T Deterministic 30000<br />
Con1-2 0.5T Exponential 30000<br />
Con2-1 1.5T Deterministic 19200<br />
Con2-2 1.5T Exponential 19200<br />
Con3-1 2.5T Deterministic 12000<br />
Con3-2 2.5T Exponential 12000<br />
We use ten different control policies for each experimental condition as shown in Table 2.<br />
First eight control policies (FX-XX) use fixed value modes over time. In predictive control<br />
policies (PC-XX) components selects value modes by solving the optimization model in (14). In<br />
round-robin resource allocation (XX-RR) the components in each node are assigned equal
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 19<br />
weights <strong>and</strong> in proportional allocation (XX-PA) proportional to the components’ load indices as<br />
in (15). PC-PA is the control policy corresponding to the programming model we have<br />
developed. The system makes decision every 100 time units.<br />
Table 2. Control policies used for experimentation<br />
Control policy<br />
F2-RR<br />
F2-PA<br />
F3-RR<br />
F3-PA<br />
F4-RR<br />
F4-PA<br />
F5-RR<br />
F5-PA<br />
PC-RR<br />
PC-PA<br />
Description<br />
v i = 2 for all i with round-robin allocation<br />
v i = 2 for all i with proportional allocation<br />
v i = 3 for all i with round-robin allocation<br />
v i = 3 for all i with proportional allocation<br />
v i = 4 for all i with round-robin allocation<br />
v i = 4 for all i with proportional allocation<br />
v i = 5 for all i with round-robin allocation<br />
v i = 5 for all i with proportional allocation<br />
Predictive control with round-robin allocation<br />
Predictive control with proportional allocation<br />
5.2 Results<br />
Numerical results from the experimentation are shown in Table 3. PC-PA gives the best<br />
performance close to QoS UB in all different conditions. As the cost of completion time increases<br />
the system under PC-PA completes earlier as a result of trading off between the value of solution<br />
<strong>and</strong> the cost of completion time. There can be seen many cases in which the value of solution<br />
under PC-PA is even larger in spite of less completion time. It is because the programming<br />
model gives the maximal value of solution for a given completion time.<br />
Though both PC-PA <strong>and</strong> PC-RR choose value modes by solving the optimization model in<br />
(14), PC-RR gives worse performance because the optimization model is built presuming<br />
proportional resource allocation. Proportional allocation shows significant advantages compared<br />
to round-robin allocation in all thirty instances of comparison. The superiority supports the<br />
optimality of proportional resource allocation <strong>and</strong> consequently the effectiveness of the<br />
programming model.
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 20<br />
Table 3. Experimental results<br />
Control Policy<br />
F2-RR F2-PA F3-RR F3-PA F4-RR F4-PA F5-RR F5-PA PC-RR PC-PA<br />
T 5614 4814 8019 7219 10423 9624 12828 12028 12828 12028<br />
Con1-1<br />
V 14400 14400 21600 21600 28800 28800 36000 36000 36000 36000<br />
QoS 11593 11993 17590 17990 23588 23988 29585 29985 29585 29985<br />
% 0.386 0.400 0.586 0.600 0.786 0.800 0.986 1.000 0.986 1.000<br />
T 5592 4993 8093 7114 10356 9593 12846 11885 12846 11885<br />
Con1-2<br />
V 14400 14400 21600 21600 28800 28800 36000 36000 36000 36000<br />
QoS 11604 11903 17553 18043 23622 24004 29577 30058 29577 30058<br />
% 0.387 0.397 0.585 0.601 0.787 0.800 0.986 1.002 0.986 1.002<br />
T 5614 4814 8019 7219 10423 9624 12828 12028 11282 9742<br />
Con2-1<br />
V 14400 14400 21600 21600 28800 28800 36000 36000 34283 33700<br />
QoS 5980 7179 9572 10771 13164 14364 16757 17956 17359 19087<br />
% 0.311 0.374 0.499 0.561 0.686 0.748 0.873 0.935 0.904 0.994<br />
T 5592 4993 8093 7114 10356 9593 12846 11885 11313 10062<br />
Con2-2<br />
V 14400 14400 21600 21600 28800 28800 36000 36000 34183 33845<br />
QoS 6011 6910 9460 10929 13266 14411 16731 18173 17214 18752<br />
% 0.313 0.360 0.493 0.569 0.691 0.751 0.871 0.947 0.897 0.977<br />
T 5614 4814 8019 7219 10423 9624 12828 12028 6171 4881<br />
Con3-1<br />
V 14400 14400 21600 21600 28800 28800 36000 36000 25354 24055<br />
QoS 366 2365 1553 3553 2741 4740 3928 5928 9927 11853<br />
% 0.031 0.197 0.129 0.296 0.228 0.395 0.327 0.494 0.827 0.988<br />
T 5592 4993 8093 7114 10356 9593 12846 11885 6309 5089<br />
Con3-2<br />
V 14400 14400 21600 21600 28800 28800 36000 36000 25593 24277<br />
QoS 419 1917 1367 3816 2910 4818 3886 6289 9820 11554<br />
% 0.035 0.160 0.114 0.318 0.243 0.402 0.324 0.524 0.818 0.963<br />
T: Completion time, V: Value of solution, %: QoS/QoS UB<br />
6. Conclusions<br />
The increasing complexity of modern software systems gives rise to the needs for more<br />
sophisticated but scalable control mechanisms. In this paper we designed such a control<br />
mechanism for an emerging information network. The network is large-scale with distributed<br />
<strong>and</strong> component-based architectures, <strong>and</strong> its behavior can be controlled by algorithm selection<br />
<strong>and</strong> resource allocation. In the designed control mechanism, an auction market coordinates the<br />
components of a network to produce optimal decisions <strong>and</strong> the market opens periodically for<br />
each current system state.
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 21<br />
Our work can be extended by providing adaptivity to changing stress environments. As the<br />
modern systems can be easily exposed to various adverse events such as accidental failures <strong>and</strong><br />
malicious attacks, there is a need to adapt to such environments. Because the adverse events<br />
affect the system by limiting available resources, it would be possible to model such<br />
environments by quantifying the resource availability of the system through appropriate sensors.<br />
Acknowledgements<br />
The authors acknowledge the support for this research provided by <strong>DARPA</strong> (Grant#:<br />
MDA972-01-1-0038) under the UltraLog program.<br />
References<br />
[1] B. Meyer, “On to components”, IEEE Computer, vol. 32, no. 1, pp. 139-140, 1999.<br />
[2] P. Clements, “From subroutine to subsystems: Component-based software development,” in<br />
Component Based Software Engineering, A. W. Brown, Ed. IEEE Computer Society Press,<br />
1996, pp. 3-6.<br />
[3] M. O. McCracken, A. Snavely, <strong>and</strong> A. Malony, “Performance modeling for dynamic<br />
algorithm selection,” in Proc. Int. Conf. Computational Science, 2003, pp. 749-758.<br />
[4] P. Oreizy, M. M. Gorlick, R. N. Taylor, D. Heimbigner, G. Johnson, N. Medvidovic, A.<br />
Quilici, D. S. Rosenblum, <strong>and</strong> A. L. Wolf, “An architecture-based approach to self-adaptive<br />
software,” IEEE Intelligent Systems, vol. 14, no. 3, pp. 54-62, 1999.<br />
[5] D. Moore, W. Wright, <strong>and</strong> R. Kilmer, “Control surfaces for Cougaar,” in Proc. First Open<br />
Cougaar Conference, 2004, pp. 37-44.<br />
[6] W. Peng, V. Manikonda, <strong>and</strong> S. Kumara, “Underst<strong>and</strong>ing agent societies using distributed
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 22<br />
monitoring <strong>and</strong> profiling,” in Proc. First Open Cougaar Conference, 2004, pp. 53-60.<br />
[7] H. Gupta, Y. Hong, H. P. Thadakamalla, V. Manikonda, S. Kumara, <strong>and</strong> W. Peng, “Using<br />
predictors to improve the robustness of multi-agent systems: Design <strong>and</strong> implementation in<br />
Cougaar,” in Proc. First Open Cougaar Conference, 2004, pp. 81-88.<br />
[8] D. Moore, A. Helsinger, <strong>and</strong> D. Wells, “Deconfliction in ultra-large MAS: Issues <strong>and</strong> a<br />
potential architecture,” in Proc. First Open Cougaar Conference, 2004, pp. 125-133.<br />
[9] R. D. Snyder <strong>and</strong> D. C. Mackenzie, “Cougaar agent communities,” in Proc. First Open<br />
Cougaar Conference, 2004, pp. 143-147.<br />
[10] J. B. Rawlings, “Tutorial overview of model predictive control,” IEEE Control Systems, vol.<br />
20, no. 3, pp. 38-52, 2000.<br />
[11] M. Morari <strong>and</strong> J. H. Lee, “Model predictive control: Past, present <strong>and</strong> future,” Computers<br />
<strong>and</strong> Chemical Engineering, vol. 23, no. 4, pp. 667-682, 1999.<br />
[12] M. Nikolaou, “Model predictive controllers: A critical synthesis of theory <strong>and</strong> industrial<br />
needs,” Advances in Chemical Engineering Series, Academic Press, 2001.<br />
[13] S. J. Qin <strong>and</strong> T. A. Badgwell, “A survey of industrial model predictive technology,” Control<br />
Engineering Practice, vol. 11, pp. 733-764, 2003.<br />
[14] J. Regehr, “Some guidelines for proportional share CPU scheduling in general-purpose<br />
operating systems,” Presented as a work in progress at 22nd IEEE Real-Time Systems<br />
Symposium, London, UK, Dec. 3-6, 2001.<br />
[15] I. Stoica, H. Abdel-Wahab, J. Gehrke, K. Jeffay, S. K. Baruah, <strong>and</strong> C. G. Plexton, “A<br />
proportional share resource allocation algorithm for real-time, time-shared systems,” in<br />
Proc. 17th IEEE Real-Time Systems Symposium, 1996, pp. 288-299.<br />
[16] C. A. Waldspurger <strong>and</strong> W. E. Weihl, “Lottery scheduling: Flexible proportional-share
Manuscript for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 23<br />
resource management,” in Proc. First Symposium on Operating System Design <strong>and</strong><br />
Implementation, 1994, pp. 1-11.<br />
[17] C. Waldspurger <strong>and</strong> W. Weihl, “Stride scheduling: Deterministic proportional-share<br />
resource management,” Lab. for Computer Science, Massachusetts Institute of Technology,<br />
Cambridge, MA, Tech. Rep. MIT/LCS/TM-528, 1995.<br />
[18] C. Waldspurger, “Lottery <strong>and</strong> stride scheduling: Flexible proportional share resource<br />
management,” Ph.D. dissertation, Lab. for Computer Science, Massachusetts Institute of<br />
Technology, Cambridge, MA, 1995.<br />
[19] T. Gonzalez <strong>and</strong> S. Sahni, “Flowshop <strong>and</strong> jobshop schedules: Complexity <strong>and</strong><br />
approximation,” Operations Research, vol. 26, pp. 36-52, 1978.<br />
[20] J. Lenstra, A. R. Kan, <strong>and</strong> P. Brucker, “Complexity of machine scheduling problems,”<br />
Annals of Discrete Mathematics, vol. 1, pp. 343-362, 1977.<br />
[21] Y. Lengwiler, “The multiple unit auction with variable supply,” Economic Theory, vol. 14,<br />
no. 2, pp. 373-392, 1999.
Coordinating Control Decisions of Software Agents for Adaptation to Dynamic<br />
Environments<br />
Y. Hong 1 , S. R. T. Kumara 1<br />
1 Harold <strong>and</strong> Inge Marcus Department of <strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />
The Pennsylvania State University, University Park, PA, 16802, USA<br />
Abstract<br />
We suggest a design for an infrastructure-level load control mechanism of a multiagent system, Cougaar. The<br />
purpose of control is to strengthen the robustness of a software multiagent system with respect to load<br />
balancing such that the system can keep working without disastrous performance degradation even under<br />
occasional harsh running environments. Resource control in multiagent systems is carried out mainly by<br />
agent’s self-control, which makes the control problem very difficult. We suggest a hierarchical control<br />
structure in order to reduce complexity of control while inducing coherent movement of agents.<br />
Keywords:<br />
load balancing, hierarchical control, multi-agent system<br />
1 INTRODUCTION<br />
Multiagent systems have significant advantages in the<br />
development of complex distributed software system [1].<br />
Agents are naturally matched to components in complex<br />
systems. Therefore, complicated interactions among the<br />
subcomponents can be represented by agent interactions.<br />
Due to the modularity <strong>and</strong> autonomy of agents, the<br />
application could be composed by assembling the agents.<br />
The multiagent systems are flexible in design. Partial<br />
changes in the system could be localized for a few agents<br />
without affecting the rest of the system. Thus, constructing<br />
or altering a large software system could become easier<br />
with agent technology.<br />
In addition to the advantages in designing <strong>and</strong> constructing<br />
a large system, robustness is also an important factor for a<br />
multiagent system to be a good software construction<br />
technology. Robustness of a software represents “the<br />
ability of software to react appropriately to abnormal<br />
circumstances” [2]. Like many biological or man-made<br />
systems, through feedback controls <strong>and</strong> redundancy of<br />
components (agents), the software system can also cope<br />
with uncertainties in dynamic environments <strong>and</strong> improve its<br />
robustness at the expense of increasing complexity [3][4].<br />
The time varying computational load could be one of<br />
threats to robustness. A sudden excessive workload could<br />
degrade performance to an extent in which the system<br />
cannot meet minimum requirements on response time.<br />
This is in specific very critical for real time applications.<br />
Because agent systems are distributed <strong>and</strong> decentralized,<br />
it is hard to build a control mechanism by which agents can<br />
adapt to the changing environments effectively <strong>and</strong><br />
coherently. In order to resolve this problem, we suggest an<br />
infrastructure-level load control mechanism for a<br />
multiagent system, Cougaar. The reason we consider<br />
infrastructure level control mechanism is that the<br />
application developers’ efforts to secure robustness of<br />
software with respect to the load control could be much<br />
reduced. Multiagent systems such as Cougaar [5] <strong>and</strong><br />
Jade [6] provide many infrastructure level services, which<br />
save the application developers efforts required to build<br />
basic functions of the multiagent system. Load control<br />
function can be included in the infrastructure <strong>and</strong> its<br />
necessity has been emphasized [7]. Infrastructure can hide<br />
the complexity of controlling resource allocation such that<br />
application developers tune the performance using highlevel<br />
abstract parameters for load control.<br />
2. LOAD BALANCING IN MULTIAGENT SYSTEMS<br />
In multiagent systems, system functions are decomposed<br />
into software agents. Agents carry out system functions by<br />
exchanging services with each other [7]. Agents have their<br />
own work <strong>and</strong> specialize in a specific service. Agents<br />
request some service from another agent who is<br />
specialized in that service. Providing the service requires<br />
the use of some computational resource such as CPU<br />
time. Agents are distributed on multiple machines, which<br />
are connected through communication networks. More<br />
than one agent can be on a machine <strong>and</strong> share the CPU<br />
time. The frequency of service request of each agent is<br />
time varying depending on real world, which the application<br />
deals with.<br />
Considerable research has been done on dynamic load<br />
balancing for computer clustering. However, we cannot<br />
apply this directly to a multiagent system [7]. As noted by<br />
chow <strong>and</strong> kwok [7], multiagent systems (MAS) are different<br />
from computer clustering with respect to load balancing.<br />
Firstly, in MAS, agents are continuously running while in<br />
computer clustering, jobs submitted by users are killed<br />
after completion. Secondly, communications between<br />
agents in multiagent system are highly variable, whereas,<br />
communications between jobs usually has static patterns.<br />
Another difference, which is not pointed out by chow <strong>and</strong><br />
kwok, is that agents could proactively manage their<br />
workload.
Load balancing issues have not been paid much attention<br />
in MAS studies [7][8]. There are few papers in multiagent<br />
load balancing. Schaerf et. al. [9] studies how an agent can<br />
adapt to the environment. They separated the resources<br />
from the agents. In their model, agents assign their jobs to<br />
these resources. Using reinforcement learning, they<br />
showed that agents could adapt to each other under fixed<br />
or even for dynamical loads. Chow <strong>and</strong> kwok [7] devise an<br />
agent reallocation algorithm, called ‘Comet’ algorithm,<br />
which select agents to be moved to other machines.<br />
Agents are distributed on multiple machines. The comet<br />
algorithm chooses agents based on credits, which are<br />
continuously evaluated for each agent. The agent with low<br />
credit will be moved. The credit will decrease as the<br />
agent’s workload increases or the agent has more<br />
communication with other agents on other machines. We<br />
consider a similar agent system environment with chow<br />
<strong>and</strong> kwok. However, we added a feature of agent’s selfregulation<br />
on the workload.<br />
3 QUEUEING MODEL FOR WORKLOAD DYNAMICS<br />
We conjecture that workload dynamics could be modeled<br />
as a queueing system. A service request from outside or<br />
other agents could be seen as a customer in a queueing<br />
system. While a request is being served, the later incoming<br />
service requests will wait in the queue. We consider a<br />
situation in which agents have multiple alternative<br />
algorithms to provide their service. Those algorithms trade<br />
off between computation time <strong>and</strong> quality of solution.<br />
Thus, depending on the workload in the queue, an agent<br />
can choose an optimal algorithm to improve the overall<br />
performance measure. This is similar to anytime algorithm<br />
composition [10]. In anytime algorithm, we have to<br />
determine time duration in which the algorithm solves the<br />
problem. Here, we assume that problem solving time is not<br />
predetermined. Instead, it is a statistical characteristic of<br />
the algorithm. From the queueing model perspective, this<br />
could be seen as a service rate control problem [11].<br />
Multiagent system infrastructure can have a facility, where<br />
each machine works as a server by assigning<br />
computational resources (run right) to the agents for CPU<br />
time-sharing. We call the server as a node. This could be<br />
seen as a polling model, which has been used to model for<br />
time-sharing in a computer operating system or link<br />
sharing in a communication network. The node could give<br />
priority to a certain agent by visiting the agent more<br />
frequently. The node could monitor the amount of workload<br />
or arrival rate of service requests through agents. Based<br />
on the detected changes, it can change the priority of the<br />
agent.<br />
Imbalance among machines can be controlled by<br />
reallocation of agents from a high loaded machine to a low<br />
loaded machine for better performance. However, in this<br />
paper, we considered only agent <strong>and</strong> machine level<br />
control.<br />
4 DECENTRALIZED CONTROL<br />
In view of the above mentioned workload dynamics, load<br />
control in multiagent systems could be seen as a<br />
decentralized stochastic control problem [12]. The<br />
decentralized control system consists of multiple control<br />
posts. They locally sense <strong>and</strong> control some part of the<br />
system they take charge of. However, their controls<br />
influence the system dynamics collectively. Thus, in order<br />
to operate the system optimally, decisions taken by<br />
controllers should be compatible <strong>and</strong> coherent. The<br />
information sharing between agents is treated as the main<br />
issue. In order for the controllers to obtain global optimal<br />
control decisions, exchange of all the local information is<br />
inevitable. Each controller makes a decision by solving a<br />
larger problem in which other agents’ movement is<br />
considered. However, it is unrealistic in the case of<br />
multiagent systems because of the long communication<br />
time. Finding optimal controls might be very difficult due to<br />
the size of the problem. It is difficult or almost impossible to<br />
find a purely decentralized optimal control policy for a<br />
multiagent system in this way. There are few systems in<br />
which locally made decisions could be globally compatible<br />
[13]. However this is very limited to some specific problem<br />
only. Thus, we need a control structure in which each<br />
control component (for example, an agent) takes control<br />
decisions by communicating only with closely connected<br />
components, while well coordinated decisions could be<br />
generated. In this paper, we suggest a hierarchical control<br />
structure which aims at achieving the above mentioned<br />
expectations.<br />
5 HIERARCHICAL CONTROL<br />
5.1 General Description<br />
In order to manage the complexity of large-scale problems,<br />
hierarchical control approach have been studied in various<br />
areas [14][15]. For multiagent systems, hierarchical control<br />
has been adopted as an intermediate form between<br />
centralized <strong>and</strong> decentralized control as a tradeoff between<br />
the advantages of the two approaches [14]. Hierarchical<br />
control can reduce the computation of gathering<br />
information <strong>and</strong> finding an optimal control than centralized<br />
control. On the other h<strong>and</strong>, it has better coordination<br />
capabilities than decentralized control.<br />
We consider three levels of hierarchy – the entire system,<br />
nodes <strong>and</strong> agents. There are usually multiple<br />
subcomponents under a higher-level controller i.e. there<br />
are multiple agents under a node <strong>and</strong> multiple nodes under<br />
a top controller. Agents <strong>and</strong> a node controller have direct<br />
communication connection by which they share<br />
information. The information sharing is restricted between<br />
the components, which are connected in the hierarchy. In<br />
this case, an agent reports its workload <strong>and</strong> performance<br />
(see 5.2) <strong>and</strong> the node controller announces control (the<br />
visit order <strong>and</strong> frequency) or state information such as<br />
estimates about environment parameters, which could be<br />
more effectively observed by the node rather than agents.<br />
We could also consider similar information exchanges<br />
between nodes <strong>and</strong> the top controller. Here, nodes report<br />
their node level workload trend to the top controller. On the<br />
other h<strong>and</strong>, the top controller could inform the system level<br />
environment parameters <strong>and</strong> order to move agents from<br />
one node to another.<br />
We assume the control frequency to be different at<br />
different levels. It is higher in lower levels compared to the<br />
higher levels. Control decisions on service rate are more<br />
frequent than the changes on the CPU time assignment<br />
policy in the node level. For a given arrival rate <strong>and</strong><br />
configuration, CPU time assignment policy in node level<br />
will not be changed until the arrival rate <strong>and</strong> configuration<br />
change. At higher level, the frequency of events is less <strong>and</strong><br />
the time intervals between events become longer. This<br />
could be said to have multi-time scale depending on the<br />
level [16]. The difference of our problem from other multitime<br />
scale problems is that there are multiple components<br />
in the lower levels. In addition, it is reasonable to assume<br />
that the environment does not change very frequently, in<br />
such a manner that the system may not be able to<br />
estimate environment parameters <strong>and</strong> control it.<br />
A Higher-level controller has coarser information than the<br />
lower level subsystems. A higher-level controller could<br />
have better global information over its territory because it<br />
collects information from its subcomponents. However, it<br />
will not use the information gathered as a system state
directly. It has coarser scale. It will neglect a certain range<br />
of fluctuations in the measurements from subcomponent.<br />
In this framework, controls in a level do not affect the<br />
higher level. However, higher-level controls constrain the<br />
lower level components’ working conditions <strong>and</strong> thus<br />
decreasing the degree of freedom on the control problem<br />
of the lower level.<br />
Now we want to show our ideas on the optimal control<br />
problem in which the features of hierarchical control<br />
structure are reflected.<br />
5.2 Optimal Control Problem<br />
The load control problem is to find optimal control policies<br />
for each component such that it minimizes the long run<br />
cost of the overall system, while it is subjected to time<br />
varying computational workload. The cost function we want<br />
to optimize through load control is a multi-objective<br />
function of holding cost <strong>and</strong> penalty cost for service quality.<br />
The performance is measured for each agent<br />
independently. The performance of the overall system is<br />
assumed to be the sum of the individual agent’s<br />
performance.<br />
Each agent could have different algorithms for a service<br />
depending on the service type. For simplicity, we assume<br />
that every agent has two algorithms, called level 6 <strong>and</strong><br />
level 2 respectively. The quality of solutions for Level 6<br />
algorithm is higher than Level 2 on an average while the<br />
computation time for Level 2 is less than that of Level 6.<br />
The default algorithm is the Level 6 algorithm. We will<br />
impose a penalty cost for using level 2 algorithm. Thus,<br />
when the system is congested, using level 2 algorithm<br />
would be helpful, even though it incurs the penalty cost.<br />
There could be various stresses such as (1) the<br />
unexpected increase in service request, <strong>and</strong> (2) loss of<br />
CPU time due to other applications. Stress (1) could be<br />
seen as an arrival rate change in queueing model. In agent<br />
systems, arrival of service requests could be time varying<br />
<strong>and</strong> bursty. Stress (2) could be seen as a server in a<br />
polling system serving an imaginary additional queue (an<br />
agent) at r<strong>and</strong>om time <strong>and</strong> for r<strong>and</strong>om duration. The<br />
sources of these two stresses in (1) <strong>and</strong> (2) are<br />
environmental factors, which we cannot control. Thus, we<br />
have to model them as r<strong>and</strong>om events. The controller will<br />
choose control action considering estimations about these<br />
events.<br />
Increasing arrival rate could be modeled using a Markov<br />
Modulated Poisson Process (MMPP). MMPP describes an<br />
arrival rate, which changes depending on the source state<br />
where if the source state is i, the arrival process is Poisson<br />
with arrival rate λ i . The source state is modeled as<br />
continuous time markov chain. [16]<br />
Depending on the level, the problem definitions are<br />
different. In the perspective of the agent, the problem is to<br />
find optimal service rate in a situation where there are<br />
r<strong>and</strong>om server vacations. The server vacation means that<br />
other agents get a run-right <strong>and</strong> process their work. The<br />
frequency to visit the agent is controlled by the node. The<br />
agent should find an optimal service control policy for the<br />
given expected vacation time. Taking into account the<br />
service rate in server vacation problem has not been<br />
studied so far to our best knowledge. Usually in server<br />
vacation control problems, the optimal service beginning<br />
time or the optimal time to add additional server<br />
(removable server case) have been studied [18].<br />
A node has a problem of finding polling policy. The runright<br />
assignment frequency will not change frequently<br />
because we assumed that the changes in the<br />
circumstances around agents are not frequent. Thus for a<br />
given node state, the problem is to find static polling table<br />
[19]. Whenever the node state changes, the node will pick<br />
another appropriate polling table.<br />
The problem is fairly complex because the self-regulating<br />
agents share a single resource CPU. They are<br />
interdependent. One agent’s decision could affect other<br />
agents’ waiting time for CPU. This feature makes our<br />
problem significantly different from other queueing models.<br />
6 HIERARCHICAL CONTROL FOR COUGAAR<br />
In this section, we show how to apply our hierarchical<br />
control ideas to a multiagent system, Cougaar [5]. We use<br />
Cougaar version 10.2.1.<br />
6.1 The Cougaar Infrastructure<br />
Distinguishing features of Cougaar are blackboard<br />
communication <strong>and</strong> plugins [5]. An agent consists of<br />
plugins that implement agents’ functions. In Cougaar, it is<br />
recommended that the functions of agents should be well<br />
divided into the sufficiently small units of program modules,<br />
plugins. They communicate by posting <strong>and</strong> reading<br />
messages on the blackboard that exists in every agent.<br />
Communications between agents are also conducted<br />
through the blackboard.<br />
Cougaar runs its code – plugins or some infrastructural<br />
level modules via shared thread. The total number of<br />
shared threads is limited in a node. This means that the<br />
number of simultaneously running plugins could not be<br />
greater than the upper limit. The number of available<br />
threads for the node is predetermined at the initial loading<br />
stage.<br />
Cougaar provides mobile agent functions. Agents can<br />
move from one machine to another. This function could<br />
also be used for load control by moving agents from highloaded<br />
machine to less loaded machines.<br />
6.2 Queueing Model in Cougaar<br />
Usually an agent has many plugins in cougaar applications<br />
[5]. From the perspective of infrastructure, the plugin could<br />
be seen as a workload or a job. If the agent does not get a<br />
thread for this plugin, the plugin should be put in the queue<br />
until the agent obtains a thread.<br />
A service request is processed in an agent through a<br />
sequence of plugins. The service request is represented as<br />
an object, called Task. While Tasks go through those<br />
plugins, they are exp<strong>and</strong>ed or aggregated <strong>and</strong> finally<br />
allocated to assets, which represent actual physical<br />
resources. Each plugin repeats a series of processes –<br />
retrieving Tasks, processing Tasks <strong>and</strong> publishing another<br />
Task. For example, if the application is a planning system<br />
<strong>and</strong> re-planning is triggered whenever there are<br />
discrepancies between planning <strong>and</strong> real world, we could<br />
see continuous arrival of plugins to the queue in the<br />
agents. This phenomenon could be naturally described as<br />
queueing models.<br />
6.3 Control Using Thread Services<br />
Even though the shared threads are managed by thread<br />
service, they are not used for control purpose. Current<br />
infrastructure just assigns threads to agents in a roundrobin<br />
fashion. We build a control structure utilizing the<br />
thread services so that agent <strong>and</strong> node level controls are<br />
feasible in the hierarchical control structure as we<br />
mentioned before. Through the control structure, agents<br />
(nodes) assign threads at their will through a<br />
predetermined scheme to plugins (agents) through thread<br />
service.<br />
After the plugin finishes its work, it will release the runright.<br />
Thus, the run right could be reassigned to other<br />
plugins. On the other h<strong>and</strong>, a node can dynamically
change the limitation on the number of run rights of each<br />
agent. If a certain agent has high workload, the node could<br />
reduce the number of run-rights on other agents so that the<br />
agent can get more opportunity to run its work.<br />
(Agent)<br />
Scheduler<br />
Sensor<br />
(ThreadListener)<br />
DynamicSortedQueue<br />
TreeNode<br />
(Node)<br />
Scheduler<br />
Resource Allocator<br />
(RightsSelector)<br />
(Agent)<br />
Scheduler<br />
Sensor<br />
(ThreadListener)<br />
Figure 1: infrastructure level<br />
Figure 1 shows a schematic representation of hierarchical<br />
control structure within a node. In Cougaar, nodes <strong>and</strong><br />
agents have their own schedulers. They assume a tree<br />
data structure. Cougaar does not have a direct<br />
communication channel between agents <strong>and</strong> nodes. We<br />
modified them to exchange feedback report <strong>and</strong> control<br />
message through the scheduler. Each agent could monitor<br />
every plugins’ arrival, service start <strong>and</strong> service end through<br />
the Thread Listener. Plugins have algorithms to process<br />
Tasks. Agent could choose an algorithm of a plugin for<br />
service rate control through Dynamic Sorted Queue. Every<br />
agent has the Dynamic Sorted Queue. A java interface to<br />
Plugin is added to let the agent set the algorithm in a<br />
plugin. Node could assign the run-right (thread) to specific<br />
agent using right selector. We let the scheduler have a set<br />
of control policies. We could see that the modified<br />
infrastructure could effectively control the plugins <strong>and</strong> runright<br />
assignment through experiments on a small example<br />
agent society.<br />
7 CONCLUDING REMARKS<br />
Load balancing in multiagent systems is different from<br />
other load balancing problems because of agent’s selfregulation<br />
<strong>and</strong> highly dynamic communication load<br />
between agents [7]. This paper discussed a hierarchical<br />
control structure, which would help agents or nodes make<br />
control decisions based on local information <strong>and</strong> obtain an<br />
overall optimal system’s performance. In addition, the<br />
higher-level controller’s estimation on changes in system<br />
parameters could help the agent adapt to changing<br />
environment.<br />
Agent’s self-regulation makes load-balancing problem<br />
significantly difficult. Each agent wants to use more CPU<br />
time. However, if every agent uses CPU time greedily,<br />
overall system performance may not be optimal. This could<br />
be seen as a Social dilemma. Use of Game theory is a<br />
promising approach in finding equilibrium among agents.<br />
8 ACKNOWLEDGMENTS<br />
The work described here was performed under the <strong>DARPA</strong><br />
UltraLog Grant#: MDA972-1-1-0038. The authors wish to<br />
acknowledge <strong>DARPA</strong> for their generous support.<br />
[2] Meyer, B., 1997, Object-Oriented Software<br />
Construction, Second Edition, Upper Saddle River,<br />
N.J., Prentice Hall.<br />
[3] Csete, M.E. <strong>and</strong> Doyle, J.C., 2002, Reverse<br />
Engineering of Biological Complexity, Science,<br />
295:1664-1669.<br />
[4] Huhns, M.N. <strong>and</strong> Holderfield, V.T., 2002, Robust<br />
Software, IEEE Internet Computing, March/April:80-<br />
82.<br />
[5] Cougaar Open Source Site. http://www.cougaar.org<br />
[6] Java Agent DEvelopment Framework (JADE).<br />
http://sharon.cselt.it/projects/jade/<br />
[7] Chow, K. <strong>and</strong> Kwok, Y., 2002, On Load Balancing for<br />
Distributed Multiagent Computing, IEEE Transactions<br />
On Parallel <strong>and</strong> Distributed Systems, 13/8:787-801.<br />
[8] Lee, L.C., Nwana, H.S., Ndumu, D.T. <strong>and</strong> De Wilde,<br />
P., 1998, The stability, scalability <strong>and</strong> performance of<br />
multiagent systems, BT Technology Journal, 16/3:<br />
94-103.<br />
[9] Schaerf, A., Shoham, Y., <strong>and</strong> Tennenholtz, M. , 1995,<br />
Adaptive Load Balancing: A Study in Multiagent<br />
Learning. Journal of Artificial Intelligence Research,<br />
2:475-500.<br />
[10] Zilberstein, S. <strong>and</strong> Russell, S., 1996, Optimal<br />
composition of real-time systems, Artificial<br />
Intelligence, 82:181-213.<br />
[11] George, J. M. <strong>and</strong> Harrison, J. M., 2001, Dynamic<br />
control of a queue with adjustable service rate,<br />
Operations Research, 49/5:720-731.<br />
[12] Ooi, J.M., Verbout, S.M., Ludwig, J.T., <strong>and</strong> Wornell,<br />
G.W., 1997, A Separation Theorem for Periodic<br />
Sharing Information Patterns in Decentralized<br />
Control, IEEE Transactions On Automatic Control,<br />
32/2:1546-1550.<br />
[13] Yao, D.D. <strong>and</strong> Schechner, Z., 1989, Decentralized<br />
Control of Service Rates in a Closed Jackson<br />
Network, IEEE Transactions On Automatic Control,<br />
42/11:236-240.<br />
[14] Lygeros, J., Godbole, D. N., <strong>and</strong> Sastry, S., 1997, A<br />
Design Framework for Hierarchical, Hybrid Control.,<br />
California PATH Research <strong>Report</strong>, UCB-ITS-PRR-97-<br />
24.<br />
[15] Gershwin, S.B., 1989, Hierarchical Flow Control: A<br />
Framework for Scheduling <strong>and</strong> Planning Discrete<br />
Events in <strong>Manufacturing</strong> Systems, Proceedings of the<br />
IEEE, 77/1:195-209.<br />
[16] Chang, H.S., Fard, P.J., Marcus, S.I., <strong>and</strong> Shayman,<br />
M., 2003, Multitime Scale Markov Decision<br />
Processes, IEEE Transactions On Automatic Control,<br />
48/6:976-987.<br />
[17] Gusella, R., 1991, Characterizing the variability of<br />
arrival processes with indexes of dispersion, IEEE<br />
Journal on Selected Areas in Communications,<br />
9/2:203-211.<br />
[18] Zhang, R., Phillis, Y.A., <strong>and</strong> Zhu, X., 1998, Fuzzy<br />
Control of Queueing Systems with Removable<br />
Servers, IEEE International Conference on Systems,<br />
Man, <strong>and</strong> Cybernetics, 3:2160-2165.<br />
[19] Levy, H. <strong>and</strong> Sidi, M., 1990, Polling Systems:<br />
Applications, Modeling, <strong>and</strong> Optimization, IEEE<br />
Transactions on Communications, 38/10:1750-1760.<br />
9 REFERENCES<br />
[1] Jennings, N. R., 2001, An agent-based approach for<br />
building complex software systems, Communications<br />
of the ACM, 44/4:35-41.
Underst<strong>and</strong>ing Agent Societies Using Distributed Monitoring <strong>and</strong> Profiling<br />
† Wilbur Peng, † Vikram Manikonda, <strong>and</strong> ‡ Soundar Kumara<br />
† Intelligent Automation Incorporated<br />
7519 St<strong>and</strong>ish Place, Suite 200, Rockville, MD 20855<br />
{wpeng ,vikram}@i-a-i.com<br />
‡<br />
<strong>Industrial</strong> <strong>and</strong> <strong>Manufacturing</strong> Engineering<br />
310 Leonhard Building, The Pennsylvania State University, University Park, PA 16802<br />
{skumara}@psu.edu<br />
Abstract<br />
In this paper, we describe methodologies for<br />
underst<strong>and</strong>ing large-scale agent societies using the<br />
Castellan, a distributed profiling <strong>and</strong> logging system<br />
developed for Cougaar. Castellan enables the detailed<br />
efficient logging of blackboard plan activity. We describe<br />
the design, functionality, use <strong>and</strong> a number of<br />
applications of the Castellan tool, including a<br />
visualization <strong>and</strong> data mining tool based on a flexible<br />
algorithm for finding subgraph isomorphisms. By<br />
mapping “equivalent” meaningful graph nodes <strong>and</strong> edges<br />
to representative subgraph elements, the graph reduction<br />
approach reduces large plan graphs of hundreds of<br />
thous<strong>and</strong>s to millions of nodes to meaningful <strong>and</strong><br />
underst<strong>and</strong>able clusters <strong>and</strong> graph nodes. This algorithm<br />
is demonstrated through its application to event traces<br />
obtained from running Castellan within a military<br />
logistics planning society. In addition to providing data<br />
for static analysis after planning <strong>and</strong> execution, the<br />
Castellan approach is also useful for on-line analysis of<br />
active, running agent systems. We also describe a<br />
number of other potential applications of distributed<br />
monitoring for modeling, control, load balancing <strong>and</strong><br />
analysis.<br />
1. Introduction<br />
Distributed agent systems provide significant challenges<br />
for debugging, testing, profiling <strong>and</strong> tuning. Agent<br />
societies consist of distributed, state encapsulated entities<br />
that can run concurrently. Additionally, they have the<br />
additional constraint of state encapsulation, i.e. each<br />
agent does not have direct access to the state of other<br />
agents. Instead, they interact solely through message<br />
passing. Within an agent, different functionscan interact<br />
through sharing state.<br />
The Cougaar agent infrastructure supports an approach to<br />
distributed planning in which tasks are created <strong>and</strong><br />
exp<strong>and</strong>ed into subtasks by agents which can in turn be<br />
forwarded to other agents. The planning process creates a<br />
plan graph that spans multiple agents that can potentially<br />
be very large, growing to hundreds thous<strong>and</strong>s to millions<br />
of elements. Adding to the complexity of underst<strong>and</strong>ing<br />
system function, the plan graph generated by the agents<br />
can be dynamically modified during the planning <strong>and</strong><br />
execution phases of the society. As Cougaar agent<br />
societies increase in size <strong>and</strong> scope, underst<strong>and</strong>ing the<br />
distributed execution of the system becomes increasingly<br />
difficult. Being able to trace the time-evolving, eventdriven<br />
behavior across agents running societies becomes<br />
increasingly important.<br />
In this paper, we discuss methods for underst<strong>and</strong>ing,<br />
analyzing <strong>and</strong> controlling Cougaar agent societies<br />
through distributed profiling. Section 1.1 covers<br />
background concepts in distributed planning used by<br />
Cougaar. In Section 2, the Castellan profiling <strong>and</strong> logging<br />
system is introduced <strong>and</strong> its implementation <strong>and</strong> design<br />
described. Section 3 presents in detail an application of<br />
Castellan to data mining <strong>and</strong> visualization application<br />
using a plan graph reduction algorithm. <strong>Final</strong>ly, Section<br />
4 discusses potential applications of Castellan.<br />
1.1 Distributed plan graphs in Cougaar societies<br />
In this section, we review some basic concepts of<br />
planning in the Cougaar context. Additional details about<br />
plan representation can be found in [3].<br />
In Cougaar applications such as logistics planning <strong>and</strong><br />
execution, agents generate plans by decomposing tasks<br />
into subtasks, aggregating tasks, <strong>and</strong> forwarding tasks to<br />
other organization entities which are in turn are<br />
represented by other agents.
In the Cougaar planning model, the basic element is a<br />
task. Each task has a unique identifier (UID) <strong>and</strong> a set of<br />
fields including the task verb (e.g. “Supply”, “Project”,<br />
“Transport”) <strong>and</strong> the direct object (e.g. the UID of the<br />
Asset which the task acts on).<br />
Each task element must be allocated to a plan element<br />
during the distributed planning process. These include:<br />
• Allocation elements. An allocation is a<br />
assignments of tasks to particular assets. The<br />
assets can be locally represented (e.g. an<br />
inventory) or an organizational asset (e.g. a<br />
customer organization allocates a task T to a<br />
Supplier asset. Here, the Supplier asset<br />
represents an actual agent to which T will be<br />
forwarded.)<br />
• Expansions. Decomposition of tasks into<br />
subtasks.<br />
• Aggregations. These collect multiple tasks into<br />
a single task.<br />
Each agent can therefore be modeled as taking inputs as a<br />
set of tasks, generating local blackboard tasks <strong>and</strong> plan<br />
elements, <strong>and</strong> generating outputs as a set of tasks to be<br />
forwarded to the representative agent(s). (In Cougaar,<br />
tasks are forwarded to another agent by the logic provider<br />
if they are allocated to the (local) organization asset<br />
which represents the target agent.) All the blackboards<br />
<strong>and</strong> task elements are assumed to be persistent, unique<br />
objects.<br />
The result of a single planning run is logically a<br />
connected, distributed plan graph that spans multiple<br />
agents <strong>and</strong> multiple nodes. In addition, the plan graph<br />
may evolve <strong>and</strong> change during replanning as tasks are<br />
rescinded, modified <strong>and</strong> replanned.<br />
2. Castellan System Design <strong>and</strong><br />
Implementation<br />
The primary distinguishing aspect of Castellan is that<br />
provides the ability to observe the time evolving state of<br />
the distributed agent blackboards rather than the final<br />
state after planning has been reached.<br />
The Castellan system has two aspects: the client<br />
implementation which monitors planning at agents, <strong>and</strong><br />
the server implementation that collects the logs<br />
accumulated by the client side. Figure 1 shows an<br />
example of the concept of operations. It shows a set of<br />
agents which are being monitored <strong>and</strong> sending events<br />
traces to a server application. In turn, the server<br />
application can log them to a database or feed them<br />
directly to monitoring <strong>and</strong> analysis applications. In the<br />
current implementation of Castellan, the server<br />
application is itself implemented as a plugin which can be<br />
embedded in a Cougaar agent.<br />
Castellan has evolved to support the following modes of<br />
operation on the client implementation:<br />
• Plugin based execution. A monitoring client<br />
plugin loaded in each agent subscribes to all<br />
modifications to the blackboard.<br />
• Logic provider based execution. A Castellan<br />
logic provider is attached to each agent <strong>and</strong><br />
monitors changes to the blackboard through the<br />
logic provider interface.<br />
The primary difference between these two approaches is<br />
that the latter allows monitoring the source of each<br />
change to the blackboard as well as the number of<br />
execute cycles associated with each change. The latter<br />
approach is useful in debugging since it can observe<br />
which plugins execute <strong>and</strong> the number of execute cycles<br />
of each plugin loaded for an agent. These features are<br />
useful for debugging <strong>and</strong> detailed performance analysis of<br />
agents.<br />
Agent 1 Agent 2 Agent 3<br />
Event<br />
Database<br />
Castellan Server<br />
Plan Analysis<br />
Applications<br />
Sensors<br />
Event Protocol<br />
Figure 1. Castellan System Concept<br />
As agent execution proceeds, the client implementation<br />
generates a stream of events for each task <strong>and</strong> plan<br />
element added, changed, or removed to the system<br />
blackboard. The event trace <strong>and</strong> logging protocol extracts<br />
a subset of the data encapsulated by the tasks, assets, <strong>and</strong><br />
plan elements sufficient to reconstitute the entire plan<br />
graph. These include:<br />
• The unique identifier, encoded as a symbol id<br />
rather than a string.<br />
• The timestamps associated with the blackboard<br />
action (both simulation <strong>and</strong> wall clock time.)<br />
• For tasks, the verb for tasks encoded as a<br />
symbol id.
• For tasks, the UID of the direct object also<br />
encoded as a symbol.<br />
• For allocations, the allocation results observed.<br />
A design objective of Castellan was to reduce the amount<br />
of b<strong>and</strong>width consumed by the event traces while<br />
retaining key data <strong>and</strong> minimizing CPU consumption.<br />
This was accomplished through a variety of approaches,<br />
including:<br />
• Compression of UID <strong>and</strong> other symbols using a<br />
space efficient symbol id based protocol.<br />
• Detecting <strong>and</strong> transmitting only changes in the<br />
allocation results for each task.<br />
• Batching mechanisms, e.g. serializing batches of<br />
messages rather than individual messages.<br />
The generated stream of events can be delivered to the<br />
Castellan server implement during planning. The message<br />
transport between the client implementation <strong>and</strong> the<br />
server implementation can be varied depending on the<br />
application. Currently, a buffered relay is used to transmit<br />
events through batches of serialized events. (A relay is an<br />
point-to-point communications mechanism between<br />
Cougaar agents.) This form of “in-b<strong>and</strong>” communications<br />
shares the communications channel with other Cougaar<br />
message traffic <strong>and</strong> hence is more suitable for distributed<br />
control applications in which agents need to be aware of<br />
the detailed planning status of other agents within the<br />
society. Alternative message transport implementations<br />
(e.g. using a separate communications backplane) can<br />
provide non-intrusive analysis for debugging <strong>and</strong> testing.<br />
The scalability of Castellan as a real-time monitoring tool<br />
is limited primary by the following factors:<br />
• B<strong>and</strong>width of the links between the monitoring<br />
agents. For example, a 160,000 element event<br />
trace from a 30 agent, 60 min. long planning run<br />
in total takes approximately a total of 8 MB of<br />
b<strong>and</strong>width. This is not significant for a LAN test<br />
environment but may be an issue for real world<br />
operating environments.<br />
• The impact on CPU consumption for each agent<br />
being monitored. In a small Cougaar society of<br />
~30 agents, this was measured to impact total<br />
planning time by approximately 5%.<br />
• The processing bottleneck at the server agent.<br />
While receiving the data is not very expensive,<br />
inserting the results at run time into a SQL<br />
database tends to overwhelm most typical<br />
processors.<br />
In large agent societies, we do not expect that all agents<br />
can be monitored by a single server due to the volume of<br />
data generated. Instead, Castellan can be configured to<br />
monitor any subset of agents as desired, e.g. a community<br />
or an enclave. For debugging <strong>and</strong> profiling purposes, the<br />
data can then be merged after the run is complete.<br />
3. Graph Reduction Using Subgraph<br />
Isomorphism<br />
This section describes a significant application of the<br />
Castellan system that enables real-time <strong>and</strong> off-line<br />
analysis of the planning functions of a distributed agent<br />
system. Existing graph “clustering” <strong>and</strong> reduction<br />
approaches have been used in data mining applications to<br />
find meaningful repeated subgraphs within a larger graph.<br />
[1][2]<br />
The plan graphs generated during a single planning run of<br />
a Cougaar agent society can be extremely large. For<br />
example, the event traces generated from a relatively<br />
small test planning society engaged in logistics planning<br />
consisting of more than thirty agents resulted in over one<br />
hundred sixty thous<strong>and</strong> events <strong>and</strong> a very large<br />
corresponding plan graph with thous<strong>and</strong>s of individual<br />
tasks, plan elements <strong>and</strong> assets. In this section, a general<br />
graph reduction algorithm is described that can be used to<br />
reduce the size of the plan graph in a manner tailored to<br />
specific applications.<br />
We define a plan graph P={N,E} as a set of nodes<br />
N={T,A} <strong>and</strong> a set of directed, attributed edges E={L,<br />
X,G}. Here, the nodes N consist of a set of tasks T <strong>and</strong> a<br />
set of assets A. The attributes associated with each task t<br />
∈ T are expressed as a tuple (V,Uid,Agent) <strong>and</strong> can<br />
include other properties depending on the amount of<br />
detail collected within the event trace. The plan graph P<br />
is a directed acyclic graph (DAG) since no cycles can<br />
exist in the current Cougaar task grammar. The output of<br />
the algorithm is a reduced graph R={N’,E’}.<br />
The set of edges E consist of a set of allocations L, a set<br />
of expansions X, <strong>and</strong> a set of aggregations G. Also, the<br />
nature of the Cougaar task grammar dictates a number of<br />
additional constraints on the plan graph. These include:<br />
• Each task node t 1 ∈ T can be connected to either<br />
an asset node s∈S through an allocation edge l<br />
or another task node t 2 through an expansion or<br />
allocation edges e ∈{X, G}.<br />
• Each task node t ∈ T has exactly one edge (<strong>and</strong><br />
hence one parent node) except for those task<br />
nodes which are connected by a set of edges G’<br />
⊂G.<br />
• The set of aggregation edges G is subdivided<br />
into a set of disjoint subsets which have the same<br />
destination node.
• The set of expansion edges X is subdivided into<br />
a set of disjoint subsets which have the same<br />
source node.<br />
• A task t is connected to exactly one task by an<br />
edge e unless e is a member of the set of<br />
expansion edges X.<br />
The basic principle behind the graph reduction approach<br />
is as follows. For each type of reduced graph mapping R,<br />
we define equivalence criteria between nodes in the graph<br />
to find abstract nodes. An example of criteria C 1 would<br />
be “All tasks at the same agent with parent tasks<br />
originating from another agent.” In applying the graph<br />
reduction algorithm, all nodes which satisfy this criteria<br />
are aggregated into a single node.<br />
Also, for every graph reduction mapping R, we define a<br />
equivalence criteria for abstract edges. Abstract edges<br />
are aggregates of equivalent subgraphs into a single<br />
representative edge.<br />
Continuing the example, consider a criteria C 2 that states<br />
“Subgraphs that connect all aggregate nodes satisfying C 1<br />
<strong>and</strong> connect to organizational assets associated with the<br />
same agent as C 1 .” Also, we define a second node<br />
equivalence criterion C 3 as “All tasks at the same agent<br />
which are allocated to assets representing external<br />
organizations.”<br />
We apply the criteria C 1 , C 2 <strong>and</strong> C 3 to a subgraph<br />
t 1 (→x 1 )t 2 (→x 2 ) t 3 (→l 2 )a 1 associated with agent A. This<br />
subgraph represents a task t 1 exp<strong>and</strong>ed to a task t 2 which<br />
is in turn exp<strong>and</strong>ed to a task t 3 <strong>and</strong> allocated to an asset<br />
a 1 . Here, we assume that t 1 has a parent external to A. We<br />
further assume that the asset a 1 is an organizational asset<br />
representing agent B. The task node t 1 satisfies C 1 <strong>and</strong><br />
hence is aggregated into an abstract node n 1 . Similarly,<br />
the task node t 3 satisfies C 3 <strong>and</strong> is aggregated into a node<br />
n 2 . The subgraph (→x 2 )t 2 (→x 2 ) therefore matches C 2 <strong>and</strong><br />
is associated with a single edge e 1 ∈E’ which connects<br />
the two abstract nodes n 1, , n 2, ∈ N’.<br />
Together, the equivalence criteria for abstract nodes <strong>and</strong><br />
edges leads to the identification of equivalent subgraphs.<br />
By changing the equivalence criteria for aggregated nodes<br />
<strong>and</strong> edges, a variety of different graph reduction<br />
mappings can be achieved.<br />
The computational complexity of the approach described<br />
above varies depending on the complexity of finding <strong>and</strong><br />
matching isomorphic subgraphs. Cougaar task graphs are<br />
generally well structured, <strong>and</strong> for most of the equivalence<br />
criteria that are described in the following, subgraphs can<br />
be matched using a (worst case) O(n) graph traversal,<br />
where n is the size of the subgraph. In the equivalence<br />
criteria described in the next section, all subgraphs fall<br />
within a single agent’s plan graph, thus bounding the size<br />
of the matched subgraphs. If m is the total number of<br />
abstract nodes discovered, then the total computational<br />
complexity is O(m * n).<br />
3.1 Algorithm implementation <strong>and</strong> applications<br />
The implementation of the graph reduction algorithm<br />
within Castellan allows generation of the task graph from<br />
an arbitrary stream of events. Except for asset <strong>and</strong><br />
organizational information, it is not necessary to have a<br />
complete plan graph to use this approach to devise<br />
reduced graphs.<br />
The following types of reduced graphs were found to be<br />
useful for underst<strong>and</strong>ing Cougaar societies <strong>and</strong> are<br />
implemented within the Castellan system.<br />
Aggregate task graphs defined using an equivalence<br />
criterion that maps tasks of the same type with the<br />
identical verb to single nodes. Also, theis equivalence<br />
criterion requires a strict ordering in depth between tasks<br />
which are aggregated. Specifically, in order to map a<br />
node n to an abstract node n’, the node n’s parents (<strong>and</strong><br />
all of its ancestors by implication) must map to an<br />
abstract node n 2 ’ which is an ancestor of n’. This<br />
requirement is imposed to prevent cycles from appearing.<br />
Figure 2 shows a conceptual representation of task<br />
aggregation in which multiple “similar” subgraphs are<br />
collapsed into a single aggregate subgraph. An example<br />
of a task aggregate plan graph is shown in Figure 3.<br />
Asset dependency graphs consider the assets (both<br />
organizational <strong>and</strong> physical) as abstract nodes. In this<br />
case, no aggregation of assets is performed as all assets<br />
are considered unique. The criteria for abstract edges are<br />
as follows:<br />
• All assets are mapped to abstract nodes.<br />
(Optionally, additional asset matching criteria<br />
can be introduced to aggregate assets.)<br />
• In addition, all agents that generate tasks are<br />
designed as “Source” abstract nodes. (These<br />
serve as the roots of the reduced DAG.)<br />
• All tasks <strong>and</strong> allocations that form plan graph<br />
dependencies between different assets are<br />
mapped to a single abstract edge.<br />
Asset dependency graphs are useful for finding both<br />
organization <strong>and</strong> physical dependencies within a<br />
distributed plan.<br />
Workflow graphs characterize the input/output<br />
relationships between agents <strong>and</strong> are particularly useful
for tracking the dependencies between agents incurred<br />
during distributed planning under a particular society<br />
configuration. The equivalence criteria for abstract nodes<br />
<strong>and</strong> edges are defined as follows:<br />
• Task nodes that are on the boundary (e.g. have a<br />
parent from another agent) <strong>and</strong> have identical<br />
verbs are considered equivalent.<br />
• All task nodes with the identical verb allocated<br />
to another agent are equivalent.<br />
• All subgraphs linking boundary nodes of the two<br />
types described above are mapped to the same<br />
abstract edges<br />
CLEAR-<br />
PAYMENT<br />
ORDERBOOKS<br />
BOOKS-<br />
FROM<br />
WAREHOUSE<br />
PACK<br />
SHIP<br />
CLEAR-<br />
PAYMENT<br />
ORDERBOOKS<br />
CLEARPAYMENT<br />
BOOKS-<br />
FROM<br />
WAREHOUSE<br />
PACK<br />
ORDERBOOKS<br />
SHIP<br />
ROUTE<br />
BOOKS-<br />
FROM<br />
WAREHOUSE<br />
0.6<br />
PACK<br />
SHIP<br />
CLEAR-<br />
PAYMENT<br />
0.3<br />
PUBLISH<br />
ORDERBOOKS<br />
BOOKS-<br />
FROM<br />
WAREHOUSE<br />
PUBLISH<br />
AND ( EXPANSION)<br />
AGGREGATION<br />
OR<br />
An example of a workflow graph is shown in Figure 4,<br />
with a detailed blowup in Figure 5. This particular graph<br />
was extracted from an event trace consisting of more than<br />
160,000 events <strong>and</strong> more thirty agents; however, it clearly<br />
shows the input/output relationships between agents <strong>and</strong><br />
the number of each type of task transmitted between<br />
agents. Each box contains abstract nodes belonging a<br />
single agent. Here, the hexagonal nodes depict a set of<br />
tasks with identical verbs which are generated in an<br />
specific agent <strong>and</strong> subsequently forwarded to other agents<br />
for planning, the light colored boxes depict abstract nodes<br />
represented tasks which are inputs to the agent, <strong>and</strong> the<br />
ovals represent tasks allocated to assets within the agent.<br />
This type of graph is useful for deriving the dependencies<br />
between agents by finding the types of tasks which are<br />
inputs <strong>and</strong> outputs from agents <strong>and</strong> filtering out the<br />
internal details of the plan graph within each agent. A<br />
potential application of this system would be a smart load<br />
balancer which anticipates the generation <strong>and</strong> allocation<br />
of tasks <strong>and</strong> allocated higher priority accordingly.<br />
ROUTE<br />
Figure 2. Example of Task Aggregation<br />
In summary, each of these provides a different logical<br />
view of the overall task graph that is useful for different<br />
purposes.<br />
Figure 3. Example Task Aggregate Graph<br />
Reduction
1-6-INFBN<br />
Transport(2)<br />
Supply(338)<br />
ProjectSupply(96)<br />
ProjectWithdraw(33)<br />
1-35-ARBN<br />
Supply(354)<br />
ProjectSupply(153)<br />
Transport(2)<br />
ProjectWithdraw(42)<br />
Supply(692)<br />
ProjectSupply(249)<br />
Transport(2)<br />
ProjectSupply(85)<br />
Supply(375)<br />
47-FSB<br />
2-BDE-1-AD<br />
Transport(2)<br />
DISCOM-1-AD<br />
Transport(2)<br />
Withdraw(692)<br />
ProjectWithdraw(357)<br />
Supply(82)<br />
ProjectSupply(13)<br />
Transport(2)<br />
ProjectSupply(21)<br />
Supply(168)<br />
ProjectSupply(59)<br />
Supply(464)<br />
ProjectSupply(128)<br />
Supply(628)<br />
123-MSB<br />
ProjectSupply(13)<br />
Supply(45)<br />
Supply(131)<br />
ProjectSupply(23)<br />
Transport(2)<br />
1-AD<br />
Transport(12)<br />
227-SUPPLYCO<br />
ProjectSupply(42)<br />
Supply(337)<br />
Transport(2)<br />
ProjectSupply(87)<br />
ProjectSupply(6)<br />
Supply(532)<br />
Supply(54)<br />
ProjectSupply(53)<br />
Supply(160)<br />
Transport(2)<br />
485-CSB<br />
Transport(2)<br />
71-MAINTBN<br />
Transport(2)<br />
18-MAINTBN<br />
Transport(2)<br />
565-RPRPTCO<br />
Transport(2)<br />
106-TCBN<br />
Transport(2)<br />
16-CSG<br />
Transport(2)<br />
7-CSG<br />
Transport(2)<br />
ProjectWithdraw(178)<br />
Withdraw(383)<br />
592-ORDCO<br />
102-POL-SUPPLYCO<br />
Supply(29)<br />
ProjectSupply(12)<br />
ProjectSupply(23)<br />
Supply(67)<br />
Transport(2)<br />
51-MAINTBN<br />
Transport(2)<br />
37-TRANSGP<br />
Transport(2)<br />
6-TCBN<br />
Transport(2)<br />
28-TCBN<br />
Transport(2)<br />
29-SPTGP<br />
Transport(2)<br />
Withdraw(108)<br />
ProjectWithdraw(20)<br />
343-SUPPLYCO<br />
Transport(2)<br />
ProjectSupply(42)<br />
Supply(337)<br />
ProjectWithdraw(70)<br />
Withdraw(487)<br />
Transport(2)<br />
ProjectSupply(53)<br />
ProjectSupply(6)<br />
Supply(54)<br />
Supply(89)<br />
ProjectSupply(87)<br />
Supply(199)<br />
3-SUPCOM-HQ<br />
Transport(2)<br />
Transport(20)<br />
191-ORDBN<br />
110-POL-SUPPLYCO<br />
OSC<br />
Supply(44)<br />
ProjectSupply(21)<br />
ProjectWithdraw(23)<br />
Withdraw(67)<br />
21-TSC-HQ<br />
Transport(16)<br />
Transport(2)<br />
DLAHQ<br />
ProjectSupply(42)<br />
Supply(337)<br />
HNS<br />
ProjectSupply(93)<br />
Supply(161)<br />
ProjectWithdraw(87)<br />
Withdraw(199)<br />
Transport(52)<br />
TRANSCOM<br />
Transport(1706)<br />
Transport(1708)<br />
Transport(1706)<br />
Transport(1708)<br />
GlobalAir<br />
GlobalSea<br />
Transport(1706)<br />
Transport(1680)<br />
Transport(1680)<br />
Transport(1708)<br />
Transport(1708)<br />
Transport(1708)<br />
Transport(1706)<br />
Transport(3388)<br />
Transport(3388)<br />
Transport(1708)<br />
PlanePacker<br />
TheaterGround<br />
CONUSGround<br />
ShipPacker<br />
Transit(448)<br />
Transport(224)<br />
Transport(446)<br />
Transit(892)<br />
Transport(156)<br />
Transit(312)<br />
Transit(32)<br />
Transport(16)<br />
capture the detailed interactions that span multiple nodes,<br />
applications <strong>and</strong> plugins, nor can they track the dynamic<br />
evolution of system state during planning <strong>and</strong> execution.<br />
Moreover, agent systems may be non-deterministic,<br />
resulting in different results for each run. In the absence<br />
of such tools, underst<strong>and</strong>ing <strong>and</strong> debugging agent systems<br />
becomes exceeding difficult.<br />
Castellan can be used to analyze global agent society<br />
behavior. The plan graph reduction algorithms can be<br />
used to evaluate the completeness of the plan <strong>and</strong> to<br />
confirm whether or not the patterns of plan generation are<br />
correct. The approach can also identify groups of tasks,<br />
which are not complete, i.e. which have not been<br />
associated with any plan element.<br />
Profiling tools are also often useful to increase <strong>and</strong><br />
optimize performance. Although distributed agent<br />
systems can benefit from parallelism, often serial<br />
bottlenecks may be present, e.g. planning/execution may<br />
be depending on single agents within the system that<br />
constrain the rest of the planning process.<br />
Figure 4. Example Workflow Graph<br />
1-6-INFBN<br />
Transport(2)<br />
47-FSB<br />
Supply(338)<br />
Supply(692)<br />
Withdraw(692)<br />
ProjectSupply(96)<br />
ProjectSupply(249)<br />
ProjectWithdraw(357)<br />
ProjectWithdraw(33)<br />
Transport(2)<br />
1-35-ARBN<br />
The Castellan event traces can provide useful information<br />
by capturing the time dependent evolution of the plan<br />
rather than capturing a single snapshot at the end of<br />
planning. Moreover, the event traces measure the time to<br />
perform planning actions that may be time consuming,<br />
enabling the identification of hotspots <strong>and</strong> bottlenecks.<br />
4.2 On-Line Control <strong>and</strong> Monitoring<br />
Applications<br />
Supply(354)<br />
ProjectSupply(153)<br />
Transport(2)<br />
ProjectWithdraw(42)<br />
ProjectSupply(85)<br />
Supply(375)<br />
In the current version of Castellan, applications such as<br />
workflow analysis, visualization <strong>and</strong> data mining that<br />
used static event trace databases have been supported.<br />
However, the concepts <strong>and</strong> approaches used in Castellan<br />
can be applied to on-line analysis as well.<br />
Figure 5 Example Workflow Graph (Detail)<br />
4. Discussion<br />
Applications of distributed logging <strong>and</strong> monitoring<br />
applications within large-scale agent societies include<br />
both offline static analysis <strong>and</strong> on-line sensors <strong>and</strong><br />
monitoring.<br />
4.1 Profiling <strong>and</strong> Debugging Applications<br />
Conventional debugging tools are inadequate to h<strong>and</strong>le<br />
large-scale agent based systems. They cannot easily<br />
Sensors <strong>and</strong> control strategies that require prediction of<br />
system performance can benefit from distributed<br />
monitoring. These include:<br />
• Falling behind sensors. We have used to<br />
Castellan to extract data streams to build falling<br />
behind sensors that can predict whether the<br />
society as a whole is falling behind due to<br />
excessive CPU load. In this case, the data from<br />
Castellan was used to train various neural<br />
network systems that would inform agents<br />
within the system when the society was in<br />
danger of falling behind.<br />
• Load balancing sensors. Based on the workflow<br />
analysis, it is possible to dynamically at runtime<br />
find the flow of tasks between multiple agents
within the monitored enclave. With such a<br />
model present, it becomes possible to identify<br />
the processing requirements of tasks as they flow<br />
through the agent society <strong>and</strong> hence allocate<br />
resources accordingly as planning/execution<br />
progresses.<br />
5. Conclusions<br />
Tools for monitoring agent systems have been noticeably<br />
missing from many agent infrastructures. As Cougaar has<br />
evolved <strong>and</strong> been applied to increasingly large societies<br />
<strong>and</strong> complex applications, the need for systems such as<br />
Castellan that provide detailed run-time event information<br />
will increase for both analysis of static event traces for<br />
<strong>and</strong> on-line monitoring applications. It also provides a<br />
general purpose graph reduction algorithm than enables a<br />
wide variety of approaches to analyzing <strong>and</strong><br />
underst<strong>and</strong>ing large distributed plan graphs.<br />
6. Acknowledgements<br />
This research was performed under the <strong>DARPA</strong> Ultralog<br />
effort <strong>and</strong> was supported by <strong>DARPA</strong> grant MDA972-1-1-<br />
0038 <strong>and</strong> Contract 2087-IAI-ARPA-0038. We would like<br />
to thank Dr. Mark Greaves, Marshall Brinn <strong>and</strong> Beth<br />
DePass for their support, comments <strong>and</strong> insightful<br />
discussions.<br />
7. References<br />
[1] Emden R. Gansner <strong>and</strong> Stephen C. North. “An open<br />
graph visualization system <strong>and</strong> its applications to<br />
software engineering”, Software Practice <strong>and</strong><br />
Experience, pp. 1–5, 1999.<br />
[2] Jonyer, L. B. Holder, <strong>and</strong> D. J. Cook. ``Graph-Based<br />
Hierarchical Conceptual Clustering in Structural<br />
Databases'', In the Proceedings of the Seventeenth<br />
National Conference on Artificial Intelligence, 2000<br />
[3] Cougaar Developers Guide, Version 11.0<br />
http://www.cougaar.org.
Reliable MAS Performance Prediction Using Queueing Models<br />
Nathan Gnanasamb<strong>and</strong>am, Seokcheon Lee, Natarajan Gautam, Soundar R.T. Kumara<br />
Pennsylvania State University<br />
State College, PA 16801<br />
{gsnathan, stonesky, ngautam, skumara}@psu.edu<br />
Wilbur Peng, Vikram Manikonda<br />
Intelligent Automation Inc.<br />
Rockville, MD, 20855<br />
{wpeng, vikram}@i-a-i.com<br />
Marshall Brinn<br />
BBN Technologies<br />
10 Moulton Street, Cambridge, MA 02138<br />
mbrinn@bbn.com<br />
Mark Greaves<br />
<strong>DARPA</strong> IXO<br />
3701 North Fairfax Drive, Arlington, VA 22203-1714<br />
mgreaves@darpa.mil<br />
Abstract<br />
In this paper, we model a multi-agent system (MAS)<br />
in military logistics based on the systemic specifications<br />
of the capabilities <strong>and</strong> attributes of individual agents<br />
(TechSpecs). Assuring the survivability of the MAS that implements<br />
distributed planning <strong>and</strong> execution is a significant<br />
design-time <strong>and</strong> run-time challenge. Dynamic battlefield<br />
stresses in military logistics range from heavy computational<br />
loads (information warfare) to being destructive<br />
to infrastructure. In order to sustain <strong>and</strong> recover from<br />
damages to continuously deliver performance, a mechanism<br />
that distributes knowledge about the capabilities <strong>and</strong><br />
strategies of the system is crucial. Using a queueing model<br />
to represent the network of distributed agents, strategies<br />
are developed for a prototype military logistics system.<br />
The TechSpecs contain the capabilities of the agents, playbooks<br />
or rules, quantities to monitor, types of information<br />
flow (input/output), measures of performance (Quality of<br />
Service) <strong>and</strong> their computation methods, measurement<br />
points, defenses against stresses <strong>and</strong> configuration details<br />
(to reflect comm<strong>and</strong> <strong>and</strong> control structure as well as task<br />
flow). With these details, models could be dynamically<br />
developed <strong>and</strong> analyzed in real-time for fine-tuning the<br />
system. Using a Cougaar (<strong>DARPA</strong> Agent Framework)<br />
based model for initial parameter estimation <strong>and</strong> analysis,<br />
we obtain an analytical <strong>and</strong> a simulation model <strong>and</strong><br />
extract generic results. Results indicate strong correlation<br />
between experimental <strong>and</strong> actual events in the agent society.<br />
0-7803-8799-6/04/$20.00 ©2004 IEEE.<br />
Keywords: Multi-agent systems, Survivability, Queueing<br />
network models, Technical specifications<br />
1. Introduction<br />
Multi-agent systems that implement distributed planning<br />
<strong>and</strong> execution are highly complex systems to design <strong>and</strong><br />
model. In this research, we model a survivable multiagent<br />
system (MAS) based on the systemic specifications<br />
(TechSpecs) of the capabilities <strong>and</strong> attributes of individual<br />
agents. The MAS under consideration is exposed to significant<br />
stresses because it operates in highly unpredictable<br />
battlefield-like environments. Even under such hostile conditions,<br />
the stated goal of this survivable MAS based logistics<br />
system is to deliver robustness, security <strong>and</strong> performance.<br />
Hence, performance prediction using suitable models<br />
is vital to being able to tune the actual performance delivered<br />
by the MAS.<br />
Within the research domain of military logistics, we are<br />
conducting our studies using a continuous planning <strong>and</strong> execution<br />
(CPE) agent society. The CPE society is constructed<br />
using the Cougaar MAS development platform developed<br />
under <strong>DARPA</strong>’s leadership [2]. From the modeling perspective,<br />
the CPE society (or otherwise) is nothing but a collection<br />
of distributed agents that lend themselves to be represented<br />
by a network of queues. With this motivation, we analytically<br />
modeled the CPE society using queueing theory.<br />
In doing so, we realized that if the TechSpecs were suitably<br />
specified, the generation of the queueing model could be<br />
55
Figure 1. Agent Hierarchy in CPE Society<br />
accomplished with lesser human intervention. The primary<br />
function of the model is to help evaluate the performance of<br />
the MAS <strong>and</strong> provide alternatives to steer the agent society<br />
towards optimal regions of operation boosting performance<br />
in a distributed environment. Therefore the main focus of<br />
this research lies in specifying the MAS in a systematic<br />
fashion so that queueing models can be derived from the<br />
specification.<br />
1.1 Continuous Planning <strong>and</strong> Execution Society<br />
Overview<br />
The CPE society comprises of agents <strong>and</strong> a world model.<br />
Agents in the CPE society assume a combination of comm<strong>and</strong><br />
<strong>and</strong> control, <strong>and</strong> customer-supplier roles as required<br />
in a military logistics scenario. The world model is an artificial<br />
source that provides the agents with external stimuli.<br />
Figure 1 represents the superior-subordinate <strong>and</strong> the<br />
customer-supplier relations between the brigade (BDE),<br />
battalion (BN), company (CPY) <strong>and</strong> supplier (SUPP) agents<br />
as modeled in this research. Each agent in the society constantly<br />
performs one or more of the following tasks: 1)<br />
Evaluates its own perception of the world state through local<br />
sensors <strong>and</strong> remote inputs; 2) Performs planning, replanning,<br />
plan reconciliation <strong>and</strong> plan refinement; 3) Executing<br />
plans, either through local actuators or through sending<br />
messages to other agents; 4) Adapting to the environment,<br />
e.g. centralizing or decentralizing planning as computational<br />
resources permit.<br />
1.2 Definitions<br />
The following definitions are in order when relating to<br />
the system under consideration.<br />
Stresses occur due to the operation of the MAS in battlefield<br />
environments where events such as permanent infrastructure<br />
damage <strong>and</strong> information attacks adversely affect<br />
overall system performance.<br />
Based on the planning activity in CPE, we simply base<br />
our measures of performance (MOPs) on timeliness or<br />
freshness of a plan at the point of usage <strong>and</strong> on the quality<br />
of the plan. Based on the requirements of Ultra*log<br />
[3], a broad series of performance measures categorized according<br />
to timeliness, completeness, correctness, accountability<br />
<strong>and</strong> confidentiality is available but is outside the requirements<br />
of CPE. Some insights about these MOPs can<br />
be gained from [6]. The MOPs are the components of the<br />
quality of service (QoS) expected from the system.<br />
Survivability of a distributed agent based system (or otherwise)<br />
is the extent to which the quality of service (QoS)<br />
of the system is maintained under stress [6].<br />
Although we consider a survivable MAS, we only concern<br />
ourselves with performance analysis in this work. We<br />
assume that a global controller exists that coordinates between<br />
threads relating to performance, robustness <strong>and</strong> security.<br />
The contents of this paper are organized in the following<br />
way. In Section 2, we introduce the concept of Tech-<br />
Specs based design <strong>and</strong> some of the benefits associated with<br />
this approach. We then discuss the components of the CPE<br />
society in detail <strong>and</strong> organize the TechSpecs for CPE into<br />
various categories in Section 3. The discussion on Tech-<br />
Specs leads us further in the direction of how to utilize them<br />
to form models. We dicuss some models we created in Section<br />
4. We provide two analytical methods using queueing<br />
networks to model a small example in CPE <strong>and</strong> verify our<br />
models using a simulation. <strong>Final</strong>ly, in Section 5 we discuss<br />
our conclusions <strong>and</strong> some possible directions for future research.<br />
2. The Concept of TechSpecs Based Design<br />
Technical Specifications (or TechSpecs) refer to<br />
component-wise, static information relating to agent<br />
input/output behavior, operating requirements, control<br />
actions <strong>and</strong> their consequences for adaptivity [7]. In<br />
addition to outlining a comprehensive set of functionalities,<br />
the TechSpecs are responsible for the definition of domain<br />
MOPs, their respective computational methodologies <strong>and</strong><br />
QoS measurement points. The construction of TechSpecs<br />
helps us proceed in the following direction:<br />
1. Use the specs to ensure a close mapping between MAS<br />
functionality <strong>and</strong> an abstracted model. An apparent<br />
choice here is a queueing model because of similarities<br />
between multi-class traffic in queueing networks <strong>and</strong><br />
the different types of flows in CPE.<br />
2. Establish the parameters of the queueing model - from<br />
TechSpecs directly (eg. update rate at a node) as well<br />
as by collecting empirical data from sample runs (eg.<br />
processing times).<br />
56
Benefits of TechSpecs<br />
The advantage of establishing comprehensive TechSpecs<br />
is that it leads to the codification of requirements, functionalities,<br />
measurements <strong>and</strong> responses to situations. Further,<br />
it enhances the potential to aid the MAS configuration (what<br />
nodes to put agents on) both statically <strong>and</strong> dynamically. An<br />
incomplete list of potential benefits of using a TechSpecs<br />
based approach to MAS design is provided below:<br />
• Enhancement of the MAS Design: Since TechSpecs<br />
impose the requirement of predictability, the MAS<br />
components must be built with fidelity<br />
Figure 2. TechSpecs based MAS Design<br />
3. As the queueing model provides an indication of system<br />
performance for a given configuration, use it to<br />
quickly explore options for control (choices resulting<br />
from adjusting (queueing) parameters or configurations).<br />
Once a suitable c<strong>and</strong>idate is obtained, this<br />
choice is translated back into the application level knob<br />
settings (for control) to result in better QoS for the<br />
MAS.<br />
The direction that TechSpecs motivates us to take is illustrated<br />
in Figure 2. Figure 2 indicates that we could use<br />
the specs in an online or offline fashion. Because the functionality<br />
is clearly defined using TechSpecs, offline analysis<br />
can be independently carried out to remove instabilities<br />
from the MAS design. Assuming automatic conversion<br />
from TechSpec to a model is feasible, TechSpecs have a<br />
real-time use as well - i.e. use the specs as a template to<br />
derive the model. As noted above, the c<strong>and</strong>idate parameters<br />
from the queueing model (parameters that may lead to<br />
performance improvement) cannot be used directly. Reconverting<br />
these choices to actual control knob settings may be<br />
h<strong>and</strong>led by a seperate global controller. We allude to this in<br />
Section 3.2.<br />
It can be noted that the idea of TechSpecs bears analogy<br />
to the conventional control problems in electronic or<br />
hardware realms where the technical specification or rating<br />
could be leveraged to effect better design <strong>and</strong> control. This<br />
was one of the motivating factors for TechSpecs based design<br />
for MAS.<br />
• Distribution of Knowledge: TechSpecs carries with it<br />
the idea of being composable. By using the TechSpecs<br />
of smaller components as building blocks we can build<br />
the TechSpecs of larger systems when the system exp<strong>and</strong>s.<br />
• Concurrent Analysis: Model building can be concurrent<br />
with actual MAS design. Provides a look-ahead<br />
capability to avoid regions of instability or bottlenecks<br />
(especially from queueing analysis).<br />
3. CPE Society TechSpecs<br />
In this section we discuss the formulation of TechSpecs.<br />
In order to build TechSpecs the functionalities of the components<br />
of the CPE society are defined as described in Section<br />
3.1. We then categorize the capabilities of CPE components<br />
in a manner that would lend itself to easy translation<br />
into the queueing models. We then show through examples<br />
how the mapping process between a TechSpec <strong>and</strong> a queueing<br />
model could be interpreted. This would enable us to<br />
analyze the MAS using the models we develop in Section<br />
4.<br />
3.1 Description of CPE Society Components<br />
The World Model: The world model refers to the<br />
conceptual set-up that provides the agents with external<br />
stimuli. It captures a military engagement scenario using a<br />
2-dimensional model of the world. As shown in Figure 3,<br />
CPY agents moving along the x-axis engage an unlimited<br />
supply of targets that move along the y-axis. The targets<br />
move at a fixed rate but engagement slows them down.<br />
While a probabilistic model is chosen to create targets<br />
<strong>and</strong> engaging them, a deterministic model is chosen for<br />
fuel consumption (which is dependent on the distance<br />
moved). A logistics model for resupplying the units with<br />
fuel or ammunition is based on the dem<strong>and</strong> generation<br />
from maneuver plans. Currently, the world model is also<br />
implemented as an agent.<br />
57
Figure 3. The World Model<br />
CPY Agent: Each CPY unit is designated a target<br />
area for engaging in combat actions. These action require a<br />
superior agent (BN) to supply a maneuver plan to each of<br />
the CPY agents. This plan enables the CPY agent to move<br />
along the x-axis <strong>and</strong> engage the enemy by firing. Each<br />
of these agents simulate sensors <strong>and</strong> actuators. The CPY<br />
agents consume resources <strong>and</strong> subsequently forward the<br />
dem<strong>and</strong> to SUPP agents. The current status is reported to<br />
superior agents to enable replanning.<br />
BN Agent: The BN agent maintains situational awareness<br />
of all the agents under its direct comm<strong>and</strong> <strong>and</strong> performs<br />
(re)planning for them using a consistent set of observations<br />
that is collected continously. The BN agent has to execute<br />
a branch <strong>and</strong> bound algorithm of a specified planning depth<br />
<strong>and</strong> breadth to generate a maneuver plan for its subordinates.<br />
The BN agent serves as a medium for transferring<br />
orders from superiors to subordinates.<br />
BDE Agent: The BDE agent is responsible for generating<br />
maneuver plans for the BN <strong>and</strong> CPY agents although<br />
this implementation does not empower the BDE with that<br />
functionality.<br />
SUPP Agent: SUPP agents represent an abstracted<br />
set of supply <strong>and</strong> inventory <strong>and</strong> sustainment services.<br />
These agents take maneuver plans from the CPY agents<br />
<strong>and</strong> supply them with fuel or ammunition. It is currently<br />
assumed that the SUPP units have infinite inventory. Projected<br />
<strong>and</strong> actual consumption depend on the sustainment<br />
plan generated from orders <strong>and</strong> the presence of enemy<br />
targets.<br />
3.1.1 TechSpec Organization<br />
Right at the outset, our goal is to embed enough transperancy<br />
in the TechSpecs to allow the generation of models<br />
(queueing models). Hence, we extract the input/output behavior,<br />
state, actions <strong>and</strong> QoS for each entity within CPE<br />
<strong>and</strong> form the following categories within the TechSpecs :<br />
• Internal State of an Agent: Corresponds to continously<br />
updated variables or data structures corresponding to<br />
the actual working of the agent.<br />
• Inputs: Relates to distinct classes of information received<br />
or sent to or from an agent respectively.<br />
• Outputs: Information provided to other agents.<br />
58
• Actions: Determines the actions that need to be taken<br />
as a result of state changes or the dependencies introduced<br />
by input/output operations.<br />
• Operating Modes: The fidelity or the rate at which outputs<br />
are sent may relate to the operating mode of an<br />
agent. Switching operating modes may be necessary<br />
to alter QoS requirements or as counter-measure for<br />
stress.<br />
• QoS Measurement (QoS Measurement Points): Indicates<br />
the measure of performance that needs to be<br />
monitored or measured in order to compute the QoS<br />
at the designated measurement point. For example,<br />
when we consider queueing models, we would be interested<br />
in measuring the average waiting times at different<br />
agents to compute a quantity such as the freshness<br />
of the maneuver plan.<br />
Table 1. TechSpec Categories:<br />
Perspective<br />
Application<br />
• Tradeoffs: While these may not pertain to every agent,<br />
some agents have the capability to trade-off a certain<br />
measure of performance to gain another. These are<br />
specified explicitly in TechSpecs.<br />
This categorization facilitates the delineation of specific<br />
flows of jobs between agents. For example, consider the<br />
following flow: External stimuli at CPY gets converted to<br />
update tasks at CPY, delivered to BN as updates, converted<br />
to a manueuver plan at BN, delivered to CPY <strong>and</strong> then forwarded<br />
to SUPP for sustainment. From a queueing theory<br />
perspective, the update tasks that originates at CPY <strong>and</strong> end<br />
up at BN for the purpose of planning could constitute a<br />
class of traffic with CPY <strong>and</strong> BN acting as servers to process<br />
these tasks. Similarly, consider the flow where external<br />
stimuli received at CPY end up as updates at BDE through<br />
BN. This could be regarded as another class of traffic. At<br />
this point it is important to notice that classes of traffic could<br />
be derived form the input/output details embedded within<br />
TechSpecs. We decribe how we h<strong>and</strong>le these flows in the<br />
queueing network formulation in Section 4.<br />
Another example of how we could describe something in<br />
the application domain (say a QoS metric) with the queueing<br />
model is as follows. If one is interested in how fresh a<br />
maneuver plan is at its usage point (i.e. CPY), the model<br />
could describe it in terms of the queueing delays for a particular<br />
class of traffic. In our application, this very quantity<br />
happens to be a QoS metric called manuever plan freshness.<br />
In the actual MAS, this metric is calculated directly from the<br />
timestamps that are tagged to the tasks.<br />
3.1.2 TechSpec Representation<br />
Although an elaborate discussion of the format of TechSpec<br />
representation is outside the scope of this paper, we present<br />
some aspects of the specification directly relating to the application<br />
<strong>and</strong> some infrastructural requirements that need to<br />
be part of the specification.<br />
Table 1 represents some TechSpecs categories specific<br />
to this application. Simply speaking, this is a tabular representation<br />
of the information contained in Section 3.1 organized<br />
using the aforementioned categories. From Table 1<br />
one can underst<strong>and</strong> that an output called update originates<br />
from CPY agent <strong>and</strong> travels up at BN because BN is CPY’s<br />
superior. Similarly, an output called maneuver plan would<br />
reach CPY from BN. One assumption that is being made<br />
here is that updates travel up the hierarchy <strong>and</strong> plans downward.<br />
These outputs form part of the different classes of<br />
traffic if observed from a queueing perspective. Another example<br />
would be that the plan action in the BN agent relates<br />
to a functionality in the MAS domain <strong>and</strong> would simply be<br />
abstracted by a processing time in the queueing domain.<br />
In addition to the above specification, static requirements<br />
of the agents in terms of infrastructure are also embedded<br />
into TechSpecs. Some of these requirements for BDE, BN,<br />
CPY <strong>and</strong> BDE agents shown in Table 2.<br />
3.2 Translating TechSpecs to the Queueing Domain<br />
In order to translate the specs into queueing models we<br />
first use the following rules:<br />
1. Inputs <strong>and</strong> outputs are regarded as tasks;<br />
2. The rate at which external stimuli are received is captured<br />
by the arrival rate(λ);<br />
3. Actions take time to perform so they get abstracted by<br />
processing times(µ i );<br />
59
Table 2. TechSpecs: Infrastructure Perspective<br />
an open system because tasks constantly enter <strong>and</strong> exit<br />
the system.<br />
• Does any parameter of the model require empirical<br />
data from the actual society?<br />
Although some aspects in this research are currently being<br />
resolved, the following observations can be made.<br />
4. QoS Metrics such as freshness are in terms of average<br />
waiting times at several nodes ( ∑ W ij , i is the node, j<br />
is the class of traffic);<br />
5. If tasks follow a particular route (or flow as described<br />
in Section 3.1.1), then that route gets associated to a<br />
class of traffic;<br />
6. If a particular task goes into the node <strong>and</strong> gets converted<br />
to another task, we say class-switching has occured.<br />
For example, in our application update tasks go<br />
to BN <strong>and</strong> get converted to plan tasks;<br />
7. If a connection exists between two nodes, this is converted<br />
to a transition probabilty p ij , where i is the<br />
source <strong>and</strong> j is the target node.<br />
Using the above rules as well as the aforementioned representations<br />
of TechSpecs we develop a mapping between<br />
the TechSpecs <strong>and</strong> a queueing model. Although the current<br />
procedure is manual, in thoery this procedure could<br />
be automated. Such an automatic capabilty of translating<br />
TechSpecs would prove very beneficial for predicting performance<br />
of the MAS in real-time. Table 3 captures the<br />
queueing model abstraction from TechSpecs for the CPY<br />
agents. Similarly, we can establish the mapping for other<br />
agents as well. Some useful guidelines that were followed<br />
in order to translate the TechSpecs into models are as follows:<br />
• Identify flows of traffic: Trace the route followed<br />
by each type of packet completely within the system<br />
boundary i.e. from the entry into the system until it exits<br />
the system. These would subsequently form classes<br />
of traffic in the queueing model. Care has to be taken<br />
to note any class switching.<br />
• Identify the network type: The network could be<br />
closed (fixed number of tasks) or open. The CPE is<br />
• Who does the TechSpecs translation? Where does the<br />
model run? In our case the translation is done manually<br />
at present. The model would run at a place visible<br />
to the controller (possibly as a seperate agent at the<br />
highest level). The controller we refer to here is the<br />
actual effector of control actions throughout the CPE<br />
society <strong>and</strong> is seperate from all we have discussed so<br />
far. The role of the controller is also to balance between<br />
other threads such as robustness <strong>and</strong> security.<br />
• The identification of control alternatives is currently<br />
centralized. However, we visualize a decentralized, hierarchical<br />
controller for effecting the changes.<br />
4. Queueing Network Models (QNMs)<br />
A complex logistics system such as the CPE society has<br />
numerous interactions. Yet, if the functionalities are abstracted<br />
to capture some application level specifics in terms<br />
of queueing model elements (example as shown in Table 3),<br />
analytical predictions on the behavior of the MAS can be<br />
made. Analytical models are good c<strong>and</strong>idates for enforcing<br />
adaptive control quickly <strong>and</strong> in real-time. Each agent behaves<br />
like a server that process jobs waiting in line. Hence,<br />
the mapping between an agent <strong>and</strong> a server with a queue<br />
is easily established. Because of the task flow structure<br />
<strong>and</strong> the superior-subordinate relationships in the TechSpecs,<br />
queues can be connected in t<strong>and</strong>em with jobs entering <strong>and</strong><br />
exiting the system. This results in the formation of an open<br />
queuing network.<br />
We conducted initial experiments using an actual<br />
Cougaar based MAS, an analytical formulation <strong>and</strong> an<br />
Arena simulation. We used this experiment to bootstrap<br />
our modeling process in terms of parameter estimation <strong>and</strong><br />
calibration. However, working with the MAS was timeconsuming<br />
as our goal was to identify modeling alternatives<br />
<strong>and</strong> control ramifications. Hence we continued our experimentation<br />
with a scaled up queueing model <strong>and</strong> simulation<br />
with the insight gained from working with the actual society.<br />
Thus the open queueing network’s parameters were carefully<br />
chosen <strong>and</strong> tasks sub-divided into mutiple-classes to<br />
denote a particular task within the MAS. The TechSpecs<br />
clearly delineate the input <strong>and</strong> output tasks facilitating the<br />
60
Table 3. Queuing Model Abstraction from<br />
TechSpecs for CPY Agent<br />
Figure 4. Task Flow in the MAS<br />
mapping to arrivals <strong>and</strong> services in a queueing network. Application<br />
level QoS measures of the MAS are calcuated in<br />
terms of the waiting times (or other equivalent perfromance<br />
measures) at the individual nodes of the QNM.<br />
Figure 4 is a representation of the CPE society from a<br />
queueing perspective. We show two types of tasks flowing<br />
in the network namely the plan (denoting maneuver <strong>and</strong><br />
sustainment) <strong>and</strong> the update tasks. These tasks can be divided<br />
further into three classes of traffic. The first class<br />
refers to update packets entering at the CPY nodes <strong>and</strong> proceeding<br />
further as updates to BDE through BN. Class 2<br />
relates to those update packets that are converted to plan<br />
tasks. There is class-switching at nodes 2 <strong>and</strong> 3 <strong>and</strong> we introduce<br />
approximations to deal with this later in the paper.<br />
The third class relates to the maneuver plan tasks that reach<br />
SUPP nodes through CPY. Although we know multiple task<br />
types exist in the MAS, by making the simplifying assumption<br />
<strong>and</strong> treating all job classes alike we analyze the MAS<br />
using Jackson networks [5] in Section 4.1. We further analyze<br />
the system taking into accout multiple classes of traffic<br />
as discussed in Section 4.2. We compare the two analytical<br />
approaches with a simulation model.<br />
4.1 Jackson Network Model<br />
We apply a single class Jackson network [5] formulation<br />
for open queuing networks to our example by choosing<br />
a weighted average service time for nodes with multiple<br />
classes. The nine agents of the MAS considered here can<br />
then be assumed to be M/M/1 systems. The arrival rates<br />
of the open network can be computed by solving the traffic<br />
equations. Assuming the load is balanced to start-with, the<br />
routing probabilities are also known. If each node of the<br />
system is ergodic, we can calculate the steady state probabilities<br />
<strong>and</strong> performance measures of the entire network by<br />
computing these measures for every agent exactly as in an<br />
M/M/1 system.<br />
We consider a simple example. For this queueing<br />
model, we assume all tasks are of a single type <strong>and</strong> do not<br />
distinguish between classes as shown in Figure 4. Let λ 0i<br />
<strong>and</strong> λ i0 be the rate of arrival <strong>and</strong> exit into <strong>and</strong> from the i th<br />
node respectively. Since the routing probabilties are known<br />
we can calculate the arrival rates λ i of each of the nodes of<br />
the open network by solving the following traffic equations:<br />
λ i = λ 0i +<br />
9∑<br />
λ j p ji , i =1, ..., 9 .<br />
j=1<br />
The routing probabilties (p ji : probability from i (column<br />
index) to j (row index)) for the balanced case are as follows:<br />
0 1/5 1/5 0 0 0 0 0 0<br />
0 0 0 1/4 1/4 1/4 1/4 0 0<br />
0 0 0 1/4 1/4 1/4 1/4 0 0<br />
0 1/5 1/5 0 0 0 0 0 0<br />
0 1/5 1/5 0 0 0 0 0 0<br />
0 1/5 1/5 0 0 0 0 0 0<br />
0 1/5 1/5 0 0 0 0 0 0<br />
0 0 0 1/4 1/4 1/4 1/4 0 0<br />
0 0 0 1/4 1/4 1/4 1/4 0 0<br />
Note that the customer exits from a node i with probability<br />
1 − ∑ j p ji. Once the arrival rates are known, we can<br />
calculate the average waiting times at the nodes by using<br />
the following formula:<br />
1/µ i<br />
W i =<br />
, i =1, ..., 9 .<br />
1 − (λ i /µ i )<br />
The QoS metrics namely maneuver plan freshness (MPF)<br />
<strong>and</strong> sustainment plan freshness (SPF) are calculated in<br />
61
terms of the average waiting times of the nodes at each<br />
level (W CPY ,W BN ,W SUPP ) as follows:<br />
MPF =2W CPY + W BN ,<br />
SPF =2W CPY + W BN + W SUPP .<br />
If the load is not balanced <strong>and</strong> the waiting times are different<br />
for the different branches, the QoS measures are accordingly<br />
calculated. It can be observed that two methods<br />
of control are straightaway obvious: 1) Adjust the µ i so<br />
that we could process faster if possible, 2) Alter the transition<br />
probabilties p ji to divert traffic to nodes that are less<br />
loaded. Although we allude to some control methods, these<br />
are outside the scope of this paper.<br />
4.2 BCMP Network Model<br />
We apply the Baskett, Ch<strong>and</strong>y, Muntz <strong>and</strong> Palacios<br />
(BCMP) algorithm [5] with a small modification to the<br />
above example. The network considered here consists of<br />
nine nodes <strong>and</strong> three class of traffic. The first class correponds<br />
to the stream that enter the CPY nodes <strong>and</strong> get sent to<br />
BDE through BN as updates. The second class corresponds<br />
to the tasks that enter the CPY nodes <strong>and</strong> get sent to the<br />
BN nodes for planning. The second class is converted to a<br />
plan <strong>and</strong> fed back to the CPY nodes. As class-switching occurs<br />
here we make a first order approximation <strong>and</strong> feed this<br />
as an independent class back at CPY nodes as tasks of the<br />
third class. Since most tasks are of the update type it makes<br />
sense to serve the latest update first <strong>and</strong> hence we follow the<br />
LCFS-PR (last come first served with preemptive resume)<br />
scheme whereever there are multiple classes. This allows us<br />
to assume the service rates to be exponential. Since all tasks<br />
arrive from the environment we assume the arrival process<br />
to be a Poisson Process.<br />
If λ ir is the arrival rate of the r th class at the i th node,<br />
λ 0,ir is the arrival rate of the arrival rate of the r th class<br />
at the i th node, <strong>and</strong> p js,ir is the probability that a task<br />
of class s at the j th node is transferred to a task of class<br />
r at the i th node, then the arrival rates for each class at<br />
the individual nodes can be calculated using the following<br />
traffic equations:<br />
λ ir = λ 0,ir +<br />
9∑<br />
j=1 s=1<br />
3∑<br />
λ js p js,ir , i =1, ..., 9 .<br />
The routing probabilties (p ji : probability from i to j) for<br />
the class 1 tasks (portion of update tasks that go to BDE)<br />
are as follows:<br />
0 1 1 0 0 0 0 0 0<br />
0 0 0 1/2 1/2 1/2 1/2 0 0<br />
0 0 0 1/2 1/2 1/2 1/2 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
The routing probabilties (p ji : probability from i (column<br />
index) to j (row index)) for the class 2 tasks (portion of<br />
update tasks that leave at 2 or 3) are as follows:<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 1/2 1/2 1/2 1/2 0 0<br />
0 0 0 1/2 1/2 1/2 1/2 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
The routing probabilties (p ji : probability from i (column<br />
index) to j (row index)) for the class 3 tasks (portion of<br />
update tasks that enter node 4,5,6 or 7 <strong>and</strong> proceed to node<br />
8 or 9) are as follows:<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 0 0 0 0 0 0<br />
0 0 0 1/2 1/2 1/2 1/2 0 0<br />
0 0 0 1/2 1/2 1/2 1/2 0 0<br />
Once the arrival rates for the different classes at all nodes<br />
are known, the waiting time (W ir or W i,r ) at node i for<br />
class r was calculated as follows:<br />
W ir =<br />
λ ir /µ ir<br />
(1 − ∑ 3<br />
r=1 λ ir/µ ir )µ ir<br />
.<br />
The application level QoS measures were calculated in<br />
terms of the node level average waiting times of the different<br />
classes of the BCMP network as follows:<br />
MPF = W CPY,2 + W BN,2 + W CPY,3 ,<br />
SPF = W CPY,2 + W BN,2 + W CPY,3 + W SUPP,3 .<br />
62
Figure 5. Maneuver Plan Freshness using<br />
Jackson Network<br />
Figure 6. Maneuver Plan Freshness using<br />
BCMP Network<br />
4.3 Discussion<br />
We assume that the load is initially balanced. Yet in the<br />
unbalanced case, waiting times for the different branches<br />
can be calculated seperately.<br />
We studied the impact of changing the processing rates at<br />
the nodes to illustrate the benefit of deriving a online queueing<br />
model that could form an integral part of a controller.<br />
Three methods were followed: 1) Jackson network model,<br />
2) BCMP network model, 3) A Discrete-Event Simulation<br />
Model in Arena [1]. We compute the maneuver plan <strong>and</strong><br />
sustainment plan freshness from the average waiting times<br />
of the individual nodes. We assume the processing rate for<br />
class 1 tasks, µ update_tasks =10Mb/s at all the nodes. We<br />
assume that the overall arrival rate from the environment<br />
is according to a Poisson Process with λ =2Mb/s. We<br />
vary the processing rates for the class 2 tasks at BN <strong>and</strong><br />
CPY <strong>and</strong> observe the impact on maneuver plan freshness as<br />
shown in Figure 5 <strong>and</strong> Figure 6. The low value of processing<br />
rates at the BN agent for class 2 tasks are in line with<br />
reality, wherein the BN agent implements a search procedure<br />
that is more time-consuming than to process class 1<br />
tasks which are updates meant for superiors in the chain<br />
of comm<strong>and</strong>. We found that the Jackson network matched<br />
reasonably well with the simulation results. The multi-class<br />
BCMP method performed better than the Jackson network<br />
because it was able to capture more of the MAS’s characteristics<br />
using different classes of traffic. This can be observed<br />
by comparing Figure 5 <strong>and</strong> Figure 6 with Figure 7.<br />
We consider only two parameters (processing rates for<br />
the class 2 tasks at BN <strong>and</strong> CPY) for variation <strong>and</strong> nine<br />
experiments for each method. We do this to keep the calculations<br />
simple. It can be observed from Figure 5 <strong>and</strong> Figure<br />
6 that adjusting the processing rate in BN impacts the QoS<br />
significantly as opposed to altering processing rates at CPY.<br />
Hence, to increase performance, the controller may have to<br />
adjust the application level knobs to provide a greater processing<br />
rate for the planning tasks. Similarly, other trends<br />
can be observed by adjusting other parameters.<br />
With these models, we believe it is be possible to identify<br />
unstable regions <strong>and</strong> steer the MAS towards regions<br />
providing better QoS. The running time of these models in<br />
Matlab is less than one second per iteration. If embedded<br />
within the system, several alternate <strong>and</strong> feasible system configurations<br />
can be simulated to identify c<strong>and</strong>idate choices<br />
for performance improvement.<br />
5. Results <strong>and</strong> Future Directions<br />
The hierarchy within the MAS, the specification of static<br />
attributes <strong>and</strong> the similarity between a distributed MAS<br />
based planning procedure <strong>and</strong> queueing network with multiple<br />
classes facilitate the performance modeling of the MAS<br />
using QNMs. TechSpecs are a structured method to encapsulate<br />
static data <strong>and</strong> distribute them because agent based<br />
planning applications are inherently distributed. From<br />
TechSpecs, queueing models (offline <strong>and</strong> online) can be developed<br />
for a cluster of nodes. The QNM will serve as an<br />
performance analysis tool for that cluster of nodes.<br />
63
Acknowledgements<br />
The work described here was performed under the<br />
<strong>DARPA</strong> UltraLog Grant#: MDA972-1-1-0038. The authors<br />
wish to acknowledge <strong>DARPA</strong> for their generous support.<br />
References<br />
[1] Arena. www.arenasimulation.com. Rockwell Automation.<br />
[2] Cougaar open source site. http://www.cougaar.org.<br />
<strong>DARPA</strong>.<br />
[3] Ultralog program site. http://www.ultralog.net.<br />
<strong>DARPA</strong>.<br />
Figure 7. Maneuver Plan Freshness using<br />
Simulation<br />
[4] Web-ontology (webont) working group.<br />
http://www.w3.org/2001/sw/WebOnt/.<br />
[5] G. Bolch, S. G. H. de Meter, <strong>and</strong> K. S.Trivedi. Queuing<br />
Networks <strong>and</strong> Markov Chains: Modeling <strong>and</strong> Performance<br />
Evaluation with Computer Science Applications.<br />
John Wiley <strong>and</strong> Sons, Inc., 1998.<br />
The main contributions of this work is that we have identified<br />
that TechSpecs could serve as good template that can<br />
guide MAS design <strong>and</strong> model development in a concurrent<br />
fashion. We have codified the static attributes of MAS<br />
in such a way that QNMs may be constituted from distributed<br />
information, especially in realtime. This technique<br />
for adaptivity by using a model on dem<strong>and</strong> to predict trends<br />
in QoS may be helpful in building survivable systems.<br />
[6] M. Brinn <strong>and</strong> M. Greaves. Leveraging agent properties<br />
to assure survivability of distributed multi-agent systems.<br />
Proceedings of the Second Joint Conference on<br />
Autonomous Agents <strong>and</strong> Multi-Agent Systems (Poster<br />
Session), 2003.<br />
[7] A. Cass<strong>and</strong>ra, D. Wells, M. Nodine, <strong>and</strong> P. Paz<strong>and</strong>ak.<br />
Techspecs: Content, issues <strong>and</strong> nomenclature. Technical<br />
<strong>Report</strong>, Telcordia Inc. <strong>and</strong> OBJS Inc., 2003.<br />
Currently, work is ongoing to identify an appropriate<br />
method of representation of TechSpecs that would have<br />
some reasoning <strong>and</strong> deduction capabilties such as OWL<br />
[4]. A module that could convert this representation of<br />
TechSpecs into queueing models automatically would be<br />
useful in this endeavor. An approach that would identify<br />
alternate choices for performance improvement is also<br />
necessary. <strong>Final</strong>ly, a controller that actually uses the<br />
analysis from the QNMs to optimize the global utility is<br />
also being pursued.<br />
64
Supply Chain Network: A Complex Adaptive Systems Perspective<br />
AMIT SURANA † , SOUNDAR KUMARA ‡* , MARK GREAVES**,<br />
USHA NANDINI RAGHAVAN<br />
In this era where on one h<strong>and</strong>, information technology is revolutionizing almost every<br />
domain of technology <strong>and</strong> society, on the other h<strong>and</strong> the “complexity revolution” is<br />
occurring in science at a silent pace. In this paper we look at the impact of the two, in the<br />
context of supply chain networks. With the advent of information technology, supply<br />
chains have acquired complexity almost equivalent to that of biological systems.<br />
However, one of the major challenges that we are facing in supply chain management is<br />
the deployment of coordination strategies that lead to adaptive, flexible <strong>and</strong> coherent<br />
collective behavior in supply chains. The main hurdle has been the lack of the principles<br />
that govern how supply chains with complex organizational structure <strong>and</strong> function arise<br />
<strong>and</strong> develop, <strong>and</strong> what organizations <strong>and</strong> functionality is attainable, given specific kinds<br />
of lower-level constituent entities. The study of Complex Adaptive Systems (CAS), has<br />
been a research effort attempting to find common characteristics <strong>and</strong>/or formal<br />
distinctions among complex systems arising in diverse domains (like biology, social<br />
systems, ecology <strong>and</strong> technology) that might lead to better underst<strong>and</strong>ing of how<br />
complexity occurs, whether it follows any general scientific laws of nature, <strong>and</strong> how it<br />
might be related to simplicity. In this paper we argue that supply chains should be treated<br />
as a CAS. With this recognition, we propose how various concepts, tools <strong>and</strong> techniques<br />
used in the study of CAS can be exploited to characterize <strong>and</strong> model supply chain<br />
networks. These tools <strong>and</strong> techniques are based on the fields of nonlinear dynamics,<br />
statistical physics <strong>and</strong> information theory.<br />
1. Introduction<br />
Supply chain is a complex network with an overwhelming number of interactions <strong>and</strong> interdependencies<br />
among different entities, processes <strong>and</strong> resources. The network is highly nonlinear,<br />
shows complex multi-scale behavior, has structure spanning several scales, <strong>and</strong> evolves <strong>and</strong> selforganizes<br />
through a complex interplay of its structure <strong>and</strong> function. This sheer complexity of<br />
supply chain networks, with inevitable lack of prediction makes it difficult to manage <strong>and</strong> control<br />
them. Furthermore, the changing organizational <strong>and</strong> market trends requires the supply chains to<br />
be highly dynamic, scalable, reconfigurable, agile <strong>and</strong> adaptive: the network should sense <strong>and</strong><br />
respond effectively <strong>and</strong> efficiently to satisfy customer dem<strong>and</strong>. Supply chain management<br />
necessitates that decisions made by business entities take more global factors into considerations.<br />
The successful integration of the entire supply chain process now depends heavily on the<br />
availability of accurate <strong>and</strong> timely information that can be shared by all members of the supply<br />
chain. Information Technology with its capability of setting up dynamic information exchange<br />
network has been a key enabling factor in shaping supply chains to meet such requirements. A<br />
major obstacle remains, however in the deployment of coordination <strong>and</strong> decision technologies to<br />
achieve complex, adaptive, <strong>and</strong> flexible collective behavior in the network. This is due to the lack<br />
of our underst<strong>and</strong>ing of organizational, functional <strong>and</strong> evolutionary aspects in supply chains. A<br />
† Department of Mechanical Engineering, The Massachusetts Institute of Technology, Cambridge, MA, 02139, email:<br />
surana@mit.edu<br />
‡ The Harold <strong>and</strong> Inge Marcus Department of <strong>Industrial</strong> & <strong>Manufacturing</strong> Engineering, University Park, PA 16802,<br />
email: skumara@psu.edu. * Corresponding Author<br />
**IXO, <strong>DARPA</strong>, 3701 North Fairfax Drive, Arlingon, VA 22203, 1714, mgreaves@darpa.mil
key realization to tackle this problem is that supply chain networks should not just be treated as a<br />
“system”, but as a “Complex Adaptive System” (CAS). The study of CAS augments the systems<br />
theory <strong>and</strong> provides a rich set of tools <strong>and</strong> techniques to model <strong>and</strong> analyze the complexity arising<br />
in systems encompassing science <strong>and</strong> technology. In this paper we take this perspective in<br />
dealing with supply chains <strong>and</strong> show how various advances in the realm of CAS, provide novel<br />
<strong>and</strong> effective ways to characterize, underst<strong>and</strong> <strong>and</strong> manage their emergent dynamics.<br />
A similar viewpoint has been emphasized in (Choi et al. 2001). The focus of Choi et al. was to<br />
demonstrate how supply networks should be managed if we recognize them as CAS. The concept<br />
of CAS allows one to underst<strong>and</strong> how supply networks as living systems co-evolve with the<br />
rugged <strong>and</strong> dynamic environment in which they exist <strong>and</strong> identify patterns that arise in such an<br />
evolution. The authors conjecture various propositions stating how the patterns of behavior of<br />
individual agents in a supply network can be related to the emergent dynamics of the network.<br />
One of the important deductions made is that when managing supply networks, managers must<br />
appropriately balance how much to control <strong>and</strong> how much to let emerge. However, no concrete<br />
framework has been suggested under which such conjectures can be verified <strong>and</strong> generalized. It is<br />
the onus of this paper to show how the theoretical advances made in the realm of CAS can be<br />
used to study such issues systematically <strong>and</strong> formally in the context of supply chain networks.<br />
This paper is divided into eight sections, including the introduction. In Section 2, we give a<br />
brief introduction to complex adaptive systems in which we discuss the architecture <strong>and</strong><br />
characteristics of complex systems in diverse areas encompassing biology, social systems,<br />
ecology <strong>and</strong> technology. In Section 3 we discuss characteristics of supply chain network <strong>and</strong><br />
argue that they should be understood in terms of a CAS. We also present some emerging trends in<br />
supply chains <strong>and</strong> the increasing critical role of information technology in supply chain<br />
management in the light of these trends. In Section 4 we give a brief overview of the main<br />
techniques that have been used for modeling <strong>and</strong> analysis of supply chains <strong>and</strong> then discuss how<br />
the science of complexity provides a genuine extension <strong>and</strong> reformulation of these approaches.<br />
Like any CAS, the study of supply chains, should involve a proper balance of simulation <strong>and</strong><br />
theory. System dynamics based <strong>and</strong> recently agent based simulation models (inspired from<br />
complexity theory) have been extensively used to make theoretical investigations of supply<br />
chains feasible <strong>and</strong> to support decision-making in real world supply chains. System dynamics<br />
approach often leads to models of supply chains, which can be described in the form of a<br />
dynamical system. Dynamical systems theory provides a powerful framework for rigorous<br />
analysis of such models <strong>and</strong> thus can be used to supplement the system dynamics simulation<br />
approach. We illustrate this in Section 5, using some nonlinear models, which consider the effect<br />
of priority, heterogeneity, feedback, delays <strong>and</strong> resource sharing on the performance of supply<br />
chain. Furthermore, the large volumes of data, generated from simulations can be used to<br />
underst<strong>and</strong> <strong>and</strong> comprehend the emergent dynamics of supply chains. Even though an exact<br />
underst<strong>and</strong>ing of the dynamics is difficult in complex systems, archetypal behavior patterns can<br />
often be recognized, using techniques from complexity theory like Nonlinear Time Series<br />
Analysis <strong>and</strong> Computational Mechanics, which are discussed in Section 6. The benefits of<br />
integrated supply chain concepts are widely recognized, but the analytical tools that can exploit<br />
those benefits are scarce. In order to study supply chains as a whole it is critical to underst<strong>and</strong> the<br />
interplay of organizational structure <strong>and</strong> functioning of supply chains. Network dynamics an<br />
extension of nonlinear dynamics to networks, provides a systematic framework to deal with such<br />
issues <strong>and</strong> is discussed in Section 7. We conclude in Section 8, with the recommendations for<br />
future research.<br />
2. Complex Adaptive Systems<br />
Many natural systems <strong>and</strong> increasingly many artificial (man-made) systems as well, are<br />
characterized by apparently complex behaviors that arise as the result of nonlinear spatiotemporal<br />
interactions among a large number of components or subsystems. We would use the
term agent <strong>and</strong> node interchangeably to refer to the component or subsystems. Examples of such<br />
natural systems include immune systems, nervous systems, multi-cellular organisms, ecologies,<br />
insect societies <strong>and</strong> social organizations. However, such systems are not just confined to biology<br />
<strong>and</strong> society. Engineering theories of controls, communications <strong>and</strong> computing have matured in<br />
recent decades, facilitating the creation of various large–scale systems, which have turned out to<br />
possess bewildering complexity, almost equivalent to that of biological systems. Systems sharing<br />
this property include parallel <strong>and</strong> distributed computing systems, communication networks,<br />
artificial neural networks, evolutionary algorithms, large-scale software systems, <strong>and</strong> economies.<br />
Such systems have been commonly referred to as Complex Systems (Baranger , Flake 1998,<br />
Adami 1998, Bar-Yam 1997). However, at the present time, the notion of complex system is not<br />
precisely delineated.<br />
The most remarkable phenomena exhibited by the complex systems, is the emergence of highly<br />
structured collective behavior over time from the interaction of simple subsystems without any<br />
centralized control. Their typical characteristics include: dynamics involving interrelated spatial<br />
<strong>and</strong> temporal effects, correlations over long length <strong>and</strong> time scales, strongly coupled degrees of<br />
freedom, non-interchangeable system elements, exist in quasi equilibrium <strong>and</strong> show a<br />
combination of regularity <strong>and</strong> r<strong>and</strong>omness (i.e. interplay of chaos <strong>and</strong> non-chaos). Such systems<br />
have structures spanning several scales <strong>and</strong> show emergent behavior. Emergence is generally<br />
understood to be a process that leads to the appearance of structure not directly described by the<br />
defining constraints <strong>and</strong> instantaneous forces that control a system. The combination of structure<br />
<strong>and</strong> emergence leads to self-organization, which is what happens when an emerging behavior has<br />
an effect of changing the structure or creating a new structure. Complex Adaptive System is a<br />
special category of complex systems to accommodate living beings. As the name suggests they<br />
are capable of changing themselves to adapt to changing environment. In this regard many<br />
artificial systems like those stated earlier can be considered as CAS, due to their capability of<br />
evolving. Coexistence of competition <strong>and</strong> cooperation is another dichotomy exhibited by CAS.<br />
A CAS can be considered as a network of dynamical elements where the states of both the<br />
nodes <strong>and</strong> the edges can change, <strong>and</strong> the topology of the network itself often evolves in time in a<br />
nonlinear <strong>and</strong> heterogeneous fashion. A dynamical system can be considered as simply behaving:<br />
“obeying the laws of physics”. From another perspective, it can be viewed as processing<br />
information: how systems get information, how they incorporate that information in the models of<br />
their surroundings, <strong>and</strong> how they make decisions on the basis of these models determine how they<br />
behave (Llyod <strong>and</strong> Slotine 1996). This leads to one of the more heuristic definitions of a complex<br />
system: one that “stores, processes <strong>and</strong> transmits, information” (Sawhil 1995). From a<br />
thermodynamic viewpoint such systems have the total energy (or its analogy) unknown, yet<br />
something is known about the internal state structure. In these large open systems (do not possess<br />
well defined boundaries) energy enters at low entropy <strong>and</strong> is dissipated. Open systems organize<br />
largely due to the reduction in the number of active degrees of freedom caused by dissipation.<br />
Not all behaviors or spatial configurations can be supported. The result is a limitation of the<br />
collective modes, cooperative behaviors, <strong>and</strong> coherent structures that an open system can express.<br />
A central goal of the sciences of complex systems is to underst<strong>and</strong> the laws <strong>and</strong> mechanisms by<br />
which complicated, coherent global behavior can emerge from the collective activities of<br />
relatively simple, locally interacting components.<br />
Complexity arises in natural system thorough evolution, while design plays an analogous role<br />
for the complex engineering systems. Convergent evolution/design leads to remarkable<br />
similarities at higher level of organization, though at the molecular or device level natural <strong>and</strong><br />
man-made systems differ significantly. Complexity in both cases is driven far more by the need<br />
for robustness to uncertainty in the environment <strong>and</strong> component parts than by basic functionality.<br />
Through design/evolution, such systems develop highly structured, elaborate internal<br />
configurations, with layers of feedback <strong>and</strong> signaling. It is the protocols that organize highly<br />
structured <strong>and</strong> complex modular hierarchies to achieve robustness, but also create fragilities to
are or ignored perturbations. The evolution of protocols can lead to a<br />
robustness/complexity/fragility spiral where complexity added for robustness also adds new<br />
fragilities, which in turn leads to new <strong>and</strong> thus spiraling complexities (Csete <strong>and</strong> Doyle 2002).<br />
However all this complexity remains largely hidden in normal operation becoming conspicuous<br />
acutely when contributing to rare cascading failures or chronically through fragility/complexity<br />
evolutionary spirals. Highly Optimized Tolerance (HOT) (Carlson <strong>and</strong> Doyle 1999) has been<br />
introduced recently to focus on the "robust, yet fragile" nature of complexity. It is also becoming<br />
increasingly clear that robustness <strong>and</strong> complexity in biology, ecology, technology, <strong>and</strong> social<br />
systems are so intertwined that they must be treated in a unified way. Given the diversity of<br />
systems falling into this broad class, the discovery of any commonalities or “universal” laws<br />
underlying such systems requires very general theoretical framework.<br />
The scientific study of CAS has been attempting to find common characteristics <strong>and</strong>/or formal<br />
distinctions among complex systems that might lead to better underst<strong>and</strong>ing of how complexity<br />
develops, whether it follows any general scientific laws of nature, <strong>and</strong> how it might be related to<br />
simplicity. The attractiveness of the methods developed in this research effort for generalpurpose<br />
modeling, design <strong>and</strong> analysis, lies in their ability to produce complex emergent<br />
phenomena out of a small set of relatively simple rules, constraints <strong>and</strong> the relationships couched<br />
in either quantitative or qualitative terms. We believe, that the tools <strong>and</strong> techniques developed in<br />
the study of CAS, offers a rich potential for design, modeling <strong>and</strong> analysis of large-scale systems<br />
in general <strong>and</strong> supply chains in particular.<br />
3. Supply Chain Networks as Complex Adaptive Systems<br />
A supply chain network is where information, products <strong>and</strong> finances are transferred between<br />
various suppliers, manufacturers, distributors, retailers <strong>and</strong> customers. A supply chain is<br />
characterized by a forward flow of goods <strong>and</strong> a backward flow of information. Typically a supply<br />
chain is comprised of two main business processes: material management <strong>and</strong> physical<br />
distribution (Min <strong>and</strong> Zhou 2002). The material management supports the complete cycle of<br />
material flow from the purchase <strong>and</strong> internal control of production material to the planning <strong>and</strong><br />
control of work-in-process, to the warehousing, shipping, <strong>and</strong> distribution of finished products.<br />
On the other h<strong>and</strong>, physical distribution encompasses all the outbound logistics activities related<br />
to providing customer services. Combining the activities of material management <strong>and</strong> physical<br />
distribution, a supply chain does not merely represent a linear chain of one-on-one business<br />
relationships, but a web of multiple business networks <strong>and</strong> relationships.<br />
Supply chain network is an emergent phenomenon. From the view of each individual entity, the<br />
supply chain is self-organizing. Although the totality may be unknown individual entities partake<br />
in the gr<strong>and</strong> establishment of the network by engaging in their localized decision-making i.e. in<br />
doing their best to select capable suppliers <strong>and</strong> ensure on-time delivery of products to their<br />
buyers. The network is characterized by nonlinear interactions <strong>and</strong> strong interdependencies<br />
between the entities. In most circumstances, order <strong>and</strong> control in the network is emergent, as<br />
opposed to predetermined. Control is generated through nonlinear though simple behavioral rules<br />
that operate based on local information. We argue that a supply chain network forms a complex<br />
adaptive system:<br />
• Structures spanning several scales: The supply chain network is a bi-level hierarchical<br />
<strong>and</strong> heterogeneous network where at the higher level each node represents an individual<br />
supplier, manufacturer, distributor, retailer or customer. However at the lower level the<br />
nodes represent the physical entities that exist inside each node in the upper level. The<br />
heterogeneity of most networks is a function of various technologies being provided by<br />
whatever vendor could supply them at the time their need was recognized.<br />
• Strongly coupled degrees of freedom <strong>and</strong> correlations over long length <strong>and</strong> time<br />
scales: Different entities in a supply chain typically operate autonomously with different<br />
objectives <strong>and</strong> subject to different set of constraints. However when it comes to
improving due date performance, increasing quality or reducing costs they become highly<br />
inter-dependent. It is the flow of material, resources, information <strong>and</strong> finances that<br />
provides the binding force. The well fare of any entity in the system directly depends on<br />
the performance of the others <strong>and</strong> their willingness <strong>and</strong> ability to coordinate. This leads to<br />
correlations between entities over long length <strong>and</strong> time scales.<br />
Figure 1. Supply Chain Network<br />
• Coexistence of Competition <strong>and</strong> Cooperation: The entities in a supply chain often have<br />
conflicting objectives. Competition abounds in the form of sharing <strong>and</strong> contention of<br />
resources. Global control over nodes is an exception rather than a rule; more likely is a<br />
localized cooperation out of which a global order emerges, which is itself unpredictable.<br />
• Nonlinear dynamics involving interrelated spatial <strong>and</strong> temporal effects: Supply<br />
chains have wide geographic distribution. Customers can initiate transactions at any time<br />
with little or no regard for existing load, thus contributing to a dynamic <strong>and</strong> noisy<br />
network character. The characteristics of a network tend to drift as workloads <strong>and</strong><br />
configuration change, producing a non-stationary behavior. The coordination protocols<br />
attempt to arbitrate among entities with resource conflicts. Arbitration is not perfect<br />
however; hence over <strong>and</strong> under corrections contribute to the nonlinear character of the<br />
network.<br />
• Quasi Equilibrium <strong>and</strong> combination of regularity <strong>and</strong> r<strong>and</strong>omness (i.e. interplay of<br />
chaos <strong>and</strong> non-chaos) The general tendency of a supply chain is to maintain a stable <strong>and</strong><br />
prevalent configuration in response to external disturbances. However they can undergo a<br />
radical structural change when they are stretched from equilibrium. At such a point a<br />
small event can trigger a cascade of changes that eventually can lead to system wide<br />
reconfiguration. In some situations unstable phenomena can arise, due to feedback<br />
structure, inherent adjustment delays <strong>and</strong> nonlinear decision-making processes that go in<br />
the nodes. One of the causes of unstable phenomena is that the information feedback in<br />
the system is slow relative to the rate of changes that occur in the system. The first mode<br />
of unstable behavior to arise in nonlinear systems is usually the simple one-cycle self-
sustained oscillations. If the instability drives the system further into the nonlinear<br />
regime, more complicated temporal behavior may be generated. The route to chaos<br />
through subsequent period-doubling bifurcations, as certain parameters of the system are<br />
varied, is generic to large class of systems in physics, chemistry, biology, economics <strong>and</strong><br />
other fields. Functioning in chaotic regime deprives the ability for long-term predictions<br />
about the behavior of the system, while short-term predictions may be possible<br />
sometimes. As a result, control <strong>and</strong> stabilization of such a system becomes very difficult.<br />
• Emergent behavior <strong>and</strong> Self-Organization: With the individual entities obeying a<br />
deterministic selection process, the organization of the overall supply chain emerges<br />
through a natural process of order <strong>and</strong> spontaneity. This emergence of highly structured<br />
collective behavior, over time from the interaction of the simple entities leads to<br />
fulfillment of customer orders. Dem<strong>and</strong> amplification, inventory swing are some other<br />
but undesirable emergent phenomena that can also arise. For instance, the decisions <strong>and</strong><br />
delays downstream in a supply chain often leads to amplifying non-desirable effect<br />
upstream, a phenomena commonly known as “Bull Whip” effect.<br />
• Adaptation <strong>and</strong> Evolution: Supply chain both reacts to <strong>and</strong> creates it environment.<br />
Generally speaking a supply chain interacts with almost every other conceivable network.<br />
Operationally, the environment depends on the chosen scale of analysis, for e.g. it can be<br />
taken as the customer market. Typically, significant dynamism exists in the environment<br />
which necessitates a constant adaptation of the supply network. However the<br />
environment is highly rugged making the co evolution difficult. The individual entities<br />
constantly observe what emerges from a supply network <strong>and</strong> make adjustments to<br />
organizational goals <strong>and</strong> supporting infrastructure. Another common way of adaptation is<br />
through altering boundaries of the network. The boundaries can change as a result of<br />
including or excluding particular entity <strong>and</strong> by adding or eliminating connections among<br />
entities, thereby changing the underlying pattern of interaction. As we discuss next,<br />
Supply chain management plays a critical role in making the network evolve in a<br />
coherent manner.<br />
3.1 Supply Chain Management<br />
Supply chain management is defined as the integration of key business processes from endusers<br />
through original suppliers that provide products, services, <strong>and</strong> information <strong>and</strong> add value for<br />
customers <strong>and</strong> other stakeholders (Cooper et. al. 1997). It involves balancing reliable customer<br />
delivery with manufacturing <strong>and</strong> inventory costs. It is evolved around a customer-focused<br />
corporate vision, which drives changes throughout a firm’s internal <strong>and</strong> external linkages <strong>and</strong><br />
then captures the synergy of inter-functional, inter-organizational integration <strong>and</strong> coordination.<br />
Due to the inherent complexity it is a challenge to coordinate the actions of entities across<br />
organizational boundaries so that they perform in a coherent manner.<br />
An important element in managing SCN is to control the ripple effect of lead-time so that the<br />
variability in supply chain can be minimized. Dem<strong>and</strong> forecasting is used to estimate dem<strong>and</strong> for<br />
each stage, <strong>and</strong> the inventory between stages for the network is used for protecting against<br />
fluctuations in supply <strong>and</strong> dem<strong>and</strong> across the network. Due to the decentralized control properties<br />
of the SCN, control of ripple effect requires coordination between entities in performing their<br />
tasks. The problem of coordination has reached another dimension due to some other trends in the<br />
current supply chains.<br />
Two important organizational <strong>and</strong> market trends that are on their way have been the<br />
atomization of markets as well as that of organizational entities (Balakrishnan et al. 1999). In<br />
such a scenario product realization process has a continuous customer involvement in all phases -<br />
from design to delivery. Customization is not only limited to selecting from pre-determined<br />
model variants; rather, product design, process plans, <strong>and</strong> even the supply chain configuration<br />
have to be tailored for each customer. The product realization organization has to be formed on
the fly- as a consortium of widely dispersed organizations to cater to the needs of a single<br />
customer. Thus organizations consist of series of opportunistic alliances among several focused<br />
organizational entities to address particular market opportunities. For manufacturing<br />
organizations to operate effectively in this environment of dynamic, virtual alliances, products<br />
must have modular architectures, processes must be well characterized <strong>and</strong> st<strong>and</strong>ardized,<br />
documentation must be digitized <strong>and</strong> widely accessible, <strong>and</strong> systems must be interoperable.<br />
Automation <strong>and</strong> intelligent information processing is vital for diagnosing problems during<br />
product realization <strong>and</strong> usage, coordination, design <strong>and</strong> production schedules, searching for<br />
relevant information in multi-media databases. These trends exacerbate the challenges of<br />
coordination <strong>and</strong> collaboration as the number of product realization networks increase, <strong>and</strong> so<br />
does the number of partners in each network.<br />
Inventory is unwise approach to dealing with highly changing market dem<strong>and</strong> <strong>and</strong> short life<br />
cycle products. Information is an appropriate substitute for inventory. Information about the<br />
material lead-time from different suppliers can be used for planning the material arrival, instead<br />
of building up inventory. The dem<strong>and</strong> information can be transmitted to the manufactures on a<br />
timely basis, so that the orders can be fulfilled with less inventory costs. In fact it is widely<br />
realized, that the successful integration of the entire supply chain process depends heavily on the<br />
availability of accurate <strong>and</strong> timely information that can be shared by all members of the supply<br />
chain. Supply chain management now increasingly relies on Information Technology as discussed<br />
below.<br />
3.2 Information Technology in Supply Chain Management<br />
Information technology with its capability of providing global reach <strong>and</strong> wide range of<br />
connectivity, enterprise integration, micro autonomy <strong>and</strong> intelligence, object <strong>and</strong> networked<br />
oriented computing paradigms <strong>and</strong> rich media support; has been key enabler for the management<br />
of future manufacturing enterprises. It is vital for eliminating collaboration <strong>and</strong> coordination<br />
costs, <strong>and</strong> to permit rapid setup of dynamic information exchange networks. Connectivity<br />
permits involvement of customers <strong>and</strong> other stakeholders in all aspects of manufacturing.<br />
Enterprise integration facilitates seamless interaction among global partners. Micro autonomy <strong>and</strong><br />
intelligence permit atomic tracking <strong>and</strong> remote control. New software paradigms enable<br />
distributed, intelligent <strong>and</strong> autonomous operations. Distributed computing facilitates quick<br />
localized decisions without loosing vast data gathering potential <strong>and</strong> powerful computing<br />
capabilities. Rich media support, which includes capabilities like digitization, visualization tools<br />
<strong>and</strong> virtual reality, facilitate collaboration <strong>and</strong> immersion.<br />
Many improvements have occurred in supply chain management because IT enables changes to<br />
be made in inventory management <strong>and</strong> production, dynamically. It assists the managers in coping<br />
up with uncertainty <strong>and</strong> lead-time through improved collection <strong>and</strong> sharing of information<br />
between supply chain nodes. The success of an enterprise is now largely dependent on how its<br />
information resources are designed, operated <strong>and</strong> managed, especially with the Information<br />
technology emerging as a critical input to be leveraged for significant organizational productivity.<br />
However, the difficulty arises when trying to design an information system that can h<strong>and</strong>le the<br />
information needs of supply chain nodes to allow efficient, flexible <strong>and</strong> decentralized supply<br />
chain management. The main hurdle in efficiently using information technology is the lack of our<br />
underst<strong>and</strong>ing of the organizational, functional <strong>and</strong> evolutionary principles of supply chains.<br />
Recognizing supply chains as CAS, can however lead to novel <strong>and</strong> effective ways to<br />
underst<strong>and</strong> their emergent dynamics. It has been found that many of the diverse looking CAS<br />
share similar characteristics <strong>and</strong> problems <strong>and</strong> thus can be tackled through similar approaches.<br />
While at present networks are largely controlled by humans; the complexity, diversity <strong>and</strong><br />
geographic distribution of the networks, makes it necessary that the networks maintain<br />
themselves in a sort of evolutionary sense, just as biological organisms do (Maxon 1990).<br />
Similarly, the problem of coordination, which is a challenge in supply chains, has been routinely
solved by biological systems for literally billions of years. We believe that the complexity,<br />
flexibility <strong>and</strong> adaptability in the collective behavior of the supply chains can be accomplished<br />
only by importing the mechanisms that govern these features in nature. Along with these robust<br />
design principles, we require equally sound techniques for modeling <strong>and</strong> analysis of supply<br />
chains. This would form the focus of this paper. We first give a brief overview of the main<br />
techniques that have been used for modeling <strong>and</strong> analysis of supply chains <strong>and</strong> then discuss how<br />
the science of complexity provides a genuine extension <strong>and</strong> reformulation of these approaches.<br />
4. Modeling <strong>and</strong> Analysis of Supply Chain Networks<br />
As pointed out the key challenge in designing supply chain networks or for that matter any<br />
large-scale systems is the difficulty of reverse engineering, i.e., determining what individual agent<br />
strategies lead to the desired collective behavior. Due to this difficulty in underst<strong>and</strong>ing the effect<br />
of individual characteristics on the collective behavior of the system, simulation have been the<br />
primary tools for designing <strong>and</strong> optimizing such systems. Simulation makes investigations<br />
possible <strong>and</strong> useful, when in the real world situation experimentation would be too costly or for<br />
ethical reasons not feasible, or where the decisions <strong>and</strong> their consequences are well separated in<br />
space <strong>and</strong> time. It seems at present that large-scale simulations of future complex processes may<br />
be the most logical, <strong>and</strong> perhaps, an important vehicle to study them objectively (Ghosh, 2002).<br />
Simulation in general helps one to detect design errors, prior to developing a prototype in a cost<br />
effective manner. Secondly, simulation of system operations may identify potential problems that<br />
might occur during actual operation. Thirdly, extensive simulation may potentially detect<br />
problems that are rare <strong>and</strong> otherwise elusive. Fourthly, hypothetical concepts that do not exist in<br />
nature, even those that defy natural laws, may be studied. The increased speed <strong>and</strong> precision of<br />
today’s computers promise the development of high fidelity models of physical <strong>and</strong> natural<br />
processes, ones that yield reasonably accurate results, quickly. This in turn would permit system<br />
architects to study the performance impact of wide variation of key parameters, quickly <strong>and</strong> in<br />
some cases, even in real time. Thus a qualitative improvement in system design may be achieved.<br />
In many cases, unexpected variations in external stress can be simulated quickly to yield<br />
appropriate system parameters values, which are then adopted into the system to enable it to<br />
successfully counteract the external stress.<br />
Mathematical analysis on the other h<strong>and</strong> has to a play a critical role because it alone can enable<br />
us to formulate rigorous generalizations or principle. Neither physical experiments nor computerbased<br />
experiments on their own can provide such generalizations. Physical experiments usually<br />
are limited to supplying inputs <strong>and</strong> constraints for rigorous models, because experiments<br />
themselves are rarely described in a language that permits deductive exploration. Computer based<br />
experiments or simulations have rigorous descriptions, but they deal only in specifics. A welldesigned<br />
mathematical model on the other h<strong>and</strong> generalizes the particulars revealed by the<br />
physical experiments, computer based models <strong>and</strong> any interdisciplinary comparisons. Using<br />
mathematical analysis we can study the dynamics, predict long term behavior, gain insights into<br />
system design: e.g., what parameters determine group behavior, how individual agent<br />
characteristics affect the system <strong>and</strong> that the proposed agent strategy leads to the desired group<br />
behavior. In addition, mathematical analysis may be used to select parameters that optimize<br />
system’s collective behavior, prevent instabilities, etc.<br />
It seems that successful modeling efforts of large-scale systems like supply chain network,<br />
large-scale software systems, communication networks, biological ecosystems, food webs, social<br />
organizations, etc. would require a solid empirical base. Pure abstract mathematical<br />
contemplation would unlikely lead to useful models. The discipline of physics provides an<br />
appropriate parallel; advances in theoretical physics are more often than not inspired by<br />
experimental findings. The study of supply chain networks should therefore involve an amalgam<br />
of both simulation <strong>and</strong> analytical techniques.
Considering the broad spectrum of a supply chain, no model can capture all the aspects of<br />
supply chain processes. The modeling proceeds at three levels:<br />
• Competitive Strategic analysis, which includes location-allocation decision, dem<strong>and</strong><br />
planning, distribution channel planning, strategic alliances, new product development,<br />
outsourcing, IT selection, pricing, <strong>and</strong> network structuring.<br />
• Tactical problems like inventory control, production/distribution coordination, material<br />
h<strong>and</strong>ling, layout design.<br />
• Operational level problems, which includes routing/scheduling, workforce scheduling<br />
<strong>and</strong> packaging.<br />
The models in supply chains can be categorized into four classes (Min <strong>and</strong> Zhou 2002):<br />
• Deterministic: single objective <strong>and</strong> multiple objective models.<br />
• Stochastic: optimal control theoretic <strong>and</strong> dynamic programming models.<br />
• Hybrid: with elements of both deterministic <strong>and</strong> stochastic models <strong>and</strong> includes inventory<br />
theoretic <strong>and</strong> simulations models.<br />
• IT driven: models that aim to integrate <strong>and</strong> coordinate various phases of supply chain<br />
planning on a real-time bases using application software, like ERP.<br />
Mathematical programming techniques <strong>and</strong> simulation have been primarily two approaches for<br />
the analysis <strong>and</strong> study of the supply chains models. The mathematical programming mainly takes<br />
into consideration static aspects of supply chain. The simulation on the other h<strong>and</strong> studies<br />
dynamics in supply chains <strong>and</strong> generally proceeds based on “system dynamics” <strong>and</strong> “agent<br />
based” methodologies. System dynamics is a continuous simulation methodology that uses<br />
concepts from engineering feedback control to model <strong>and</strong> analyze dynamic socioeconomic<br />
systems (Forrester, 1961). The mathematical description is realized with the help of ordinary<br />
differential equation. An important advantage of system dynamics is the possibility to deduce the<br />
occurrence of a specific behavior mode because the structure that leads to the system dynamics is<br />
made transparent. We present some nonlinear models in Section 5 which are useful for<br />
underst<strong>and</strong>ing the complex interdependencies, effects of priority, nonlinearities, delays,<br />
uncertainties <strong>and</strong> competition/cooperation for resource sharing in supply chains. The drawback of<br />
system dynamics model is that the structure has to be determined before starting the simulation.<br />
Agent-based modeling (a technique from complexity theory) on the other h<strong>and</strong> is a “bottom up<br />
approach” which simulates the underlying processes believed responsible for the global pattern,<br />
<strong>and</strong> allows us to evaluate what mechanisms are most influential in producing that emergent<br />
pattern. In (Schieritz <strong>and</strong> Grobler, 2003) a hybrid modeling approach has been presented that<br />
intends to make the system dynamics approach more flexible by combining it with the discrete<br />
agent-based modeling approach. Such large-scale simulations with their many degrees of freedom<br />
raise serious technical problems about the design of experiments <strong>and</strong> the sequence in which they<br />
should be carried out in order to obtain the maximum relevant information. Furthermore, in order<br />
to analyze data from such large-scale simulations we require systematic analytical <strong>and</strong> statistical<br />
methods. In Section 8, we describe two such techniques: Nonlinear Time Series Analyses <strong>and</strong><br />
Computational Mechanics.<br />
A useful paradigm for modeling a supply chain, taking into consideration the detailed pattern of<br />
interaction is to view it as a network. A network is essentially anything that can be represented by<br />
a graph: a set of points (also generically called nodes or vertices), connected by links (edges, ties)<br />
representing some relationship. Networks are inherently difficult to underst<strong>and</strong> due to their<br />
structural complexity, evolving structure, connection diversity, dynamical complexity of nodes,<br />
node diversity <strong>and</strong> meta–complication where all these factors influence each other. Queuing<br />
theory has primarily been used to address the steady-state operation of a typical network. On the<br />
other h<strong>and</strong> techniques from mathematical programming have been used to solve the problem of<br />
resource allocation in networks. This is meaningful when dynamic transients can be disregarded.<br />
However, present day supply chain networks are highly dynamic, reconfigurable, intrinsically
non-linear <strong>and</strong> non-stationarity. New tools <strong>and</strong> techniques are required for their analysis such that<br />
the structure, function <strong>and</strong> growth of networks can be considered simultaneously. In this regard<br />
we discuss “Network Dynamics” in Section 9, which deals with such issues <strong>and</strong> can be used to<br />
study the structure of supply chain <strong>and</strong> its implication for its functionality. Underst<strong>and</strong>ing the<br />
behavior of large complex networks is the next logical step for the field of nonlinear dynamics,<br />
because they are so pervasive in the real world. We begin with a brief introduction to dynamical<br />
systems theory, in particular nonlinear dynamics in next section.<br />
5 Dynamical Systems Theory<br />
Many physical systems that produce continuous-time response can be modeled by a set of<br />
differential equations of the form:<br />
dy = f ( y,<br />
a)<br />
, (I)<br />
dt<br />
where, y = y ( t),<br />
y ( t),......<br />
y ( )) represents the state of the system <strong>and</strong> may be thought of as a<br />
(<br />
1 2<br />
n<br />
t<br />
point in a suitably defined space S-which is known as phase space <strong>and</strong><br />
a = a ( t),<br />
a ( t)<br />
L,<br />
a ( )) is a parameter vector. The dimensionality of S is the number of<br />
(<br />
1 2<br />
m<br />
t<br />
apriori degrees of freedom in the system. The vector field f(y,a) is in general a non-linear operator<br />
acting on points in S. If f(y,a) is locally Lipschtiz, above equation defines an initial value problem<br />
in the sense that a unique solution curve passes through each point y in the phase space. Formally<br />
we may write the solution at time t given an initial value y0 as y( t)<br />
= ϕ<br />
t<br />
y0<br />
. ϕ<br />
t<br />
represents a oneparameter<br />
family of maps of the phase space into itself. We can perceive the solutions to all<br />
possible initial value problems for the system by writing them collectively as ϕ . This may be<br />
thought of as a flow of points in the phase space. Initially the dimension of the set ϕ t<br />
S will be<br />
that of S itself. As the system evolves, however, it is generally the case for the so-called<br />
dissipative system that the flow contracts onto a set of lower dimension known as attractor. The<br />
attractors can vary from simple stationary, limit cycle, quasi-periodic to complicated chaotic ones<br />
(Strogatz 1994, Ott 1996). The nature of attractor changes as parameters (a) are varied, a<br />
phenomena studied in bifurcation analysis. Typically a nonlinear system is always chaotic for<br />
some range of parameters. Chaotic attractors have a structure that is not simple; they are often not<br />
smooth manifolds, <strong>and</strong> frequently have a highly fractured structure, which is popularly referred to<br />
as Fractals (self–similar geometrical objects having structure at every scale). On this attractor,<br />
stretching <strong>and</strong> folding characterize the dynamics; the former phenomenon causes the divergence<br />
of nearby trajectories <strong>and</strong> latter constraints the dynamics to finite region of the state space. This<br />
accounts for fractal structure of attractors <strong>and</strong> the extreme sensitivity to changes in initial<br />
conditions, which is hallmark of chaotic behavior. System under chaos is unstable everywhere<br />
never settling down, producing irregular <strong>and</strong> aperiodic behavior which leads to a continuous<br />
broadb<strong>and</strong> spectrum. While this feature can be used to distinguish chaotic behavior from<br />
stationary, limit cycle, quasi-periodic motions using st<strong>and</strong>ard Fourier Analysis it makes it<br />
difficult to separate it from noise which also has a broadb<strong>and</strong> spectrum. It is this “deterministic<br />
r<strong>and</strong>omness” of chaotic behavior, which makes st<strong>and</strong>ard linear modeling <strong>and</strong> prediction<br />
techniques unsuitable for analysis.<br />
5.1 Nonlinear Models for Supply Chain<br />
Underst<strong>and</strong>ing the complex interdependencies, effects of priority, nonlinearities, delays,<br />
uncertainties <strong>and</strong> competition/cooperation for resource sharing is fundamental for prediction <strong>and</strong><br />
control of supply chains. System dynamics approach often leads to models of supply chains,<br />
which can be described in the form of equation (I). Dynamical systems theory provides a<br />
powerful framework for rigorous analysis of such models <strong>and</strong> thus can be used to supplement the<br />
S t
system dynamics approach. We next describe some nonlinear models <strong>and</strong> their detailed analysis.<br />
These models can be either used to represent entities in a supply chain or as macroscopic models,<br />
which capture collective behavior. The models reiterate the fact that simple rules can lead to<br />
complex behavior, which in general are difficult to predict <strong>and</strong> control.<br />
5.1.1 Preemptive Queuing Model with delays<br />
Priority <strong>and</strong> heterogeneity are fundamental to any logistic planning <strong>and</strong> scheduling. Tasks have<br />
to be prioritized in order to do the most important things first. This comes naturally as we try to<br />
optimize an objective <strong>and</strong> assign the tasks their “importance.” Priorities may also arise due to the<br />
non-homogeneity of the system where “knowledge” level of one agent is different from the other.<br />
In addition in all logistics systems, resources are limited, both in time <strong>and</strong> space. Temporal<br />
dependence plays an important role in logistic planning (interdependency). Sometime they can<br />
also arise from the physical facts when different stages of processing have certain temporal<br />
constraint.<br />
The considerations regarding the generality of assumptions <strong>and</strong> the clear one-to-one<br />
correspondence between the physical logistics tasks <strong>and</strong> the model parameters described in<br />
(Erramilli, <strong>and</strong> Forys 1991) made us apply the queuing model in context of supply chains<br />
(Kumara et al. 2003). The Queuing system considered here has two queues (A <strong>and</strong> B) <strong>and</strong> a<br />
single server with following characteristics:<br />
• Once served, the class A customer returns as a class B customer after a constant interval of<br />
time<br />
• Class B has non-preemptive priority over class A, i.e., the class A queue does not get served<br />
until the class B queue is emptied.<br />
• The schedules are organized every T units of time, i.e., if the low priority queue is emptied<br />
within time T, the server remains idle for the reminder of the interval.<br />
• <strong>Final</strong>ly, the higher priority class B has a lower service rate than the low priority class A.<br />
Figure 2. Preemptive Queuing Model<br />
Suppose the system is sampled at the end of every schedule cycle, <strong>and</strong> the following<br />
quantities are observed at the beginning of the kth interval:<br />
A<br />
k<br />
: Queue length of low priority queue<br />
B : Queue length of high priority queue<br />
k<br />
C : Outflow from low priority queue in the kth interval<br />
k
D<br />
k<br />
: Outflow from high priority queue in the kth interval<br />
λ : Inflow to low priority queue from the outside in the kth interval<br />
k<br />
The system is characterized by the following parameters:<br />
µ<br />
a<br />
: Rate per unit of the schedule cycle at which the low priority queue can be served<br />
µ : Rate per unit of the schedule cycle at which the high priority queue can be served<br />
b<br />
l: The feedback interval in units of the schedule cycle<br />
The following four equations then completely describe the evolution of the system:<br />
A<br />
C<br />
B<br />
k + 1<br />
= Ak<br />
+ λk<br />
− Ck<br />
(1)<br />
Dk<br />
= min( Ak<br />
+ λk<br />
, µ<br />
a<br />
(1 − ))<br />
(2)<br />
µ<br />
k<br />
b<br />
k +1<br />
= Bk<br />
+ Ck<br />
−l<br />
− Dk<br />
(3)<br />
D = min( B + C , µ )<br />
(4)<br />
k<br />
k<br />
k −l<br />
b<br />
Equations (1) <strong>and</strong> (3) are merely conservation rules, while equations (2) <strong>and</strong> (4) model the<br />
constraints on the outflows <strong>and</strong> the interaction between the queues. This model while<br />
conceptually simple, exhibits surprisingly complex behaviors.<br />
Dynamical Behavior<br />
The analytic approach to solve for the flow model under constant arrivals (i.e. λ<br />
k<br />
= λ for all k)<br />
shows several classes of solutions. The system is found to batch its workload even for perfectly<br />
smooth arrival patterns. Following are the characteristics of behavior of the system:<br />
1) Above a threshold arrival rate ( λ ≥ / 2 ), a momentary overload can send the system<br />
µ b<br />
into a number of stable modes of oscillations.<br />
2) Each mode of oscillations is characterized by distinct average queuing delays.<br />
3) The extreme sensitivity to parameters, <strong>and</strong> the existence of chaos, implies the system at a<br />
given time may be any one of a number of distinct steady-state modes.<br />
The batching of the workload can cause significant queuing delays even at moderate occupancies.<br />
Also such oscillatory behavior significantly lowers the real-time capacity of the system. For<br />
details of application of this model in supply chain context, refer to (Kumara et al. 2003).<br />
5.1.2 Managerial Systems<br />
Decision-making is another typical characteristic in which the entities in a supply chain are<br />
continuously engaged in. Entities make decisions to optimize their self-interests, often based on<br />
local, delayed <strong>and</strong> imperfect information.<br />
To illustrate the effects of decisions on the dynamics of supply chain as a whole, we consider a<br />
managerial system, which allocates resources to its production <strong>and</strong> marketing departments in<br />
accordance with shifts in inventory <strong>and</strong>/or backlog (Rasmussen <strong>and</strong> Moseklide 1988). It has four<br />
level variables: resources in production, resources in sales, inventory of finished products <strong>and</strong><br />
number of customers. In order to represent the time required to adjust production, a third order<br />
delay is introduced between production rate <strong>and</strong> inventory. The sum of the two resource variables<br />
is kept constant. The rate of production is determined from resources in production through a<br />
nonlinear function, which expresses a decreasing productivity of additional resources as the
company approaches maximum capacity. The sales rate, on the other h<strong>and</strong>, is determined by the<br />
number of customers <strong>and</strong> by the average sales per customer-year. Customers are mainly recruited<br />
through visits of the company salesman. The rate of recruitment depends upon the resources<br />
allocated to marketing <strong>and</strong> sales, <strong>and</strong> again it is assumed that there is a diminishing return to<br />
increasing sales activity: Once recruited, customers are assumed to remain with the company for<br />
an average period AT, the association time.<br />
A difference between production <strong>and</strong> sales causes the inventory to change. The Company is<br />
assumed to respond to such changes by adjusting its resource allocation. When the inventory is<br />
lower than desired, on the other h<strong>and</strong>, resources are redirected from sales to production. A certain<br />
minimum of resources is always maintained in both production <strong>and</strong> sales. In the model, this is<br />
secured by means of two limiting factors, which reduce the transfer rate when a resource floor is<br />
approached. <strong>Final</strong>ly the model assumes that there is a feedback from inventory to customer<br />
defection rate. If the inventory of finished products becomes very low, the delivery time is<br />
assumed to become unacceptable to many customers. As a consequence, the defection rate is<br />
enhanced by a factor 1+H.<br />
Figure 3. Managerial System<br />
Dynamical Behavior<br />
The managerial system described is controlled by two interacting negative feedback. Combined<br />
with the delays involved in adjusting production <strong>and</strong> sales, these loops create the potential for<br />
oscillatory behavior. If the transfer of resources is fast enough, this behavior is destabilized <strong>and</strong>
the system starts to perform self-sustained oscillations. The amplitude of these oscillations is<br />
finally limited by the various nonlinear restrictions in the model, particularly by the reduction of<br />
resource transfer rate as lower limits to resources in production or resources in sales are<br />
approached.<br />
A series of abrupt changes in the system behavior is observed as competition between the basic<br />
growth tendency <strong>and</strong> nonlinear limiting factors is shifted. The simple one-cycle attractor<br />
corresponding to H=10, becomes unstable for H=13 <strong>and</strong> a new stable attractor with twice the<br />
original period arises. If H is increased to 28 the stable attractor attains a period of 4. As H is<br />
further increased, the period-doubling bifurcations continue until H=30 the threshold to chaos is<br />
exceeded. The system now starts to behave in an aperiodic <strong>and</strong> apparently r<strong>and</strong>om behavior.<br />
Hence the system shows chaotic behavior through a series of period doubling bifurcations.<br />
5.1.3 Deterministic Queuing Model<br />
In this section we consider an alternate discrete-time deterministic queuing model, for studying<br />
decision making at an entity level in supply chains. The model consists of one server <strong>and</strong> two<br />
queuing lines (X <strong>and</strong> Y) representing some activity (Feichtinger et al. 1994). The input rates of<br />
both queues are constant <strong>and</strong> their sum equals the server-capacity. In each time period the server<br />
has to decide how much time to spend on each of the two activities.<br />
The following quantities can be defined:<br />
α : Constant input rate for activity X<br />
β : Constant input rate for activity Y<br />
Φ<br />
X<br />
: Time spent on activity X<br />
Φ<br />
Y<br />
: Time spent on activity Y<br />
x : Queue length of X<br />
k<br />
y : Queue length of Y<br />
k<br />
The amount of time<br />
Figure 4. Deterministic Queuing Model<br />
Φ<br />
X<br />
<strong>and</strong> Φ that will be spent on activities X <strong>and</strong> Y in period k+1 are<br />
Y<br />
determined by an adaptive feedback rule depending on the difference of the queue lengths<br />
x k<br />
<strong>and</strong>
y<br />
k<br />
.The decision rule or policy function says that longer queues are served with higher priority.<br />
Two possibilities considered are:<br />
1) All-or nothing decision: the server decides to spend all its time on the activity corresponding to<br />
the longer queue. Hence Φ is a Heaviside function given by<br />
Φ( x − y)<br />
= 1 if x ≥ y<br />
=0 if x < y .<br />
2) Mixed Solutions: the server decides to spend most of its time to the activity corresponding to<br />
the longer queue. For this decision function a S-shaped logistic function is used as given by:<br />
1<br />
Φ ( x − y)<br />
= .<br />
k ( x−<br />
)<br />
1+<br />
e<br />
y<br />
The parameter k tunes the “steepness” of the S-shape.<br />
With these decision functions the new queue lengths <strong>and</strong> are given equations<br />
xk+1<br />
y k+ 1<br />
xk+ 1<br />
= xk<br />
+ α − Φ(<br />
xk<br />
− yk<br />
) ,<br />
yk+ 1<br />
= yk<br />
+ β − Φ(<br />
xk<br />
− yk<br />
) .<br />
Using the constraints α + β = 1 <strong>and</strong> Φ<br />
X<br />
+ ΦY<br />
= 1, it is sufficient to consider the dynamics of<br />
the map in order to study the behavior of the system<br />
f ( x)<br />
= x + α − Φ(2x<br />
− 2) .<br />
Dynamical Behavior<br />
For 0
sustained in a supply chain. Resources can be of various types: physical resources, manpower,<br />
information <strong>and</strong> monetary. With the IT architectures being developed to realize supply chains,<br />
sharing of computational resources (like CPU, Memory, B<strong>and</strong>width, databases etc.) is also<br />
becoming a critical issue. It is through resource sharing that interdependencies arise between<br />
different entities. This leads to a complex web of interactions in supply chains just like for e.g. in<br />
a food web or an ecology. As a result such systems can be referred to as “Computational<br />
Ecosystems” (Hogg <strong>and</strong> Huberman 1988) in analogy with biological ecosystems.<br />
“Computational Ecosystems” is a generic model of the dynamics of resource allocation among<br />
agents trying to solve a problem collectively. The model captures following features: distributed<br />
control, asynchrony in execution, resource contention <strong>and</strong> cooperation among agents <strong>and</strong><br />
concomitant problem of incomplete knowledge <strong>and</strong> delayed information. The behavior of each<br />
agent is modeled using a payoff function whose nature determines whether an agent is<br />
cooperative or competitive. The agent here can be any entity in a supply chain like a distributor,<br />
retailer etc. or a software agent in e-commerce scenario. The state of the system is represented as<br />
an average number of entities using different resources <strong>and</strong> follows a delay differential equation<br />
under mean field approximation. The resources can be physical or computational as discussed<br />
before. For example in case of two resources with n identical agents, law governing the rate of<br />
change of occupation of a resource is given by:<br />
d n1<br />
( t)<br />
= α ( n ρ − n1<br />
( t)<br />
)<br />
dt<br />
where,<br />
n1 ( t)<br />
= Expected no. of agents using resource 1 at given instant of time t.<br />
α : Expected no. of choices made by an agent per unit time<br />
ρ : A r<strong>and</strong>om variable that denotes that resource1 will be perceived to have a higher payoff than<br />
res ource 2 <strong>and</strong> ρ gives its expected value.<br />
Figure 5. Computational Ecosystems
(τ = Time delay <strong>and</strong> σ : St<strong>and</strong>ard deviation of ρ )<br />
The global performance of the ecosystems can be obtained from the above equation. Under<br />
different conditions of delay, uncertainty, cooperation/competition the system shows a rich<br />
panoply of behaviors ranging from stable, sustained oscillations to intermittent chaos <strong>and</strong> finally<br />
to fully developed chaos. Furthermore, following generic deductions can be made from this<br />
model (Kephart et al. 1989): While information delay has adverse impact on the system<br />
performance, uncertainty has a profound effect on the stability of the system. One can<br />
deliberately increase uncertainty in agents’ evaluation of the merits of choices to make it stable<br />
but at the expense of performance degradation. Second possibility is very slow reevaluation rate<br />
of the agents, which however makes them non-adaptive. Heterogeneity in the nature of agents<br />
can however lead to more stability in the system compared to homogenous case but the system<br />
loses its ability to cope up with unexpected changes in the system such as new task requirements.<br />
On the other h<strong>and</strong> poor performance can be traced to the fact that the non-predictive agents do not<br />
take into account the information delay.<br />
If the agents are able to make accurate predictions of its current state, the information delay<br />
could be overcome, <strong>and</strong> the system would perform well. This results in a “co-evolutionary”<br />
system in which all of the individual are simultaneously trying to adapt to one another. In such a<br />
situation agents can act like Technical Analysts <strong>and</strong> System Analysts (Kephart et al. 1990). Agents<br />
as technical analysts (like those in market behavior) use either linear extrapolation or cyclic trend<br />
analysis to estimate the current state of the system. On the other h<strong>and</strong>, agents as system analysts<br />
have knowledge about both the individual characteristics of the other agents in the system <strong>and</strong><br />
how those characteristics are related to the overall system dynamics. Technical Analysts are<br />
responsive to the behavior of the system, but suffer from an inability to take into account the<br />
strategies of other agents. Moreover good predictive strategy for a single agent may be disastrous<br />
if applied on a global scale. System Analysts perform extremely well when they have very<br />
accurate information about other agents in the system, but can perform very poorly when their<br />
information is even slightly inaccurate. They take into account the strategies of other agents, but<br />
pay no heed to the actual behavior of the system. This suggests combining the strengths of both<br />
methods to form a hybrid- adaptive system analyst-, which modifies its assumptions about other<br />
to feedback about success of its own predictions. The resultant hybrid is able<br />
agents in response<br />
to perform well.<br />
In order to avoid chaos while maintaining high performance <strong>and</strong> adaptability to unforeseen<br />
changes more sophisticated techniques are required. One such way is by reward mechanism<br />
(Hogg <strong>and</strong> Huberman 1991) whereby the relative number of computational agents following<br />
effective strategies is increased at the expense of the others. This procedure, which generates a<br />
right mix of diverse population out of essentially homogenous ones, is able to control chaos by a<br />
series of bifurcations into a stable fixed point.<br />
In the above description each agent chooses amongst different resources according to its<br />
perceived payoff, which depends on the number of agents already using it. Even the agent with<br />
predictive ability is myopic in its view, as it considers only its current estimate of the system<br />
state, without regard to the future. Expectations come into play if agents use past <strong>and</strong> present<br />
global behavior in estimating the expected future payoff for each resource. A dynamical model of<br />
collective action that includes expectations can be found in (Glance 1993).<br />
6. Models from Observed Data<br />
One of the central problems in a supply chain, closely related to modeling, is that of dem<strong>and</strong><br />
forecasting: given the past, how can we predict the future dem<strong>and</strong>? The classic approach to<br />
forecasting is to build an explanatory model from the first principle <strong>and</strong> measure the initial<br />
conditions. Unfortunately this has not been possible for two reasons in systems like supply<br />
chains: Firstly, we still lack the general “first principles” for dem<strong>and</strong> variation in supply chains,
which are necessary to make good models. Secondly, due to the distributed nature of the supply<br />
chains, the initial data or the conditions are often difficult to obtain.<br />
Due to these factors, the modern theory of forecasting that has been used in supply chains,<br />
views a time series x(t) as a realization of a r<strong>and</strong>om process. This is appropriate when effective<br />
r<strong>and</strong>omness arises from complicated motion involving many independent, irreducible degrees of<br />
freedom. An alternative cause of r<strong>and</strong>omness is chaos, which can occur even in very simple<br />
deterministic systems as we discussed in the earlier sections. While chaos places a fundamental<br />
limit on long-term prediction, it suggests possibilities for short-term prediction. R<strong>and</strong>om-looking<br />
data may contain only few irreducible degrees of freedom. Time traces of the state variable of<br />
such chaotic systems display a behavior, which is intermediate between regular periodic or<br />
quasiperiodic motions, <strong>and</strong> unpredictable, truly stochastic behavior. It has long been seen as a<br />
form of “noise” because the tools for its analysis were couched in language tuned to linear<br />
process. The main such tool is Fourier analysis which is precisely designed to extract the<br />
composition of sines <strong>and</strong> cosines found in an observation x(t). Similarly, the st<strong>and</strong>ard linear<br />
modeling <strong>and</strong> prediction techniques, such as autoregressive moving average (ARMA) models are<br />
not suitable for nonlinear systems.<br />
With the advances in IT <strong>and</strong> science of complexity both the challenges for forecasting can be<br />
revived. Large-scale simulation <strong>and</strong> micro autonomy (Section 2) enable tracking of the detailed<br />
interaction between different entities in a supply chain. The large volumes of data, so generated<br />
can be used to underst<strong>and</strong> dem<strong>and</strong> patterns in specific <strong>and</strong> comprehend the emergence of other<br />
characteristics in general. Even though an exact prediction of future behavior is difficult, often<br />
archetypal behavior patterns can be recognized using this data. Techniques from the complexity<br />
theory like Nonlinear Time Series Analysis <strong>and</strong> Computational Mechanics are appropriate for this<br />
purpose.<br />
6.1 Nonlinear Time Series Analysis<br />
The need to extract interesting physical information about the dynamics of observed systems<br />
when they are operating in a chaotic regime has led to development of nonlinear time series<br />
analysis techniques. Systematically, the study of potentially, chaotic systems may be divided into<br />
three areas: identification of chaotic behavior, modeling <strong>and</strong> prediction <strong>and</strong> control. The first area<br />
shows how chaotic systems may be separated from stochastic ones <strong>and</strong>, at the same time,<br />
provides estimates of the degrees of freedom <strong>and</strong> the complexity of the underlying chaotic<br />
system. Based on such results, identification of a state space representation allowing for<br />
subsequent predictions may be carried out. The last stage, if desirable involves control of a<br />
chaotic system.<br />
Given the observed behavior of a dynamical system as a one-dimensional time series x(n) we<br />
want to build models for prediction. The most important task in this process is phase space<br />
reconstruction, which involves building topologically <strong>and</strong> geometrically equivalent attractor. In<br />
general steps in nonlinear time series analysis can be summarized as (Abarbanel 1996):<br />
• Signal Separation (Finding the signal): Separation of broadb<strong>and</strong> signal form broadb<strong>and</strong><br />
“noise” using deterministic nature of signal.<br />
• Phase Space reconstruction (Finding the space): Using the method of delays one can<br />
construct series of vectors, which is diffeomorphically equivalent to the attractor of the<br />
original dynamical system <strong>and</strong> at the same time distinguish it from the being stochastic.<br />
The basis for this is Taken’s Embedding theorem (Takens 1981). Time lagged variables<br />
are used to construct vectors for a phase space in d E dimension:<br />
y( n)<br />
= [ x(<br />
n),<br />
x(<br />
n + T),......<br />
x(<br />
n + ( d − E<br />
1) T)]<br />
The time lag T can be determined using mutual information (Fraser <strong>and</strong> Swinney 1983)<br />
<strong>and</strong> d E using false nearest neighbors test (Kennel et al. 1992).
• Classification of the signal: System identification in nonlinear chaotic systems means<br />
establishing a set of invariants for each system of interest <strong>and</strong> then comparing<br />
observations to that library of invariants. The invariants are properties of attractor <strong>and</strong> are<br />
independent of any particular trajectory of the attractor. Invariants can be divided into<br />
two classes: fractal dimensions (Farmer et. al. 1983) <strong>and</strong> Lyapunov exponents (Sano <strong>and</strong><br />
Sawada 1985). Fractal dimensions characterize geometrical complexity of dynamics i.e.<br />
how the sample of points along a system orbit are distributed spatially. Lyapunov<br />
exponents on the other h<strong>and</strong> describe the dynamical complexity i.e. “stretching <strong>and</strong><br />
folding” in the dynamical process.<br />
• Making models <strong>and</strong> Prediction: This step involves determination of the parameters of<br />
the assumed model of the dynamics:<br />
y(<br />
n)<br />
→ y(<br />
n + 1)<br />
y(<br />
n + 1) = F(<br />
y(<br />
n),<br />
a , a<br />
,..... a<br />
1 2 p<br />
which is consistent with invariant classifiers (Lyapunov exponents, dimensions). The functional<br />
form F (⋅) often used, includes polynomials, radial basis functions etc. Local False Nearest<br />
Neighbor (Abarbanel <strong>and</strong> Kennel 1993) test is used to determine how many dimensions are<br />
locally required to describe the dynamics generating the time series, without knowing the<br />
equations of motion <strong>and</strong> hence gives the dimension for the assumed model. The methods for<br />
building nonlinear models can be classified as Global <strong>and</strong> Local (Farmer <strong>and</strong> Sidorowich 1987;<br />
Casdalgi 1989). By definition Local methods vary from point to point in the phase space while<br />
Global Models are constructed once <strong>and</strong> for all in the whole phase space. Models based on<br />
Machine Learning techniques such as radial basis functions or Neural Networks (Powell 1987)<br />
<strong>and</strong> Support Vector Machines (Mukherjee et al. 1997) carry features of both. They are usually<br />
used as global functional forms, but they clearly demonstrate localized behavior too.<br />
The techniques from nonlinear time series analysis are well suited for modeling the<br />
nonlinearities in the supply chains. For an application of nonlinear time series analysis in supply<br />
chains, the reader is referred to Lee et al., 2002. Using it one can deduce that the time series is<br />
deterministic, so that it should be possible in principle to build predictive models. The invariants<br />
can be used to effectively characterize the complex behavior. For e.g., the largest Lyapunov<br />
exponent gives an indication of how far into the future, reliable predictions can be made while the<br />
fractal dimensions gives an indication of how complex a model should be chosen to represent the<br />
data. These models then provide the basis for systematically developing the control strategies. It<br />
should be noted the functional forms used for modeling in the step (4) above, are continuous in<br />
their argument. This approach builds models viewing a dynamical system as obeying laws of<br />
physics. From another perspective a dynamical system can be considered as processing<br />
information. So an alternative class of discrete “computational” models inspired from the theory<br />
of automata <strong>and</strong> formal languages can also be used for modeling the dynamics (Marcus 1996).<br />
“Computational Mechanics”, considers this viewpoint <strong>and</strong> describes the system behavior in terms<br />
of its intrinsic computational architecture i.e. how it stores <strong>and</strong> processes information.<br />
6.2 Computational Mechanics<br />
Computational mechanics is a method for inferring the causal structure of stochastic processes<br />
from empirical data or arbitrary probabilistic representations. It combines ideas <strong>and</strong> techniques<br />
from nonlinear dynamics, information theory <strong>and</strong> automata theory, <strong>and</strong> is, as it were, an “inverse”<br />
to statistical mechanics. Instead of starting with a microscopic description of particles <strong>and</strong> their<br />
interactions, <strong>and</strong> deriving macroscopic phenomena, it starts with observed macroscopic data, <strong>and</strong><br />
infers the simplest causal structure: the “ε -machine” capable of generating the observations. The<br />
ε -machine in turn describes the system's intrinsic computation, i.e., how it stores <strong>and</strong> processes<br />
information. This is developed using the statistical mechanics of orbit ensembles, rather than<br />
)<br />
a j
focusing on the computational complexity of individual orbits. By not requiring a Hamiltonian,<br />
computational mechanics can be applied in a wide range of contexts, including those where an<br />
energy function for the system may not manifest like for the supply chains. Notions of<br />
Complexity, Emergence <strong>and</strong> Self-Organization have also been formalized <strong>and</strong> quantified in terms<br />
of various information measures (Shalizi 2000).<br />
Given a time series, the (unknowable) exact states of an observed system are translated into<br />
sequence of symbols via a measurement channel (Crutchfield 1992). Two histories (i.e., two<br />
series of past data) carry equivalent information if they lead to the same (conditional) probability<br />
distribution in the future (i.e., if it makes no difference whether one or the other data-series is<br />
observed). Under these circumstances, i.e., the effects of the two series being indistinguishable,<br />
they can be lumped together. This procedure identifies causal states, <strong>and</strong> also identifies the<br />
structure of connections or succession in causal states, <strong>and</strong> creates what is known as an “epsilonmachine”.<br />
The ε -machines form a special class of Deterministic Finite State Automata (DFSA)<br />
with transitions labeled with conditional probabilities <strong>and</strong> hence can also be viewed as Markov<br />
chains. However, finite-memory machines likeε -machines may fail to admit a finite size model<br />
implying that the number of casual states could turn out to be infinite. In this case, a more<br />
powerful model than DFSA needs to be used. One proceeds by trying to use the next most<br />
powerful model in the hierarchy of machines known as the casual hierarchy (Crutchfield 1994),<br />
in analogy with the Chomsky hierarchy of formal languages. While “ε -machine reconstruction”<br />
refers to the process of constructing the machine given an assumed model class, “hierarchical<br />
machine reconstruction” describes a process of innovation to create a new model class. It detects<br />
regularities in a series of increasingly accurate models. The inductive jump to a higher<br />
computational level occurs by taking those regularities as the new representation.<br />
ε -machines reflect a balanced utilization of deterministic <strong>and</strong> r<strong>and</strong>om information processing<br />
<strong>and</strong> this is discovered automatically during ε -machine reconstruction. These machines are<br />
unique <strong>and</strong> optimal in the sense that they have maximal predictive power <strong>and</strong> minimum model<br />
size (hence satisfy Principle of Occam Razor i.e. causes should not be multiplied beyond<br />
necessity). ε -machine provides a minimal description of the pattern or regularities in a system in<br />
the sense that the pattern is the algebraic structure determined by the causal states <strong>and</strong> their<br />
transitions. ε -machines are also minimally stochastic. Hence computational mechanics acts as a<br />
method for automatic pattern discovery.<br />
ε -machine is the organization of the process, or at least of the part of it which is relevant to<br />
our measurements. The ε -machine being a model of the observed time series from a system can<br />
be used to define <strong>and</strong> calculate macroscopic or global properties that reflect the characteristic<br />
average information processing capabilities of the system. Some of these include Entropy rate,<br />
Excess entropy <strong>and</strong> Statistical Complexity (Feldman <strong>and</strong> Crutchfield 1998) <strong>and</strong> (Crutchfield <strong>and</strong><br />
Feldman 2001). The entropy density indicates how predictable the system is. Excess entropy on<br />
other h<strong>and</strong> provides a measure of the apparent memory stored in a spatial configuration <strong>and</strong><br />
represents how hard it is the prediction.ε -machine reconstruction leads to a natural measure of<br />
the statistical complexity of a process, namely the amount of information needed to specify the<br />
state of the ε -machine i.e. the Shannon Entropy. Statistical Complexity is distinct <strong>and</strong> dual from<br />
information theoretic entropies <strong>and</strong> dimension (Crutchfield <strong>and</strong> Young 1989). The existence of<br />
chaos shows that there is rich variety of unpredictability that spans the two extremes: periodic <strong>and</strong><br />
r<strong>and</strong>om behavior. This behavior between two extremes while of intermediate information content<br />
is more complex in that the most concise description (modeling) is an amalgam of regular <strong>and</strong><br />
stochastic processes. Information theoretic description of this spectrum in terms of dynamical<br />
entropies measures raw diversity of temporal patterns. The dynamical entropies however do not<br />
measure directly the computational effort required in modeling the complex behavior, which is<br />
what statistical complexity captures.
Computational mechanics sets limits on how well processes can be predicted <strong>and</strong> shows how at<br />
least in principle, those limits can be attained. ε -machines are what any prediction method would<br />
build, if only they could. Similar to ε -machine reconstruction, techniques exists which can be<br />
used to discover casual architecture in memory less transducers, transducers with memory <strong>and</strong><br />
spatially extended systems (Shalizi 2000). Computational mechanics can be used for modeling<br />
<strong>and</strong> prediction in supply chains in the following way:<br />
• In systems like supply chain, it is difficult to define analogs of various thermodynamic<br />
quantities like energy, temperature, pressure etc as we can do for physical systems. Each<br />
component in the network has a cognition, which is absent in physical systems; say a<br />
molecule of a gas. Due to such difficulties statistical mechanics cannot be applied directly<br />
to build prediction models for supply chains. As discussed previously by not requiring a<br />
Hamiltonian (the energy like function), computational mechanics is still applicable in<br />
case of supply chains.<br />
• ε -machines can be built to discover patterns in behavior of various quantities in supply<br />
chains like the inventory levels, dem<strong>and</strong> fluctuations, etc.<br />
• ε -machines can be used for prediction through a process known as “synchronization”<br />
(Crutchfield <strong>and</strong> Feldman 2003).<br />
• ε -machines can be used to calculate various global properties like entropy rate, excess<br />
entropy <strong>and</strong> statistical complexity, that reflect how the system stores <strong>and</strong> processes<br />
information. The significance of these quantities has been discussed earlier.<br />
• We can also quantify notions of Complexity, Emergence <strong>and</strong> Self-Organization in terms<br />
of various information measures derived from ε -machines. By evaluating such quantities<br />
we can compare complexity of different supply chains <strong>and</strong> quantify the extent to which<br />
the network is showing emergence. We can also infer when a supply chain is undergoing<br />
self-organization <strong>and</strong> to what extent. Such quantification can help us to compare<br />
precisely what policies or cognitive capabilities possessed by individual agents can lead<br />
to different degrees of emergence <strong>and</strong> self-organization. Hence we can decide to what<br />
extent we desire to enforce the control <strong>and</strong> to what extent we want to let the network<br />
emerge.<br />
7. Network Dynamics<br />
The ubiquity of networks in the social, biological <strong>and</strong> physical sciences <strong>and</strong> in technology leads<br />
naturally to an important set of common problems, which are being currently studied under the<br />
rubric of “Network Dynamics” (Strogatz 2001). Structure always affects function <strong>and</strong> it is<br />
important to consider dynamical <strong>and</strong> structural complexity together in the study of networks. For<br />
instance, the topology of social networks affects the spread of information <strong>and</strong> disease, <strong>and</strong> the<br />
topology of the power grid affects the robustness <strong>and</strong> stability of power transmission. The<br />
different problem areas in network dynamics are discussed below.<br />
One area of research in this field has been primarily concerned with the dynamical complexity<br />
in regular networks without regard to other network topologies. While the collective behavior<br />
depends on the details of the network, some generalization can still be drawn (Strogatz 2001). For<br />
instance, if the dynamical system at each node has stable fixed points <strong>and</strong> no other attractor, the<br />
network tends to lock into a static fixed pattern. If the nodes have competing interactions,<br />
network may become frustrated <strong>and</strong> display enormous number of locally stable equilibria. In the<br />
intermediate case where each node has a stable limit cycle, synchronization <strong>and</strong> patterns like<br />
traveling waves can be observed. For non-identical oscillators temporal analogue of phase<br />
transition can be seen with the control parameter as the coupling coefficient. At the opposite<br />
extreme if each node has identical chaotic attractor, the network can synchronize their erratic<br />
fluctuations. For a wide range of network topologies, synchronized chaos requires that the<br />
coupling be neither too weak nor too strong; otherwise spatial instabilities are triggered. Related
line of research that deals with networks of identical chaotic map is coupled map lattices<br />
(Kaneko <strong>and</strong> Tsuda 1996) <strong>and</strong> cellular automata (Wolfram 1994). However these systems have<br />
been used mainly as test-beds for exploring spatio-temporal chaos <strong>and</strong> pattern formation in the<br />
simplest mathematical settings, rather than as models of real systems.<br />
The second area in network dynamics is concerned about characterizing the network structure.<br />
Network structure or topologies in general can vary from completely regular like chains, grids,<br />
lattices <strong>and</strong> fully connected to completely r<strong>and</strong>om. Moreover the graphs can be directed or<br />
undirected <strong>and</strong> cyclic or acyclic. In order to characterize topological properties of the graphs,<br />
various statistical quantities have been defined. Most important of them include average path<br />
length, clustering coefficient, degree distributions, size of giant component <strong>and</strong> various spectral<br />
properties. A review of the main models <strong>and</strong> analytical tools, covering regular graphs, r<strong>and</strong>om<br />
graphs, generalized r<strong>and</strong>om graphs, small-world <strong>and</strong> scale-free networks, as well as the interplay<br />
between topology <strong>and</strong> the network's robustness against failures <strong>and</strong> attacks can be found in<br />
(Albert <strong>and</strong> Barabasi 2002, Dorogovtsev <strong>and</strong> Mendes 2002).<br />
The classic r<strong>and</strong>om graphs were introduced by Erdos <strong>and</strong> Renyi (Bollobas 1985) <strong>and</strong> have been<br />
the most thoroughly studied models of networks. Such graphs have Poisson degree distribution<br />
<strong>and</strong> statistically uncorrelated vertices. At large N (total number of nodes in the graph) <strong>and</strong> large<br />
enough p (probability that two arbitrary vertices are connected), a giant connected component<br />
appears in the network, a process known as percolation. The r<strong>and</strong>om graphs exhibit low average<br />
path length, <strong>and</strong> low clustering coefficient. The regular networks on other h<strong>and</strong> show high<br />
clustering coefficient <strong>and</strong> also a greater average path length compared to the r<strong>and</strong>om graphs of<br />
similar size. The networks found in real world, however are neither completely regular nor<br />
completely r<strong>and</strong>om. This has been recently discovered in the form of “small world” <strong>and</strong> “scale<br />
free” characteristics, for many real networks like: social networks, internet, WWW, power grids,<br />
collaboration networks, ecological <strong>and</strong> metabolic networks to name a few.<br />
In order to describe the transition from a regular network to a r<strong>and</strong>om network, Watts <strong>and</strong><br />
Strogatz introduced the so-called small-world graphs as models of social networks (Watts <strong>and</strong><br />
Strogatz 1998) <strong>and</strong> (Newman 2000). This model exhibits a high degree of clustering as in the<br />
regular network <strong>and</strong> a small average distance between vertices as in the classic r<strong>and</strong>om graphs. A<br />
common feature of this model with r<strong>and</strong>om graph model is that the connectivity distribution of<br />
the network peaks at an average value <strong>and</strong> decays exponentially. Such an exponential network is<br />
homogeneous in nature: each node has roughly the same number of connections. Due to high<br />
degree of clustering the models of dynamical systems with small-world coupling display<br />
enhanced signal-propagation speed, rapid disease propagation, <strong>and</strong> synchronizability (Watts <strong>and</strong><br />
Strogatz 1998).<br />
Another significant recent discovery in the field of complex networks is the observation that<br />
the connectivity distributions of a number of large-scale <strong>and</strong> complex networks, including the<br />
−γ<br />
WWW, Internet, <strong>and</strong> metabolic network, have the power law form P ( k)<br />
≈ k , where P(k)<br />
is<br />
the probability that a node in the network is connected to k other nodes, <strong>and</strong> γ is a positive real<br />
number (Barabasi et al. 2000, Barabasi 2001). Since power-laws are free of characteristic scale,<br />
such networks are called “scale-free network”. A scale-free network is inhomogeneous in nature:<br />
most nodes have few connections but small number (but statistically significant) have many<br />
connections. The average path length is smaller in the scale free network than in a r<strong>and</strong>om graph,<br />
indicating that the heterogeneous scale-free topology is more efficient in bringing the nodes<br />
closer than homogenous topology of the r<strong>and</strong>om graphs. The clustering coefficient of the scalefree<br />
network is about 5 times higher than that of the r<strong>and</strong>om graph, <strong>and</strong> this factor slowly<br />
increases with the number of nodes. It has been shown that it is practically impossible to achieve<br />
synchronization in a nearest-neighbor coupled network (regular connectivity) if the network is<br />
sufficiently large. However, it is quite easy to achieve synchronization in a scale-free dynamical<br />
network no matter how large the network is (Weng <strong>and</strong> Chen, 2002). Moreover, the
synchronizability of a scale-free dynamical network is robust against r<strong>and</strong>om removal of nodes,<br />
but is fragile to specific removal of the most highly connected nodes.<br />
The scale free property <strong>and</strong> high degree of clustering (the small world effect) however are not<br />
exclusive for a large number of real networks. Yet most models proposed to describe the topology<br />
of complex networks have the difficulty capturing simultaneously these two features. It has been<br />
shown in (Ravasz <strong>and</strong> Barabasi, 2003) that these two features are the consequence of a<br />
hierarchical organization present in the networks. This argument also agrees with that proposed<br />
by Herbert Simon (Simon 1997) who argues: “we could expect complex systems to be hierarchies<br />
in a world in which complexity has to evolve from simplicity. In their dynamics, hierarchies have<br />
a property, near decomposability, that greatly simplifies their behavior. Near decomposability<br />
also simplifies the description of complex systems <strong>and</strong> makes it easier to underst<strong>and</strong> how the<br />
information needed for the development of the system can be stored in reasonable compass”.<br />
Indeed many networks are fundamentally modular: one can easily identify groups of nodes that<br />
are highly interconnected with each other, but have only a few or no links to nodes outside of the<br />
group to which they belong. This clearly identifiable modular organization is at the origin of high<br />
degree of clustering coefficient. On the other h<strong>and</strong> these modules can be organized in a<br />
hierarchical fashion into increasingly large groups, giving rise to “hierarchical networks”, while<br />
still maintaining the scale-free topology. Thus modularity, scale-free character <strong>and</strong> high degree of<br />
clustering can be achieved under a common roof. Moreover, in hierarchical networks the degree<br />
of clustering characterizing the different groups follows a strict scaling law, which can be used to<br />
identify the presence of hierarchical structure in real networks.<br />
The mathematical theory of graphs with arbitrary degree distributions known as “generalized<br />
r<strong>and</strong>om graphs” can be found in (Newman et al. 2001) <strong>and</strong> (Newman 2003). Using the<br />
“generating function formulation”, the authors have been able to solve the percolation problem<br />
(i.e. have found conditions for predicting the appearance of a giant component), have obtained<br />
formulae for calculating clustering coefficient <strong>and</strong> average path length for generalized r<strong>and</strong>om<br />
graphs. The authors have proposed <strong>and</strong> studied models of propagation of diseases, failures, fads<br />
<strong>and</strong> synchronization on such graphs <strong>and</strong> have extended their results for bipartite <strong>and</strong> directed<br />
graphs.<br />
Network dynamics though in its infancy promises a formal framework to characterize the<br />
organizational <strong>and</strong> functional aspects in supply chains. With the changing trends in supply chains,<br />
many new issues have become critical like: organizational resistance to change, inter-functional<br />
or inter-organizational conflicts, relationship management, <strong>and</strong> consumer <strong>and</strong> market behavior.<br />
Such problems are ill structured <strong>and</strong> behavioral <strong>and</strong> cannot be commonly addressed by analytical<br />
tools such as mathematical programming. Successful supply chain integration depends on the<br />
supply chain partners’ ability to synchronize <strong>and</strong> share real-time information. The establishment<br />
of collaborative relationship among supply chain partners is a pre-requisite to information<br />
sharing. As a result successful supply chain management relies on systematically studying<br />
questions like 1) what are the robust architectures for collaboration <strong>and</strong> what are the coordination<br />
strategies that lead to such architectures, 2) if different entities make decisions on whether or not<br />
to cooperate on the basis of imperfect information about the group activity, <strong>and</strong> incorporate<br />
expectations on how their decision will affect other entities, can overall cooperation be sustained<br />
for long periods of time 3) how do the expectations, group size, <strong>and</strong> diversity affect coordination<br />
<strong>and</strong> cooperation <strong>and</strong> 4) which kinds of organizations are most able to sustain ongoing collective<br />
action, <strong>and</strong> how might such organizations evolve over time. Network dynamics addresses many<br />
of such questions <strong>and</strong> should be explored in context of supply chains.<br />
8. Conclusions <strong>and</strong> Future Work<br />
The idea of managing the whole supply chain <strong>and</strong> transform them into a highly autonomous,<br />
dynamic, agile, adaptive <strong>and</strong> reconfigurable network certainly provides an appealing vision for<br />
managers. The infrastructure provided by Information technology has made this vision partially
ealizable. But the inherent complexity of supply chains makes the efficient utilization of<br />
information technology an elusive endeavor. Tackling this complexity has been beyond the<br />
existing tools <strong>and</strong> techniques <strong>and</strong> requires revival <strong>and</strong> extensions.<br />
As a result we emphasized in this paper, that in order to effectively underst<strong>and</strong> a supply chain<br />
network, it should be treated as a CAS. We laid down some initial ideas for the extension of<br />
modeling <strong>and</strong> analysis of supply chains using the concepts, tools <strong>and</strong> techniques arising in the<br />
study of CAS. As a future work we need to verify the feasibility <strong>and</strong> usefulness of the proposed<br />
techniques in the context of large scale supply chains.<br />
Acknowledgements<br />
The authors wish to acknowledge <strong>DARPA</strong> (Grant#: MDA972-1-1-0038 under UltraLog Program)<br />
for their generous support for this research. In additions the partial support provided by NSF<br />
(Grant#:DMII-0075584) for Professor Kumara is greatly appreciated.<br />
References<br />
Abarbanel, H.D.I, 1996, The Analysis of Observed Chaotic Data, Springer-Verlag, New York.<br />
Abarbanel, H. D. I. <strong>and</strong> Kennel, M. B., 1993, Local False Nearest Neighbors <strong>and</strong> Dynamical<br />
Dimensions from Observed Chaotic Data, Phys. Rev. E, 47, 3057-3068.<br />
Adami, C., 1998, Introduction to Artificial Life, Springer-Verlag.<br />
Albert, R. <strong>and</strong> Barabasi, A. L., 2002, Statistical Mechanics of Complex Networks, Reviews of<br />
Modern Physics, 74, 47.<br />
Albert, R., Barabási, A. L., Jeong, H. <strong>and</strong> Bianconi, G., 2000, Power-law distribution of the<br />
World Wide Web, Science, 287, 2115.<br />
Albert R., Jeong, H., Barabasi, A. L.,2000, Error <strong>and</strong> attack tolerance of complex networks,<br />
Nature, 406, 378-382.<br />
Balakrishnan, A., Kumara, S. <strong>and</strong> Sundaresan, S., 1999, Exploiting Information Technologies for<br />
Product Realization, Information Systems Frontiers, A Journal of Research <strong>and</strong> Innovation,<br />
1(1), 25-50.<br />
Barabasi, A.L., July 2000, The Physics of Web, Physics Web.<br />
Barabasi, A. L., Albert, R., <strong>and</strong> Jeong, H., 2000, Scale-free characteristics of r<strong>and</strong>om networks:<br />
The topology of the World Wide Web, Physica A, 281, 69-77.<br />
Baranger, M., Chaos, Complexity, <strong>and</strong> Entropy: A physics talk for non-physicists,<br />
http://necsi.org/projects/baranger/cce.pdf.<br />
Bar-Yam, Y., 1997, Dynamics of complex systems, Reading, Mass, Addison-Wesley.<br />
Bollobas, B., 1985, R<strong>and</strong>om Graphs, Academic Press, London.<br />
Callaway, D. S., Newman, M. E. J., Strogatz, S. H. <strong>and</strong> Watts, D. J., 2000, Network robustness<br />
<strong>and</strong> fragility: Percolation on r<strong>and</strong>om graphs, Phys. Rev. Lett. 85, 5468-5471.<br />
Carlson, J. M., Doyle, J., 1999, Highly optimized tolerance: a mechanism for power laws in<br />
designed systems, Physics Review E, 60(2), 1412-1427.<br />
Casdalgi, M., 1989, Nonlinear prediction of chaotic time series, Physica D, 35, 335-356.<br />
Choi, T. Y., Dooley, K. J., Ruangtusanathan, M., 2001, Supply networks <strong>and</strong> complex adaptive<br />
systems: control versus emergence, Journal of Operations Management 19(3), 351-366.<br />
Cooper, M. C., Lambert, D. M., <strong>and</strong> Pagh, J. D., 1997, Supply chain management: More than a<br />
new name for logistics, The International Journal of Logistics Management, 8(1), 1-13.<br />
Crutchfield, J. P., 1992, Knowledge <strong>and</strong> Meaning … Chaos <strong>and</strong> Complexity, in Modeling<br />
Complex Systems, L. Lam <strong>and</strong> H. C. Morris, editors, Springer-Verlag, Berlin, 66 -101.<br />
Crutchfield, J. P., 1994, The Calculi of Emergence: Computation, Dynamics <strong>and</strong> Induction,<br />
Physica D, 75, 11-54.<br />
Crutchfield, J. P. <strong>and</strong> Young, K., 1989, Inferring Statistical Complexity, Physical Review Letters,<br />
63, 105-108.
Crutchfield, J. P. <strong>and</strong> Feldman, D. P., 2001, Synchronizing to the Environment: Information<br />
Theoretic Constraints on Agent Learning, Advances in Complex Systems, 4, 251-264.<br />
Crutchfield, J. P. <strong>and</strong> Feldman, D. P., 2003, Regularities Unseen, R<strong>and</strong>omness Observed: Levels<br />
of Entropy Convergence, Chaos (submitted).<br />
Csete, M. E. <strong>and</strong> Doyle, J., 2002, Reverse Engineering of Biological Complexity, Science, 295,<br />
1664.<br />
Dorogovtsev, S. N. <strong>and</strong> Mendes, J. F. F., 2002, Evolution of networks, Advances in Physics, 51,<br />
1079-1187.<br />
Erramilli, A. <strong>and</strong> Forys, L. J., 1991, Oscillations <strong>and</strong> Chaos in a Flow Model of a Switching<br />
System, IEEE Journal on selected areas in communications, 9(2), 171-178.<br />
Farmer, J. D., Ott, E. <strong>and</strong> Yorke, J. A., 1983, The dimension of chaotic attractors, Physica D, 7,<br />
153-180.<br />
Farmer, J. D. <strong>and</strong> Sidorowich, J. J., 1987, Predicting chaotic time-series, Physics Review Letters,<br />
59(8), 845-848.<br />
Feichtinger, G., Hommes, C. H. <strong>and</strong> Herold, W., 1994, Chaos in a simple Deterministic Queuing<br />
System, ZOR- Mathematical Methods of Operations Research, 40, 109-119.<br />
Feldman, D. P. <strong>and</strong> Crutchfield, J. P., Discovering Noncritical Organization: Statistical<br />
Mechanical, Information Theoretic, <strong>and</strong> Computational Views of Patterns in One-Dimensional<br />
Spin Systems, Santa Fe Institute Working Paper 98-04-026.<br />
Flake, G. W., 1998, The Computational Beauty of Nature, MIT Press.<br />
Forrester, J. W., 1961, <strong>Industrial</strong> Dynamics. Cambridge: MIT press.<br />
Fraser, A. M. <strong>and</strong> Swinney, H. L., 1983, Independent coordinates for strange attractors from<br />
mutual information, Phys. Rev. A, 33(2), 1134-1140.<br />
Ghosh, S., 2002, The role of Modleing <strong>and</strong> Asynchronous Distributed Simulation in Analyzing<br />
Complex systems of the Future, Information System Frontiers, A Journal of Research <strong>and</strong><br />
Innovation, 4(2), 166-171.<br />
Glance, N. S., 1993, Dynamics with Expectations, PhD Thesis Physics Department Stanford<br />
University.<br />
Hogg, T. <strong>and</strong> Huberman, B. A., 1988, The Behavior of Computational Ecologies, in The Ecology<br />
of Computation North-Holl<strong>and</strong>, 77-116.<br />
Hogg, T. <strong>and</strong> Huberman, B. A., 1991, Controlling Chaos in Distributed Systems, IEEE Trans. on<br />
Systems, Man <strong>and</strong> Cybernetics, 21, 1325-1332.<br />
Kaneko, K. <strong>and</strong> Tsuda, I., 1996, Complex Systems: Chaos <strong>and</strong> Beyond, Springer-Verlag.<br />
Kennel, M., Brown, R. <strong>and</strong> Abarbanel, H. D. I., 1992, Determining embedding dimension for<br />
phase-space reconstruction using a geometrical construction, Phys. Rev. A, 45(6), 3403-3068.<br />
Kephart, J. O., Hogg, T. <strong>and</strong> Huberman, B. A., 1989, Dynamics of Computational Ecosystems,<br />
Physical Review A, 40 (1), 404-421.<br />
Kephart, J. O., Hogg, T. <strong>and</strong> Huberman, BA, 1990, Collective Behavior of Predictive Agents,<br />
Physica D, 42, 48-65.<br />
Kumara, S., Ranjan, P., Surana, A. <strong>and</strong> Narayanan, V., Decision Making in Logistics: A Chaos<br />
Theory Based Analysis, Annals of the International Institution for Production Engineering<br />
Research (Annals of CIRP) (accepted to appear).<br />
Lee, S., Gautam, N., Kumara, S., Hong, Y., Gupta, H., Surana, A., Narayanan, V., Thadakamalla,<br />
H., Brinn, M. <strong>and</strong> Greaves, M., 2002, Situation Identification Using Dynamic Parameters in<br />
Complex Agent-Based Planning Systems, Intelligent Engineering Systems Through Artificial<br />
Neural Networks, 12, 555-560.<br />
Llyod, S. <strong>and</strong> Slotine, J. J. E., 1996, Information theoretic tools for stable adaptation <strong>and</strong> learning,<br />
Int. Journal of Adaptive Control <strong>and</strong> Signal Processing, 10, 499-530.<br />
Maxion, R. A., Toward Diagnosis as an Emergent Behavior in a Network Ecosystem, Physica D,<br />
42, 66-84.
Min, H. <strong>and</strong> Zhou, G., 2002, Supply chain modeling: past, present <strong>and</strong> future, Computers <strong>and</strong><br />
<strong>Industrial</strong> Engineering, 43, 231-249.<br />
Mukherjee, S., Osuna, E. <strong>and</strong> Girosi, F., 1997, Nonlinear Prediction of Chaotic Time Series<br />
Using Support Vector Machines, IEEE Workshop on Neural Networks for Signal Processing<br />
VII, 511-519.<br />
Newman, M. E. J., 2000, Models of the small world, J. Stat. Phys., 101, 819-841.<br />
Newman, M. E. J., 2002, The spread of epidemic disease on networks, Phys. Rev. E, 66,<br />
Newman, M. E. J., 2003, R<strong>and</strong>om graphs as models of networks, in H<strong>and</strong>book of Graphs <strong>and</strong><br />
Networks, S. Bornholdt <strong>and</strong> H. G. Schuster (eds.), Wiley-VCH, Berlin.<br />
Newman, M. E. J., Strogatz, S. H. <strong>and</strong> Watts, D. J., 2001, R<strong>and</strong>om graphs with arbitrary degree<br />
distribution <strong>and</strong> their applications, Phys. Rev. E, 64.<br />
Ott, E., 1996, Chaos in Dynamical Systems, Cambridge University Press.<br />
Powell, M. J. D., 1987, Radial basis function approximation to polynomials, preprint University<br />
of Cambridge.<br />
Rasmussen, D. R. <strong>and</strong> Moseklide, M., 1988, Bifurcations <strong>and</strong> chaos in generic management<br />
model, European Journal of Operations Research, 35, 80-88.<br />
Ravasz, E. <strong>and</strong> Barabasi A. L., 2003, Hierarchical organization in complex networks, Physical<br />
Review E, 67.<br />
Sano, M. <strong>and</strong> Sawada, Y., Measurement of the Lyapunov Spectrum form a Chaotic Time Series,<br />
1985, Phys. Rev. Lett., 55, 1082-1084.<br />
Sawhill, B. K., 1993, Self-Organized Criticality <strong>and</strong> Complexity Theory, Lectures in Complex<br />
Systems, edited by Nadel L. <strong>and</strong> Stein DL, Addison Wesley Longman, 143-170.<br />
Schieritz, N. <strong>and</strong> Grobler, A., 2003, Emergent Structures in Supply Chains- A study Integrating<br />
Agent-Based <strong>and</strong> System Dynamics Modeling, Paper presented at the 36th Annual Hawaii<br />
Internation Conference on System Sciences, Big Isl<strong>and</strong>.<br />
Shalizi, C. R. <strong>and</strong> Crutchfield, J. P., Computational Mechanics: Pattern <strong>and</strong> Prediction, Structure<br />
<strong>and</strong> Simplicity, SFI Working Paper 99-07-044.<br />
Shalizi, C. R., 2001, Causal Architecture, Complexity <strong>and</strong> Self-Organization in Time Series <strong>and</strong><br />
Cellular Automata, http://www.santafe.edu/~shalizi/thesis.<br />
Simon, H. A., 1997, The Sciences of the Artificial, Cambridge, MA: The MIT Press, 3rd Edition.<br />
Strogatz, S. H., 1994, Nonlinear Dynamics <strong>and</strong> Chaos, Addison-Wesley, Reading, MA.<br />
Strogatz, S. H., 2001, Exploring complex networks, Nature, 410, 268-276.<br />
Takens, F., 1981, in Dynamical Systems <strong>and</strong> Turbulence, Warwick, 1980, edited by D. R<strong>and</strong> <strong>and</strong><br />
L.S. Young, Lecture Notes in Mathematical No. 898 (Springer, Berlin), 366.<br />
Wang, X. F. <strong>and</strong> Chen, G., 2002, Synchronization in Scale-free Dynamical networks: Robustness<br />
<strong>and</strong> Fragility, IEEE Transactions on Circuits <strong>and</strong> Systems I-Fundamental Theory And<br />
Applications, 49(1), 54-62.<br />
Watts, D. J. <strong>and</strong> Strogatz, S. H., 1998, Collective dynamics of ‘small-world’ networks, Nature,<br />
393, 440-442.<br />
Wolfram, S., 1994, Cellular Automata <strong>and</strong> Complexity: Collected Papers, Reading, Mass:<br />
Addison-Wesley Pub. Co.
Decision Making in Logistics: A Chaos Theory Based Analysis<br />
S. R. T. Kumara 1 , P. Ranjan, A. Surana , V. Narayanan<br />
The Pennsylvania State University<br />
310 Leonhard Building, University Park, PA 16802<br />
Abstract<br />
Logistics in general is a complex system. In this paper we investigate the existence of chaos in logistics<br />
systems. Such an investigation is necessary to use appropriate <strong>and</strong> correct methods for further analysis,<br />
as linear systems techniques will not be useful. If a system exhibits chaos, decision-making should<br />
consider the system characterization parameters from a chaos theory perspective. In this paper, we<br />
consider a non-preemptive queuing model <strong>and</strong> its extensions to the logistics domain. A prototypical supply<br />
chain example is used <strong>and</strong> the resulting behavior is characterized. At certain input values the behavior of<br />
the logistics system exhibits chaos. This information is useful for further analysis for prediction <strong>and</strong> control.<br />
The working prototype is implemented in the <strong>DARPA</strong> Couggar agent architecture.<br />
Keywords:<br />
Non-linear Dynamics, Production, Distributed<br />
1 INTRODUCTION<br />
In a logistics system one of the most fundamental<br />
questions is the analysis of the system behavior. We<br />
define the system as the entities (software <strong>and</strong> hardware)<br />
along with their interconnections (network). A typical<br />
logistics system is characterized by a supply chain. Our<br />
hypothesis is that these systems are nonlinear, dynamic<br />
<strong>and</strong> in specific, chaotic. That is, the time evolution of the<br />
system behavior (measured by certain behavioral<br />
parameters of the system) is chaotic. The question now is<br />
how do we really characterize the time evolution <strong>and</strong> how<br />
can we use the insights obtained from such an analysis?<br />
This paper deals with these questions. We first give a brief<br />
explanation of the notion of nonlinear dynamics <strong>and</strong><br />
continue further discussion.<br />
2 NONLINEAR DYNAMICS, CHAOS AND FRACTALS<br />
In this section, we present a concise description of<br />
nonlinear dynamics, chaos <strong>and</strong> fractals. During the past<br />
decade, chaos theory has elicited a lot of interest among<br />
scientists <strong>and</strong> researchers. As a result, its ideas are<br />
beginning to be applied to many scientific <strong>and</strong> engineering<br />
disciplines, especially where nonlinear models are relevant<br />
[1].<br />
Many physical systems that produce continuous-time<br />
response may be modeled by a set of differential<br />
equations of the form<br />
_<br />
_<br />
d x( t)<br />
F(<br />
x(<br />
t))<br />
dt<br />
() ⋅<br />
= (1)<br />
F is generally a nonlinear vector field. The solution to<br />
this results in a trajectory<br />
x () t = f ( x()<br />
0 , t)<br />
(2)<br />
where f : M → M represents the flow that determines<br />
the evolution of x(t) for a particular initial condition x(0).<br />
If the system is dissipative, as the system evolves from<br />
different initial conditions, the solutions usually shrink<br />
asymptotically to a compact subset of the whole state<br />
space M. This compact subset is called an attracting set.<br />
Every attracting disjoint subset of an attracting set is called<br />
an attractor [1].<br />
In dissipative systems, the overall volume of the state<br />
space shrinks with time. However, there may be some<br />
directions along which the state space actually exp<strong>and</strong>s.<br />
That is, the system trajectories tend to move apart along<br />
certain directions <strong>and</strong> shrink along the others. However, as<br />
the attractors usually remain bounded, the flow exhibits a<br />
horseshoe-type pattern [2]. Because of this, trajectories<br />
starting from near-by points within an attractor may get<br />
separated exponentially as the system evolves. This<br />
condition is known as the sensitive dependence on initial<br />
condition (SDIC), <strong>and</strong> the attractor exhibiting SDIC is<br />
called a strange attractor.<br />
A flow f, for a particular initial condition, is said to be<br />
chaotic if the trajectories in an attractor exhibit: sensitive<br />
dependence on initial conditions, but are bounded,<br />
irregular <strong>and</strong> aperiodic behavior, <strong>and</strong> continuous broad<br />
b<strong>and</strong> spectrum.<br />
The irregular <strong>and</strong> aperiodic response of chaotic systems,<br />
usually betrays a special property of self-similarity or scale<br />
invariance. That is, the response appears similar over<br />
multiple scales of observation. Scale invariant<br />
mathematical entities are commonly known as fractals.<br />
Analytical techniques used to deduce the characteristics of<br />
nonlinear systems collectively constitute fractal analysis.<br />
The main objectives of fractal analysis can be broadly<br />
categorized depending on the end-purpose as follows:<br />
identification of the presence of chaos from the system<br />
response, establishing the invariants of the system<br />
dynamics for system identification or indirect state<br />
estimation, <strong>and</strong> chaos modeling, when the end-purpose is<br />
to capture <strong>and</strong> later reproduce the system dynamics. A<br />
more detailed description of these concepts may be found<br />
in [3]. For applications of Nonlinear Dynamics in the<br />
modeling <strong>and</strong> control of complex production systems, refer<br />
to [4][ 5].
In the rest of this paper we report a queuing model that is<br />
useful in supply chain analysis. We explain our rationale<br />
for selecting this model for adaptation to the logistics<br />
domain. For some of the other models existing in literature,<br />
refer to [6][7]. We extend the queuing model <strong>and</strong> apply it to<br />
the logistics scenario in the Cougaar architecture <strong>and</strong><br />
discuss the results. We raise the fundamental question of<br />
how we can use these results for further analysis <strong>and</strong><br />
control of a logistics system.<br />
3 SUPPLY CHAIN AND NONLINEARITY<br />
The notion of evolution over time falls into the realm of<br />
what physicists call dynamics. Logistics systems are<br />
dynamic. Their behavior can be nonlinear. Therefore we<br />
can model a logistics system using the principles of<br />
nonlinear dynamics. A supply chain is an example of a<br />
logistics system. A typical supply chain exhibits stable<br />
behavior with damped oscillations in response to external<br />
disturbances. Unstable phenomena however can arise,<br />
due to feedback structure, inherent adjustment delays [6],<br />
nonlinear decision-making [7] <strong>and</strong> interactions that go in a<br />
supply chain. One of the causes of unstable phenomena is<br />
that the information feedback in the system is slow relative<br />
to rate of changes that occur in the system. Nonlinearity is<br />
inherent in a supply chain. The first mode of unstable<br />
behavior to arise in nonlinear systems is usually the simple<br />
one-cycle self-sustained oscillations. If the instability drives<br />
the system further into the nonlinear regime, more<br />
complicated temporal behavior may be generated. The<br />
route to chaos through subsequent period-doubling<br />
bifurcations, as certain parameters of the system are<br />
varied, is generic to large class of systems in physics,<br />
chemistry, biology, economics <strong>and</strong> other fields.<br />
Functioning in chaotic regime deprives us the ability for<br />
long-term predictions about the behavior of the system,<br />
while short-term predictions may be possible sometimes.<br />
As a result, control <strong>and</strong> stabilization of such a system<br />
becomes almost impossible. Here we investigate such<br />
dynamical behaviors that can arise in models that<br />
represent some of the components in a supply chain.<br />
3.1 Preemptive Queuing Model with Delays<br />
The Queuing system [8] considered here has two queues<br />
(A <strong>and</strong> B) <strong>and</strong> a single server with following characteristics:<br />
•Once served, the class A customer returns as a class B<br />
customer after a constant interval of time<br />
•Class B has non-preemptive priority over class A, i.e., the<br />
class A queue does not get served until the class B queue<br />
is emptied.<br />
•Schedules are organized every T units of time, i.e., if the<br />
low priority queue is emptied within time T, the server<br />
remains idle for the remaining time interval.<br />
•<strong>Final</strong>ly, the higher priority class B has a lower service rate<br />
than the low priority class A<br />
Suppose the system is sampled at the end of every<br />
schedule cycle, <strong>and</strong> the following quantities are observed<br />
at the beginning of the kth interval: Ak<br />
the queue length of<br />
low priority queue, Bk<br />
the queue length of high priority<br />
queue, C<br />
k the outflow from low priority queue in the kth<br />
interval <strong>and</strong> Dk<br />
the outflow from high priority queue in the<br />
kth interval. In the model, λ<br />
k denotes the arrival rate, µ<br />
a<br />
is the service rate for the lower priority queue, µ<br />
b is the<br />
service rate for the higher priority queue <strong>and</strong> l the<br />
feedback interval in units of the schedule cycle.<br />
The following four equations then completely describe the<br />
evolution of the system:<br />
Ak<br />
+ 1<br />
= Ak<br />
+ λk<br />
− Ck<br />
(3)<br />
C<br />
B<br />
k<br />
min( A<br />
k<br />
Dk<br />
+ λk<br />
, µ<br />
a<br />
(1 − ))<br />
µ<br />
= (4)<br />
k +1<br />
= Bk<br />
+ Ck<br />
−l<br />
− Dk<br />
(5)<br />
D min( B + C , µ )<br />
k<br />
=<br />
k k −l<br />
b<br />
(6)<br />
Equations (3) <strong>and</strong> (5) are merely conservation rules, while<br />
equations (4) <strong>and</strong> (6) model the constraints on the outflows<br />
<strong>and</strong> the interaction between the queues. This model while<br />
conceptually simple, exhibits surprisingly complex<br />
behaviors. The dynamical behavior reported in [8] is<br />
summarized in the following.<br />
Figure 1: Non-preemptive Queuing Model<br />
Dynamical Behavior: The analytic approach to solve for<br />
the flow model under constant arrivals (i.e.<br />
b<br />
λ =<br />
k<br />
λ<br />
for all<br />
k) shows several classes of solutions. The system is found<br />
to batch its workload even for such perfectly smooth arrival<br />
patterns. Following are the characteristics of behavior of<br />
the system:<br />
•Above a threshold arrival rate ( λ ≥ µ b<br />
/ 2 ), a<br />
momentary overload can send the system into a number of<br />
stable modes of oscillations.<br />
•Each mode of oscillations is characterized by distinct<br />
average queuing delays.<br />
•Extreme sensitivity to parameters, <strong>and</strong> the existence of<br />
chaos, implies that the system at a given time may be in<br />
any one of a number of distinct steady-state modes.<br />
The batching of the workload can cause significant<br />
queuing delays even at moderate occupancies. Also such<br />
oscillatory behavior significantly lowers the real-time<br />
capacity of the system.<br />
4 APPLICATION OF QUEUING MODEL TO<br />
LOGISTICS SCENARIO<br />
The assumptions in the model proposed by [8] are generic<br />
in the sense that priorities are widely observed in large<br />
systems due to economic <strong>and</strong> administrative compulsions.<br />
Sometime they can also arise from the physical facts when<br />
two different stages of processing have certain temporal<br />
constraint. Priorities may also arise due to the nonhomogeneity<br />
of the system where “knowledge” level of one<br />
agent is different from the other.<br />
Varying service time again follows from physical<br />
constraints on the task. For example in a simple logistics<br />
scenario tasks like unpacking, shipping, logging <strong>and</strong><br />
dispatching may take different times. These times scales<br />
can vary widely depending on the nature <strong>and</strong> physical<br />
characteristics of the tasks.<br />
The considerations regarding the generality of<br />
assumptions <strong>and</strong> the clear one-to-one correspondence
etween the physical logistics tasks <strong>and</strong> the model<br />
parameters described in [8] made us apply the queuing<br />
model to a simple, yet, realistic logistics scenario.<br />
4.1 Example Logistics Scenario<br />
The example scenario consists of two stages modeled by<br />
the non-preemptive queuing formalism. We take a simple<br />
battle front scenario (this can be any context of supply of<br />
materials, not necessarily battle front). During the first<br />
stage, supplies are processed by the node (agent) This<br />
involves two tasks: Unpacking (Task A) <strong>and</strong> Shipping<br />
(Task B). Our assumptions are that shipping takes more<br />
resources than packing, shipping gets a non preemptive<br />
priority <strong>and</strong> resources are common to both the tasks<br />
The second stage consists of disbursement of supplies.<br />
The output of first stage feeds into the second stage (as<br />
arrival). The two associated tasks are: Maintaining an<br />
inventory (Task A) <strong>and</strong> Disbursing the supply to the troops<br />
(Task B). The assumptions at stage two are that<br />
disbursing takes more resources than maintaining<br />
inventory, disbursing has a non pre-emptive priority <strong>and</strong><br />
resources are common to both the tasks.<br />
Figure 1 shows the queuing model. This is figure is<br />
reproduced from [8]. It must be noted that that rules are<br />
very simple <strong>and</strong> generic. Priority <strong>and</strong> heterogeneity are<br />
fundamental to any logistic planning <strong>and</strong> scheduling.<br />
Tasks have to be prioritized in order to do the most<br />
important thing first. This comes naturally as we try to<br />
optimize an objective <strong>and</strong> assign the tasks their<br />
"importance.” In addition in all logistics systems, resources<br />
are limited, both in time <strong>and</strong> space. Temporal constraints<br />
considered in the example are realistic, in the sense that<br />
you cannot disburse supplies without unpacking them.<br />
Temporal dependence plays an important role in logistic<br />
planning (interdependency). This simple example also<br />
simulates the effect of arbitrary but bounded initial<br />
conditions<br />
Cougaar (Cognitive Agent Architecture) is developed<br />
under <strong>DARPA</strong> Advanced Logistics Program (ALP).<br />
Survivability of Cougaar is addressed in the UltraLog<br />
program of <strong>DARPA</strong>. In the above example each stage is<br />
modeled as an agent. The activities are modeled as agent<br />
processes. We do not discuss Cougaar architecture in this<br />
paper. Details can be found at the URL:<br />
http://www.couggar.org.<br />
4.2 Analysis<br />
One of the hallmarks of chaos is sensitive dependency to<br />
initial conditions (SDIC). External environment (the world<br />
in which the logistics scenario resides) changes <strong>and</strong> hence<br />
changing the initial conditions <strong>and</strong> the parameters. The<br />
following affects the initial conditions <strong>and</strong> parameters of<br />
the agents ( thereby affecting the initial conditions of the<br />
queuing model): change in arrival rate of supplies (inputs<br />
to the agents), change in resources (assets) available in<br />
each agent, <strong>and</strong> delay in processing of Tasks.<br />
The internal states of the two agents are characterized by:<br />
supplies waiting to be shipped (X 1 ), supplies waiting to be<br />
unpacked (X 2 ), supplies actually shipped, supplies waiting<br />
to be inventoried (X 3 ), supplies waiting to be disbursed (X 4 )<br />
to the troops <strong>and</strong> supplies actually shipped. We have<br />
considered these variables <strong>and</strong> observed their behavior.<br />
Characterization of these behaviors leads to some<br />
interesting inferences.<br />
We simulated the queuing models in each agent with the<br />
following model parameters. There are 162 personnel in<br />
each of the agents, who can be allocated to either task.<br />
We assume that it takes 1 unit of time <strong>and</strong> one person to<br />
do task A <strong>and</strong> one unit of time with 2 people to do task B.<br />
This defines the capacity/arrival rate as 54 items/unit time.<br />
Hence arrival rate can be 0-54 per unit time. We assume<br />
that the initial conditions are given by: X10=131, X20=201,<br />
X30=151 <strong>and</strong> X40=29.<br />
S tate X 1<strong>and</strong>X2-><br />
M agnitude(dB)<br />
P o w er S pectru m<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
-10<br />
-20<br />
-30<br />
Evolution of system states<br />
50 55 60 65 70 75 80 85 90 95 100<br />
Time-><br />
(a): Time evolution of system state<br />
Power Spectrum of state X1<br />
-40<br />
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1<br />
Frequency<br />
50<br />
(b) Power spectrum<br />
State space trajectories for center 2(blue) <strong>and</strong> for center 1(red)<br />
150<br />
100<br />
2 ,x4-> x<br />
0<br />
0 50 100 150<br />
x1,x3-><br />
(c) State-space plot<br />
d) Multi-stability:Bifurcation diagram for the system<br />
Figure 2: Plots for arrival rate =53<br />
We have used Matlab for computations. We have<br />
experimented with several arrival rates <strong>and</strong> delays. We<br />
observe the state-space structure (time evolution) of the<br />
following: arrival rates at all the queues, time series of<br />
various parameters <strong>and</strong> the Power-spectrum.<br />
At arrival rates of 40 the system has a period of 1, at<br />
arrival rate of 50 a period 2, at 52 a period of 4 <strong>and</strong> at 53<br />
the system shows a seemingly r<strong>and</strong>om behavior. This
shows relatively irregular behavior with several different<br />
peaks in the power spectrum. The bifurcation diagram<br />
shows that at the arrival rate of 53 the system is chaotic.<br />
We show illustrative plots in figure 2. Time evolution (2a)<br />
clearly shows the existence of many periods showing the<br />
possible existence of chaotic behavior.<br />
4.3 Discussion<br />
We could successfully show with certain initial conditions<br />
the existence of chaos in the simple yet realistic logistic<br />
system. The underlying queuing model at arrival rates of<br />
53 leads to the chaotic behavior of the number of jobs<br />
waiting to be processed. The bifurcation diagram points to<br />
the fact that X j ’s (for j=1,2,3,4) exhibit aperioidic behavior.<br />
The physical implication is that the resources needed vary<br />
from time to time <strong>and</strong> the logistics system will exhibit<br />
nervousness, which is an undesirable property. We have<br />
also observed a cascading effect (when one agent enters<br />
the chaotic behavior, the connected agent also tends to<br />
exhibit chaos). This leads to the problem of planning of<br />
later stages facing much more uncertainty compared to<br />
the first stage even for simple fixed deterministic arrivals.<br />
We have also observed increased average delay. There is<br />
an increase in delay by 25% if the system starts batching<br />
the load From our analysis we can conclude that if the two<br />
agents start load batching then inventory requirement may<br />
go to 200% as evident from the plots.<br />
It is necessary to make sure in this case to keep the arrival<br />
rates to less than 53, there by enforcing control policies<br />
which will keep the system stable or quasi-stable. If the<br />
system ends up being chaotic then we could perform<br />
further analysis to study the characteristics <strong>and</strong> use them<br />
to control the behavior in the short term. We also compute<br />
: Average mutual information, Global dimension, Local<br />
dimension, Correlation dimension <strong>and</strong> Largest Lyapunov<br />
Exponent.These computed values also indicate the<br />
existence of Chaos in this logistics system.<br />
5 SUMMARY<br />
Chaotic behavior in deterministic dynamical systems is an<br />
intrinsically non-linear phenomenon. We could successfully<br />
show that a simple example logistics system is chaotic. A<br />
characteristic feature of a chaotic systems is an extreme<br />
sensitivity to changes in initial conditions while the<br />
dynamics, at least for the so-called dissipative systems, is<br />
still constrained to a finite region of the state space called<br />
an attractor. In such instances, Fourier analysis <strong>and</strong><br />
ARMA models may not be useful to study the time traces<br />
of supply chain systems. The need to extract interesting<br />
physical information about the dynamics of observed<br />
systems when they are operating in a chaotic regime has<br />
led to development of nonlinear time series analysis<br />
techniques. Systematically, the study of potentially, chaotic<br />
systems may be divided into three areas: identification of<br />
chaotic behavior, modeling <strong>and</strong> prediction <strong>and</strong> control. The<br />
first area shows how chaotic systems may be separated<br />
form stochastic ones <strong>and</strong>, at the same time, provides<br />
estimates of the degrees of freedom <strong>and</strong> the complexity of<br />
the underlying chaotic system. Based on such results,<br />
identification of a state space representation allowing for<br />
subsequent predictions may be carried out. The last stage,<br />
if desirable, involves control of a chaotic system. In this<br />
short paper we have concentrated on the first area i.e.<br />
identification of chaotic behavior. In general if we consider<br />
this step in spatio-temporal regime, the following tasks are<br />
needed to be accomplished [9]:<br />
1.Signal Separation (Finding the signal): Separation of<br />
broadb<strong>and</strong> signal form broadb<strong>and</strong> “noise” using<br />
deterministic nature of signal.<br />
2.Phase Space reconstruction (Finding the space): Time<br />
lagged variables are used to form coordinates for a phase<br />
space in the embedding dimension. The embedding<br />
dimension can be determined using false nearest<br />
neighbors test <strong>and</strong> time lag using mutual information.<br />
3.Classification of the signal: Determination of invariants of<br />
system such as Lyapunov exponents <strong>and</strong> various fractal<br />
dimensions.<br />
4.Making models <strong>and</strong> Prediction: Determination of the<br />
parameters of the assumed model, which are consistent<br />
with the invariant classifiers (like Lyapunov exponents <strong>and</strong><br />
dimensions).<br />
In this paper the non-preemptive queuing model is used<br />
for detailed application to a part of a supply chain, two<br />
agents interacting in a military logistics scenario. The<br />
queuing model forms the processing component of the<br />
logistics agents implemented in the Cougaar architecture.<br />
One of the manifestations of complexity is through the<br />
onset of chaos. Our analysis shows the cascading effect of<br />
chaos. This points to the conjecture that the supply chain<br />
may exhibit chaotic behavior. The underlying motivation of<br />
our study is to build control models. Our next step in this<br />
research is to build adaptive predictive <strong>and</strong> control models<br />
for larger supply chain networks from the insights we have<br />
derived from the current analysis.<br />
6 ACKNOWLEDGMENTS<br />
The authors acknowledge <strong>DARPA</strong> for its support (Grant#:<br />
MDA 972-01-1-0563) under the UltraLog program for this<br />
research. The help of Seokcheon Lee, Yunho Hong <strong>and</strong><br />
Hariprasad, T. is greatly appreciated.<br />
7 REFERENCES<br />
[1] Isham, V., 1993, Statistical Aspects of chaos: A<br />
review, Networks <strong>and</strong> Chaos - Statistical <strong>and</strong><br />
Probabilistic Aspects, Barndorff-Nielsen et al.<br />
(editors).<br />
[2] Wiggins, S., 1990, Introduction to Applied Nonlinear<br />
Dynamical Systems <strong>and</strong> Chaos, Springer-Verlag, New<br />
York, Inc.<br />
[3] Bukkapatnam, S.T.S., Kumara, S., <strong>and</strong> Lakhtakia,<br />
A., 2000, Fractal estimation of flank wear in turning<br />
,ASME Journal of Dynamic Systems, Measurements<br />
<strong>and</strong> Control, 122:89-94.<br />
[4] Reiter, S. R., Freitag, M. <strong>and</strong> Schmieder, A., 2002,<br />
Modeling <strong>and</strong> Control of Production Systems Based<br />
on Nonlinear Dynamics Theory, Annals of the CIRP,<br />
51/1 :375-378.<br />
[5] Wiendahl, H.P. <strong>and</strong> Scheffczyk, H., 1999, Simulation<br />
Based Analysis of Complex Production Systems with<br />
methods of Nonlinear Dynamics, Annals of the CIRP,<br />
48/1 :357-360.<br />
[6] Rasmussen, R. D. <strong>and</strong> Moseklide, E., 1988,<br />
Bifurcations <strong>and</strong> Chaos in Generic Management<br />
Model, European Journal of Operations Research<br />
Science, 35:80-88.<br />
[7] Feichtinger, G., Hommes, C. H. <strong>and</strong> Herold W., 1994,<br />
Chaos in a simple Deterministic Queuing System,<br />
ZOR- Mathematical Methods of Operations Research,<br />
40:109-119.<br />
[8] Erramilli, A. <strong>and</strong> Forys, L. J., 1991. Oscillations <strong>and</strong><br />
Chaos in a Flow Model of a Switching System. IEEE<br />
Journal on Selected Areas in Communications, Vol.9,<br />
No 2:171-178.<br />
[9] Abarbanel, H.D.I., 1996, Analysis of Observed<br />
Chaotic Data, Springer-Verlag, New York Inc.