Elasticity, Resilience, Antifragility in CoLlective and Individual Objects and Systems: 2014

Monday 17 November 2014

Second Edition of the International Workshop on Computational Antifragility and Antifragile Engineering (ANTIFRAGILE 2015)

Resilience may be described as an intended emerging property resulting from the coupling of a system and its environment(s). Depending on the interactions between these two "ends" and on the quality of the individual behaviours that both system and environment(s) may exercise, different strategies may be chosen:

elasticity (preserving system identity by masking changes);
entelechism (guaranteeing the identity of the system by tolerating changes);
antifragility (adapting both the system and its identity so as to best fit the changing environment; and, while doing so, evolving the "self" and learning how to evolve the adaptation processes).

The major focus of the ANTIFRAGILE 2015 Workshop is computational and engineering aspects of antifragility, the term recently introduced by Professor N. Taleb in his book "Antifragile: Things that Gain from Disorder". Antifragile computing systems are those resilient systems that are

open to their own system-environment fit;
able to exercise complex auto-predictive behaviours;
and that develop wisdom as a result of matches between available strategies and obtained results.

The engineering of antifragile computer-based systems is a challenge that, once met, would allow systems and ambients to self-evolve and self-improve by learning from accidents and mistakes in a way not dissimilar from that of human beings. Learning how to design and craft antifragile systems is an extraordinary endeavour whose tackling is likely to reverberate on many a computer engineering field. New methods, programming languages, even custom platforms will have to be designed. The expected returns are extraordinary as well: antifragile computer engineering promises to enable realizing truly autonomic systems and ambients able to meta-adapt to changing circumstances; self-adjust to dynamically changing environments and ambients; self-organize so as to track dynamically and proactively optimal strategies to sustain scalability, high-performance, and energy efficiency; personalize their aspects and behaviours after each and every user. And to learn how to get better while doing it.

Building atop the very positive response of last year, this second edition of ANTIFRAGILE aims to further enhance the awareness of the above challenges and to continue the initiated discussion on how computer and software engineering may address them. As a design aspect cross-cutting through all system and communication layers, antifragile engineering calls for multidisciplinary visions and approaches able to bridge the gaps between “distant” research communities so as to

propose novel solutions to design and develop antifragile systems and ambients;
devise conceptual models and paradigms for computational antifragility;
provide analytical and simulation models and tools to measure systems ability to withstand faults, adjust to new environments, and enhance their resilience in the process;
foster the exchange of ideas and lively discussions able to drive future research and development efforts in the area.

The main topics of the workshop include, but are not limited to:

Conceptual frameworks for antifragile systems, ambients, and behaviours;
Dependability, resilience, and antifragile requirements and open issues;
Design principles, models, and techniques for realizing antifragile systems and behaviours;
Frameworks and techniques enabling resilient and antifragile applications;
Antifragile human-machine interaction;
End-to-end approaches towards antifragile services;
Autonomic antifragile behaviours;
Middleware architectures and mechanisms for resilience and antifragility;
Theoretical foundation of resilient and antifragile behaviours;
Formal modelling of resilience and antifragility;
Programming language support for resilience and antifragility;
Machine learning as a foundation of resilient and antifragile architectures;
Antifragility and resiliency against malicious attacks;
Antifragility and the Cloud;
Service Level Agreements for Antifragility;
Verification and validation of resilience and antifragility;
Antifragile and resilient services.

All accepted papers of the previous edition of the workshop are freely available here. A detailed description of two of the papers of the previous edition of the workshop, as well as their presentations, are available at this page. For more information about computational antifragility, please visit also the LinkedIn group "Computational Antifragility" and the G+ Community at this page.

In this second edition, Professor Taleb kindly agreed to give his keynote speech through teleconferencing.

ANTIFRAGILE is co-located with the 6th International Conference on Ambient Systems, Networks and Technologies, June 2 - 5, 2015, London, UK. ANTIFRAGILE is likely to take place on the second day of the Conference, June 3 (though this has not been confirmed yet.) All ANT-2015 accepted papers (thus including the ANTIFRAGILE 2015 papers) will be published by Elsevier Science in the open-access Procedia Computer Science series on-line. Procedia Computer Sciences is hosted on www.Elsevier.com and on Elsevier content platform ScienceDirect (www.sciencedirect.com), and will be freely available worldwide. All papers in Procedia will be indexed by Scopus (www.scopus.com) and by Thomson Reuters' Conference Proceeding Citation Index http://thomsonreuters.com/conference-proceedings-citation-index/. The papers will contain linked references, XML versions and citable DOI numbers. Authors will be able to provide a hyperlink to all delegates and direct your conference website visitors to your proceedings. All accepted papers will also be indexed in DBLP (http://dblp.uni-trier.de/). Selected papers will be invited for publication in special issues of international journals.

For more information about ANTIFRAGILE 2015 please visit the ANTIFRAGILE web site. A number of resources and reflections about computational antifragility may be found at the following page and through this presentation.

Monday 10 November 2014

Some recent papers on elasticity, resilience, and computational antifragility

Some of my most recent papers on elasticity, resilience, and computational antifragility:

"Antifragility = Elasticity + Resilience + Machine Learning. Models and Algorithms for Open System Fidelity". In Proc. of the 1st International Workshop "From Dependable to Resilient, from Resilient to Antifragile Ambients and Systems" (ANTIFRAGILE 2014), Hasselt, Belgium, 2-5 June, 2014. Elsevier Science, Procedia Computer Science.
We introduce a model of the fidelity of open systems—fidelity being interpreted here as the compliance between corresponding figures of interest in two separate but communicating domains. A special case of fidelity is given by real-timeliness and synchrony, in which the figure of interest is the physical and the system’s notion of time. Our model covers two orthogonal aspects of fidelity, the first one focusing on a system’s steady state and the second one capturing that system’s dynamic and behavioural characteristics. We discuss how the two aspects correspond respectively to elasticity and resilience and we highlight each aspect’s qualities and limitations. Finally we sketch the elements of a new model coupling both of the first model’s aspects and complementing them with machine learning. Finally, a conjecture is put forward that the new model may represent a first step towards compositional criteria for antifragile systems.
"On the Behavioral Interpretation of System-Environment Fit and Auto-Resilience". In Proc. of the IEEE 2014 Conference on Norbert Wiener in the 21st Century, Boston, MA, 24-26 June, 2014. IEEE.
Already 71 years ago Rosenblueth, Wiener, and Bigelow introduced the concept of the “behavioristic study of natural events” and proposed a classification of systems according to the quality of the behaviors they are able to exercise. In this paper we consider the problem of the resilience of a system when deployed in a changing environment, which we tackle by considering the behaviors both the system organs and the environment mutually exercise. We then introduce a partial order and a metric space for those behaviors, and we use them to define a behavioral interpretation of the concept of system-environment fit. Moreover we suggest that behaviors based on the extrapolation of future environmental requirements would allow systems to proactively improve their own system-environment fit and optimally evolve their resilience. Finally we describe how we plan to express a complex optimization strategy in terms of the concepts introduced in this paper
"Preliminary Contributions Towards Auto-Resilience". In A. Gorbenko, A. Romanovsky, V. Kharchenko (Eds). Software Engineering for Resilient Systems - 5th International Workshop, SERENE 2013, Kiev, Ukraine, October 3-4, 2013. Proceedings. LNCS 8166. Springer 2013.
The variability in the conditions of deployment environments introduces new challenges for the resilience of our computer systems. As a response to said challenges, novel approaches must be devised so that identity robustness be guaranteed autonomously and with minimal overhead. This paper provides the elements of one such approach. First, building on top of previous results, we formulate a metric framework to compare specific aspects of the resilience of systems and environments. Such framework is then put to use by sketching the elements of a handshake mechanism between systems declaring their resilience figures and environments stating their minimal resilience requirements. Despite its simple formulation it is shown how said mechanism enables scenarios in which resilience can be autonomously enhanced, e.g., through forms of social collaboration. This paves the way to future “auto-resilient” systems, namely systems able to reason and revise their own architectures and organisations so as to optimally guarantee identity persistence.
"Quality indicators for collective systems resilience", Emergence: Complexity & Organization, ISSN: 1521-3250, Vol. 16, No. 3, September 2014, pp. 65-104.
Resilience is widely recognized as an important design goal though it is one that seems to escape a general and consensual understanding. Often mixed up with other system attributes; traditionally used with different meanings in as many different disciplines; sought or applied through diverse approaches in various application domains, resilience in fact is a multi-attribute property that implies a number of constitutive abilities. To further complicate the matter, resilience is not an absolute property but rather it is the result of the match between a system, its current condition, and the environment it is set to operate in. In this paper we discuss this problem and provide a definition of resilience as a property measurable as a system-environment fit. This brings to the foreground the dynamic nature of resilience as well as its hard dependence on the context. A major problem becomes then that, being a dynamic figure, resilience cannot be assessed in absolute terms. As a way to partially overcome this obstacle, in this paper we provide a number of indicators of the quality of resilience. Our focus here is that of collective systems, namely those systems resulting from the union of multiple individual parts, sub-systems, or organs. Through several examples of such systems we observe how our indicators provide insight, at least in the cases at hand, on design flaws potentially affecting the efficiency of the resilience strategies. A number of conjectures are finally put forward to associate our indicators with factors affecting the quality of resilience.
"On the Constituent Attributes of Software and Organizational Resilience", Interdisciplinary Science Reviews, vol. 38, no. 2, Maney Publishing, June 2013.
Our societies are increasingly dependent on the services supplied by our computers and their software. Forthcoming new technology is only exacerbating this dependence by increasing the number, the performance, and the degree of autonomy and inter-connectivity of software-empowered computers and cyber-physical “things”, which translates into unprecedented scenarios of interdependence. As a consequence, guaranteeing the persistence-of-identity of individual and collective software systems and software-backed organisations becomes an increasingly important prerequisite towards sustaining the safety, security, and quality of the computer services supporting human societies. Resilience is the term used to refer to the ability of a system to retain its functional and non-functional identity. In the present article we conjecture that a better understanding of resilience may be reached by decomposing it into a number of ancillary constituent properties, the same way as a better insight in system dependability was obtained by breaking it down into safety, availability, reliability, and other sub-properties. Three of the main sub-properties of resilience proposed here refer respectively to the ability to perceive environmental changes; to understand the implications introduced by those changes; and to plan and enact adjustments intended to improve the system-environment fit. A fourth property characterises the way the above abilities manifest themselves in computer systems. The four properties are then analyzed in three families of case studies, each consisting of three software systems that embed different resilience methods. Our major conclusion is that reasoning in terms of our resilience sub-properties may help revealing the characteristics—and in particular the limitations—of classic methods and tools meant to achieve system and organisational resilience. We conclude by suggesting that our method may prelude to meta-resilient systems—systems, that is, able to adjust optimally their own resilience with respect to changing environmental conditions.
"Community Resilience Engineering: Reflections and Preliminary Contributions". In I. Majzik and M. Vieira (Eds.), Proceedings of SERENE 2014, LNCS 8785, pp. 1-8, 2014
An important challenge for human societies is that of mastering the complexity of Community Resilience, namely “the sustained ability of a community to utilize available resources to respond to, withstand, and recover from adverse situations”. The above concise definition puts the accent on an important requirement: a community’s ability to make use in an intelligent way of the available resources, both institutional and spontaneous, in order to match the complex evolution of the “significant multi-hazard threats characterizing a crisis”. Failing to address such requirement exposes a community to extensive failures that are known to exacerbate the consequences of natural and human-induced crises. As a consequence, we experience today an urgent need to respond to the challenges of community resilience engineering. This problem, some reflections, and preliminary prototypical contributions constitute the topics of the present article.
A presentation of this paper is available here.
"Systems, Resilience, and Organization: Analogies and Points of Contact with Hierarchy Theory".
Aim of this paper is to provide preliminary elements for discussion about the implications of the Hierarchy Theory of Evolution on the design and evolution of artificial systems and socio-technical organizations. In order to achieve this goal, a number of analogies are drawn between the System of Leibniz; the socio-technical architecture known as Fractal Social Organization; resilience and related disciplines; and Hierarchy Theory. In so doing we hope to provide elements for reflection and, hopefully, enrich the discussion on the above topics with considerations pertaining to related fields and disciplines, including computer science, management science, cybernetics, social systems, and general systems theory.
"Behavior, Organization, Substance: Three Gestalts of General Systems Theory". In Proc. of the IEEE 2014 Conference on Norbert Wiener in the 21st Century, Boston, MA, 24-26 June, 2014. IEEE.
The term gestalt, when used in the context of general systems theory, assumes the value of “systemic touchstone”, namely a figure of reference useful to categorize the properties or qualities of a set of systems. Typical gestalts used, e.g., in biology, are those based on anatomical or physiological characteristics, which correspond respectively to architectural and organizational design choices in natural and artificial systems. In this paper we discuss three gestalts of general systems theory: behavior, organization, and substance, which refer respectively to the works of Wiener, Boulding, and Leibniz. Our major focus here is the system introduced by the latter. Through a discussion of some of the elements of the Leibnitian System, and by means of several novel interpretations of those elements in terms of today’s computer science, we highlight the debt that contemporary research still has with this Giant among the giant scholars of the past.

Monday 3 November 2014

A few thoughts on Computational Antifragility

I consider antifragility as the one end of a spectrum of behaviors of a system interacting with an "environment". I use quotes there for "environment" is in fact just another system or system-of-systems also expressing a behavior. The need for antifragility, in my opinion, comes from the "opportunities" that may appear throughout the mutual interaction of these behaviors. A few examples may clarify what I mean. If E exercises a random behavior (or one that appears to S as random, or unintelligible) then S can't use any advanced behavior and must resort to worst-case analysis and predefined use of redundancy to mask out the negative effects of E's behavior. This is elasticity. On the other hand, if E exercises purposeful behavior, viz. intelligible behaviors such that a goal may be identified and pursued, then S can match E's behavior with something "more clever". A first thing that S may do is enacting a strategy towards a simple protection of its identity. This is entelechism ("being-at-work" so as to "stay-the-same"). This is a teleological / extrapolatory behavior that ranges from reactivity to proactivity. Once more, what to use depends on E's behaviors -- if the behavior of E may be anticipated, then proactivity is a good option, while if the behavior has a reduced "extrapolation horizon" then a better option could be reactivity. An important aspect to highlight is, in my opinion, that entelechism leaves no trace in S. "Genetically" speaking, the interaction with E leaves no trace. The impact on the identity of the system is nought. In other words if you run S a second time and deploy it in E, S will start from scratch. One could say that entelechism is memoryless -- it leaves no trace in S. A different approach is what I call "computational antifragility". Here S makes use of learning techniques that leave a trace in S's identity. Computational antifragility is proactivity with machine learning, it is "being-at-work while improving-the-self". This corresponds to Professor Taleb's concept of antifragility, applied to the context of computing systems.

The currently missing link is, in my opinion, the ability to reconfigure the system so as to select the resilience strategy best matching the current behavior of E. In other words, a self-resilient (or as I call it, an auto-resilient) approach is required, with an environment behavior classificator able to tell what options are made viable by E's behavior, a planner selecting the corresponding strategy and its parameters, and a reconfigurator able to re-weave the system according to the selected strategy.

A few thoughts on Computational Antifragility by Vincenzo De Florio is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Permissions beyond the scope of this license may be available at mailto:vincenzo.deflorio@gmail.com.

Monday 9 June 2014

ANTIFRAGILE 2014! Part 1 (Keynote speech and first presentation)

As you may know already, the first edition of the ANTIFRAGILE workshop took place on June 3 in Hasselt, Belgium. The workshop was a satellite event of the ANT'14 Conference, hosted at the University of Hasselt.

Being a workshop on computational antifragility, it was only normal that the workshop itself had to be... put to test! In fact we had to tolerate and learn from a number of problems, both technical and logistic in nature, including a missing remote controller for operating the LCD projector; no computers being available with the projector; and people being dispatched to the wrong Campus as a result of wrong information at the Conference website. In fact, we can proudly say that we managed to compensate for all those inconveniences at the minimal cost of 30' delay! (Yes, we had considered the possibility of such a delay and used an elasticity strategy to reduce its effects...)

Presentation summary	Presentation	Article
Dr. Kenny H. Jones	Presentation	Article
Vincenzo De Florio	Presentation	Article

Glad to have passed our ordeal and happy to have earned the right to call ours as a truly "antifragile workshop," we began our meeting with the insightful keynote speech of Dr. Kenny H. Jones, from the NASA Langley Research Center (LaRC) in Hampton. Dr. Jones' presentation and paper are freely available for download. Among the many important contributions and lessons learned that Dr. Jones shared with us, I found several of the statements in his abstract as particularly convenient for the occasion¹:

"NASA is working to infuse concepts from Complexity Science in to the engineering process. Some [...] problems may be solved by a change in design philosophy. Instead of designing systems to meet known requirements that will always lead to fragile systems at some degree, systems should be designed wherever possible to be antifragile: designing cognitive cyber-physical systems that can learn from their experience, adapt to unforeseen events they face in their environment, and grow stronger in the face of adversity."

Dr. Jones in particular identifies a first "deadly sin" of traditional engineering practice in reductionism, namely the assumption that "Any system, no matter how complicated, can be completely understood if reduced to elemental components". This leads to the fallacy that "By fully understanding the elements, system behavior can be predicted and therefore controlled". While this may be true in some cases, more and more we are confronted with systems that are more than the sum of their parts (which, incidentally, is the theme of a presentation that I recently gave at the 2014 SuperMinds Event!) System behavior in this case is more difficult to capture, predict, and control, as it is the result of complex interactions among the parts and the environment they are set to operate in. We use to say that in these complex systems the behavior emerges from those interactions. Dr. Jones observed how despite considerable effort and funding a non-negligible gap exists between theoretical results and practical solutions. This brought to partnerships such as NFS and LaRC and to initiatives as the Inter Agency Working Group -- both of which were actions specifically addressing the solution of the above gap. Apart from partnerships, NASA also initiated internal actions specifically meant to address the engineering practice of complex systems. The Complex Aeronautics Systems Team at LaRC is one such activity. Ultimate aim of those initiatives is being able to engineer large-scale complex systems that be able to deal more effectively with uncertainty; optimally self-manage their action; be less costly and characterized by reduced development times; and be applicable to general and augmented contexts such as the social, the political, and the economic.

It is at this point that Dr. Jones introduces his main observation: a second "deadly sin" of the traditional engineering practice, he states, is that currently systems are designed to be fragile in the first place! In fact, traditional systems are the result of design requirements, and those design requirements systematically introduce Achilles' Heels in the system: strict dependences on a reference environment that in practice prohibit the system to address the unexpected. In fact any violation of the design requirement inherently translates in an assumption failure. In other words, "If the system is stressed beyond the design requirements, it will fail", and systems "are designed to be fragile at some degree"! Antifragile systems engineering is in fact quite the opposite: a novel practice such that the system becomes stronger when stressed; after all, as the famous Latin quote says, it is per aspera (through difficulties) that we get ad astra (to the stars — a primary objective of NASA by the way!!)

With the words of Dr. Jones, "what is needed are new methods producing systems that can adapt functionality and performance to meet the unknown".
Dr. Jones then introduced a non exhaustive list of very interesting exemplary applications and concluded his speech with a number of statements. His final one constitutes in my opinion the major lesson learned and the starting point of our work in computational antifragility:

A change in design philosophy is needed that will produce anti fragile systems: systems able to learn to perform in the face of the unexpected and improve performance beyond what was anticipated.

The speech was intertwined with rapid questions / answers and was attended also by some of the organizers of the main Conference, ANT'14.

I had the pleasure and honor to give the second presentation, entitled "Antifragility = Elasticity + Resilience + Machine Learning — Models and Algorithms for Open System Fidelity". Presentation and paper are freely available for download.

Starting point of my discussion are the two questions: what is computational antifragility, and why is it different from established disciplines such as dependability, resilience, elasticity, robustness, and safety? My answer is constructed through a number of "moves". Making use of the classic Aristotelian definition, I first focus my attention on resilience, a system's ability to preserve one's identity through an active behavior. Again Aristotle is quoted as the Giant who first introduced resilience by the name of entelechy (ἐντελέχεια). But what is identity, and what is behavior? We tackle first identity.

We do this via an example: we consider a Voice-over-IP application and a call between two endpoints; and we observe that the identity of this application is not merely the fact that communication between the two endpoints is possible; the identity is preserved only if the quality-of-experience throughout the call matches the expectations of the two endpoints! This brings the endpoints "in the resilience loop" so to say. A system is resilient only so long as it is able to adjust its operation to what the two external parts — the users of the system — consider as "acceptable"; for instance, if the endpoints are two human beings, this means that the expected quality is that of a conversation of two people talking and listening to each other without any problem.

In practice the experienced quality is a dynamic system, namely one that varies its characteristics with time; and the challenge of resilience is that of being able to compensate for disturbances and keep the experienced quality "not too far away" from the minimal quality expected by the endpoints. We conclude that resilience calls for fidelity, namely quality of representation-and-control between a reference domain and an execution domain. This is in fact an argument brought about by another great Giant scholar, Leibniz. As anticipated by Leibniz, systems operate in a resource-constrained world and are characterized by different "powers of representation", namely different fidelity. The higher the system fidelity — the greatest that is its power of representation — the stronger is that system's claim for existence: its resilience! Thus fidelity (both reflective fidelity and control fidelity) among a reference domain and an execution domain represent one of the factors that play a significant role in the emergence of quality and resilience.

A typical example is fidelity in cyberphysical systems. As indicated by their very name, cyberphysical systems base their action on the fidelity of properties in the physical world and corresponding properties in the "cyberworld". This fidelity is, in mathematical terms, an isomorphism, namely a bijective function that preserves concepts and operations. Thus in the case of the Voice-over-IP example, fidelity should be able to preserve concepts such as delay, jitter, echo, and latency: physical phenomena should correspond to cyberphenomena, and vice-versa. In fact a better approach is to talk of fidelities and consider a fidelity isomorphism for each of the n figures that an open system either senses or control. I use the terms n-open systems and n-open system fidelities to refer to open systems and their fidelity.

Fidelity allows us to reason about a system's identity. In order to exemplify this I use the case of systems that are open the physical dimension of time. Fidelity in this case is an isomorphism between cybertime and physical time. Several fidelity classes are possible, including for instance the following ones:

[RT]₀: Perfect fidelity: In this case we have perfect correspondence between wall-clock time and computer-clock time. No drift is possible and the two concepts can always reliably related to one another.
[RT]₁: Strong fidelity: This corresponds to hard real-time systems. Drifts are possible, but they are typically known and bound. The system typically enacts simple forms of behavior (see further on).
[RT]₂: Statistically strong fidelity: This corresponds to soft real-time systems. Drifts are not fixed bounds but rather averages and standard deviations.
[RT]₃: Best-effort fidelity: As a result of quality-vs-costs trade-offs the quality drifts experienced by the user should be most of the time acceptable and not discourage the user form using the system.
[RT]₄: No fidelity: No guarantee is foreseen; drifts are possible, unbound, unchecked, and uncontrolled.

The above classes (or others, defined for instance by differentiating among reference bounds and statistical figures) allow to provide an operational definition of resilience: Resilience is

Being able to perform one's function: ("Being at work")
Staying in the same class!: Identity is violated as soon as the system changes its class and is no more able to "stay the same".

This brings the discussion to a second coordinate of resilience, namely behavior. Behavior is interpreted here as any change an entity enacts in order not to lose its system identity, namely to "stay in the same class". As suggested by Rosenblueth, Wiener, and Bigelow, we can distinguish different cases of behavior, including the following ones:

Passive behavior: corresponding to inert systems.
Purposeful behavior: this is the simplest behavior having a purpose, as it is the case with, e.g., servo-mechanism. This is the domain of Elasticity: faults, attacks, and disturbances are masked out by making use of redundancy. Said redundancy is predefined and statically defined as a result of worst-case analyses. So long as the analyses are correct the system is resilient; as soon as this is not the case, the system fails. The resulting systems are inherently fragile (as explained by sitting ducks for change!
Teleologic and extrapolatory behaviors: are more complex purposeful behaviors of systems whose action is governed by a feedback loop from the goal or from its extrapolated future state. This is the domain of Resilience: here systems are able to "be at work" and respond to changes — to some degree — making use of perception, awareness, and planning.
AUTO-PREDICTIVE behaviors!: This class of behaviors extends the set proposed by Rosenblueth, Wiener and Bigelow and corresponds to systems that plan their resilience by evaluating strategy-environment fits and learning which option best matched which scenario. Evolutionary Game Theory and machine learning are likely to play a significant role in this context.

The final move of my treatise is then made by stating a conjecture: That the domain of auto-predictive behaviors is that of antifragile computing systems. Antifragile systems are thus resilient systems that are open to their own system-environment fit and that are able to develop wisdom as a result of matches between available strategies and obtained results. A general structure to achieve antifragility is also conjectured and introduced: an antifragile computer system should operate as follows:

Monitor fidelities;
Whenever system identity is not jeopardized:
- Use computational elasticity strategies;
Whenever system identity is jeopardized:
- Use computational resilience strategies, auto-predictive behaviors, and machine learning to compensate reactively or proactively for the drift; assess strategy-environment fits; and persist lessons learned.

Our conclusions are finally stated: by differentiating and characterizing antifragile behaviors from elastic and resilient behaviors we concluded that computational antifragility is indeed different from other systemic abilities such as elasticity or resilience. A great deal of work is needed to move from ideas and theoretical concepts to an actual antifragile engineering practice of computers and their software; on the other hand, the expected returns are also exceptional and are mandated by the ever growing complexity of our systems, services, and societies!

Endnotes

1: Text in blue are original contributions by Vincenzo De Florio.

Tuesday 29 April 2014

Preconditions to Resilience: 1.2 Perception

Frank Zappa once said:

“A mind is like a parachute. It doesn't work if it is not open.”

Paraphrasing Zappa we could say that the same applies to a resilient system: it must be open to "be at work and stay the same". Therefore in my previous post I focused on openness and perception as prerequisites to resilience. There I introduced the three basic services perception is based upon: sensors, quale, and memory. (As discussed elsewhere, antifragility extends resilience with (machine-) learning capability, therefore what we mentioned in our post also applies to computational antifragility.) In this post I continue the discussion providing a practical example: a perception service for a well-known and quite widespread programming language, the so called "C" language.

My discussion will not be a very technical one, and I will do my best to remember that the reader may not be a programmer or an expert in computers altogether! A number of computer-specific concepts will be required though, which will be now introduced as gently and as non-technically as possible (at least, as possible to me!)
The reader accustomed to terms such as "programming language", "computer program", or "programming language variables" may skip this part and go immediately to the next one.

To better enjoy this part the reader is suggested to listen to Frank Zappa's "Call Any Vegetable", kindly provided here.

If you want your computer to do things for you, you need to formulate the intended actions in a way that the computer may understand. Though very fast, computers "speak" a very simple language; that language is so simple that it would be unpractical and in most cases unreasonable to expect a human being to "speak" the same language of a computer.

People who do are often called nerds or in some cases engineers, this second term possibly meaning "persons that talk to engines". (Have you ever seen one such person while s/he calls an engine? Quite moving. Or, at least, the engine often does afterwards.)

As the computer are very good at doing fast very simple things, a first sensible thing to do was to let the computers understand more complex actions. Engineers talked to computers and created "interpreters". As a result of this magic, instead of speaking directly to the machine, people now formulate their commands in some special language. Commands are called "programs" and those special languages are called "programming languages".

(By the way, those "interpreters" were programs too. And yes, I'm using the term "interpreter" for the sake of simplicity.)

Once the first programming languages were created, people could translate their commands — for instance, mathematical formulae — in the simple-and-fast "native language" of computers. Not very surprisingly, that language is often called "machine language". Among the first programming languages that were created there was FORTRAN. FORTRAN in fact stands for FORmula TRANslator. Once the trick was found and its positive returns assessed, other nerds/engineers decided to apply it again and again: as a result, now we have programs written in complex programming languages that are automagically translated into programs in other and simpler programming languages. IF each program is correctly translated and ultimately performs the actions that were intended by the user, then the scheme will work nicely. Yes, it's a big "IF" there.

In mathematical terms: if each "stage" of the above translation process is an isomorphism (namely a function that preserves in the output the validity of the input operations); and if the whole transitive closure is also isomorphic; then the chances are good that the computer will respond to you as you expect it to do.
(In fact even vegetables sometimes are known to respond to you.
By the way, computers are not vegetables. Cabbage is a vegetable. Dig? Need Zappa for that.)

Okay, so now we know more or less what is a program and what is a programming language. We just need another few little ingredients and then we are ready to go with the main course for today — our perception layer. We still need to explain two "little things". One is memory. You might have heard that computers have memories (you know, "my computer has four gigabytes of that!" — "Oh, mine it's better, it's got eight" — that sort of stuff). Memory is were data is stored. If you store things somewhere, it's good to be able to remember where you stored 'em, otherwise you'd end up like me and the stuff on my desk. But that's another story.

When people want to remember where things are stored, they use names. "Where did you put all your pencils?" "Oh those ones? They are in the desk drawer". "Desk drawer" should ideally identify in a clear way where I put those pencils. If there's several drawers in my desk I should be more specific: "they are in the third drawer", for instance. The same applies to computers. Computer memories consists of a long array of "drawers", called "words". If we want to specify where something is stored in a memory word, we must tell its position in the array. That position is called the address of the word. An action for my computer could then be "let me have a look at the content of the memory word at the address 123456"; another one could be "write number 10 in the memory word at the address 123456".

One of the first things that were introduced in programming languages was a better way to refer to those memory words and their content. The engineerds had an idea: let us create names to label certain areas of memory corresponding to memory words. Better, let us allow the program writers to choose their own names. As an example, if I write

int CALEDONIA_MAHOGANIES_ELBOWS;

what I'm actually telling the computer is:

"Hello mr. computer, please reserve a memory word for me; from now on I will refer to said memory word through the name CALEDONIA_MAHOGANIES_ELBOWS; mind that said memory word will be used to store and retrieve integer numbers (or better, computer representations thereof)."

CALEDONIA_MAHOGANIES_ELBOWS is the name of a variable in a programming language. In this case the programming language is called C and the variable is integer. The latter means that the variable can be used in any arithmetic (or Boolean) expression that accepts an integer number as an argument; one such expression is for instance CALEDONIA_MAHOGANIES_ELBOWS = 7; which stores in the memory word reserved to CALEDONIA_MAHOGANIES_ELBOWS the representation of integer number "7". Another such expression is, for instance, CALEDONIA_MAHOGANIES_ELBOWS / 2. If the two expressions follow each other in the order of their appearance here, then the second expression will return the representation of integer number "3".

We can now proceed to our perception service for the C programming language.

As already mentioned, in my previous post I observed how resilience requires some form of reactive or proactive behavior. In turn, those behaviors call for the system to be "open" — in the sense discussed in previous post and here: the system must be able to continuously communicate and “interact with other systems outside of" itself. In what follows the system at hand will be a program written in C using a special tool. This tool in fact allows a number of sensors to be interfaced and corresponding qualia to be associated to programming languages variables. No special memory services are needed, in that variables are automatically preserved by the hardware.

How does a programming language such as C cope with writing an open system? Not that well actually. No standard tool in the language and supporting system provides standard support for this. How do we optimally manage this then? Through what I call reflective variables.

What is a reflective variable? Well, it's a special type of programming language variable. What makes it special is the fact that the value of a reflective variable is not "stable"; rather, it changes dynamically and abruptly. Why? Because a reflective variable is associated with a hardware sensor and stores the values representing the "raw facts" registered by that sensor and converted into corresponding quale. Thus if we assume that a reflective variable, called int temperature, is associated to a thermostat, then temperature would automatically change its values so as to reflect the figures measured by the thermostat. As an example if the thermostat is turned on and measures a temperature of 20°C, a little later reflective variable temperature would be set to integer value "20"; and if at some point the thermostat realizes the temperature has dropped from 20°C to 19°C, then somewhat later temperature would change its value from "20" to "19".

Sensors, reflective variables, and memory provide a C program with a perception service as defined here. This allows a system programmed in C to be "open" — to a certain degree. As an example take a look at the following picture:

The picture shows a program that prints the content of reflective variable int cpu every two seconds. cpu is an integer number that varies between 0 to 100. Said number is in fact the quale that represents the percentage of utilization of the CPU. The Windows task manager is also shown to visualize the actual CPU usage over time.

The actual code of this program and some explanations are given in here and here. The code for the system supporting reflective variable cpu is available on demand.

A more complex example is shown in the following picture:

Here we have two reflective variable, int cpu and int mplayer. By using these two reflective variables a program becomes "open" two context figures: the amount of CPU used (as in previous example) and the state of an instance of the mplayer video player. As we have already described cpu, now we focus on mplayer: the latter is an integer variable whose qualia identify , e.g., whether an mplayer instance has been launched (code: 4); if it is currently being slowed down (code: 2); whether the user requested to abort processing (code: 5); and whether the mplayer instance exited (code: 1). The left-hand window shows the mplayer instance while the right-hand windows shows our exemplary program. The first highlighted area in the left-hand window shows the text produced by mplayer when it detects that "the system is too slow to play" the current video. The second highlighted area in the left-hand window shows the text produced by mplayer when the user type "^C" and aborts the rendering. In the right-hand window we see the cpu growing from 24% to 99 or 100% due to the CPU-intensive rendering task of mplayer. The "Mplayer server:" messages tell when reflective variable mplayer changes its state as well as its new state value and an explanation of the meaning of the state transition.

Further explanations are given here and here. The code for the system supporting reflective variable cpu and mplayer is available on demand.

In this post and the previous one we discussed perception as a first "ingredient" towards resilient systems. Next, we are going to define and exemplify awareness.

As a final message I'd like to express my gratitude to The Resentment Listener, who is kindly initiating me to the Art, System, and Life of Frank Vincent Zappa. (He's my Zappa guru — though not in the sense of Cosmik Debris, mind! "Now what kind of a guru are you anyway?" 😉)

Preconditions to Resilience: 1.2 Perception by Vincenzo De Florio is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Permissions beyond the scope of this license may be available at mailto:vincenzo.deflorio@gmail.com.

Monday 14 April 2014

Preconditions to Resilience: 1.1 Perception

Three important preconditions to resilience are perception, awareness, and planning. Perception is key because "What we cannot perceive, we cannot react from—hence we cannot adapt to". Awareness (also called apperception) is key in that it "defines how [the perception data] are accrued, put in relation with past perception, and used to create dynamic models of the “self” and of the “world”." Planning is also fundamental for the purpose of guaranteeing resilience, as it means being able to make effective use of the accrued knowledge to plan a reactive or a proactive response to the onset of change.

This post is a first of a few ones where we shall discuss the above mentioned preconditions. We begin here with perception.

We begin by defining the main term of our discussion. Thus what is perception? In what follows we shall refer to perception as to an open system’s ability to become timely aware of some portion of the context. Underlined words are those that most likely require some explanation:

Open systems: are systems that continuously communicate and “interact with other systems outside of themselves”. Modern electronic devices and cyber-physical systems are typical examples of open systems that more and more are being deployed around us in different shapes and “things”!
Context: is defined by Dey and Abowd as “any information that can be used to characterize the situation of an entity, where an entity can be a person, place, or object. [...] These entities are anything relevant to the interaction between the user and application, including the user and the application.”
Timely aware: puts the accent on the fact that perception of a context change requires performance guarantees. If I become aware of something when the consequences of the event are beyond my sphere of reaction, then it is too late: if a goalkeeper becomes aware of the ball when it's penetrated into the goal, he or she is not doing their job well.

In order to understand perception and related problems I think it is wise to break perception down into three distinct aspects, which I call sensors, quale, and memory.

Sensors

may be considered as the primary interface with the "physical world". Sensors register certain “raw facts” (for instance luminosity, heat, sounds...) and transmit information to the system’s processing and control units—its “brains”. The amount and quality of the sensors and of the sensory processes have a direct link with the "openness" of a system and ultimately with its resilience. Note also that the sensing processes imply a change of representation and thus an encoding. The overall quality of perception strongly depends also on the quality of this encoding process.

Quale

(singular: Qualia) are the system-dependent internal representations of the raw facts registered by the sensors. Also in this case the quality of reactive control -- and thus also the quality of resilience -- strictly depend on the qualia processes. In particular we need to consider the following quality attributes:

The fidelity of the representation process. This may be considered as the robustness of an isomorphism between the physical and the cybernetic domain as explained in this paper;
The time elapsed between the physical appearance of a raw fact and the corresponding production of a qualia (I call this the qualia manifestation latency);
The amount of raw facts that may be reliably encoded as quale per time unit (which I call reflective throughput).

Memory

is the service that persists the quale. Whatever the quality of the sensors and quale services, if the system does not retain information there's no chance that it will make good use of it! Thus the quality of the memory services of perception is another important precondition to overall quality and resilience. We may consider, among others, the following two quality attributes:

The average probability that qualia q will be available in memory after time t from its last retrieval (retention probability);
How quickly the "control layers" can access the qualia (qualia access time).

Apri la mente a quel ch’io ti paleso
e fermalvi entro; ché non fa scïenza,
sanza lo ritenere, avere inteso
(“Open thy mind to that which I reveal,
And fix it there within; for 'tis not knowledge,
The having heard without retaining it.”)

— Paradise, V, 40-42.

Okay so if we want to talk about resilience we need to discuss perception first; and if we want to discuss perception we need to consider in turn the above three aspects. Kind of fractal, if you ask me. Good! What now? Well, now we can build models of perception and try to use them to have an answer to questions such as how good (better, how open) a system is or which of any two systems is "better" in terms of perception.

As mentioned in another post, resilience is no absolute figure; you can't tell whether a system is better than another one in terms of resilience without considering a reference environment! Well, the same applies to perception. Also in the case of perception quality is the result of a match with a reference environment.

Let me illustrate this through the following example: suppose we have a system, S, that can perceive four context figures — figure 1, 2, 3, and 4. We shall assume that the perception subservices of S are practically perfect, meaning that none of the above mentioned quality attributes (qualia manifestation latency, reflective throughput, retention probability, qualia access time, etc.) translate in limiting factors during a given observation period.

Now we take S and we place it in a certain environment, let's say environment E. Let us suppose that five context figures can change in E: the four ones that are detected by S plus an other one — figure 5.

As a result of this deployment step, several changes take place as time goes on. Let us suppose that during a given observation period the following changes occur:

Time segment s₁:: Context figures 1 to 4 change their state.
Time segment s₂:: Context figure 1 and context figure 4 change their state.
Time segment s₃:: Context figure 4 changes its state.
Time segment s₄:: Context figures 1 to 4 change their state.
Time segment s₅:: All context figures, namely context figures 1 to 5, change their state.

What is depicted above and was just described is clearly the behavior of a dynamic system, thus it is wise to point this out explicitly by writing "E(t)" instead of just "E".

So what happens to S while we move on from s₁ to s₅? Well, during s₁ and s₄ we are in a perfect situation: the system perception and the changes enacted by the environment are perfectly matched. In s₂ and s₃ the situation is still favorable, though no more optimal: system S is ready to perceive any of the four context figure changes, but changes only affect a subset of those figures. Thus "energy", or attention, is wasted. (Think of an eye that constantly watches something; if we knew that that something will not change its state in the next 5 minutes, we could close the eye and relax for that amount of time 😄

But the real problem occurs during s₅: then, the environment produces a change that is not detectable by system S. A dreadful example that comes to mind is that of a man in the middle of a minefield. Short of minesweeping sensors, the man would have no way to detect the presence of a land mine, often with devastating consequences.

What can we learn from even so simplistic a model as the one we've just shown?

A couple of things in particular:

First, that the design of the perception system already defines the "shape" of the design for resilience. In fact if S is static, then it can only be the result of design trade-off carried out considering a generic environment. A worst-case analysis needs to be carried out to evaluate what worst-possible "range" of environmental conditions system S will be prepared to match. This is clearly an elasticity strategy rather than a resilience one. Apart from a limited and bound quality, said strategies imply non negligible development and operating costs and strongly limit the design freedom of the other resilience subsystems — the awareness and planning systems in particular. A better design is therefore that of an S(t) perception system, namely one that is prepared to reconfigure itself so as to "widen" and "shorten" perception depending on the observed environmental conditions. In the future scenarios of cyber-physical societies depicted, e.g., in our post here, a collective cyber-physical thing S(t) could be dynamically built by selecting cyber-physical sensors and qualia services matching the current requirements.
Secondly, by considering how near or how far the system perception gets to the optimal match with the current environmental conditions, it could be possible to provide the "upper layers" of resilience (namely the awareness and planning subsystems) with an indication of the risk of failures. As an example, if we consider again the above example and the five time segments s₁, ..., s₅, we could observe that s₁ and s₄ are those that represent the higher risk of an environment "outwitting" the system design; s₂ and especially s₃ represent more "relaxed" conditions; while s₅ is a condition of perception failure. In this paper I have shown how this may be used to define a quantitative measure of the risk of failures.

Next post will be devoted to a particular example: a perception layer for the C programming language.

Preconditions to Resilience: 1.1 Perception by Vincenzo De Florio is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Permissions beyond the scope of this license may be available at mailto:vincenzo.deflorio@gmail.com.

Monday 3 March 2014

Italian translation of my post "Lessons from the Past"

No new post this time, and not for everybody out there ;-) This is in fact to share with you the Italian translation that Giovanni Fanfoni did of my post Lessons from the Past. I was delighted of his interest in my ideas, and even more of his wonderful translation. So here it is, courtesy of Giovanni, to all the Italian-speaking visitors of this blog!

LEZIONI DAL PASSATO

Devo confessare che fino a poco tempo fa non sapevo che l'estinzione dei dinosauri non fosse l'unica né la più importante delle estinzioni di massa alle quali il pianeta Terra ha assistito.

"L'estinzione del Cretaceo-Paleocene (K-Pg) fu un'estinzione di massa che coinvolse i tre quarti delle piante e delle specie animali della Terra (inclusi tutti i dinosauri non aviari) e accadde in un'epoca geologicamente vicina, 66 milioni di anni fa" [da Wikipedia].

Certamente fu una catastrofe ma in effetti non quanto la cosiddetta Grande Moria: "l'estinzione del Permiano-Triassico (P.Tr) che accadde 252,58 milioni di anni fa. Si tratta della più grave estinzione mai capitata sulla Terra, in cui scomparvero il 96% delle specie marine e il 70% di quelle terrestri. Inoltre è l'unica estinzione di massa di insetti che ci sia dato sapere. Perirono circa il 57% di tutte le famiglie e l'83% di tutti i generi" [da Wikipedia],

Insomma, circa 252 milioni di anni fa una catena di eventi provocò una catastrofe che colpì così profondamente l'ecosistema terrestre da far supporre che "occorsero circa 10 milioni di anni perché la Terra riuscisse a porvi rimedio". Tuttavia, alla fine vi riuscì, determinando un cambiamento così importante nella storia naturale che gli scienziati sono stati costretti a separare nettamente un prima e un dopo: l'era Paleozoica (la "vecchia vita") e l'era Mesozoica (la "vita di mezzo").

Tra le numerose e importanti domande sollevate da un evento così catastrofico, alcune mi sembrano particolarmente rilevanti, ovvero:

Q1: ci furono delle cause generali dietro all'evento dell'estinzione P-Tr? o, detto altrimenti, ci furono comuni fattori scatenanti tali da provocare un disastro così ampio?
Q2: quale fu l'elemento decisivo, cioè la strategia difensiva determinante, che permise alla Terra di sopravvivere nonostante un colpo così duro?

Per cercare di rispondere a queste domande, occorre tenere presente i seguenti fatti:

F1: gli scheletri formati con processi di mineralizzazione conferiscono protezione contro i predatori [da Knoll]
F2: la formazione di uno scheletro non richiede solo la capacità di fissare i minerali nella matrice ossea, tale operazione deve essere svolta secondo un preciso modello in specifici ambienti biologici [da Knoll]
F3: "l'estinzione colpì anzitutto gli organismi che avevano depositato calcio carbonato negli scheletri, e in particolar modo quelli che dipendevano dai livelli di CO2 nell'ambiente per poter produrre lo scheletro" [da Wikipedia]

In altre parole, uno dei numerosi e indipendenti percorsi evolutivi ebbe particolarmente successo (F1) e quindi si diffuse ampiamente; purtroppo, l'adozione di una medesima soluzione sviluppò una forte dipendenza a condizioni ambientali predefinite e costanti (F2); infine, è stata rilevata una correlazione tra i gruppi di specie che adottarono questa soluzione e i gruppi di specie che furono maggiormente colpiti dall'estinzione P-Tr (F3).

Se leggessimo quanto scritto fin qui col gergo informatico della resilienza e della affidabilità nei computer, potremmo dire che:

una data soluzione si era ampiamente diffusa (ad esempio, una tecnologia per la memorizzazione, una libreria di oggetti, un linguaggio di programmazione, un sistema operativo o un motore di ricerca)
la soluzione aveva introdotto un punto debole: ad esempio, la dipendenza da un presupposto implicito, o un baco connesso a particolari condizioni ambientali, delicate e molto rare
tutto ciò precipitò per opera di un comune fattore scatenante, una singolarità di molteplici guasti: uno o più eventi resero evidente il punto debole e colpirono duramente tutti i sistemi che avevano adottato la medesima soluzione.

Un buon esempio in proposito è fornito probabilmente dal cosiddetto Millennium bug.

Cosa si può concludere da questi fatti e da queste analogie?
Le soluzioni che funzionano nelle situazioni più ricorrenti sono quelle che si diffondono più ampiamente.
Purtroppo, questo fatto diminuisce la disparità, ossia la diversità tra le specie. Specie che esternamente appaiano molto diverse l'una dall'altra, ma di fatto condividono una caratteristica comune, un comune schema progettuale.
Ciò significa che non appena le situazioni comuni vengono sostituite da un evento raro quanto dannoso, il Cigno Nero, un'ampia porzione dell'ecosistema viene compromessa. Di fatto, quanto più è rara ed eccezionale la nuova situazione e quanto più è diffusa la caratteristica comune, tanto più ampio è il numero di specie che saranno colpite.

Ora possiamo tentare di rispondere alla domanda Q1: ebbene sì, ci furono fattori scatenanti comuni che in definitiva produssero l'evento dell'estinzione P-Tr avendo aumentato la diffusione delle stesse "ricette evolutive" e avendo preparato così la strada a una grande quantità di insuccessi correlati tra loro.

D'altra parte, la Terra riuscì a sopravvivere alla Grande Moria e altre estinzioni. Perché? La mia ipotesi di risposta alla questione Q2 è che la Natura introduce sistematicamente delle soglie per accertarsi che la disparità tra le forme viventi non sorpassi mai un certo minimo. L'ingrediente fondamentale per riuscirvi è la diversità: non è un caso che la mutazione è un meccanismo intrinseco all'evoluzione genetica. La mutazione (e forse altri meccanismi) fanno sì che, in qualsiasi momento, non tutte le specie condividano gli stessi modelli. A sua volta, ciò garantisce che, in ogni momento, non tutte le specie subiscano lo stesso destino.

È interessante notare che soluzioni simili sono ricercate anche per progettare sistemi informatici. Al fine di diminuire la possibilità di guasti correlati, vengono eseguite molteplici repliche in parallelo o in sequenza. Si chiama progettazione della diversità e spesso si basa su modelli di progettazione come la N-version programming o i Recovery Blocks.

È altresì degno di nota che l'adozione di modelli basati sulla progettazione della diversità porta ad una diminuzione della disparità dei metodi di progettazione (ebbene sì, è una storia infinita).

La principale lezione che dobbiamo imparare da tutto ciò è che la diversità è un componente essenziale della resilienza. Riducete la diversità e ridurrete la possibilità dell'ecosistema di resistere al Cigno Nero quando apparirà (e, dato un tempo sufficiente, siate certi che prima o poi apparirà). Un'elevata diversità significa che un gran numero di sistemi sarà messo alla prova con nuove caratteristiche quando il Grande Evento accadrà. Anche se la maggior parte degli adattamenti tra il sistema e l'ambiente dovesse estinguersi (o guastarsi) comunque alcuni sistemi (per caso, per così dire) avranno le caratteristiche necessarie per sopravvivere al Cigno Nero con danni limitati. Saranno loro ad ereditare la Terra.

Friday 31 January 2014

A Comedy of Errors!

Shakespeare is a constant source of inspiration. As Aristotle touched and influenced all sides of what is now modern science, so Shakespeare did navigate so masterfully through all the meanders and rivulets of the human soul that he's become a touchstone we can hardly fail to refer to in any work of art.

And not just art, actually!

Some years ago I was writing my doctoral thesis on resilience of distributed software systems. The main "character in the play", so to say, was a programming language for the specification of error recovery tasks. The idea was that, when an error is detected in a software program, this would trigger the execution of a second program meant specifically to deal with the errors found in the first one. The second program was written in the programming language I conceived, designed, and implemented during my doctoral studies, which I called ARIEL. ARIEL code was a series of "guarded actions", namely commands that would only be executed when a condition (the "guard") was verified as being true. The hidden side of ARIEL was the so-called Backbone (BB), a system of agents watching the user program and its own components and gathering information about what the user program was having problems with: exceptions, missed deadlines, crashes — much of the whole lot. The general scheme was as follows:

As I said already, programs in ARIEL are given by one or more "guarded actions". An example of those actions is given below:

IF [ FAULTY TASK {MYTASK} ]
THEN
     RESTART TASK {MYTASK}
FI .

I wrote a translator, called "art", to convert lists of guarded actions like the one above into a new program, equivalent in meaning but more suited for being quickly interpreted by a machine ("art" stands for ARiel Translator, natcherly). The new program was called RCODE ("recovery code"). If something went wrong in the main program (for instance a distributed application including a task called "MYTASK"), and if the BB would become the wiser of the bad news, then the BB would start a new special task: the ARIEL interpreter. This task would execute the ARIEL RCODE and eventually bump in the (RCODE translation of the) above IF statements. Condition "FAULTY TASK {MYTASK}" would be found true and as a result TASK {MYTASK} would be restarted. Simple strategy, innit?

ARIEL could do more complex things than that of course. For instance it could manage groups of tasks and make use of special characters to represent subsets of a group of task. As an example, in the following guarded action "STOP TASK@" means "stop of faulty tasks in the current group", in this case group "MY_Group", while "SEND {DEGRADE} TASK~" means "send the DEGRADE message to all the non-faulty tasks in the current group":

IF [ FAULTY (GROUP{My_Group}) ]
THEN
       STOP TASK@
       SEND {DEGRADE} TASK∼
FI

Okay so by now you will be asking "but what ARIEL has to do with Shakespeare and his Tempest??" Well, the name originally came after the spelling of letters "R" "E" "L" (“[a:*]-[i:]-[el]”), for REcovery Language. But then, while reading The Tempest, I found this fragment:

ARIEL My master through his art foresees the danger
That you, his friend, are in; and sends me forth—
For else his project dies—to keep them living.
(Shakespeare, The Tempest, Act II, Scene I)

I was amazed—it all seemed to fit together so nicely! In Shakespeare's Tempest spirit Ariel is a magic creature, the invisible assistant of Prospero, the Duke of Milano. All was going a-okay with Prospero and his daughter Miranda, or at least until his brother Antonio decided it was time for him to replace his brother as the new Duke. Antonio wastes no time and has Prospero and Miranda abandoned on a raft at sea, where they would have certainly died were it not for supplies and some books of magic, courtesy of good-hearted Gonzalo. Thanks to those books ("...through his art…") Prospero calls forth "his familiar spirit Ariel, his chief magical agent". (Amazing, isn't it??) And Ariel starts serving Prospero along several threads of action. In particular he conjures a Tempest that brings all characters in the same scene. There Prospero can keep an eye (Ariel's actually) on everybody, good and bad alike. Many "wonders" take place until all wrongs are righted and all evil is banished. All errors are recovered, we could say. Precisely what we expect to have when the ARIEL program stops processing! Magic, isn't it 😉

For more information about this "Comedy of Errors" (and recovery thereof!), you could also have a look at this book! Or ask me your questions here of course 😃

A Comedy of Errors! by Vincenzo De Florio is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at mailto:vincenzo.deflorio@gmail.com.

Monday 6 January 2014

Fractal Social Organizations

In our previous post we discussed the challenges of Community Resilience, and especially those pertaining to response/recovery/reduction, as enunciated in the report by Colten, Kates, and Laska. Through that report we understood that one of the major problems following the event was one of coordination. Coordination among institutional responders was difficult and inefficient, and even more so it was between institutional and “informal responders” (non-institutional responders, originating in “households, friends and family, neighborhoods, non-governmental and voluntary organizations, businesses, and industry”, called by the authors “Shadow Responders”). We saw how a major challenge of Community Resilience is that of being able to solve those coordination problems. A possible way to reformulate it is that of conquering the complexity and engineering practical methods to dynamically create a coherent and effective organization-of-organizations as a response to disasters. The major goal of such an organization-of-organizations (OoO) would be that of enabling mutualistic relationships between all the involved social layers and produce mutually satisfactory, self-serving, controllable “social behaviors” enhancing the efficiency and the speed of intervention. In what follows I discuss now one such possible OoO that I call the “fractal social organization.”

In what follows I will make use of Professor Knuth’s characters to label paragraphs that contain information intended for nerds like me. All sane people willing to preserve their sanity may skip those paragraphs with no repercussion on the already compromised readability of this text.

Fractal Social Organizations (FSO) are an application of fractal organizations to socio-technical systems. In a nutshell, they are a “special” organization-of-organizations whose building blocks are “special” socio-technical systems called Service-oriented Communities (SoC). I’ll now try and explain what is so special in SoC and FSO.

An SoC is a service-oriented architecture that creates a community of peer-level entities – for instance human beings, cyber-physical things, and organizations thereof. These entities are called members. No predefined classification exists among members. No specific role is defined; for instance there are no predefined clients or servers, service requesters or service providers, care-givers or care-takers, nor responders or assisted ones. Depending on the situation at hand a member may be on the receiving end or at the providing end of a service. Members (may) react to changes. If something occurs in a SoC, some of its members may become active. Being active means being willing to play some role. Service availabilities and service requests, together with events, are semantically annotated and published into a service registry. The service registry reacts to publications by checking whether the active members may play roles that enable some action.

This is in fact like in dataflows, in which it is the availability of the inputs that triggers the execution of a function. Members become like the virtual registers in Tomasulo’s algorithm…

This check is done semantically, by discovering whether the published services are compatible with the sought roles. “Costs” may be associated with a member being enrolled. Enrolments may be done in several ways, each aiming at some preferred goal – for instance speed of response, safety, or cost-effectiveness. Furthermore, the optimization goals may focus on the individual member, or have a social dimension, or take both aspects into account.

A nice thing with the SoC and the above assumptions is that they enable mutualistic relationships. In this paper Sun et al. suggested that two elderly persons requiring assistance could find an adequate and satisfactory response by helping each other – thus without the intervention of carers. An interesting thing not yet done is to experiment with mutualistic relationship with more than two members and with different roles – for instance mutualistic relationships between two service providers (for instance, an institutional responder and a shadow responder in the response phases of some crisis). (Obviously this would call for agreeing on collaboration protocols, establishing common ontologies to reason on possible triggering events, discussing policies and modes of intervention, and several other aspects; we consider this to be outside the scope of the present discussion).

In fact one of my future goals is to try and simulate the SoC behaviours to see whether new mutualistic relationships would emerge!

As mentioned before, no predefined role exists in a SoC, though the creation of a new SoC calls for appointing a member with the special role of service coordinator. It is such member that hosts the service registry and performs the semantic processing. The coordinator may be elected and there could be hot backups also maintaining copies of the service registry as described in here or elsewhere. A SoC may be specialized for different purposes – for instance crisis management or ambient assistance of the elderly and the impaired. More information on an SoC devoted to the latter and called “Mutual Assistance Community” may be found, e.g., here.

A major aspect of the SoC is given by the assumption of a “flat” society: a cloud of social resources are organized and orchestrated under the control of a central “hub” – the service coordinator. Of course this flat organization introduces several shortcomings

In particular scalability and resilience: if the size of the community becomes “too big” the coordinator may be slowed down; and of course in the presence of a single and non-redundant coordinator a single failure may bring the whole community to a halt!

The Fractal Social Organization was created to solve the above mentioned shortcomings. The nerdy definition of FSO is as follows:

A Fractal Social Organization is a fractal organization of Service-oriented Communities. A Service-oriented Community is a trivial case of a Fractal Social Organization consisting of a single node.

In practice, if a SoC is allowed to include other SoC as their members, we end up with a distributed hierarchy of SoC, one nested into the other. This is a little like nested directories in a file system or “matryoshka dolls” (but such that each doll may contain more than a single smaller doll.)

This is nothing new of course. Society includes many examples of such “fractal organizations”; “the tri-level system (city, state, federal) of emergency response” in use in the States and mentioned in CARRI report no. 3 is one such example. The added value of the FSO is that it implements a sort of cybernetic sociocracy. Sociocracy teaches us that it is possible to lay a secondary organizational structure over an existing one. The idea is that the elements of a layer (in sociocracy, a “circle”) may decide that a certain matter deserves system-wide attention; if so, they appoint a member as representative of the whole circle. Then the appointed member becomes (temporarily) part of an upper circle and can discuss the matter (e.g., propose an alternative way to organize a process or deal with a threat) with the members of that circle. This allows information to flow beyond the boundaries of strict hierarchies; real-life experimentation proved that this enhances considerably an organization’s resilience. Through the sociocratic rules, an organization may tap on the full well of its social energy and create collective forms of intelligence as those discussed by George Pór here. FSO propose a similar concept. Whenever a condition calls for roles, the coordinator looks for roles by semantically matching the services in its registry. If all roles can be found within the community, the corresponding activity is launched. When certain roles are missing, the coordinator raises an exception to the next upper layer – the next matryoshka doll up above, that is to say. The role shortcoming event is thus propagated to the next level upward in the hierarchy. This goes on until the first suitable candidate member for playing the required role is found or until some threshold is met. The resulting “responding team” is what I called a social overlay network: a network of collaborating members that are not restricted to a single layer but can span dynamically across multiple layers of the FSO. Such new “responding team” is in fact a new ad hoc Service-oriented Community whose objective and lifespan are determined by the originating event.

Much is yet to be done. The FSO protocols have not been even formalized and only a partial and static version of the system is currently being implemented. Some results are already available though: my mathematical model of the activity of a flat service-oriented community shows the emergence of self-similarity, modularity, and a structured addition of complexity, which we conjectured in our previous post as being two of the most important “ingredients” towards community resilience. The idea is being used, albeit in a limited form, in the framework of a national project in Flanders. IBM selected FSO as one of the winning projects for their Faculty Awards for 2013. Only future will tell if all this will lead to a practical definition of an FSO for community resilience.

Fractal Social Organization for society {0,1,1,1,1,2,2,3,3,3,3,4}. More information and other pictures available here.

Fractal Social Organizations by Vincenzo De Florio is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Permissions beyond the scope of this license may be available at mailto:vincenzo.deflorio"gmail.com.