In my role as a Software Systems Architect,
in positions as both employee and consultant,
I have critiqued and contributed to existing and planned systems.
When doing so I consider a number of different issues,
most of which I was not taught in school.
Although the title of this post refers to software,
I also like to consider the environment in which the software
is developed and used,
so some of the issues discussed below are purely software
while some are related to hardware or other aspects of the
development or operating environments.
There are relationships between some of these issues,
but for the most part they are orthogonal,
so each must be separately considered to ensure sufficient quality
along that dimension.
Strip out the comments after the colons
and you could use this as a checklist to see how your system measures up.
Contents
Basics
Most developers know about these issues.
They apply to almost every software system, no matter how small.
- Functional: This is the area most people initially think of when
considering software quality, and is the area most visible to
end users.
- Correct: the system must perform the right operations,
preferably in a way that matches the user's expectations.
The "principle of least surprise" is applicable here.
- Usable: the system should be as easy to use as is practical.
In colloquial terms, it should be user friendly.
- Consistent: the different parts of the system should appear similar
to the user.
There should be model of the system presented to
the user that is easily understandable and as simple as possible.
Consistency among different parts of the system allows the user
to understand those different parts with less overall to learn.
- Performant: Performance improvements are usually thought of
as quantitative changes, but
a big enough change in performance becomes a qualitative
difference in the system.
As one of my friends says, performance is a feature.
- Fast: the system must be fast enough to satisfy any real-time or
near real-time constraints.
Worst case scenarios must be considered: can the system keep up
under the biggest possible load, and if not, what happens?
- Small: the system must be able to operate within its resource
constraints.
If you can make the system smaller, that usually means cheaper,
faster, and easier to maintain.
- Maintainable: Systems can last for a long time, sometimes far longer
than the original designers might have thought (as happened with
the Y2K problem).
For such systems, the cost of maintenance over the life of the system
can be far more than the original cost of development.
- Modular: the system should be subdivided into separable parts
with well-defined integration points between the parts.
This allows separate replacement or upgrading of just some parts
rather than the whole system.
- Documented: significant aspects of the system design should be
captured in written form to reduce the loss of critical
knowledge when employees leave the project.
- Commented: a specific form of documentation.
Code is easier to write than to read, but is typically
read far more often than it is written.
Appropriate comments simplify the task of reading
and understanding the code.
- Standardized: minimal and standard languages/tools/subsystems
should be required.
Fewer languages, tools and subsystems such as databases,
operating systems and hardware platforms
simplifies finding the resources needed to maintain the system.
- Versioned (change-controlled): all source code and other system
configuration information should be stored in a
version-controlled repository so
that new versions can easily be created when fixes are required,
and old versions can be recovered when necessary to audit problems
or restore functionality that was unintentionally lost in an upgrade.
- Testable: the system should include automated
unit and system tests to
provide confidence that changes to the system have not broken
anything.
The system should be designed to support effective tests.
Hardening
For larger systems or systems with more valuable data,
these capabilities have particular importance.
They are more expensive to implement than the Basics,
so are often omitted from smaller systems.
They can also be very difficult to retrofit to a system that was
not originally designed with them in mind.
- Secure: Systems must be protected against unauthorized use.
- Access control: the first line of defense.
- Physical: anyone who has physical access to the system can
cause problems. Console access is often not secured as
well as remote access, and even if the person can't get in
to the software, he can still disconnect or physically
damage the system.
Depending on the size of the system and the value of its
data and services, this could mean
physical locks on machines, locked rooms, or
secure data centers with controlled access, including
ID card with photo, signature, and biometrics such as
handprint or retina scan.
- Network: the system should include firewalls to limit
access to only the services the system is designed to provide.
For more secure systems, IP address filtering, VPN connections,
or TLS certificate requirements can limit who can get to
the system.
- Authentication: users must be identified. This is typically done
by requiring a username and password.
More secure systems can require biometrics, such as a thumb scan,
or a physical device such as an RFID security card.
Every user should have a separate account to allow tracking
exactly who it is that is logging in to the system.
- Authorization: once a user is identified through the authentication
process, the system can load up the privileges assigned to that user.
Each user should have only the privileges necessary to do his job.
Giving everyone complete access to the system increases the
probability that someone will do something he should not do,
whether intentionally or by accident.
- Auditing: the system should record all pertinent information to
allow a definitive determination of history for all significant
security events.
Details should include at a minimum
what was done, who did it, and when it happened.
- Internal checks (against bugs and attacks):
for improved security, systems
should run self-checks to ensure that critical parts of the
system continue to run correctly.
Ideally the self-check applications should be able to monitor
themselves (or each other if more than one) to guard against
problems within the self-check programs (including malicious
users attempting to circumvent the self-checks).
- Resource management/limits:
the system should include mechanisms to allow limiting the resources
consumed by each user, including disk space and CPU usage.
In addition to allowing for a more fair use of the system by
preventing one user from hogging resources,
these mechanisms help prevent DOS (denial of service) attacks.
- Robust: Nothing is perfect. A well designed system takes into account
many different kinds of possible failure.
The probability of failure can not be completely eliminated, but can
be made very small given sufficient resources.
The cost of failure must be compared to the cost of building a system
with a sufficiently low probability of failure.
- Redundant: the system should have no (or minimal)
single points of failure.
Redundancy can be applied at many levels: a single box can have
multiple power supplies, multiple CPUs, multiple disks, and
multiple network connections.
Within one data center there can be multiple boxes and multiple
network routers, with battery backup power.
For maximum redundancy there can be multiple geographically
separated data centers with multiple network routes connecting them.
- Diverse: monocultures are more susceptible to damage.
Just as with biological systems, diversity provides protection
against problems that affect a "species".
Sharing responsibilities among
different operating systems and different applications provides
defense against viruses that attack specific operating systems
and applications, and against bugs in those components.
This aspect of robustness can be very expensive, so is not
often considered.
- Forgiving (fault-tolerant):
the larger the system, the higher the probability that
it will have some problems, including bugs in the software.
The system should tolerate small problems in the parts;
a small problem should remain a small problem,
not be amplified by cascading failures into a larger problem.
- Self-correcting:
Self-monitoring can be done at multiple levels, from
memory parity up to application-level system checks.
More sophisticated techniques allow for automatic correction
of errors, such as
Reed-Solomon coding instead of simple memory parity.
Care must be taken to ensure that the probability of failure
or errors in the error-correcting step is less than the parts
that it is monitoring.
- Scalable: If you expect usage of the system to grow over time,
your design should allow incremental expansion of the system
and should continue to perform well at the new usage levels.
Scalability should be considered at multiple levels, depending on
how far you need to scale.
When scaling up by adding more identical units,
you also get the benefits of redundancy, because
with more units, the portion that you must set aside purely for
redundancy can be reduced.
Stated another way, effective redundancy can be added to a system
much less expensively when that system is already using
a collection of identical units for scaling purposes.
- Scalable algorithms: algorithms should have appropriate big-O
performance. Attention should be paid to word size limits
to prevent overflow on large data sets.
- Resource-tight: even a small memory leak can bring down a
application when there is enough usage.
The system should be tested under load and monitored to ensure
it does not run out of memory, file descriptors, database connections,
or other resources that could leak from poor coding.
- Parallelizable to multiple threads or processes:
the first step in spreading out to multiple units is to use
multiple threads or processes on one machine.
Areas of concern are shared memory and concurrency: deadlock or
livelock, excessive blocking due to excessive synchronization, and
stale or corrupt data due to insufficient synchronization.
On a multiprocessor machine, be sure you understand the memory model,
and be aware of possible cache coherence problems that can arise
if code is not properly synchronized.
- Parallelizable to multiple machines:
when one machine is not enough, scaling up to multiple machines
clustered in one location is the next step.
For this level of parallelization, typical issues of concern
are network bandwidth limits, how to distribute data for improved
throughput, and what to do when one host in the cluster is down.
- Geo-redundant: for the largest applications, having multiple data
centers in geographically separated locations provides redundancy
and disaster security, as well as sometimes providing improved
performance as compared to a single location because of reduced
network delays.
Typical issues are dealing with network latency, managing routing
between data centers, and data replication between data centers
for improved performance and redundancy.
Business
In theory you could build a great system without considering these issues,
but in practice you had better pay attention to them,
else your business will not last very long.
- Affordable: The art of engineering includes balancing costs against
other qualities.
The following costs should be considered:
- Design and development: the labor costs of building the system.
- Hardware and external software: the purchase costs of the system.
- Maintenance: the ongoing costs of repairing and upgrading the system.
- Operational: the data center costs of operation.
- Timely: When a product hits the market can determine whether or not
it succeeds.
When a product is scheduled to hit the market should be factored into
the design, both in terms of how much time is available for development,
and in terms of what the expected environment will be like then.
- Soon enough: taking too long to get to market is a well-known
concern.
- Late enough: the first to market is not always the most successful.
Sometimes it is better to let someone else bear the costs of
opening up the market.
- Available support and parts: a good design will plan on using parts
that will be cost effective when the product is shipping, which is
not necessarily the same as what is cost effective when the product
is being designed.
This requires predicting the future, so can be tricky to get right.
- Operational: If the system is successful, it may be used for a long
time or by many people, amplifying the value of making the system
easy to operate.
- Visibility: the operator should be able to easily verify that
the system is functioning properly.
This includes being able to determine what level of resources
are being used and that it is not being attacked.
All of the following pieces of information should be
readily available to the operator:
- What version of what component is running where.
- What is happening now.
- What has happened recently.
- Who is accessing now or recently.
- Performance and other resource statistics.
- Usage statistics (e.g. what function is most popular).
- Error and warning conditions.
- Debugging information (e.g. logging).
- Post-mortem info (for bug fixes).
- Control: the operator should be able to adjust aspects of
the system as appropriate to maintain its proper operation and
to protect its security.
This includes being able to do the following:
- Start and stop processes, applications or hosts.
- Manage access control, including
adding, removing or modifying
accounts, privileges and resource limits for users and groups.
- Enable and disable features.
- Perform maintenance.
- High availability: for systems that require high availability once
put into production,
in addition to the robustness items listed above
the following operations should be possible:
- Piecemeal upgrades or replacement of components while running.
- Ability to run with mixed versions of components (soft rollout),
particularly on a system with multiple copies
of the same component.
- Granular control: for optimal resource management and flexibility
in managing the business model of usage and payment,
the system should support these capabilities:
- Per-user restrictions, e.g. by time of day, stop when
out of money.
- Ability to charge customers per transaction or per
other metrics (cpu usage, disk usage, #calls, etc).
- Image: you might not be able to judge a book by its cover,
and beauty is only skin deep, but people respond to how things
look, so it is important to remember to maintain consistency
and quality in these areas:
- Branding: consistent look and feel concerning logos,
company colors, slogans, etc.
- Beauty: too often not included in software products.
- Backups: any system with data of any value should be backed up.
This is valid even for small systems, such as your
laptop or home network.
- You should have a backup schedule based on how much you
would be bothered by losing your data.
For a simple system, a regular schedule of full backups plus an
automated incremental backup should suffice and is
generally relatively easy to set up.
For a larger system, you might want a full backup less often
(due to the amount of data) with a layered schedule of weekly
and daily backups.
- Backup media should be validated.
At a minimum, the backup media should be checked to make sure
it contains the data that was intended to be written to it.
For a more complete test, the backup data should be used
to recreate a system in the same manner as would need to be
done following a disaster.
- Copies of backups should be stored off-site for disaster security.
If your backups are stored on-site and the building is
destroyed, those backups won't do you much good.
- Sensitive data on backup media should be encrypted to prevent
its use in the event that media is stolen.
I am always interested in continuing to learn, so if you think I have
left out anything from my list, please let me know.
No comments:
Post a Comment