In my experience, you are far more likely to extend an existing system as a professional software developer than you are to build something from scratch. I mean, unless you’re amazing at engineering situations for yourself where you never have to build on top of existing code that is.
You tricky developer you.
Assuming that you’re not always as tricky as you want to be, when you build on top of an existing system, I have found the following quote to be quite helpful:
First, refactor to make the change easy (warning, this might be hard),
Then make the easy change
The internet says that this particular piece of wisdom originally came from Kent Beck in his book on TDD, which seems pretty likely.
Regardless of where it came from though, I’m going to use this post to explore a case study where I applied the quote directly while I was extending a piece of code that I didn’t originally write.
Minimum Viable Product
To set the stage, we have a small application that extracts data from a database and pushes it to an S3 bucket, where it is consumed by some third parties. It runs quietly in the background on a regular schedule thanks to TeamCity.
One of the consuming third parties would prefer that the data was delivered via SFTP though, to allow them to trigger some process automatically when the files arrive, rather than having to poll the S3 bucket for new files on their own schedule.
A completely understandable and relatable desire.
In the same vein, when we run the automated process through TeamCity, it would be nice if the files generated were attached to the execution as Build Artifacts, so that we could easily look at them later without having to go into the S3 bucket.
Obviously neither of these two requirements existing when the application was originally written, but now there needs to be multiple ways to export the data for each run. Perhaps we don’t always want to export to all the destinations either, maybe we want to do a test run that only exports to TeamCity for example.
In my current role I don’t typically write as much code as I used to, but everyone else was engaged in their own priorities, so it was time for me to step up, extend a system I did not write and flex my quickly atrophying technical muscles.
Structural Integrity Lacking
The application in question is not particularly complicated.
There are less than 20 classes total, and they are generally pretty focused. For example, there is class that offers the capability to write a file to S3 (abstracting away some of the complexity inherent in the AWS supplied interfaces), another for building queries, another for executing a query on top of a PostgreSQL database and streaming the results to a CSV file and finally a class that orchestrates the whole “get data, upload data” process for all of the queries that we’re interested in.
That last class (the aptly named CsvS3Export) is where we need to start refactoring.
public class CsvS3Export : IS3Export
{
private readonly IClock _clock;
private readonly IFileNamer _fileNamer;
private readonly ITableToCsvExporter _tableExporter;
private readonly IS3FileUploader _s3Uploader;
private readonly ITableExportSpecificationResolver _tableExportSpecificationResolver;
private readonly AmazonS3Client _s3Client;
private readonly IReporter _reporter;
public CsvS3Export(
IClock clock,
IFileNamer fileNamer,
ITableToCsvExporter tableExporter,
IS3FileUploader s3Uploader,
ITableExportSpecificationResolver tableExportSpecificationResolver,
AmazonS3Client s3Client,
IReporter reporter
)
{
_clock = clock;
_fileNamer = fileNamer;
_tableExporter = tableExporter;
_s3Uploader = s3Uploader;
_tableExportSpecificationResolver = tableExportSpecificationResolver;
_s3Client = s3Client;
_reporter = reporter;
}
public S3ExportsSummary ExportTables(IList<string> tableNames, string workingDirectory, string s3BucketName)
{
try
{
var runTimestamp = _clock.UtcNow;
var summaries = tableNames
.AsParallel()
.Select((tn, i) => ExportTable(workingDirectory, s3BucketName, tn, runTimestamp));
return new S3ExportsSummary(summaries);
}
catch (Exception e)
{
_reporter.ReportInfo($"Error running export \n\n {e}");
return new S3ExportsSummary(e, tableNames);
}
}
private S3TableExportSummary ExportTable(string workingDirectory, string s3BucketName, string tn, DateTimeOffset runTimestamp)
{
try
{
var fileName = _fileNamer.GetFileName(runTimestamp, tn);
var spec = _tableExportSpecificationResolver.Resolve(tn);
var localOutputFile = new FileInfo(workingDirectory + "\\" + fileName);
var exportResult = _tableExporter.Export(spec, localOutputFile);
var s3Location = new S3FileLocation(s3BucketName, fileName.Replace("\\", "/"));
var uploadResult = _s3Uploader.Upload(exportResult.FileLocation, s3Location);
return new S3TableExportSummary(spec.TableName, localOutputFile, uploadResult);
}
catch (Exception e)
{
_reporter.ReportInfo($"Error running export for table {tn}\n\n {e}");
return new S3TableExportSummary(tn, e);
}
}
}
As you can tell from the source above, the code is very focused around a specific type of file (a CSV) and a specific export destination (S3).
We don’t really care about the file generation part (we have no interest in generating non-CSV files for now), but if we want to start adding new destinations, we really should break this class apart into its constituent parts.
Remember, first make the change easy (refactor the existing code to allow for new destinations), then make the easy change (create the new destinations and slot them in).
Export Tariffs
Rather than coupling the class directly with the S3 functionality, all we have to do is extract a simple IExporter interface and then take a list of exporters as a constructor dependency.
The CsvS3Export then becomes more of an orchestrator, calling out to the class that does the actual data extraction with a target file and then iterating (safely) through the exporters, with the intent to report on the success or failure of any of the components involved.
public class DefaultExportOrchestrator : IExportOrchestrator
{
public DefaultExportOrchestrator(IClock clock, ITableExportSpecificationResolver resolver, IFileNamer namer, ITableToCsvExtractor extractor, List<IExporter> exporters)
{
_clock = clock;
_resolver = resolver;
_namer = namer;
_extractor = extractor;
_exporters = exporters;
}
private readonly IClock _clock;
private readonly ITableExportSpecificationResolver _resolver;
private readonly IFileNamer _namer;
private readonly ITableToCsvExtractor _extractor;
private readonly List<IExporter> _exporters;
public ExportOrchestrationResult Export(IList<string> tableNames, string workingDirectory)
{
if (!Directory.Exists(workingDirectory))
{
return new ExportOrchestrationResult(new DirectoryNotFoundException($"The supplied working directory [{workingDirectory}] did not exist"));
}
var now = _clock.UtcNow;
var extracted = tableNames
.AsParallel()
.Select(a => ExtractTable(a, workingDirectory, now));
return new ExportOrchestrationResult(extracted.ToList());
}
private TableExportOrchestrationResult ExtractTable(string table, string workingDirectory, DateTimeOffset runTimestamp)
{
var fileName = _namer.GetFileName(runTimestamp, table, "csv");
var spec = _resolver.Resolve(table);
var localOutputFile = new FileInfo(Path.Combine(workingDirectory, fileName.Combined));
var result = _extractor.Extract(spec, localOutputFile);
if (!result.Successfull)
{
return new TableExportOrchestrationResult(table, result);
}
var exported = _exporters.AsParallel().Select(a => SafeExport(table, result.FileLocation, a)).ToList();
return new TableExportOrchestrationResult(table, result, exported);
}
private ExportResult SafeExport(string tableName, FileInfo file, IExporter exporter)
{
try
{
return exporter.Export(tableName, file);
}
catch (Exception ex)
{
return new ExportResult(exporter.Description, tableName, file, false, string.Empty, ex);
}
}
}
The important part of this process is that I did not add any new exporters until I was absolutely sure that the current functionality of the application was maintained (via tests and whatnot).
Only once the refactor was “complete” did I add the new functionality, and test it independently and in isolation, safe in the knowledge that I was extending a system designed to be extended.
From a Git point of view, the refactor itself was a single commit, and the addition of each new exporter was another commit.
Because a clean Git history is a beautiful thing.
Conclusion
Looking back, you might think that the initial structure of the code was wrong, and it should have been built this way in the first place.
I disagree with that, because it did exactly what it needed to do before we went and changed the requirements. It would be a fools errand to try and anticipate all future changes that might need to be made and accommodate them, not to mention incredibly expensive and wasteful, so I really can’t recommend going down that path.
The most important thing is to ensure that we consistently follow good engineering practices, so that later on when we need to make some changes, we don’t have to struggle through terribly structured code that doesn’t have any tests.
To tie it all back to the quote that started this post, spending the time to refactor the code to be more accepting of the pattern that I wanted to use made the additional of the additional export destinations much easier, while still following good software engineering practices.
I could probably have just jammed some naïve code right into that original CsvS3Export class that just exported to the other destinations, but that would have just made the code much more confusing to anyone coming by at a later date.
There’s already enough terrible code in the world, why add to it?