For a top 10 American fast food restaurant, the ServiceNow application plays a critical role in capturing service requests across its corporate offices, restaurant locations, and customer platforms. Data collected from ServiceNow is used to remedy issues, optimize operations, and improve customer service company wide. Over time, the restaurant experienced latency and resource limitations with its legacy data pipeline and engaged CapTech to develop a more efficient and scalable solution to manage its growing data load.
From managing supply chain issues to providing customer support and issuing new appliances, a top 10 American fast food restaurant relies on the ServiceNow application to capture service requests across its corporate offices and franchise locations. Data from the ServiceNow API was being loaded into a centralized data lake using an older pipeline that was unable to handle the restaurant’s increasing data load.
The existing pipeline also had siloed processes for running the daily incremental jobs, which loaded any new or updated records into the data sets, and the weekly full data reload jobs, which led to lengthy runtimes that slowed down the application.
With vast amounts of data and an influx of users attempting to use the app concurrently, a more efficient process was needed to minimize the load on the ServiceNow API during business hours, control data access and accuracy, and provide a new scalable solution that could grow alongside the restaurant.
A team of CapTech’s data engineers was engaged to collaborate with the restaurant to redesign its ServiceNow Extract, Transform, Load (ETL) pipeline. The team began by decoupling the extraction steps from the transformation and load steps, which reduced the runtime of the daily incremental job by more than two hours for most tables.
Next, the pipeline was standardized and refactored. Previously, the incremental job extraction happened entirely within Python in the Airflow environment, while the transformation and load steps occurred in an EMR job. Conversely, for the weekly full refreshes, the entire ETL process was happening in batch jobs, one for each table. CapTech’s team standardized the process and refactored the code so both processes used batch jobs for extraction and AWS Glue for transformations and writing to the data lake. The incremental job was also separated to initiate individual jobs at the table level to allow for greater flexibility when handling errors and scheduling, since some of the smaller tables needed to be updated more frequently than once a day.
In the current stage of the project, the team has been migrating the data sets to take advantage of the system’s new “upsert” functionality, which allows records to be overwritten at a primary key level. Previously, the entire table or partition had to be updated at the same time. Once implemented, there will no longer be a need to regularly run the full data refreshes weekly, since deleted records can now be identified and removed.
The refreshed pipeline provided the restaurant with a scalable system that allows data sets to be added more efficiently, and with less impact to application speed.
The new pipeline:
Reduced the job refresh process from more than eight hours to under four hours by utilizing AWS Glue for transformations. For most of the datasets, some with multi-millions of rows, the Glue portion of the pipeline takes less than five minutes.
Eliminated duplicate transformation logic that had previously needed to be translated between Python and Pyspark any time an update was made.
Decreased the amount of memory needed to read and transform the high volume of records by utilizing AWS Glue, resulting in both runtime and cost decreases.
Created a standard form used to add new ServiceNow data to the data lake, streamlining the process and controlling data access and accuracy.
Minimized the load on the ServiceNow API and created a single source of record by conducting an audit of data processes that were regularly accessing the ServiceNow API during business hours and helped them transition to using the data lake tables sourced from the pipeline.