So what is a robot? It's not likely what you would think and with a name like Kapow, it better be cool! Think workflow, Integrated Development Environment (IDE), a class, and browser all wrapped up in one. They are the repeatable workhorses for extracting and transforming the source system's content with an eye towards shaping what it should become. As a developer or analyst, you have complete control in shaping the data information architecture.
During our content migration, each source system content type generally meant a robot needed to be created. When I say generally, of course, not all source system content types would become a SharePoint content type. After all, some content should live in lists or libraries and were singleton in nature. If they were not reusable or inheritable, perhaps a robot was not required. In addition, if the content had a small corpus volume (e.g. 100 items) it was easier to manually migrate than invoke the robot creation process. However, in most cases the content required robot creation. That was where Kapow Design Studio (the IDE) came into play. A wonderful client application that included a blank canvass to create steps, iterations, loops, database queries, exception handling, tag finders, tag extractions, even regular expressions, to name a few. This post discusses our process for creating robots and using the Kapow Katalyst framework that often put the cool in the migration process.
Robot Creation - Design Studio
When creating a robot we first would create "Types" to support the data we were going to extract. Think of Types as classes with various properties to hold extracted data. For our purposes we would create an Article Type (Title, Body, Date Published etc.). In addition, we created a ContentLink and Resource Types respectively. These would serve as the containers for links, images, and documents processed from each item of content.
When I mentioned IDE, this is where the construction of each robot began and ended. It was also where you can show the client how their data was being changed in real time. Again very cool. In the IDE we created "Steps" or workflow components that extract data. We began with a Database Step that loaded all of the article page content types that were previously staged from our MasterInventory table. The next step was Load Page which rendered each item of content into the IDE's Browser window. This was actually a full browser representation of the content that allowed us to target specific pieces of content to extract. From there, we createdTag Finders and Tag Extractors that would query the DOM for particular classes or structures that denoted a region or zone. Those values were then put into the Article Page Type (our class) as the workflow went along. We could iterate particular zones of pages such as related links (for each loops), capturing each value, and eventually storing in our ContentLink table. We could also grab all content links from a page's body, iterate them, extract and download resources from file shares or external databases and again saving the values to our local Resource table in our Inventory database.
Our Article Page robot at a high level performed the following events:
- Database Step for selecting all article pages to extract
- Load Page into the browser window
- Tag Finder and Tag Extraction for Article Page pieces such as Title, Breadcrumb, Body, Summary, Contact etc.
- Looping through all Article Page Body content links
- Parallel branches to extract resources or hyperlinks
- Database INSERTS for resources or content links
- Looping through other regions or zones in a page that contained Related Links or page contacts
- Image extractions
- Database Step for saving main article page content into our Inventory database
The IDE was split into five major components:
- Browser View - One of the best parts of the IDE. The content in full view. We worked from header to footer for extractions. It also worked great with one-on-one sessions with the client to display in real time page-per-page extraction.
- Robot View - Think Visio workflow diagram on steroids. This is where the extraction process were built. As stated, it all started with the database, we would then load each page and extract the following information:
- Additional database metadata with each record not found in the browser
- Our workflow was quite large utilizing multiple branches of logic to extract portions of the page. Specifically, a branch could be extracting content links in a page that referenced an external file share or database that we would be bringing into SharePoint document libraries.
- The workflow would include additional database inserts or updates to lookup or reference tables:
- Insert or update a content link
- Insert or update a resource
- Insert or update the content type being extracted (e.g. Article, News)
- Update the status of the item of content after workflow has been completed
- You can step through each step, view the contents extracted, jump forward "N" steps, and even jump back all while creating new rules or tasks.
- Configuration View - This was where we would create the logic behind each step. When clicking a step, we were presented with multiple options. This is the section where you can create rules for each step such as tag finders, regular expressions, set variable data for the types, set actions, and exception handling (to name a few).
- Data View - Robots are under the hood classes that hold properties and data. We created and then associated "Types" with a robot. During each step, this pane gives detailed view into the variables and their respective values. This was especially valuable during debugging sessions.
- Projects View - Think Visual Studio solution explorer, just with robots! Allowed for grouping of extraction, transform, and load robots along with types.
The last component of Design Studio was the Debug View. This worked very much like Visual Studio's debugger where you can hit F5 to debug, step into the robot to test tag finders, branches, set breakpoints, exception handling, go to particular steps, and view variable values. The debug view also allowed for viewing of logs such as exception handling messages you may have put in place or unhandled exceptions that we could analyze further.
Like with any debugging we would start with a small corpus volume, run the robot and build from there. Specifically, we started extracting 10 pieces of content, fix our steps, and repeat. We would then incrementally grow the extracted corpus volume until our logic handled all author permutations. This process is easily adjustable because we controlled the T-SQL in the first database step. Later, we got more elegant and extracted larger and more complicated items per channel.
At the last step of the extraction, our values were inserted or updated to the holding table for the target system content types. For the article content types, our article table comprised the following DDL:
- BodyFinal (Transformed value)
- SourceUrl (Used for joins on the MasterInventory table)
- ObjectKey (Kapow Primary Key for inserts and updates)
- RobotName (Name of the last robot to update or insert the row's value)
- ExecutionId (Kapow run ID)
Need Speed? Enter The Management Console
Running each content type through the IDE was perfectly acceptable but time consuming. For example, extracting 5,000 article pages through the IDE could take 18 hours to complete when run through the entire corpus volume. This was fine while implementing, letting it run overnight, and checking the results the next day. However, we needed a better alternative when operating during content freezes with tight deadlines on content extraction. Thankfully, Kapow had a Management Console that allowed us to multi-thread robots into parallel processes. In our case, we created an input parameter to the robot. This parameter was given a value from the Management Console. We decided to use the five channels we were extracting from Vignette as our input values. So for example if a top level channel was called "Human Resources", we would set that input parameter for the robot and it would only extract items within that channel for a given run.
The Management Console is loaded in your local browser running on (what else?) RoboServer that was installed previously. Below is a depiction of the Management Console that not only speeds up the process, but gives logging, and insight into each robot run. This console cut our processing time for extractions by 70%!
Transformation robots were not nearly as complicated as extractions. But they served an important cause...to clean the content! In our case, deprecated tags were updated using regular expressions in our robot steps. We could even check for malformed HTML and update the content. Because the content had already been extracted and stored in the database (in our case the article table), these robots would load a content type dataset, iterate through and run through our steps to transform the content we are targeting to clean. Lastly, the updated content, was inserted back into their appropriate tables.
For our automated processes, Kapow Katalyst's Design Studio allowed us not only to convert article page, news, calendar events, and other content types from the source system content types to the target system, it allowed us to perform some elegant procedures such as Content Type Cloaking. What is Content Type Cloaking you ask? Good question. Say you have an article page content type from Vignette. It has the extact same structure and is even stored in the article page Vignette table. But the author is using that content type as a news item in the legacy system. Therefore, from a developer perspective it is stored as an article, looks like an article, but its intended use was inferred as a news item when published. Robots allowed for branching and additional logic to catch these interpretations of content types and convert to real-world instances for the target system. In addition, robots allowed the new portal to reduce dependency on external systems and loosely structured file shares by extracting linked resource data and capturing the binary and metadata information into our local inventory database. That data would then be uploaded into SharePoint lists or libraries for information discovery, retention, governance, and compliance purposes. Therefore, proper managed metadata and keywords were then supplied for information discovery purposes.
Of course no migration would be complete without a target information architecture in place. In our agile project we were implementing SharePoint content types, site columns, global and local navigation, web templates, page layouts, workflows, retention policies, web parts, keywords, term sets, and other items that would serve as the containers/governors for our staged content. By no means a trivial task, is often iterative, and must be baked in early during the content analysis step. That is beyond the scope of this post but needless to say, the testing component for the target structures should be completed early, often, and with a clear plan in place.
Sounds Good, But Where's the "L" of the ETL?
Part four of this series will focus on the load process with a ton of PowerShell goodness. Kapow Katalyst is an awesome tool for migration and robots can serve as the "L" in load. I found the extractions and transformations capabilities to be enterprise level and exceeded my expectations. For loads, robots can be created and mimic recorded request and response actions (think Visual Studio Web Tests). But this process can take a very long time, especially with SharePoint modals and multiple document libraries. Therefore, during our early conversations and modeling efforts with Kapow consultants, we found that we could call batch files from robots. Anyone who has developed in SharePoint 2010 knows that PowerShell is the great equalizer, rescuer (or eraser) depending upon your situation. In our case, we decided to combine the two and create PowerShell scripts that can use the native SharePoint API while calling our extraction databases to create the target portal. Therefore, our robots mined, extracted and transformed the data, while PowerShell sculpted and uploaded it into SharePoint. Whether the legacy content type became a page, an item in a document library, needed to be sprinkled with metadata, or a calendar event, we had it covered.
Stay tuned for how we achieved this one-two punch combination!