Author Archives: David Teich

TDWI Webinar Review: David Loshin & Liaison on Data Integration

The most recent TDWI webinar had a guest analyst, David Loshin of Knowledge Integrity. The presentation was sponsored by Liaison and that company’s speaker was Manish Gupta. Given that Liaison is a cloud provider of data integration, it’s no surprise that was the topic.

David Loshin gave a good overview of the basics of data integration as he talked about the growth of data volumes and the time required to manage that flow. He described three main areas to focus upon to get a handle on modern integration issues:

  • Data Curation
  • Data Orchestration
  • Data Monitoring

Data curation is the organization and management of data. While David accurately described the necessity of organizing information for presentation, the one thing in curation that wasn’t touched upon was archiving. The ability to present a history of information and make it available for later needs. That’s something the rush to manage data streams is forgetting. Both are important and the later isn’t replacing the former.

The most important part of the orchestration Mr. Loshin described was in aligning information for business requirements. How do you ensure the disparate data sources are gathered appropriately to gain actionable insight? That was also addressed in Q&A, when a question asked why there was a need to bother merging the two distinct domains of data integration and data management. David quickly pointed out that there was no way not to handle both as they weren’t really separate domains. Managing data streams, he pointed out, was the great example of how the two concepts must overlap.

Data monitoring has to do with both data in motion, as in identifying real-time exceptions that need handling, and data for compliance, information that’s often more static for regulatory reporting.

The presentation then switched to Manish Gupta, who proceeded to give the standard vendor introduction. It’s necessary, but I felt his was a little too high level for a broader TDWI audience. It’s a good introduction to Liaison, but following Mr. Loshin there should have been more detail on how Liaison addresses the points brought up in the first half of the presentation – Just as in a sales presentation, a team would lead with Mr. Gupta’s information, then the salesperson would discuss the products in more detail.

Both presenters had good things to say, but they didn’t mesh enough, in my view, and you can find out far more talking to each individually or reading their available materials.

Why things have been quiet here

Before I catch up on some interesting presentations, I’m going to go off topic to discuss why things have been so quiet. In a word: Sasquan! The 73rd World Science Fiction Convention (Worldcon) was held here, in beautiful Spokane, WA, USA, Earth. Wednesday through Sunday saw multiple presentations, speeches, panels, autograph sessions and other wonderful events. As a local, I volunteered from setup last Monday to move-out today. I’m tired but overjoyed.

The art show was wonderful, the vendor booth and exhibits packed, and many of the panels were standing room only. We set a record for the most attended Worldcon and had far more first time Worldcon attendees than even the most optimistic planners expected.

Along with move in and move out, I volunteered at the information desk, to keep the autograph lines moving and in many other areas. I’m exhausted but happy.

My favorite big author in the autograph sessions: Vonda McIntyre. She had a long line and stayed past her time to finish signing for all the folks who waited.

I’m not much of an autograph person myself, but as long as I was handling the lines, Joe Haldeman signed my first addition paperback of Forever War, which I bought in a used bookstore the year it came out and which has followed me around. He and his wife were very gracious and it was nice to meet them.

The Hugo Awards had a lot of controversy this year, with a very conservative group of people putting forward a slate they hoped would stop progress. What it ended up doing is causing the largest number of no awards ever in a year. However, the ceremony will more importantly leave the great image of Robert Silverberg telling the story on the 1968 event in Berkeley and then leading everyone in the Hari Krishna chant. That hilariously relieved some of the tension.

The worst note had nothing to do with the conference. Eastern Washington is on fire. Three firefighters have died (as of this writing and hopefully total this season) from the many fires in a very dry summer. There’s been a haze of ash most of the time, but Thursday was terrible, with many folks needing surgical masks to go outside. I hope we get rain soon and my best wishes to the brave fire fighters and sympathy to the families of those who died.

Now it’s time to get back to the business blogging, but that was my week.

 

Webinar review: TDWI on Streaming Data in Real Time, in Memory

The Internet of Things (IOT) is something more and more people are considering. Wednesday’s TDWI webinar topic was “Stream Processing: Streaming Data in Real Time, in Memory,” and the event was sponsored by both SAP and Intel. Nobody from Intel took part in the presentation. Given my other recent post about too many cooks, that’s probably a good thing, but there was never a clear reason expressed for Intel’s sponsorship.

Fern Halper began with overview of how TDWI is seeing data streaming progress. She briefly described streaming as dealing with data while still in motion, as opposed to data in warehouses and other static structures. Ms. Halper then proceeded to discuss the overlap between event processing, complex event processing and stream mining. The issue I had is that she should have spent a bit more time discussing those three terms, as they’re a bit fuzzy to many. Most importantly, what’s the difference between the first two?

The primary difference is that complex event processing is when data comes from multiple sources. Some of the same things are necessary as ETL. That’s why the in-memory message was important in the presentation. You have to quickly identify, select and merge data from multiple streams and in-memory is the way to most efficiently accomplish that.

Ms. Halper presented the survey results about the growth of streaming sources. As expected, it shows strong growth should continue. I was a bit amused that it asked about three categories: real-time event streams, IOT and machine data. While might make sense to ask the different terms, as people are using multiple words, they’re really the same thing. The IoT is about connecting things, which interprets as machines. In addition, the main complex events discussed were medical and oil industry monitoring, with data coming from machines.

Jaan Leemet, Sr. VP, Technology, at Tangoe then took over. Tangoe is an SAP customer providing software and services to improve their IT expense management. Part of that is the ability to track and control network usage of computers, phones and other devices, link that usage to carrier billing and provide better cost control.

A key component of their needs isn’t just that they need stream processing, but that they need stream processing that also works with other less dynamic data to provide a full solution. That’s why they picked SAP’s Even Stream Processor – not only for the independent functionality but because it also fits in with their SAP ecosystem.

One other decision factor is important to point out, given the message Hadoop and other no-SQL folks like to give. SAP’s solution works in a SQL-like language. SQL is what IT and business analysts know, the smart bet for rapid adoption is to understand that and do what SAP did. Understand the customer and sales becomes easier. That shouldn’t be a shock, but technologists are often too enamored of themselves to notice.

Neil McGovern, Sr. Director, Marketing, at SAP gave the expected pitch. It was smart of them to have Jaan Leemet go first and it would have been better if Mr. McGovern’s presentation was even shorter so there would have been more time for questions.

Because of the three presenters, there wasn’t time for many questions. One of the few question for the panel asked if there was such a thing as too much data. Neil McGovern and Jaan Leemet spent time talking about the technology of handling lots of streaming data, but only in generalities.

Fern Halper turned it around and talked about the business concept of too much data. What data needs to be seen at what timeframe? What’s real-time? Those have different answers depending on the business need. Even with the large volume of real-time data that can be streamed and accesses, we’re talking about clustered servers, often from a cloud partner, and there’s no need to spend more money on infrastructure than necessary.

I would have liked to have heard a far more in-depth discussion about how to look at a business and decide which information truly requires streaming analysis and which doesn’t. For instance, think about a manufacturing floor. You want to quickly analyze any data that might indicate failures that would shut down the process, but the volumes of information that allow analysis of potential process improvements don’t need to be analyzed in the stream. That can be done through analysis of a resultant data store. Yet all the information can be coming across the same IoT feed because it’s a complex process. Firms need to understand their information priority and not waste time and money analyzing information in a stream for no purpose other than you can.

Email Etiquette 101: Wait for an intro if you don’t want to be reported as a spammer

Last week, I received an email from some unknown company asking me to go to their site and enter my SSN. I’ve never heard of the company. I responded by letting my security software know the email was spam.

Today I received an email from a client. It seems the firm is changing their accounting, payroll, finance firm. While the client sent contact information for their suppliers to the new company, they didn’t think to tell us about it and the new company didn’t think to check before emailing. The email I just received apologized and asked everyone to please go out to the new firm’s site. I clearly wasn’t the only one who properly responded to an unknown company asking for personal information.

People need to more clearly consider their actions in the internet age. All contact with people outside your organization need to be considered in the same way as marketing — messages that set an image for your company.

Semantics and big data: Thought leadership done right

Dataversity hosted a webinar by Matt Allen, Product Marketing Manager at MarkLogic. Mr. Allen’s purpose was to explain to the audience the basic challenges involved in big data which can be addressed by semantic analysis. He did a good job. Too many people attempting the same spend too much time on their own product. Matt didn’t do so. Sure, when he did he had some of the same issues that many in our industry have, of over selling change; but the corporate references were minimal and the first half of the presentation was almost all basic theory and practice.

Semantics and Complexity

On a related tangent, one of the books I’m reading now is Stanley McChrystal’s “Team of Teams.” In it, he and his co-authors point to a distinction between complicated and complex. A manufacturing process can be complicated, it can have lots of steps but have a clearly delineated flow. Complex problems have many-to-many relations which aren’t set and can be very difficult to understand.

That ties clearly into the message put forward by MarkLogic. The massive amount of unstructured data is complex, with text rather than fields and which need ways of understanding potential meaning. The problems in free text are such things as:

  • Different words can define the same thing.
  • The same word can mean different things in different contexts.
  • Depending on the person, different aspects of information are needed for the same object.

One great example that can contain all for issues was given when Matt talked about the development process. At different steps in the process, from discovery, to different development stages to product launch, there’s a lot of complexity in meanings of terms not only in development organizations but between them and all the groups in the organization with whom they have to work.

Mr. Allen then moved from discussing that complexity to talking about semantic engines. MarkLogic’s NoSQL engine has a clear market focus on semantic logic, but during this section he did well to minimize the corporate pitch and only talked about triples.

No, not baseball. Triples are a syntactical tool to link subject (person), predicate (operates), object (machine). By building those relationship, objects can be linked in a less formal and more dynamic manner. MarkLogic’s data organization is based on triples. Matt showed examples of JSON, Turtle and XML representations of triples, very neatly sliding his company’s abilities into the theory presentation – a great example of how to mention your company while giving a thought leadership presentation without being heavy handed.

Semantics, Databases and the Company

The final part of the presentation was about the database structure needed to handle semantic analytics. This is where he overlapped the theory with a stronger corporate pitch.

Without referring to a source, Mr. Allen stated that relation databases (RDBMS’) can only handle 20% of today’s data. While it’s clear that a lot of the new information is better handled in Hadoop and less structured data sources, it’s a question of performance. I’d prefer to see a focus on that.

Another error often made by folks adopting new technologies was the statement that “Relational databases aren’t solving a lot of today’s problems. That’s why people are moving to other technologies.” No, they’re extending today’s technologies with less structured databases. The RDBMS isn’t going away, as it does have its purpose. The all or nothing message creates a barrier to enterprise adoption.

The final issue is the absolutist view of companies that think they have no competitor. Mark Allen mentioned that MarkLogic is the only enterprise database using triples. That might be literally true. I’m not sure, but so what? First, triples aren’t a new concept and object oriented databases have been managing triples for decades to do semantic analysis. Second, I recently blogged about Teradata Aster and that company’s semantic analytics. While they might not use the exact same technology, they’re certainly a competitor.

Summary

Mark Allen did a very good job exposing people to why semantic analysis matters for business and then covered some of the key concepts in the arena. For folks interested in the basics to understand how the concept can help them, watch the replay or talk with folks at MarkLogic.

The only hole in the presentation is that though the high level position setting was done well, the end where MarkLogic was discussed in detail had some of the same problems I’ve seen in other smaller, still technology driven companies.

If Mr. Allen simplifies the corporate message, the necessary addition at the end of the presentation will flow better. However, that doesn’t take away from the fact that the high level overview of semantic analysis was done very well, discussing not only the concepts but also a number of real world examples from different industries to bring those concepts alive for the audience. Well done.

Marketing lesson: How to cram too many vendors into too short a timeframe

I’ll start by being very clear: This is a slam on bad marketing. Do not take this column as a statement that the products have problems, as we didn’t see the products.

Database Trends and Application magazine/website held a webinar. The first clue there was something wrong is that an hour long seminar had three sponsors. In a roundtable forum, that could work, and the email mentioned it was a roundtable, but it wasn’t. Three companies, three sequential presentations. No roundtable.

It was titled “The Future of Big Data: Hybrid Architectures and Best-of-Breed”. The presenters were Reiner Kappenberger, Global Product Manager, HP Security Voltage, Emma McGrattan, SVP Engineering, Actian, and Ron Huizenga, ER/Studio Product Manager, Embarcadero. They are three interesting companies, but how would the presentations fit together?

They didn’t.

Each presenter had a few minutes to slam through a pitch, which they did with varying speeds and content. There was nothing tying them into a unified vision or strategy. That they all mentioned big data wasn’t enough and neither was the time allotted to hear significant value from any of them.

I’ll burn through each as the stand-alone presentations they were.

HP Security Voltage

Reiner Kappenberger talked about his company’s acquisition by HP earlier this year and the major renaming from Voltage Security to HP Security Voltage (yes, “major” was used tongue-in-cheek). Humor aside, this is an important acquisition for HP to fill out its portfolio.

Data security is a critical issue. Mr. Kappenberger gave a quick overview of the many levels of security needed, from disk encryption up to authentication management. The main feature focus on Reiner’s allotted time is partial tokenization, being able to encrypt parts of a full data field. For instance, disguising the first five digits of a US Social Security number while leaving the last four visible. While he also mentioned tying into Hadoop to track and encrypt data across clusters, time didn’t permit any details. For those using Hadoop for critical data, you need to find out more.

The case studies presented included a car company’s use of both live, Internet of Things feeds and recall tracking but, again, there just wasn’t enough time.

Actian

The next vendor was Actian, an analytics and business intelligence (BI) player based on Hadoop. Emma McGrattan felt rushed by the time limit and her presentation showed that. It would have been better to slow down and cover a little less. Or, well, more.

For all the verbage it was almost all fluff. “Disruption” was in the first couple of sentences. “The best,” “the fastest,” “the most,” and similar unsubstantiated phrases flowed like water. She showed an Actian built graph with product maturity and Hadoop strength on the two axis and, as if by magic, the only company in the upper right was Actian.

Unlike the presentations before and after hers, Ms. McGrattan’s was a pure sales pitch and did nothing to set a context. My understanding, from other places, is that Actian has a good product that people interested in Hadoop should evaluate, but seeing this presentation was too little said in too little time with too many words.

In Q&A, Emma McGrattan also made what I think is a mistake, one that I’ve heard many BI companies get away from in the last few years. An attendee asked about biggest concern when transitioning from EDW to Hadoop. The real response should be that Hadoop doesn’t replace the EDW. Hadoop extends the information architecture, it can even be used to put an EDW on open source, but EDWs and big data analytics typically have two different purposes. EDWs are for clean, trusted data that’s not as volatile, while big data is typically transaction oriented information that needs to be cleaned, analyzed and aggregated before it’s useful in and EDW. They are two tools in the BI toolbox. Unfortunately, Ms. McGrattan accepted the premise.

Embarcadero

Mr. Huizenga, from Embarcadero, referred to evidence that the amount of data captured in business is doubling every 1.2 years and how the number of related jobs is also exploding. However, where most big data and Hadoop vendors would then talk about their technologies manipulating and analyzing the data, he started with a bigger issue: How do you begin to understand and model the information? After all, schema-on-write still means you need to understand the information enough to create schemas.

That led to a very smooth shift to a discussion about the concept of modeling to Embarcadero. They’ve added native support for Hive and MongoDB, they can detect embedded objects in those schemas and they can visually translate the Hadoop information into forms that enterprise IT folks are used to seeing, can understand and can add to their overall architecture models.

Big data doesn’t exist in a void, to be successful it must be integrated fully into the enterprise information architecture. For those folks already using ERwin and those who understand the need to document modeling, they are a tool that should be investigated for the world of Hadoop.

Summary

Three good companies were crammed into a tiny time slot with differing success. The title of the seminar suggested a tie that was stronger than was there. The makings existed for three good webinars, and I wish DBTA had done that. The three firms and the host could have communicated to create an overall message that integrated the three solutions, but they didn’t.

If you didn’t see the presentation, don’t bother. Whichever company interests you check it out. All three are interesting though it might have been hard to tell from this webinar.

Teradata Aster: NLP for Business Intelligence

Teradata’s recent presentation at the BBBT was very interesting. The focus, no surprise, was on Teradata Aster, but Chris Twogood, VP Products and Services Marketing, and John Thuma, Director of Aster Strategy and Analytics, took a very different approach than was taken a year earlier.

Chris Twogood started the talk with the usual business overview. Specific time was spent on four recent product announcements. The most interesting announcement was about their support for Presto, a SQL-on-Hadoop project. They are the first company to provide commercial support for the open source technology. As Chris pointed out, he counted “13 different SQL-on-Hadoop variants.” Because of the importance of SQL access and the perceived power of Presto, Teradata has committed to strengthening its presence with that offering. SQL is still the language for data access and integrating Hadoop into the rest of the information ecosystem is a necessary move for any company serving any business information market. This helps Teradata present a leadership image.

Discussion then turned to the evolution of data volumes and analytics capabilities. Mr. Twogood has a great vision of that history, but the graphic needs serious work. I won’t copy it because the slide was far too busy. The main point, however, was the link between data volumes and sources with the added capabilities to look at business in a more holistic way. It’s something many people are discussing but he seems to have a much better handle on it than most others who talk to the point, he just needs to fine tune the presentation.

Customers and On-Site Search

As most people have seen, the much of the new data coming in under the big data rubric is customer data from sources such as the web, call logs and more. Being able to create a more unified view of the customer matters. Chris Twogood wrapped up his presentation by referring to a McKinsey & Co. survey that pointed out, among other things, that studying customer journeys can increase predictive accuracy of customer satisfaction and churn by 30-40%. Though it also points out that 56% of customer interactions are through multi-channel means, one of the key areas of focus today is the journey through a web site.

With that lead-in, John Thuma took over to talk about Aster and how it can help with on-site search. He began by stating that 25-30% of web site visitors using search leave the site if the wanted result isn’t in first three items returned, while 75% abandon if the result isn’t on first page. Therefore it’s important to have searches that understand not only the terms that the prospective customer enters but possible meanings and alternatives. John picked a very simple and clear example, depending on the part of the country, somebody might search on crock pot, slow cooker or pressure cooker but all should return the same result.

While Mr. Thuma’s presentation talked about machine learning in general, and did cover some of the other issues, the main focus of that example is Natural Language Processing (NLP). We need to understand more than the syntax of the sentence, but also improve our ability to comprehend semantic meaning. The demonstration showed some wonderful capabilities of Aster in the area of NLP to improve search capabilities.

One feature is what Teradata is calling “apps,” a term that confuses them with mobile apps, a problematic marketing decision. They are full blown applications that include powerful capabilities, applications customization and very nice analytics. Most importantly, John clearly points out that Aster is complex and that professional services are almost always required to take full advantage of the Aster capabilities. I think that “app” does a disservice to the capabilities of both Aster and Teradata.

One side bar about technical folks not really understanding business came from one analyst attending the presentation who suggested that ““In some ways it would be nice to teach the searchers what words are better than others.” No, that’s not customer service. It’s up to the company to understand which words searchers mean and to use NLP to come up with a real result.

A final nit was that the term “self-service” was used while also talking about the requirement for both professional services from Teradata and a need for a mythical data scientist. You can’t, as they claimed, used Aster to avoid the standard delays from IT for new reports when the application process is very complex. Yes, afterwards you can use some of the apps like you would a visualization tool which allows the business user to do basic investigation on her own, but that’s a very limited view of self-service.

I’m sure that Teradata Aster will evolve more towards self-service as it advances, but right now it’s a powerful tool that does a very interesting job while still requiring heavy IT involvement. That doesn’t make it bad, it just means that the technology still needs to evolve.

Summary

I studied NLP almost 30 years ago, when working with expert systems. Both hardware and software have moved forward, thankfully, a great distance since those days. The ability to leverage NLP to more quickly and accurately to understand the market, improve customer acquisition and retention ROI and better run business is a wonderful thing.

The presentation was powerful and clear, Teradata Aster provides some great benefits. It is still early in its lifecycle and, if the company continues on the current course, will only get better. They have only a few customers for the on-site optimization use, none referenceable in the demo, but there is a clear ROI message building. Mid- to large-size enterprises looking to optimize their customer understand, whether for on-site search or other modern business intelligence uses, should talk to Teradata and see if Aster fits their needs.

Review: Looker Webinar on Embedding BI

Looker held a webinar today. I recently blogged about their presentation to the BBBT community, but it’s an interesting company so was worth another visit. The company is a business intelligence (BI) firm. With the presenters being Colin Zima and Zach Taylor, the presentation stayed at a much higher level than the previous presentation and was aimed at a business audience rather than analysts. It is always good to see a different view of things.

The focus of their presentation is why it’s good to embed BI in other applications in opposition to pure BI tools. It’s a good message but needs to be strengthened. Colin and Zach quickly mentioned embedding as if everyone understood it, then dove into the issues in evaluation the build v buy decision. They should have spent a couple of minutes explaining what they mean by embedding and their focus on what they focus on as places to be embedded into.

Their build v buy decision discussion was standard and hit all the right points about letting companies focus on their competencies and leverage the BI industry’s competencies for analysis. Where embedding and build v buy really blend, and they could have hit harder, is the difference in ROI between embedding and having a separate BI visualization tool.

They did have a couple of case studies that were interesting. Ibotta is a company providing analytics to their consumer packages goods clients. That’s a great application and a powerful use of BI in a business network, but I didn’t see much on what it was embedded into or how. That meant it didn’t fit into the overall scheme of the presentation.

The other key one was HubSpot using Looker to provide analytics to sales on sales performance. That’s done by embedding the analytics directly into the normal Saleforce.com windows the sales team see every day. That’s a powerful message and one that I felt deserved a bit more time.

The only questionable message I heard was during Q&A, when somebody asked about their performance issues. As in the previous presentation, they talked about using the source data and not replicating for BI. They therefore said they didn’t have performance issues when scaling users but it was one for the databases. Well, that’s not quite true.

It’s not likely that all a company’s various data sources have been built to scale to lots of users. Companies will still use ODS’s, data warehouses and other methods to parallel data and have multiple versions of the truth which require strong compliance to control. Companies will still have to spend time to analyze and prepare appropriate data sources that can handle large numbers of concurrent users. The advantage of Looker is not that it means that you don’t have to add to the confusion to get performance, whatever is provided to get good performance for Looker isn’t unique and limited to it but can serve other applications as well.

Looker is that rare young company that seems to not only have a good early generation product, but understands how to market their product to multiple audiences. As someone focused on software marketing, I think that’s great.