Data Analytics by Learning and Exploration

Main

Description

Status

Research

Publications

Demo

People

Funding

Links

Status

The DALE framework currently has focused to date on workflows for text analytics tasks such as document classification, document clustering, and topic modeling. These workflows are composed of workflow fragments that pre-process text, prepare the data, and set up the learning task.

Workflows and Workflow Fragments for Text Analytics

DALE contains workflow fragments that are common across text analytics tasks and are reused across workflows. They are composed of common workflow components, which can have several implementations, for example for term weighting there is a Chi Squared, a Mutual Information, and an Information Gain method. The user can choose one of these methods for that step, otherwise the system makes the selection automatically. We have also defined workflow fragments for viewing certain types of data.

These predefined workflow fragments make text analytics expertise readily available to new users. Workflow fragments can be executed independently from each other. Users can run them to improve their understanding of the behavior of those steps. A good starting point for novices however is to use larger end-to-end workflows that are defined using the workflow fragments. DALE has several pre-defined workflows for document classification, document clustering, and topic modeling

Worfklow Components and Datasets

The workflows are currently composed of more than 50 workflow components that we built using popular machine learning and text processing packages, including Weka, Cluto, and Mallet among others. These packages have very heterogeneous implementations but the components encapsulate the software with interfaces described with data types in the workflow system to make them reusable in different workflows.

The system also includes several widely used datasets used in the text analytics community (WebKB, Reuters, 20 Newsgroups). These datasets allow an end user to experiment with the workflows and learn how to use them with his or her own data.

In addition to the above 50 components, other components include widely-used MATLAB and R libraries (for example for sampling and visualizing datasets) and social network analysis algorithms and visualizations.

Assisting End Users

DALE assists the end user to set up and execute workflows. DALE is built on top of Wings workflow system, which provides different kinds of assistance and automation during workflow creation. It has a graphical workflow template editor that assists the user by enforcing the constraints specified for the workflow components. It also has facilities for tracking execution progress, viewing execution results, and generating provenance. As users select and configure workflows to be executed, Wings ensures that workflows are correctly composed by checking that the data types and other constraints of the input and output types are consistent with the workflow. For example, multi-labeled data cannot be used for correlation scoring (only single-labeled data can be used), so a user would be alerted if using that workflow incorrectly. All the intermediate and final data products of workflow execution can be viewed, allowing the user to explore and understand how the methods work.

Usability of DALE for Non-Experts

The first is a case of reuse by researchers not expert in machine learning or text analytics. Their goal was to improve a question answering site by automating some of the current manual processes, for example to suggest best matches from the archives for an incoming question and find the best-suited scientist for incoming questions. Using workflows simplified the process of analysis significantly, by allowing calculation of standard statistics, visualization of document topics, and facilitating extensions of standard algorithms.

The second is a case of reuse by high-school students for an internship project to analyze twitter data. Over a period of a week, they were given tutorials and datasets. They had taken two semesters of introduction to programming in the eight and ninth grades, and were entering tenth grade in the coming year. During the five days, the students: 1) Became familiar with workflows as a software paradigm; 2) Learned to use the system and run simple workflows to analyze data (e.g., compare sets of html files to see how they would be classified); 3) Learned to use pre-existing workflows for advanced text analytics (e.g., run workflows for document clustering and topic detection and compare their performance for different threshold parameters); 4) Extended existing workflows with new workflow components that they developed; and 5) Analyzed twitter data to detect topic trends by applying pre-existing advanced text analytic workflows. A report describing these activities and their findings is available: Usability Report. For example, they decided to run the same workflow with different amounts of training data to see how it affected accuracy. They also analyzed twitter data from the timeframe of the Haiti earthquake, and detected popular topics in the dataset.

<< Back to IKCAP