Design Pattern - Arten von Design Pattern
Composite - provide uniform interface on part-whole hierarchies
Command - encapsulate the execution of functionality; enable undo
Visitor - represent operations on object structure as objects
Observer - provide change notifications to objects depending on state
MVC (Model View Controll) - decouple model, view, control in for interactive applications
Proxy - refine or replace behavior of a given object
Object Adapter - provide a different interface for an existing object
Template Method - capture general structure of an algorithm
Command - encapsulate the execution of functionality; enable undo
Visitor - represent operations on object structure as objects
Observer - provide change notifications to objects depending on state
MVC (Model View Controll) - decouple model, view, control in for interactive applications
Proxy - refine or replace behavior of a given object
Object Adapter - provide a different interface for an existing object
Template Method - capture general structure of an algorithm
Design Pattern - essenzielle Elemente der Design Pattern
- Name und Absicht
- das Problem, welches beschreibt, wann man das Muster anwenden sollte
- die Lösung, welche die Elemente beschreibt, deren Relationen, Zusammenhänge und Verantwortlichkeiten
- die Konsequenz, welche getragen werden, wenn man das Muster anwendet
- das Problem, welches beschreibt, wann man das Muster anwenden sollte
- die Lösung, welche die Elemente beschreibt, deren Relationen, Zusammenhänge und Verantwortlichkeiten
- die Konsequenz, welche getragen werden, wenn man das Muster anwendet
Design Pattern - Abstract Factory
Problem:
Object models often give rise to variation. For instance, there may be multiple GUI libraries subject to different widget hierarchies. Whenever components want to abstract from the specific choice, then substantial efforts are required. For instance, the construction of objects must be tunneled through a factory.
**Solution:*
Object models often give rise to variation. For instance, there may be multiple GUI libraries subject to different widget hierarchies. Whenever components want to abstract from the specific choice, then substantial efforts are required. For instance, the construction of objects must be tunneled through a factory.
**Solution:*
Distributions - Arten von Programmier Konzepten
- Distributed Programming - Komponenten sind Netzwerk basiert - die Aktionen werden per "Message passing" koordiniert
- Concurrent Programming - Programmiersprachen Konzept - aufteilen und sortieren von Programmteilen
- Parallel Programming - Hardware-basiertes Parallelisieren von bsp. Tasks - parallel laufende Tasks
Parallel Programming
Parallele Programmierung ist eine Form der Berechung in der Kalkulationen gleichzeitig ablaufen.
- Es gibt mehrere (auch virtuelle) Prozessoren
- Es gibt mehrere Formen von Parallelisierung: - Task parallel - verteilte Ausführungsprozesse - Data parallel - verteilte Daten über mehrere Datenknoten
Multithreading in Java
Java-Threads
- ... sind Objekte
- ... sind threads of execution
- ... können gestartet werden, angehalten, schlafen gelegt und auf Notifikationen wartend gesetzt werden.
- Can be assigned an activity by passing a 'Runnable' implementation to Thread's constructor
- Behavior can also be added by creating a subclass and overwriting run(), where run initially does not implement an activity.
Multithreading in Java
Threadpools
- Single Thread Executor : Uses a single thread
- Cached Thread Pool : Creates as many threads as necessary for a task. Old threads will be reused and removed if they inavtive.
- Fixed Thread Pool : Fixed number of threads
- Scheduled Thread Pool : Task scheduling capabilities
- Single Thread Scheduled Pool : One thread and scheduling capabilities
Concurrent Programming
* genau wie im Konzept des:„Divide and conquer", kann ein Problem in mehrere Teile gespilttet werden.
* Abhängige Teile : Teile der Programme welche in einer genauen Reihenfolge bearbeitet werden müssen.
* Unabhängige Teile : Reihenfolge unwichtig. Es ist egal ob ein Programm vor oder nach einem Programm ausgeführt werden muss.
* Concurrency gibt einen Weg für die Strukturierung einer Lösung eines Problemes an, welche parallelisiert werden kann.
* Concurrency is about dealing with lots of things at once, while parallelism focuses on doing lots of things at once.
* Concurrent programming deals with some well defined interaction (communication) between the independent parts.
* Abhängige Teile : Teile der Programme welche in einer genauen Reihenfolge bearbeitet werden müssen.
* Unabhängige Teile : Reihenfolge unwichtig. Es ist egal ob ein Programm vor oder nach einem Programm ausgeführt werden muss.
* Concurrency gibt einen Weg für die Strukturierung einer Lösung eines Problemes an, welche parallelisiert werden kann.
* Concurrency is about dealing with lots of things at once, while parallelism focuses on doing lots of things at once.
* Concurrent programming deals with some well defined interaction (communication) between the independent parts.
Parallel vs Concurrent
●
„Concurrency should not be confused with
parallelism. Concurrency is a language concept
and parallelism is a hardware concept."
●
„Concurrency and parallelism are orthogonal: it
is possible to run concurrent programs on a
single processor (using preemptive scheduling
and time slices)
„Concurrency should not be confused with
parallelism. Concurrency is a language concept
and parallelism is a hardware concept."
●
„Concurrency and parallelism are orthogonal: it
is possible to run concurrent programs on a
single processor (using preemptive scheduling
and time slices)
Three Levels of Concurrency
●Distributed System : Berechnungsschnittstellen sind über das Netzwerk miteinander verbunden.
● Operating System : Managing one computing
node. One concurrent activity is called a
process and has independent memory.
● Activities inside one process : Threads are
concurrent activities that execute independently
but share the same memory space.
● Operating System : Managing one computing
node. One concurrent activity is called a
process and has independent memory.
● Activities inside one process : Threads are
concurrent activities that execute independently
but share the same memory space.
Distributed Programming - Messaging-based Concurrency
● "Massage delivery" anstatt Datenaustausch
● Tony Hoare formulated a formal language for describing patterns of interaction in concurrent systems called `Communicating Sequential Processes (CSP)'.
● Languages such as Occam or Go were influenced by CSP.
Messaging Service
* Message-Oriented-Middleware
* Create, edit, read and send messages
* Send messages to destinations
* Publish messages to all subscriptors
* An application server provides resources to support messaging capabilities
* Clients may not have any knowledge of each other's existence
● Tony Hoare formulated a formal language for describing patterns of interaction in concurrent systems called `Communicating Sequential Processes (CSP)'.
● Languages such as Occam or Go were influenced by CSP.
Messaging Service
* Message-Oriented-Middleware
* Create, edit, read and send messages
* Send messages to destinations
* Publish messages to all subscriptors
* An application server provides resources to support messaging capabilities
* Clients may not have any knowledge of each other's existence
Messaging
"Messaging" ist ein Konzept zum Realisieren von Kommunikation zwischen Aktivitäten.
Konkrete Implementationen sind: JMS and Akka.
Publish-Subscribe Messaging
● Viele Aktivitäten müssen die gleiche Message erhalten.
Publisher I -> ->Subscriber I
Publisher II -> Server ->Subscriber II
Publisher III -> (Topic) ->Subscriber III
Point-To-Point Messaging
● Eine Aktivität muss eine Message an eine spezifisch andere Aktivität senden.
Source I ->
Source II -> Queue -> Target
Source III ->
Konkrete Implementationen sind: JMS and Akka.
Publish-Subscribe Messaging
● Viele Aktivitäten müssen die gleiche Message erhalten.
Publisher I -> ->Subscriber I
Publisher II -> Server ->Subscriber II
Publisher III -> (Topic) ->Subscriber III
Point-To-Point Messaging
● Eine Aktivität muss eine Message an eine spezifisch andere Aktivität senden.
Source I ->
Source II -> Queue -> Target
Source III ->
Java Message Service (JMS) API
● Unterstützt (Publish-Subscribe, Point-to-Point Messanging) und asynchrone und synchrone Kommunikation
● Message Typen: Empty message, JavaPrimitive, Stream message, MapMessage, TextMessage, ObjectMessage, BytesMessage.
● Die Message enthält Informationen, wie ein Zeitstempel oder nutzerzugewiesene Daten.
● Die JMS Implementation ist ActiveMQ von Apache.
● Dieser Ansatz liefert eine hohe Robustheit und eine garantierte Versendung.
● JMS is used frequently in JavaEE Applications.
● Message Typen: Empty message, JavaPrimitive, Stream message, MapMessage, TextMessage, ObjectMessage, BytesMessage.
● Die Message enthält Informationen, wie ein Zeitstempel oder nutzerzugewiesene Daten.
● Die JMS Implementation ist ActiveMQ von Apache.
● Dieser Ansatz liefert eine hohe Robustheit und eine garantierte Versendung.
● JMS is used frequently in JavaEE Applications.
Akka
● Benutzt das "Actor Modell" um unabhängige Aktivitäten zu definieren.
● Die Kommunikation zwischen den "Actors" ist mit "messanges" definiert.
● Unterstützt synchrone Kommunikation auf einem lokalen Level.
● Unterstützt "Point-To-Point Messaging" durch sogenannte "Mailboxes".
● Unterstützt "Public-Subscribe Messanging" durch "Routing".
Actor Model
● Excessively uses `divide and conquer'
● An actor can be envisioned as a human worker.
● Communication based on messages.
● Messages can be placed in an actor's mailbox.
● A hierarchy of supervision has to be set up.
● If an actor does not know how to handle a certain situation, it might send a message to a supervising actor.
Actor Model Guidelines
● A manager supervises its workers that are assigned to a subtask by it.
● If an actor has `critical' data, it should assign subtasks to children to enable appropriate recovery from a failure.
● One actor may simply watch out for another's liveness, if it depends on its work.
Akka – beyond local communication
● Akka Cluster, a fault-tolerant decentralized peer-to-peer based cluster membership service.
● Based on gossip protocols (randomly communicating the cluster's state).
● Cluster membership used in Akka is based on Amazon's Dynamo system.
● One node is defined by a (hostname:port:uid) tuple and is part of a cluster, where a single node acts as a `team-leader'.
● Die Kommunikation zwischen den "Actors" ist mit "messanges" definiert.
● Unterstützt synchrone Kommunikation auf einem lokalen Level.
● Unterstützt "Point-To-Point Messaging" durch sogenannte "Mailboxes".
● Unterstützt "Public-Subscribe Messanging" durch "Routing".
Actor Model
● Excessively uses `divide and conquer'
● An actor can be envisioned as a human worker.
● Communication based on messages.
● Messages can be placed in an actor's mailbox.
● A hierarchy of supervision has to be set up.
● If an actor does not know how to handle a certain situation, it might send a message to a supervising actor.
Actor Model Guidelines
● A manager supervises its workers that are assigned to a subtask by it.
● If an actor has `critical' data, it should assign subtasks to children to enable appropriate recovery from a failure.
● One actor may simply watch out for another's liveness, if it depends on its work.
Akka – beyond local communication
● Akka Cluster, a fault-tolerant decentralized peer-to-peer based cluster membership service.
● Based on gossip protocols (randomly communicating the cluster's state).
● Cluster membership used in Akka is based on Amazon's Dynamo system.
● One node is defined by a (hostname:port:uid) tuple and is part of a cluster, where a single node acts as a `team-leader'.
Hadoop
● Allows for distributed processing of large datasets across computing node clusters
● Scale from single node to thousands
● Akka and Hadoop enable the creation of large computing node clusters to deal with huge processing loads.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
● Scale from single node to thousands
● Akka and Hadoop enable the creation of large computing node clusters to deal with huge processing loads.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Was ist Software Language Processing?
Ein Programm, welche "language processing" betreibt:
Formen von "language processors" welche durch vereinfachte Pattern dargestellt sind:
- Chooper Pattern
- Lexer Pattern
- Copy/Paste Pattern
- Acceptor Pattern
- Parser Pattern
- Lexer Generation Pattern
- Acceptor Generation Pattern
- Parser Generation Pattern
- Text-to-object Pattern
- Text-to-tree Pattern
- Parser Listener Pattern
- Acceptor
- Parser
- Analysis
- Transformation
- Unparser
Formen von "language processors" welche durch vereinfachte Pattern dargestellt sind:
- Chooper Pattern
- Lexer Pattern
- Copy/Paste Pattern
- Acceptor Pattern
- Parser Pattern
- Lexer Generation Pattern
- Acceptor Generation Pattern
- Parser Generation Pattern
- Text-to-object Pattern
- Text-to-tree Pattern
- Parser Listener Pattern
The Chopper Pattern
Intent:
Analyze text at the lexical level.
(optional steps:)
- Chop input into “pieces”.
- Classify each piece.
- Process classified pieces in a stream.
Prosses:
1. Chopping input into pieces with java.util.Scanner
scanner = new Scanner(new File(...));
2. Tokens = classifiers of pieces of input
public enum Token {
COMPANY,
DEPARTMENT,
MANAGER,
CLOSE,
STRING,
FLOAT,
...
}
3. Classify chopped pieces into keywords, floats, etc.
public static Token classify(String s) {
if (keywords.containsKey(s))
return keywords.get(s);
else if (s.matches("\"[^\"]*\""))
return STRING;
else if (s.matches("\\d+(\\.\\d*)?"))
return FLOAT;
else
throw new RecognitionException(...);
}
4. Process token stream to compute salary total
Summary:
{
* Declare an enum type for tokens.
* Set up instance of java.util.Scanner.
* Iterate over pieces (strings) returned by scanner.
* Classify pieces as tokens.
- Use regular expression matching.
* Implement operations by iteration over pieces.
- For example:
- Total: aggregates floats
- Cut: copy tokens, modify floats
A problem with the Chopper Pattern
Imput:
company “FooBar Inc.” { ...
Pieces:
‘company’, ‘“FooBar’, ‘Inc.”’, ‘{‘...
There is no general rule for chopping the input into pieces.
Analyze text at the lexical level.
(optional steps:)
- Chop input into “pieces”.
- Classify each piece.
- Process classified pieces in a stream.
Prosses:
1. Chopping input into pieces with java.util.Scanner
scanner = new Scanner(new File(...));
2. Tokens = classifiers of pieces of input
public enum Token {
COMPANY,
DEPARTMENT,
MANAGER,
CLOSE,
STRING,
FLOAT,
...
}
3. Classify chopped pieces into keywords, floats, etc.
public static Token classify(String s) {
if (keywords.containsKey(s))
return keywords.get(s);
else if (s.matches("\"[^\"]*\""))
return STRING;
else if (s.matches("\\d+(\\.\\d*)?"))
return FLOAT;
else
throw new RecognitionException(...);
}
4. Process token stream to compute salary total
Summary:
{
* Declare an enum type for tokens.
* Set up instance of java.util.Scanner.
* Iterate over pieces (strings) returned by scanner.
* Classify pieces as tokens.
- Use regular expression matching.
* Implement operations by iteration over pieces.
- For example:
- Total: aggregates floats
- Cut: copy tokens, modify floats
A problem with the Chopper Pattern
Imput:
company “FooBar Inc.” { ...
Pieces:
‘company’, ‘“FooBar’, ‘Inc.”’, ‘{‘...
There is no general rule for chopping the input into pieces.
Software models vs. megamodels
Software models
- Struktur und Verhalten von Softwaresystemen
Megamodels
- Sprachen, Technologien und Artifakte in einem System
- Relationen zwischen diesen Entitäten
verschiedene Arten von Software Modellen:
• Data models (wird in einer Datenbank implementiert)
• Structural models (wird in einer Software implementiert)
• Class diagrams (Modell/Zustand und Relationen)
• Package diagrams (to group classes)
• Behavioral models (wird in einer Software implementiert)
• Sequence diagrams (Definition eines speziellen Scenarios)
• Activity diagrams (Definition eines generellen Workflows)
• State diagrams (Definition von Zuständen und Übergängen)
- Struktur und Verhalten von Softwaresystemen
Megamodels
- Sprachen, Technologien und Artifakte in einem System
- Relationen zwischen diesen Entitäten
verschiedene Arten von Software Modellen:
• Data models (wird in einer Datenbank implementiert)
• Structural models (wird in einer Software implementiert)
• Class diagrams (Modell/Zustand und Relationen)
• Package diagrams (to group classes)
• Behavioral models (wird in einer Software implementiert)
• Sequence diagrams (Definition eines speziellen Scenarios)
• Activity diagrams (Definition eines generellen Workflows)
• State diagrams (Definition von Zuständen und Übergängen)
Summary of megamodeling
• Entities in software development
• e.g.: Java, Python, J2EE, Django, Testing, Inheritance
• Entity types in software development
• e.g.: Language, Technology, Artifact, Concept
• Relationships in software development
• e.g.:
• HelloWorld.java ∈ Java
• Django uses Python
• Relationship types in software development
• e.g., „∈“ or „uses“
• e.g.: Java, Python, J2EE, Django, Testing, Inheritance
• Entity types in software development
• e.g.: Language, Technology, Artifact, Concept
• Relationships in software development
• e.g.:
• HelloWorld.java ∈ Java
• Django uses Python
• Relationship types in software development
• e.g., „∈“ or „uses“
Entity types
Basis Typen:
• Language — conceptual entities (possibly thought of as sets) for languages
• Technology — conceptual entities for technologies
• Artifact — „manifested“ / „physical“ entities, e.g., a file
• System — a conglomeration of artifacts making up a system
• Function — mathematical functions on languages or actions
• Concept — programming techniques or other concepts in software development
• Language — conceptual entities (possibly thought of as sets) for languages
• Technology — conceptual entities for technologies
• Artifact — „manifested“ / „physical“ entities, e.g., a file
• System — a conglomeration of artifacts making up a system
• Function — mathematical functions on languages or actions
• Concept — programming techniques or other concepts in software development
entity type: Language
Definition:
• Eine Sprache welche in der Software Entwicklung verwendet
wird.
•Unterklassen von Sprachen
• Programmiersprachen: Java, Python, Ruby, …
• Query language: XPath, SQL, XQuery, …
• Transformation language: XSLT, SQL, ATL, …
• Modeling language: UML, SDL, BPMN, …
• Eine Sprache welche in der Software Entwicklung verwendet
wird.
•Unterklassen von Sprachen
• Programmiersprachen: Java, Python, Ruby, …
• Query language: XPath, SQL, XQuery, …
• Transformation language: XSLT, SQL, ATL, …
• Modeling language: UML, SDL, BPMN, …
entity type: Technology
Definition:
• Ein Tool, welche in der Softwareentwicklung genutzt wird
Untertypen von Technology:
• API and library: JDOM, JQuery, Swing, Tkinter, Twitter API, …
• Framework: JPA, Hibernate, Spring, Django, …
• IDE: Visual Studio, Eclipse, NetBeans, …
• Platform: .NET, Android, J2EE, Java (platform), JRE, …
• Language processor: javac, python, gcc, …
• Ein Tool, welche in der Softwareentwicklung genutzt wird
Untertypen von Technology:
• API and library: JDOM, JQuery, Swing, Tkinter, Twitter API, …
• Framework: JPA, Hibernate, Spring, Django, …
• IDE: Visual Studio, Eclipse, NetBeans, …
• Platform: .NET, Android, J2EE, Java (platform), JRE, …
• Language processor: javac, python, gcc, …
entity type: Artifact
Definition:
• Eine reale Einheit in einem Softwaresystem
Subtypes of Artifact — they all concern „representation“!
• File: files in the common sense of an operating system
• Folder: folders as nested collections of files and folders
• Resource: artifacts addressable / retrievable by URI/URL
• Transient: artifacts arising „temporarily“ by the execution of
software
• Fragment: artifacts being part of an artifact
• Eine reale Einheit in einem Softwaresystem
Subtypes of Artifact — they all concern „representation“!
• File: files in the common sense of an operating system
• Folder: folders as nested collections of files and folders
• Resource: artifacts addressable / retrievable by URI/URL
• Transient: artifacts arising „temporarily“ by the execution of
software
• Fragment: artifacts being part of an artifact
Relationship Symbole
• ∈ — membership relationship for languages
• defines — something defining a language or a function
• implements — something implementing a language or a
function
• ↦ — function application (data flow)
• ⊆ — subset relationship on languages
• partOf — Teil einer Beziehung- (Komposition)
• uses — Nutzung einer Sprache, Technologie oder eines Konzeptes
• facilitates — Erleichterung im Hinblick auf die Nutzung
• refersTo — Verweis auf Entitäten
• conformsTo — Konformität im Sinne von schematisch basierten Validierungen
• correspondsTo — systematische Ähnlichkeit
• defines — something defining a language or a function
• implements — something implementing a language or a
function
• ↦ — function application (data flow)
• ⊆ — subset relationship on languages
• partOf — Teil einer Beziehung- (Komposition)
• uses — Nutzung einer Sprache, Technologie oder eines Konzeptes
• facilitates — Erleichterung im Hinblick auf die Nutzung
• refersTo — Verweis auf Entitäten
• conformsTo — Konformität im Sinne von schematisch basierten Validierungen
• correspondsTo — systematische Ähnlichkeit
Relationship types for composition
• Artifact partOf Artifact — an artifact being part of another artifact
• Artifact partOf System — an artifact being part of a system
• Technology partOf Technology — a technology being part of another technology
• Language partOf Technology — a language being part of a technology
• Artifact partOf System — an artifact being part of a system
• Technology partOf Technology — a technology being part of another technology
• Language partOf Technology — a language being part of a technology
Relationship types for languages
• Artifact ∈ Language — the language of an artifact
• Artifact defines (Language | Function) — languages or functions defined by artifacts
• (Artifact | Technology) implements (Language | Function) — … implemented by technologies
• Function(Artifact) ↦ Artifact — map an artifact to another artifact (“data flow”)
• Language ⊆ Language — subset relationship on languages
• Artifact defines (Language | Function) — languages or functions defined by artifacts
• (Artifact | Technology) implements (Language | Function) — … implemented by technologies
• Function(Artifact) ↦ Artifact — map an artifact to another artifact (“data flow”)
• Language ⊆ Language — subset relationship on languages
Information Retrieval (IR)
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
• ‘Documents’ in 101project
• wiki text (pages or sections) and
• source-code units with
• program identifiers and
• comments
IR scenario
• Objective: find source-code units that implement a
specific feature, e.g., ‘Total’.
• Method: search source code for characteristic
terms, e.g., ‘total’.
• Challenges:
• Distinguish feature implementation and testing.
• Dealing with variation in natural language usage.
Performance and correctness measures in IR
Precision is the fraction of the documents retrieved that are relevant to the user's information need.
|{relevant docs} "geschnitten" {retried docs}|
precision =
|{retrieved docs}|
Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
|{relevant docs} "geschnitten" {retrieved docs}|
recall =
|{relevant docs}|
• ‘Documents’ in 101project
• wiki text (pages or sections) and
• source-code units with
• program identifiers and
• comments
IR scenario
• Objective: find source-code units that implement a
specific feature, e.g., ‘Total’.
• Method: search source code for characteristic
terms, e.g., ‘total’.
• Challenges:
• Distinguish feature implementation and testing.
• Dealing with variation in natural language usage.
Performance and correctness measures in IR
Precision is the fraction of the documents retrieved that are relevant to the user's information need.
|{relevant docs} "geschnitten" {retried docs}|
precision =
|{retrieved docs}|
Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
|{relevant docs} "geschnitten" {retrieved docs}|
recall =
|{relevant docs}|
Machine Learning
• Supervised learning
"Subervised Learning" ist, wenn man eine Maschine mithilfe von gelabelte Daten/Datensätzen trainiert. Dem Computer wird hierzu Beispiel Imput mit dem gewünschten Output dargestellt. Ziel ist es eine generelle Regel für das "mappen" von Input auf Output zugestalten.
• Unsupervised learning
Is the machine learning task of inferring a function to describe hidden structure from unlabeled data. No labels are given to the learning algorithm, leaving it on its own to find structure in its input.
"Unsupervisiertes Lernen" ist der Task
• Reinforcement learning
• Inspired by behaviorist psychology
• The algorithm learns by reward and punishment like a human
• Example: A computer program interacts with a dynamic
environment in which it must perform a certain goal (such as
driving a vehicle), without a teacher explicitly telling it
whether it has come close to its goal.
• Another example is learning to play a game by playing
against an opponent.
"Subervised Learning" ist, wenn man eine Maschine mithilfe von gelabelte Daten/Datensätzen trainiert. Dem Computer wird hierzu Beispiel Imput mit dem gewünschten Output dargestellt. Ziel ist es eine generelle Regel für das "mappen" von Input auf Output zugestalten.
• Unsupervised learning
Is the machine learning task of inferring a function to describe hidden structure from unlabeled data. No labels are given to the learning algorithm, leaving it on its own to find structure in its input.
"Unsupervisiertes Lernen" ist der Task
• Reinforcement learning
• Inspired by behaviorist psychology
• The algorithm learns by reward and punishment like a human
• Example: A computer program interacts with a dynamic
environment in which it must perform a certain goal (such as
driving a vehicle), without a teacher explicitly telling it
whether it has come close to its goal.
• Another example is learning to play a game by playing
against an opponent.
Natural Language Processing
Definitions:
Natural language processing is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with
computers in both written and spoken contexts using natural human languages instead of computer languages.
Natural language processing is a method to translate between computer and human languages. It is a method of getting a computer to understandably read a line of text without the computer being fed some sort of clue or calculation. In other words, NLP automates the translation process between computers and humans.
Computer understanding, analysis, manipulation, and/or generation of natural language. This can refer to anything from fairly simple string-manipulation tasks like stemming, or building concordances of natural language texts, to higher-level AI-like tasks like processing user queries in natural language.
Natural language processing is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with
computers in both written and spoken contexts using natural human languages instead of computer languages.
Natural language processing is a method to translate between computer and human languages. It is a method of getting a computer to understandably read a line of text without the computer being fed some sort of clue or calculation. In other words, NLP automates the translation process between computers and humans.
Computer understanding, analysis, manipulation, and/or generation of natural language. This can refer to anything from fairly simple string-manipulation tasks like stemming, or building concordances of natural language texts, to higher-level AI-like tasks like processing user queries in natural language.
Mining Software Repositories
• The Mining Software Repositories (MSR) field analyzes the rich data available in software repositories.
• Analysis of
• version control repositories
• mailing list archives
• bug tracking systems
• issue tracking systems, etc.
• to uncover information about software systems, projects
and software engineering.
• Analysis of
• version control repositories
• mailing list archives
• bug tracking systems
• issue tracking systems, etc.
• to uncover information about software systems, projects
and software engineering.
Natural Language in 101 projects
• Program identifiers (I)
• Comments (C)
• Wiki text (T)
• Commit messages (G1)
• Github issues (G2)
• Github revisions (G3)
Program identifier
- Klassen/Typen/Interface-namen
- Methoden Namen
- Parameter- und Variablennamen
Comments:
- einzelne Kommentare
//hier steht ein Kommentar
- Block-Kommentare
/ hier
* steht ein
*Kommentar
*/
Wiki Text**
- Source Code
- Text in natürlicher Sprache
- Zusatzinformationen
• Comments (C)
• Wiki text (T)
• Commit messages (G1)
• Github issues (G2)
• Github revisions (G3)
Program identifier
- Klassen/Typen/Interface-namen
- Methoden Namen
- Parameter- und Variablennamen
Comments:
- einzelne Kommentare
//hier steht ein Kommentar
- Block-Kommentare
/ hier
* steht ein
*Kommentar
*/
Wiki Text**
- Source Code
- Text in natürlicher Sprache
- Zusatzinformationen
Tokenizing
• Separate a text (String) to its tokens
• Example — Input:
• „Natural language processing makes fun.“
• Result:
• „Natural“, „language“, „processing“, „makes“, „fun“, “.“
• Best practice is to work without punctuations and
lowercased tokens (normalization of tokens).
• Normalized result:
• „natural“, „language“, „processing“, „makes“, „fun“
• Example — Input:
• „Natural language processing makes fun.“
• Result:
• „Natural“, „language“, „processing“, „makes“, „fun“, “.“
• Best practice is to work without punctuations and
lowercased tokens (normalization of tokens).
• Normalized result:
• „natural“, „language“, „processing“, „makes“, „fun“
Stemming
• Stemming is the process of reducing a word into its stem.
• The stem or root form is not necessarily a word by itself, but it
can be used to generate words by concatenating the right suffix.
• Example:
• fish, fishes and fishing stems into fish
It is a correct word
• study, studies and studying stems into studi
It is not an English word.
• Most commonly, stemming algorithms (a.k.a. stemmers) are
based on rules for suffix stripping.
• The most famous algorithm is the Porter stemmer. Introduced in 1979.
• A more aggressive stemming algorithm is the Lancaster stemmer. Introduced in 1990.
• Es gibt mehrere Python Libaries wie:NLTK und PyStemmer.
Stemming in Python
• Stemming with NLTK
import nltk
from nltk.stem.porter import PorterStemmer
def stem(tokens):
stem = []
for item in tokens:
stems.append(PorterStemmer().stem(item))
return stems
• Stemming with PyStemmer
import Stemmer
def stem(tokens):
stemmer = Stemmer.Stemmer('english')
stems = stemmer.stemWords(tokens)
return stems
• The stem or root form is not necessarily a word by itself, but it
can be used to generate words by concatenating the right suffix.
• Example:
• fish, fishes and fishing stems into fish
It is a correct word
• study, studies and studying stems into studi
It is not an English word.
• Most commonly, stemming algorithms (a.k.a. stemmers) are
based on rules for suffix stripping.
• The most famous algorithm is the Porter stemmer. Introduced in 1979.
• A more aggressive stemming algorithm is the Lancaster stemmer. Introduced in 1990.
• Es gibt mehrere Python Libaries wie:NLTK und PyStemmer.
Stemming in Python
• Stemming with NLTK
import nltk
from nltk.stem.porter import PorterStemmer
def stem(tokens):
stem = []
for item in tokens:
stems.append(PorterStemmer().stem(item))
return stems
• Stemming with PyStemmer
import Stemmer
def stem(tokens):
stemmer = Stemmer.Stemmer('english')
stems = stemmer.stemWords(tokens)
return stems
Stop words
Stop words are usually extremely common words in a language
which are filtered out before processing of natural language
data. There is no single universal list of stop words used by all
NLP tools but here are some common english stop words:
a, an, and, are, as, at, be, by, for, from, has, he, in, is it, its, of, on,
that, the, to, was, were, will, with, …
• Some stop words lists (a.k.a. stop lists):
• http:snowball.tartarus.org/algorithms/english/stop.txt
• http:xpo6.com/list-of-english-stop-words/
which are filtered out before processing of natural language
data. There is no single universal list of stop words used by all
NLP tools but here are some common english stop words:
a, an, and, are, as, at, be, by, for, from, has, he, in, is it, its, of, on,
that, the, to, was, were, will, with, …
• Some stop words lists (a.k.a. stop lists):
• http:snowball.tartarus.org/algorithms/english/stop.txt
• http:xpo6.com/list-of-english-stop-words/
Mining Software Data
We apply some techniques of NLP, data mining (IR, machine learning) to scenarios with program identifiers, program comments, or documentation (e.g., on 101wiki) as data.
Selection of techniques:
• Sentiment analysis
• IDF
• Clustering
• Prediction model
• Cosine similarity
• Correlation
• “Plotting”
Selection of techniques:
• Sentiment analysis
• IDF
• Clustering
• Prediction model
• Cosine similarity
• Correlation
• “Plotting”
Sentiment analysis
We want to compare sentiment of comments across different languages to answer questions like this:
Are Haskell programmers more positive than Java programmers?
Basic task
• classifying the polarity of a given text at the document, sentence, or feature/aspect level
The expressed opinion can be positive, negative, or neutral.
Advanced task
• classifying how the sentiment classification looks like.
For instance, at emotional states such as "angry," "sad," and „happy."
Are Haskell programmers more positive than Java programmers?
Basic task
• classifying the polarity of a given text at the document, sentence, or feature/aspect level
The expressed opinion can be positive, negative, or neutral.
Advanced task
• classifying how the sentiment classification looks like.
For instance, at emotional states such as "angry," "sad," and „happy."
Inverse Document Frequency
• Extract vocabulary from Wiki text (including stemming and
stop list application)
• Compute TF-IDF from wiki pages („documents“).
• We get ranked lists for every document.
• Do the first ranked terms ‘characterize’ the document?
• Are the first ranked terms important for all documents?
stop list application)
• Compute TF-IDF from wiki pages („documents“).
• We get ranked lists for every document.
• Do the first ranked terms ‘characterize’ the document?
• Are the first ranked terms important for all documents?
Cluster analysis
Categories based on cluster model
• Hierarchical clustering
• Centroid-based clustering
• Distribution-based clustering
• Density-based clustering
The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally.
Hierarchical clustering
Objects being more related to nearby objects than to objects farther away.
Centroid-based clustering (k-means clustering)
Find the k cluster centers and assign the objects to the nearest
cluster center, such that the squared distances from the cluster
are minimized.
Distribution-based clustering
Objects of a cluster belong most likely to the same distribution (e.g., Gaussian distributions).
Density-based clustering
Clusters are defined as areas of higher density than the remainder of the data set.
• Hierarchical clustering
• Centroid-based clustering
• Distribution-based clustering
• Density-based clustering
The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally.
Hierarchical clustering
Objects being more related to nearby objects than to objects farther away.
Centroid-based clustering (k-means clustering)
Find the k cluster centers and assign the objects to the nearest
cluster center, such that the squared distances from the cluster
are minimized.
Distribution-based clustering
Objects of a cluster belong most likely to the same distribution (e.g., Gaussian distributions).
Density-based clustering
Clusters are defined as areas of higher density than the remainder of the data set.
Cosine scenario
Find for each method or class scope of each contribution the most similar scope in another contribution where the vector is based on the term frequency (after preprocessing) of the program identifiers (or comments) in the scope. The terms to be included into the vector could be selected in different ways.
We could consider the top-n terms from an (TF-)IDF analysis.
We could consider the top-n terms from an (TF-)IDF analysis.
Flashcard set info:
Author: CoboCards-User
Main topic: PTT
Topic: PTT
School / Univ.: Uni Koblenz
City: Koblenz
Published: 08.07.2016
Tags: Lämmel
Card tags:
All cards (56)
no tags