
Technical
Projects
Wenye's projects related to Music Generation and Audio Recognition.

1. Publishing Paper : A Hybrid Architecture Combining CNN, LSTM, and Attention Mechanisms for Automatic Speech Recognition
Abstract:
This research proposes a robust sequence-to- sequence model architecture for speech recognition tasks, in- corporating convolutional and recurrent neural networks with attention mechanisms. The encoder utilizes a hybrid approach of residual convolutional blocks (RCNN) and bidirectional Long Short-Term Memory (BLSTM) networks to extract hierarchical temporal and spectral features from input sequences efficiently. The decoder employs an attention-augmented LSTM with Lu- ong Attention, dynamically focusing on relevant parts of the encoded representation during sequence generation. To enhance generalization and robustness, we integrate data augmentation techniques such as phase randomization, time-frequency masking in the preprocessing pipeline. Features are further refined using Mel-Frequency Cepstral Coefficients (MFCC) and Filter Banks (FBANK) combined with delta and delta-delta coefficients. Beam Search decoding with a width of 10 is used during inference to improve the accuracy of predictions. We evaluate the model on the TIMIT dataset and achieve a Phoneme Error Rate (PER) of 16.7%, demonstrating the effectiveness of the proposed architecture and preprocessing strategies in capturing complex phoneme patterns and improving recognition performance.
​

​Note: Since the paper is in the publishing stage, if you are interested, please contact Wenye Song to have access to the paper by clicking the button above.
3. Patent Proposal : Automatic Melody Completion Method Based on Harmony Model and Emotion-Driven Music Generation
Brief Introduction to Methodology:​
-
Data Collection and Preparation:
-
Collect harmonic samples with different emotional characteristics. Extract triads, augmented chords, diminished chords, suspended chords, and other chord types from existing published music. Arrange chords into sequences, such as I-IV-V, descending fifths, parallel sixths. Label each chord sequence with emotional tags and harmonic progression information. Link each chord sequence with its corresponding timbre features.​
-
-
Timbre Feature Extraction:
-
Extract features such as Harmonic structure and Spectral features.​
-
-
Training the Emotion-Harmony Model:
-
Using Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) structure, construct an emotion-harmony model to learn the relationships between harmonic characteristics and emotions.​
-
-
Training the Emotion-Melody Sequence Model:
-
Use sequence learning methods to associate emotion tags with melody sequences. Enable the model to predict and generate melodies that match the desired emotional content.​
-
-
Fusion of Two Models:
-
Use the emotion model and harmonic sequence model to predict the next logical chord progression.
-
Adjust chord progression based on the emotional model’s predictions. Fine-tune the generated harmonic sequence using existing musical techniques.​​
-
-
Interactive Algorithm for Melody Evolution:
-
Implement an interactive approach where users can provide feedback. Users can adjust harmonies, rate emotional accuracy, or manually tweak chord progressions. The system learns from feedback and refines the melody generation model.
-
​This project proposed to enhance AudioGPT by introducing an intonation analysis module, improving its ability to process speech recognition tasks by incorporating intonation, prosody, and emotional features. Instead of directly converting audio into text using standard ASR (Automatic Speech Recognition) models like Whisper, this approach transforms speech into a vector representation that includes intonation, emotional tone, and rhythm.​
​
Brief Introduction to Methodology:​
-
Data Preprocessing
-
Collect and augment audio datasets, and apply denoising, resampling, and normalization techniques.
-
-
Speech-to-Text Conversion
-
Convert spoken language into text sequences using standard ASR models.
-
-
Feature Extraction from Audio Signal
-
Extract intonation-related acoustic features such as: Pitch, Energy, and Spectral features.
-
-
​Vector Representation of Intonation
-
Convert intonation information into a vector representation by using word vector techniques to encode intonation characteristics into high-dimensional numerical values. For example, express tone variations using numerical changes in a numerical matrix.
-
-
Fusion of Text and Acoustic Features
-
Combine the text-based vector with the intonation acoustic feature vector using concatenation and weighted addition to create an integrated frequency feature vector.
-
-
Model Training & Optimization
-
Label each frequency segment with its corresponding semantic and intonation information. Define relationships such as question vs. statement, soft vs. firm speech, and strict vs. relaxed intonation.
Use a Recurrent Neural Network (RNN) to model frequency vectors and language to understand relationships. Train using Mean Squared Error (MSE) as the loss function for optimization.
-
2. Patent Proposal : AudioGPT with Audio Intonation Analysis
Wenye has other implemented CS projects related to machine learning and software development but not music, can be found in her github.