publications
publications in reversed chronological order. generated by jekyll-scholar. the satp consortium is an international collaboration of 50+ researchers.
2024
- ConferenceA Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four StemsKarn N. Watcharasupat, and Alexander LerchIn To appear in the Proceedings of the 25th conference of the International Society for Music Information Retrieval , Nov 2024
Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs.
@inproceedings{Watcharasupat2024StemAgnosticSingleDecoderSystem, title = {A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems}, author = {Watcharasupat, Karn N. and Lerch, Alexander}, year = {2024}, month = nov, booktitle = {To appear in the Proceedings of the 25th conference of the International Society for Music Information Retrieval}, publisher = {ISMIR}, address = {San Francisco, CA, USA}, google_scholar_id = {-f6ydRqryjwC} }
- ConferenceRemastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual SupportKarn N. Watcharasupat, Chih-Wei Wu, and Iroro OrifeIn To appear in the Proceedings of the 5th IEEE International Symposium on the Internet of Sounds , Sep 2024
Cinematic audio source separation (CASS), as a problem of extracting the dialogue, music, and effects stems from their mixture, is a relatively new subtask of audio source separation. To date, only one publicly available dataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which is currently at version 2. While DnR v2 has been an incredibly useful resource for CASS, several areas of improvement have been identified, particularly through its use in the 2023 Sound Demixing Challenge. In this work, we develop version 3 of the DnR dataset, addressing issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity. In particular, the dialogue stem of DnR v3 includes speech content from more than 30 languages from multiple families including but not limited to the Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu families. Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model even in languages with low data availability. Even in languages with high data availability, the multilingual model often performs on par or better than dedicated models trained on monolingual CASS datasets.
@inproceedings{Watcharasupat2024RemasteringDivideRemaster, title = {Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support}, author = {Watcharasupat, Karn N. and Wu, Chih-Wei and Orife, Iroro}, year = {2024}, month = sep, booktitle = {To appear in the Proceedings of the 5th IEEE International Symposium on the Internet of Sounds}, publisher = {IEEE}, address = {Erlangen, Germany}, google_scholar_id = {mB3voiENLucC} }
- JournalSoundscape Descriptors in Eighteen Languages: Translation and Validation through Listening ExperimentsFrancesco Aletta, Andrew Mitchell, Tin Oberman, Jian Kang, Sara Khelil, Tallal Abdel Karim Bouzir, Djihed Berkouk, Hui Xie, Yuan Zhang, Ruining Zhang, Xinhao Yang, Min Li, Kristian Jambrošić, Tamara Zaninović, Kirsten van den Bosch, Tamara Lühr, Nicolas Orlik, Darragh Fitzpatrick, Anastasios Sarampalis, Pierre Aumond, Catherine Lavandier, Cleopatra Christina Moshona, Steffen Lepa, André Fiebig, Nikolaos M. Papadakis, Georgios E. Stavroulakis, Anugrah Sabdono Sudarsono, Sugeng Joko Sarwono, Giuseppina Emma Puglisi, Farid Jafari, Arianna Astolfi, Louena Shtrepi, Koji Nagahata, Hyun In Jo, Jin Yong Jeon, Bhan Lam, Julia Chieng, Kenneth Ooi, Joo Young Hong, Sónia Monteiro Antunes, Sonia Alves, Maria Luiza de Ulhoa Carvalho, Ranny Loureiro Xavier Nascimento Michalski, Pablo Kogan, Jerónimo Vida Manzano, Rafael García Quesada, Enrique Suárez Silva, José Antonio Almagro Pastor, Mats E. Nilsson, Östen Axelsson, Woon-Seng Gan, Karn N. Watcharasupat, Sureenate Jaratjarungkiat, Zhen-Ting Ong, Papatya Nur Dökmeci Yörükoğlu, Uğur Beyza Erçakmak Osma , and Thu Lan NguyenApplied Acoustics, Sep 2024
This paper presents the outcomes of the “Soundscape Attributes Translation Project” (SATP), an international initiative addressing the critical research gap in soundscape descriptors translations for cross-cultural studies. Focusing on eighteen languages – namely: Arabic, Chinese, Croatian, Dutch, English, French, German, Greek, Indonesian, Italian, Japanese, Korean, Malay, Portuguese, Spanish, Swedish, Turkish, and Vietnamese – the study employs a four-step procedure to evaluate the reliability and cross-cultural validity of translated soundscape descriptors. The study introduces a three-tier confidence level system (Low, Medium, High) based on “adjusted angles”, which are a measure proposed to correct the soundscape circumplex model (i.e., the pleasant-eventful space proposed in the ISO 12913 series) of a given language. Results reveal that most languages successfully maintain the quasi-circumplex structure of the original soundscape model, ensuring robust cross-cultural validity. English, Arabic, Chinese (Mandarin), Croatian, Dutch, German, Greek, Indonesian, Italian, Spanish, Swedish, and Turkish achieve a “High” confidence level. French, Japanese, Korean, Malay, Portuguese, and Vietnamese demonstrate varying confidence levels, highlighting the importance of the preliminary translation. This research significantly contributes to standardized cross-cultural methodologies in soundscape perception research, emphasizing the pivotal role of adjusted angles within the soundscape circumplex model in ensuring the accuracy of dimensions (i.e., attributes) locations. The SATP initiative offers insights into the complex interplay of language and meaning in the perception of environmental sounds, opening avenues for further cross-cultural soundscape research.
@article{Aletta2024SoundscapeDescriptorsEighteena, title = {Soundscape Descriptors in Eighteen Languages: {{Translation}} and Validation through Listening Experiments}, shorttitle = {Soundscape Descriptors in Eighteen Languages}, author = {Aletta, Francesco and Mitchell, Andrew and Oberman, Tin and Kang, Jian and Khelil, Sara and Bouzir, Tallal Abdel Karim and Berkouk, Djihed and Xie, Hui and Zhang, Yuan and Zhang, Ruining and Yang, Xinhao and Li, Min and Jambro{\v s}i{\'c}, Kristian and Zaninovi{\'c}, Tamara and {van den Bosch}, Kirsten and L{\"u}hr, Tamara and Orlik, Nicolas and Fitzpatrick, Darragh and Sarampalis, Anastasios and Aumond, Pierre and Lavandier, Catherine and Moshona, Cleopatra Christina and Lepa, Steffen and Fiebig, Andr{\'e} and Papadakis, Nikolaos M. and Stavroulakis, Georgios E. and Sudarsono, Anugrah Sabdono and Sarwono, Sugeng Joko and Puglisi, Giuseppina Emma and Jafari, Farid and Astolfi, Arianna and Shtrepi, Louena and Nagahata, Koji and Jo, Hyun In and Jeon, Jin Yong and Lam, Bhan and Chieng, Julia and Ooi, Kenneth and Hong, Joo Young and Monteiro Antunes, S{\'o}nia and Alves, Sonia and {de Ulhoa Carvalho}, Maria Luiza and Michalski, Ranny Loureiro Xavier Nascimento and Kogan, Pablo and Vida Manzano, Jer{\'o}nimo and Garc{\'i}a Quesada, Rafael and Su{\'a}rez Silva, Enrique and Almagro Pastor, Jos{\'e} Antonio and Nilsson, Mats E. and Axelsson, {\"O}sten and Gan, Woon-Seng and Watcharasupat, Karn N. and Jaratjarungkiat, Sureenate and Ong, Zhen-Ting and D{\"o}kmeci Y{\"o}r{\"u}ko{\u g}lu, Papatya Nur and Er{\c c}akmak Osma, U{\u g}ur Beyza and Nguyen, Thu Lan}, year = {2024}, month = sep, journal = {Applied Acoustics}, volume = {224}, pages = {110109}, issn = {0003-682X}, doi = {10.1016/j.apacoust.2024.110109}, google_scholar_id = {hC7cP41nSMkC}, }
- ConferenceAdvancing Cross-Cultural Soundscape Research: Updates from the Soundscape Attributes Translation Project (SATP)Francesco Aletta, Tin Oberman, Andrew Mitchell, Jian Kang, and the SATP ConsortiumIn Proceedings of the 53rd International Congress and Exposition on Noise Control Engineering , Aug 2024
The Soundscape Attributes Translation Project (SATP) addresses linguistic barriers in soundscape research, translating ISO/TS 12913-2:2018 descriptors into various languages for global use. This paper presents recent updates from national working groups, including translations in Turkish, French, Indonesian, Albanian, Chinese, Japanese, Vietnamese, Dutch, and Spanish. A preliminary validation of all the current 18 SATP translations assesses cross-cultural robustness. SATP employs diverse methods for translation, fostering international collaboration. The preliminary validation evaluates the reliability and validity of translated descriptors across languages. Most languages maintain the quasi-circumplex structure of the original soundscape model. English, Arabic, Chinese (Mandarin), Croatian, Dutch, German, Greek, Indonesian, Italian, Spanish, Swedish, and Turkish achieve a “High” confidence level. French, Japanese, Korean, Malay, Portuguese, and Vietnamese show varying confidence levels, emphasizing the need for rigorous validation criteria. SATP advances global soundscape research, with updates from national working groups contributing to cross-cultural relevance. Preliminary validation results affirm the quasi-circumplex structure’s maintenance in most languages, emphasizing the project’s commitment to comprehensive and globally applicable soundscape research instruments.
@inproceedings{Aletta2024AdvancingCrossculturalSoundscape, title = { Advancing Cross-Cultural Soundscape Research: Updates from the Soundscape Attributes Translation Project (SATP) }, author = {Aletta, Francesco and Oberman, Tin and Mitchell, Andrew and Kang, Jian and {the SATP Consortium}}, year = {2024}, month = aug, booktitle = { Proceedings of the 53rd International Congress and Exposition on Noise Control Engineering }, google_scholar_id = {7PzlFSSx8tAC} }
- JournalValidating Thai Translations of Perceptual Soundscape Attributes: A Non-Procrustean Approach with a Procrustes ProjectionKarn N. Watcharasupat, Kenneth Ooi, Bhan Lam, Zhen-Ting Ong, Sureenate Jaratjarungkiat, and Woon-Seng GanApplied Acoustics, May 2024
Measurement of a psychological construct across populations without a common linguistic medium often necessitates the development of multiple translations of the psychometric tool across multiple languages, dialects, or other population-specific variations. In this follow-up (Stage 2) study, a listening test using a shared set of 27 stimuli from the Soundscape Attribute Translation Project (SATP) was conducted with Thai-speaking participants using the set of Thai translations of the eight perceptual affective quality (PAQ) descriptors selected in earlier (Stage 1) work through a structured evaluation questionnaire. Principal component analysis was performed on the listening test data to obtain a rank-two reduction of the responses with maximal explained variance. In order to align the principal component space to the two-dimensional circumplex space, this work presents a simple and numerically stable method based on the orthogonal Procrustes projection, to find the optimal two-dimensional orthogonal transform that aligns the first two principal components with the axes corresponding to Pleasantness and Eventfulness as defined in ISO/TS 12913-3:2019. Analysis of the listening test responses indicated good to excellent interrater reliability, reflecting the general comprehensibility of the translations to laypersons. Subsequent analyses yielded a two-dimensional projection with 94.4 % explained variance and near-perfect alignment of the composite Pleasantness and Eventfulness axes. Angular locations of the individual translated PAQs are located within 16^∘ of their theoretically ideal locations and preserve angular ordering, albeit with imperfect equiangularity. Cross-analysis against the results from Stage 1 showed that the structured evaluation may be partially useful in anticipating potential imperfections of the PAQ translations and their angular locations in Stage 2.
@article{Watcharasupat2024ValidatingThaiTranslations, title = { Validating Thai Translations of Perceptual Soundscape Attributes: A Non-Procrustean Approach with a Procrustes Projection }, author = {Watcharasupat, Karn N. and Ooi, Kenneth and Lam, Bhan and Ong, Zhen-Ting and Jaratjarungkiat, Sureenate and Gan, Woon-Seng}, year = {2024}, month = may, journal = {Applied Acoustics}, doi = {10.1016/j.apacoust.2024.109999}, google_scholar_id = {L8Ckcad2t8MC} }
- ConferenceQuantifying Spatial Audio Quality ImpairmentKarn N. Watcharasupat, and Alexander LerchIn Proceedings of the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing , Apr 2024
Spatial audio quality is a highly multifaceted concept, with many interactions between environmental, geometrical, anatomical, psychological, and contextual factors. Methods for characterization or evaluation of the geometrical components of spatial audio quality, however, remain scarce, despite being perhaps the least subjective aspect of spatial audio quality to quantify. By considering interchannel time and level differences relative to a reference signal, it is possible to construct a signal model to isolate some of the spatial distortion. By using a combination of least-square optimization and heuristics, we propose a signal decomposition method to isolate the spatial error, in terms of interchannel gain leakages and changes in relative delays, from a processed signal. This allows the computation of simple energy-ratio metrics, providing objective measures of spatial and non-spatial signal qualities, with minimal assumptions and no dataset dependency. Experiments demonstrate the robustness of the method against common spatial signal degradation introduced by, e.g., audio compression and music source separation.
@inproceedings{Watcharasupat2024QuantifyingSpatialAudio, title = {Quantifying Spatial Audio Quality Impairment}, author = {Watcharasupat, Karn N. and Lerch, Alexander}, year = {2024}, month = apr, booktitle = { Proceedings of the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing }, publisher = {IEEE}, address = {Seoul, Korea, Republic of}, pages = {746--750}, doi = {10.1109/ICASSP48485.2024.10447947}, isbn = {9798350344851}, google_scholar_id = {_kc_bZDykSQC} }
- JournalLion City Soundscapes: Modified Partitioning around Medoids for a Perceptually Diverse Dataset of Singaporean SoundscapesKenneth Ooi, Jessie Goh, Hao-Weng Lin, Zhen-Ting Ong, Trevor Wong, Karn N. Watcharasupat, Bhan Lam, and Woon-Seng GanJASA Express Letters, Apr 2024
This study presents a dataset of audio-visual soundscape recordings at 62 different locations in Singapore, initially made as full-length recordings over spans of 9–38 min. For consistency and reduction in listener fatigue in future subjective studies, one-minute excerpts were cropped from the full-length recordings. An automated method using pre-trained models for Pleasantness and Eventfulness (according to ISO 12913) in a modified partitioning around medoids algorithm was employed to generate the set of excerpts by balancing the need to encompass the perceptual space with uniformity in distribution. A validation study on the method confirmed its adherence to the intended design..
@article{Ooi2024LionCitySoundscapes, title = { Lion City Soundscapes: Modified Partitioning around Medoids for a Perceptually Diverse Dataset of Singaporean Soundscapes }, shorttitle = {Lion City Soundscapes}, author = {Ooi, Kenneth and Goh, Jessie and Lin, Hao-Weng and Ong, Zhen-Ting and Wong, Trevor and Watcharasupat, Karn N. and Lam, Bhan and Gan, Woon-Seng}, year = {2024}, month = apr, journal = {JASA Express Letters}, volume = {4}, number = {4}, pages = {047402}, doi = {10.1121/10.0025830}, issn = {2691-1191}, https://researchdata.ntu.edu.sg/dataset.xhtml?persistentId=doi:10.21979/N9/AVHSBX }, google_scholar_id = {ZeXyd9-uunAC} }
- JournalARAUS: A Large-Scale Dataset and Baseline Models of Affective Responses to Augmented Urban SoundscapesIEEE Transactions on Affective Computing, Feb 2024
Choosing optimal maskers for existing soundscapes to effect a desired perceptual change via soundscape augmentation is non-trivial due to extensive varieties of maskers and a dearth of benchmark datasets with which to compare and develop soundscape augmentation models. To address this problem, we make publicly available the ARAUS (Affective Responses to Augmented Urban Soundscapes) dataset, which comprises a five-fold cross-validation set and independent test set totaling 25,440 unique subjective perceptual responses to augmented soundscapes presented as audio-visual stimuli. Each augmented soundscape is made by digitally adding “maskers” (bird, water, wind, traffic, construction, or silence) to urban soundscape recordings at fixed soundscape-to-masker ratios. Responses were then collected by asking participants to rate how pleasant, annoying, eventful, uneventful, vibrant, monotonous, chaotic, calm, and appropriate each augmented soundscape was, in accordance with ISO/TS 12913-2:2018. Participants also provided relevant demographic information and completed standard psychological questionnaires. We perform exploratory and statistical analysis of the responses obtained to verify internal consistency and agreement with known results in the literature. Finally, we demonstrate the benchmarking capability of the dataset by training and comparing four baseline models for urban soundscape pleasantness: a low-parameter regression model, a high-parameter convolutional neural network, and two attention-based networks in the literature.
@article{Ooi2024ARAUSLargeScaleDataset, title = { ARAUS: A Large-Scale Dataset and Baseline Models of Affective Responses to Augmented Urban Soundscapes }, author = {Ooi, Kenneth and Ong, Zhen-Ting and Watcharasupat, Karn N. and Lam, Bhan and Hong, Joo Young and Gan, Woon-Seng}, year = {2024}, month = feb, journal = {IEEE Transactions on Affective Computing}, volume = {15}, number = {1}, pages = {105--120}, doi = {10.1109/TAFFC.2023.3247914}, issn = {1949-3045}, google_scholar_id = {kNdYIx-mwKoC} }
- JournalAutomating urban soundscape enhancements with AI: In-situ assessment of quality and restorativeness in traffic-exposed residential areasBhan Lam, Zhen-Ting Ong, Kenneth Ooi, Wen-Hui Ong, Trevor Wong, Karn N. Watcharasupat, Vanessa Boey, Irene Lee, Joo Young Hong, Jian Kang, Kar Fye Alvin Lee, Georgios Christopoulos, and Woon-Seng GanBuilding and Environment, Feb 2024
Formalized in ISO 12913, the “soundscape” approach is a paradigmatic shift towards perception-based urban sound management, aiming to alleviate the substantial socioeconomic costs of noise pollution to advance the United Nations Sustainable Development Goals. Focusing on traffic-exposed outdoor residential sites, we implemented an automatic masker selection system (AMSS) utilizing natural sounds to mask (or augment) traffic soundscapes. We employed a pre-trained AI model to automatically select the optimal masker and adjust its playback level, adapting to changes over time in the ambient environment to maximize “Pleasantness”, a perceptual dimension of soundscape quality in ISO 12913. Our validation study involving (N=68) residents revealed a significant 14.6% enhancement in “Pleasantness” after intervention, correlating with increased restorativeness and positive affect. Perceptual enhancements at the traffic-exposed site matched those at a quieter control site with 6dB(A) lower LA,eq and road traffic noise dominance, affirming the efficacy of AMSS as a soundscape intervention, while streamlining the labour-intensive assessment of “Pleasantness” with probabilistic AI prediction.
@article{Lam2024AutomatingUrbanSoundscape, title = {Automating urban soundscape enhancements with AI: In-situ assessment of quality and restorativeness in traffic-exposed residential areas}, journal = {Building and Environment}, volume = {266}, pages = {112106}, year = {2024}, issn = {0360-1323}, doi = {https://doi.org/10.1016/j.buildenv.2024.112106}, url = {https://www.sciencedirect.com/science/article/pii/S036013232400948X}, author = {Lam, Bhan and Ong, Zhen-Ting and Ooi, Kenneth and Ong, Wen-Hui and Wong, Trevor and Watcharasupat, Karn N. and Boey, Vanessa and Lee, Irene and Hong, Joo Young and Kang, Jian and Lee, Kar Fye Alvin and Christopoulos, Georgios and Gan, Woon-Seng}, keywords = {Urban soundscape, Natural sounds, Auditory masking, Probabilistic approach, Soundscape augmentation, Artificial intelligence}, google_scholar_id = {hFOr9nPyWt4C} }
2023
- JournalA Generalized Bandsplit Neural Network for Cinematic Audio Source SeparationKarn N. Watcharasupat, Chih-Wei Wu, Yiwei Ding, Iroro Orife, Aaron J. Hipple, Phillip A. Williams, Scott Kramer, Alexander Lerch, and William WolcottIEEE Open Journal of Signal Processing, Dec 2023
Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-tonoise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
@article{Watcharasupat2023GeneralizedBandsplitNeural, title = {A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation}, author = {Watcharasupat, Karn N. and Wu, Chih-Wei and Ding, Yiwei and Orife, Iroro and Hipple, Aaron J. and Williams, Phillip A. and Kramer, Scott and Lerch, Alexander and Wolcott, William}, year = {2023}, month = dec, journal = {IEEE Open Journal of Signal Processing}, volume = {5}, pages = {73--81}, doi = {10.1109/OJSP.2023.3339428}, issn = {2644-1322}, google_scholar_id = {Wp0gIr-vW9MC} }
- JournalCrossing the Linguistic Causeway: Ethnonational Differences on Soundscape Attributes in Bahasa MelayuBhan Lam, Julia Chieng, Kenneth Ooi, Zhen Ting Ong, Karn N. Watcharasupat, Joo Young Hong, and Woon Seng GanApplied Acoustics, Nov 2023
Despite being neighbouring countries and sharing the language of Bahasa Melayu (ISO 639-3: [Formula presented]), cultural and language education policy differences between Singapore and Malaysia led to differences in the translation of the “annoying” perceived affective quality (PAQ) attribute from English (ISO 639-3: [Formula presented]) to [Formula presented]. This study expands upon the translation of the PAQ attributes from [Formula presented] to [Formula presented] in Stage 1 of the Soundscapes Attributes Translation Project (SATP) initiative, and presents the findings of Stage 2 listening tests that investigated ethnonational differences in the translated [Formula presented] PAQ attributes and explored their circumplexity. A cross-cultural listening test was conducted with 100 [Formula presented] speakers from Malaysia and Singapore using the common SATP protocol. The analysis revealed that Malaysian participants from non-native ethnicities ([Formula presented]) showed PAQ perceptions more similar to Singapore ([Formula presented]) participants than native ethnic Malays ([Formula presented]) in Malaysia. Differences between Singapore and Malaysian groups were primarily observed in stimuli related to water features, reflecting cultural and geographical variations. Besides variations in water source-dominant stimuli perception, disparities between [Formula presented] and [Formula presented] could be mainly attributed to [Formula presented] scores. The findings also suggest that the adoption of region-specific translations, such as [Formula presented] in Singapore and [Formula presented] in Malaysia, adequately addressed differences in the [Formula presented] attribute, since significant differences were observed in one or fewer stimuli across ethnonational groups. The circumplexity analysis indicated that the quasi-circumplex model better fit the data compared to the assumed equal angle quasi-circumplex model in ISO/TS 12913-3, although deviations were observed possibly due to respondents’ unfamiliarity with the United Kingdom-centric context of the stimulus dataset. Furthermore, the alignment between Stage 2 listening tests and quantitative evaluation of attributes in Stage 1 revealed biases in the [Formula presented]–[Formula presented] dimension across ethnonational groups. This study provides insights into the perception of PAQ attributes in cross-cultural and cross-national contexts, facilitating the culturally appropriate adoption of translated PAQ attributes in soundscape evaluation.
@article{Lam2023CrossingLinguisticCauseway, title = { Crossing the Linguistic Causeway: Ethnonational Differences on Soundscape Attributes in Bahasa Melayu }, author = {Lam, Bhan and Chieng, Julia and Ooi, Kenneth and Ong, Zhen Ting and Watcharasupat, Karn N. and Hong, Joo Young and Gan, Woon Seng}, year = {2023}, month = nov, journal = {Applied Acoustics}, publisher = {Elsevier Ltd}, volume = {214}, doi = {10.1016/j.apacoust.2023.109675}, issn = {1872910X}, google_scholar_id = {4TOpqqG69KYC} }
- ConferencePreliminary Results of the Soundscape Attributes Translation Project (SATP): Lessons Learned and next StepsFrancesco Aletta, Tin Oberman, Andrew Mitchell, Jian Kang, and the SATP ConsortiumIn Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023 , Sep 2023
The ISO/TS 12913-2:2018 document for soundscape data collection provides a questionnaire instrument for researchers and practitioners to use worldwide, but its applicability has been questioned, since it’s only available in English. To address the lack of research on translations of the soundscape descriptors proposed in Method A of the ISO technical specifications (i.e., vibrant, pleasant, calm, uneventful, monotonous, annoying, chaotic, eventful), an international collaboration, the Soundscape Attributes Translation Project (SATP), was initiated to translate the descriptors into several languages, using different methodological approaches, with the goal of validating the translations using standardized listening experiments. This paper presents the current state of advancement of the project, reporting on preliminary results from selected national working groups within the SATP network, as well as discussing the proposed analysis framework to validate the translations.
@inproceedings{Aletta2023PreliminaryResultsSoundscape, title = { Preliminary Results of the Soundscape Attributes Translation Project (SATP): Lessons Learned and next Steps }, shorttitle = { Preliminary Results of the {{Soundscape Attributes Translation Project}} ({{SATP}}) }, author = {Aletta, Francesco and Oberman, Tin and Mitchell, Andrew and Kang, Jian and {the SATP Consortium}}, year = {2023}, month = sep, booktitle = { Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum 2023 }, publisher = {European Acoustics Association}, address = {Turin, Italy}, pages = {701--705}, doi = {10.61782/fa.2023.0095}, isbn = {978-88-88942-67-4}, bibtex_show = true, dimensions = true, google_scholar_id = {dhFuZR0502QC} }
- ConferencePreliminary Investigation of the Short-Term in Situ Performance of an Automatic Masker Selection SystemIn Proceedings of the 52nd International Congress and Exposition on Noise Control Engineering , Aug 2023
@inproceedings{Lam2023PreliminaryInvestigationShortterm, title = { Preliminary Investigation of the Short-Term in Situ Performance of an Automatic Masker Selection System }, author = {Lam, Bhan and Ooi, Kenneth and Ong, Zhen-Ting and Wong, Trevor and Gan, Woon-Seng and Watcharasupat, Karn}, year = {2023}, month = aug, booktitle = { Proceedings of the 52nd International Congress and Exposition on Noise Control Engineering }, doi = {10.3397/in_2023_0805}, google_scholar_id = {4DMP91E08xMC} }
- ConferenceEffect of Masker Selection Schemes on the Perceived Affective Quality of Soundscapes: A Pilot StudyIn Proceedings of the 52nd International Congress and Exposition on Noise Control Engineering , Aug 2023
@inproceedings{Ong2023EffectMaskerSelection, title = { Effect of Masker Selection Schemes on the Perceived Affective Quality of Soundscapes: A Pilot Study }, author = {Ong, Zhen-Ting and Ooi, Kenneth and Wong, Trevor and Lam, Bhan and Gan, Woon-Seng and Watcharasupat, Karn N.}, year = {2023}, month = aug, booktitle = { Proceedings of the 52nd International Congress and Exposition on Noise Control Engineering }, doi = {10.3397/in_2023_0791}, google_scholar_id = {mVmsd5A6BfQC} }
- ConferenceARAUSv2: An Expanded Dataset and Multimodal Models of Affective Responses to Augmented Urban SoundscapesIn Proceedings of the 52nd International Congress and Exposition on Noise Control Engineering , Aug 2023
@inproceedings{Ooi2023ARAUSv2ExpandedDataset, title = { ARAUSv2: An Expanded Dataset and Multimodal Models of Affective Responses to Augmented Urban Soundscapes }, author = {Ooi, Kenneth and Ong, Zhen-Ting and Lam, Bhan and Wong, Trevor and Gan, Woon-Seng and Watcharasupat, Karn}, year = {2023}, month = aug, booktitle = { Proceedings of the 52nd International Congress and Exposition on Noise Control Engineering }, doi = {10.3397/in_2023_0459}, google_scholar_id = {9ZlFYXVOiuMC} }
- PatentMethod and audio processing system for blind source separation without sampling rate mismatch estimation (不須計算取樣頻率誤差的盲源分離方法以及音訊處理系統)Hai Trieu Anh Nguyen (阮海潮英), Wai Hoong Khong (鄺偉雄), Karn Watcharasupat (瓦特察拉蘇帕特 甘), and Qing Liu (劉晴)Taiwan Patent TWI809390B, Granted Jul 2023
A method for blind source separation for an audio processing system including multiple devices is provided. Each of the devices includes multiple microphones. A measure of dissimilarity between the signal vector sensed by each device and a column of a mixing matrix is computed. The measure of dissimilarity is used to establish an objective function and an optimization algorithm is performed to compute the mixing matrix. Estimates of the original signals are computed according to the mixing matrix and the signal vector without estimating a sampling rate mismatch between the devices. Therefore, compensation of sampling rate mismatch is not required. 本揭露提出一種盲源分離方法,適用於一音訊處理系統,此音訊處理系統包括多個裝置,每一個裝置包括多個麥克風。先計算每個裝置感測的訊號向量與混和矩陣的一行之間的差異,此差異用來建立一目標函數,接著執行一最佳化演算法來計算混和矩陣。根據混和矩陣與訊號向量可以計算出原始訊號而不用計算裝置之間的取樣頻率誤差。如此一來,便不需要補償取樣頻率誤差。
@patent{Nguyen2021MethodAudioProcessingTW, author = {Nguyen (阮海潮英), Hai Trieu Anh and Khong (鄺偉雄), Wai Hoong and Watcharasupat (瓦特察拉蘇帕特 甘), Karn and Liu (劉晴), Qing}, title = {Method and audio processing system for blind source separation without sampling rate mismatch estimation (不須計算取樣頻率誤差的盲源分離方法以及音訊處理系統)}, nationality = {Taiwan}, number = {TWI809390B}, dayfiled = {13}, monthfiled = may, yearfiled = {2021}, day = {21}, month = jul, year = {2023}, granted = true, google_scholar_id = {IWHjjKOFINEC} }
- PatentMethod and audio processing system for blind source separation without sampling rate mismatch estimation and compensationHai Trieu Anh Nguyen , Andy W. H. Khong, Karn Watcharasupat, and Qing LiuSingapore Patent SG10202102050T, Granted Jul 2023
A method for blind source separation for an audio processing system including multiple devices is provided. Each of the devices includes multiple microphones. A measure of dissimilarity between the signal vector sensed by each device and a column of a mixing matrix is computed. The measure of dissimilarity is used to establish an objective function and an optimization algorithm is performed to compute the mixing matrix. Estimates of the original signals are computed according to the mixing matrix and the signal vector without estimating a sampling rate mismatch between the devices. Therefore, compensation of sampling rate mismatch is not required.
@patent{Nguyen2021MethodAudioProcessingSG, author = {Nguyen, Hai Trieu Anh and Khong, Andy W. H. and Watcharasupat, Karn and Liu, Qing}, title = {Method and audio processing system for blind source separation without sampling rate mismatch estimation and compensation}, nationality = {Singapore}, number = {SG10202102050T}, dayfiled = {1}, monthfiled = mar, yearfiled = {2021}, day = {24}, month = jul, year = {2023}, granted = true, google_scholar_id = {qxL8FJ1GzNcC} }
- ConferenceAutonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked InputsIn Proceedings of the 2023 International Conference on Acoustics, Speech, and Signal Processing , Jun 2023
Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acoustic perception. Hence, we propose modular modifications to an existing attention-based deep neural network, to allow early, mid-level, and late feature fusion of participant-linked, visual, and acoustic features. Ablation studies on module configurations and corresponding fusion methods using the ARAUS dataset show that contextual features improve the model performance in a statistically significant manner on the normalized ISO Pleasantness, to a mean squared error of 0.1194\textpm0.0012 for the best-performing all-modality model, against 0.1217 \textpm 0.0009 for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability.
@inproceedings{Ooi2023AutonomousSoundscapeAugmentation, title = { Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs }, author = {Ooi, Kenneth and Watcharasupat, Karn N. and Lam, Bhan and Ong, Zhen-Ting and Gan, Woon-Seng}, year = {2023}, month = jun, booktitle = { Proceedings of the 2023 International Conference on Acoustics, Speech, and Signal Processing }, doi = {10.1109/ICASSP49357.2023.10094866}, google_scholar_id = {ULOm3_A8WrAC} }
- PatentSoundscape augmentation system and method of forming the sameWen Rui Kenneth Ooi, Karn Watcharasupat, Bhan Lam, Zhen Ting Ong, Trevor Martens Zhi Ming Wong, and Woon Seng GanWO Patent App. PCT/SG2023/050289, Filed Apr 2023
Various embodiments may provide a soundscape augmentation system. The soundscape augmentation system may include a data acquisition system configured to provide ambient soundscape data. The soundscape augmentation system may also include a database including a plurality of masker configurations. The soundscape augmentation system may further include a perceptual attribute predictor coupled to the data acquisition system and the database, the perceptual attribute predictor configured to generate predictions representing perception on one or more pre-defined perceptual attribute scales for each masker configuration of the plurality of masker configurations based on the ambient soundscape data. The soundscape augmentation system may additionally include a masker configuration ranking system configured to determine one or more optimal masker configurations based on the predictions generated by the perceptual attribute predictor. The soundscape augmentation system may also include a playback system configured to play back or reproduce the one or more optimal masker configurations.
@patent{Ooi2022SoundscapeAugmentationMethodPCT, author = {Ooi, Wen Rui Kenneth and Watcharasupat, Karn and Lam, Bhan and Ong, Zhen Ting and Wong, Trevor Martens Zhi Ming and Gan, Woon Seng}, title = {Soundscape augmentation system and method of forming the same}, nationality = {WO}, number = {PCT/SG2023/050289}, dayfiled = {26}, monthfiled = apr, yearfiled = {2023}, day = {26}, month = apr, year = {2023}, google_scholar_id = {QIV2ME_5wuYC} }
2022
- JournalQuantitative Evaluation Approach for Translation of Perceptual Soundscape Attributes: Initial Application to the Thai LanguageKarn N. Watcharasupat, Sureenate Jaratjarungkiat, Bhan Lam, Sujinat Jitwiriyanont, Kenneth Ooi, Zhen Ting Ong, Nitipong Pichetpan, Kanyanut Akaratham, Titima Suthiwan, Monthita Rojtinnakorn, and Woon Seng GanApplied Acoustics, Nov 2022
Translation of perceptual soundscape attributes from one language to another remains a challenging task that requires a high degree of fidelity in both psychoacoustic and psycholinguistic senses across the target population. Due to the inherently subjective nature of human perception, translating soundscape attributes using only small focus group discussions or expert panels could lead to translations with psycholinguistic meanings that, in a non-expert setting, deviate or distort from that of the source language. In this work, we present a quantitative evaluation method based on the circumplex model of soundscape perception to assess the overall translation quality. By establishing a set of criteria for evaluating the linguistic and psychometric properties of the translation candidates, statistical analyses can be performed to objectively assess specific strengths and weaknesses of the translation candidates before committing to listening tests or more involved validation experiments. As an initial application domain, we demonstrated the use of the quantitative evaluation framework in the context of an English-to-Thai translation of soundscape attributes. A total of 31 participants who are bilingual in English and Thai were recruited to assess the translation candidates. Subsequent statistical analysis of the evaluation scores revealed acoustico-psycholinguistic properties of the translation candidates which were not previously identified by the expert panel and facilitated a more objective selection of the final translations for subsequent usage. Additionally, with specific biases of the final translations determined numerically, mathematical and statistical techniques for corrections of the survey data may be employed in the future to improve cross-lingual compatibility in soundscape evaluation.
@article{Watcharasupat2022QuantitativeEvaluationApproach, title = { Quantitative Evaluation Approach for Translation of Perceptual Soundscape Attributes: Initial Application to the Thai Language }, author = {Watcharasupat, Karn N. and Jaratjarungkiat, Sureenate and Lam, Bhan and Jitwiriyanont, Sujinat and Ooi, Kenneth and Ong, Zhen Ting and Pichetpan, Nitipong and Akaratham, Kanyanut and Suthiwan, Titima and Rojtinnakorn, Monthita and Gan, Woon Seng}, year = {2022}, month = nov, journal = {Applied Acoustics}, publisher = {Elsevier Ltd}, volume = {200}, doi = {10.1016/j.apacoust.2022.108962}, issn = {1872910X}, google_scholar_id = {Se3iqnhoufwC} }
- JournalCrossing the Linguistic Causeway: A Binational Approach for Translating Soundscape Attributes to Bahasa MelayuBhan Lam, Julia Chieng, Karn N. Watcharasupat, Kenneth Ooi, Zhen-Ting Ong, Joo Young Hong, and Woon-Seng GanApplied Acoustics, Oct 2022
Translation of perceptual descriptors such as the perceived affective quality attributes in the soundscape standard (ISO/TS 12913–2:2018) is an inherently intricate task, especially if the target language is used in multiple countries. Despite geographical proximity and a shared language of Bahasa Melayu (Standard Malay), differences in culture and language education policies between Singapore and Malaysia could invoke peculiarities in the affective appraisal of sounds. To generate provisional translations of the eight perceived affective attributes — eventful, vibrant, pleasant, calm, uneventful, monotonous, annoying, and chaotic — into Bahasa Melayu that is applicable in both Singapore and Malaysia, a binational expert-led approach supplemented by a quantitative evaluation framework was adopted. A set of preliminary translation candidates were developed via a four-stage process, firstly by a qualified translator, which was then vetted by linguistics experts, followed by examination via an experiential evaluation, and finally reviewed by the core research team. A total of 66 participants were then recruited cross-nationally to quantitatively evaluate the preliminary translation candidates. Of the eight attributes, cross-national differences were observed only in the translation of annoying. For instance, menjengkelkan was found to be significantly less understood in Singapore than in Malaysia, as well as less understandable than membingitkan within Singapore. Results of the quantitative evaluation also revealed the imperfect nature of foreign language translations for perceptual descriptors, which suggests a possibility for exploring corrective measures.
@article{Lam2022CrossingLinguisticCauseway, title = { Crossing the Linguistic Causeway: A Binational Approach for Translating Soundscape Attributes to Bahasa Melayu }, author = {Lam, Bhan and Chieng, Julia and Watcharasupat, Karn N. and Ooi, Kenneth and Ong, Zhen-Ting and Hong, Joo Young and Gan, Woon-Seng}, year = {2022}, month = oct, journal = {Applied Acoustics}, volume = {199}, pages = {108976}, doi = {10.1016/j.apacoust.2022.108976}, issn = {0003-682X}, google_scholar_id = {MXK_kJrjxJIC} }
- ConferenceDo uHear? Validation of uHear App for Preliminary Screening of Hearing Ability in Soundscape StudiesIn Proceedings of the 24th International Congress on Acoustics , Oct 2022
Studies involving soundscape perception often exclude participants with hearing loss to prevent impaired perception from affecting experimental results. Participants are typically screened with pure tone audiometry, the "gold standard" for identifying and quantifying hearing loss at specific frequencies, and excluded if a study-dependent threshold is not met. However, procuring professional audiometric equipment for soundscape studies may be cost-ineffective, and manually performing audiometric tests is labour-intensive. Moreover, testing requirements for soundscape studies may not require sensitivities and specificities as high as that in a medical diagnosis setting. Hence, in this study, we investigate the effectiveness of the uHear app, an iOS application, as an affordable and automatic alternative to a conventional audiometer in screening participants for hearing loss for the purpose of soundscape studies or listening tests in general. Based on audiometric comparisons with the audiometer of 163 participants, the uHear app was found to have high precision (98.04 %) when using the World Health Organization (WHO) grading scheme for assessing normal hearing. Precision is further improved (98.69 %) when all frequencies assessed with the uHear app is considered in the grading, which lends further support to this cost-effective, automated alternative to screen for normal hearing.
@inproceedings{Ong2022UHearValidationUHear, title = { Do uHear? Validation of uHear App for Preliminary Screening of Hearing Ability in Soundscape Studies }, author = {Ong, Zhen-Ting and Lam, Bhan and Ooi, Kenneth and Watcharasupat, Karn N. and Wong, Trevor and Gan, Woon-Seng}, year = {2022}, month = oct, booktitle = {Proceedings of the 24th International Congress on Acoustics}, doi = {10.21979/n9/jqdi6f}, google_scholar_id = {KlAtU1dfN6UC} }
- ConferenceA Benchmark Comparison of Perceptual Models for Soundscapes on a Large-Scale Augmented Soundscape DatasetIn Proceedings of the 24th International Congress on Acoustics , Oct 2022
@inproceedings{Ooi2022BenchmarkComparisonPerceptual, title = { A Benchmark Comparison of Perceptual Models for Soundscapes on a Large-Scale Augmented Soundscape Dataset }, author = {Ooi, Kenneth and Watcharasupat, Karn N. and Lam, Bhan and Ong, Zhen-Ting and Gan, Woon-Seng}, year = {2022}, month = oct, booktitle = {Proceedings of the 24th International Congress on Acoustics}, doi = {10.21979/n9/9otevx}, google_scholar_id = {YOwf2qJgpHMC} }
- ConferenceDeployment of an IoT System for Adaptive In-Situ Soundscape AugmentationTrevor Wong*, Karn N. Watcharasupat*, Bhan Lam, Kenneth Ooi, Zhen-Ting Ong, Furi Andi Karnapi, and Woon-Seng GanIn Proceedings of the 51st International Congress and Expo on Noise Control Engineering , Aug 2022
Soundscape augmentation is an emerging approach for noise mitigation by introducing additional sounds known as "maskers" to increase acoustic comfort. Traditionally, the choice of maskers is often predicated on expert guidance or post-hoc analysis which can be time-consuming and sometimes arbitrary. Moreover, this often results in a static set of maskers that are inflexible to the dynamic nature of real-world acoustic environments. Overcoming the inflexibility of traditional soundscape augmentation is twofold. First, given a snapshot of a soundscape, the system must be able to select an optimal masker without human supervision. Second, the system must also be able to react to changes in the acoustic environment with near real-time latency. In this work, we harness the combined prowess of cloud computing and the Internet of Things (IoT) to allow in-situ listening and playback using microcontrollers while delegating computationally expensive inference tasks to the cloud. In particular, a serverless cloud architecture was used for inference, ensuring near real-time latency and scalability without the need to provision computing resources. A working prototype of the system is currently being deployed in a public area experiencing high traffic noise, as well as undergoing public evaluation for future improvements.
@inproceedings{Wong2022DeploymentIoTSystem, title = {Deployment of an IoT System for Adaptive In-Situ Soundscape Augmentation}, author = {Wong*, Trevor and Watcharasupat*, Karn N. and Lam, Bhan and Ooi, Kenneth and Ong, Zhen-Ting and Karnapi, Furi Andi and Gan, Woon-Seng}, year = {2022}, month = aug, booktitle = { Proceedings of the 51st International Congress and Expo on Noise Control Engineering }, doi = {10.3397/IN_2022_0290}, google_scholar_id = {0EnyYjriUFMC} }
- JournalAutonomous In-Situ Soundscape Augmentation via Joint Selection of Masker and GainIEEE Signal Processing Letters, Jul 2022
The selection of maskers and playback gain levels in an in-situ soundscape augmentation system is crucial to its effectiveness in improving the overall acoustic comfort of a given environment. Traditionally, the selection of appropriate maskers and gain levels has been informed by expert opinion, which may not be representative of the target population, or by listening tests, which can be time- and labor-intensive. Furthermore, the resulting static choices of masker and gain are often inflexible to dynamic real-world soundscapes. In this work, we utilized a deep learning model to perform joint selection of the optimal masker and its gain level for a given soundscape. The proposed model was designed with highly modular building blocks, allowing for an optimized inference process that can quickly search through a large number of masker-gain combinations. In addition, we introduced the use of feature-domain soundscape augmentation conditioned on the digital gain level, eliminating the computationally expensive waveform-domain mixing process during inference, as well as the tedious gain adjustment process required for new maskers. The proposed system was evaluated on a large-scale dataset of subjective responses to augmented soundscapes with 442 participants, with the best model achieving a mean squared error of
on pleasantness score, validating the ability of the model to predict combined effect of the masker and its gain level on the perceptual pleasantness level. The proposed system thus allows in-situ or mixed-reality soundscape augmentation to be performed autonomously with near real-time latency while continuously accounting for changes in acoustic environments.}{0.122}}mathbf {}pm }{0.005}{ @article{Watcharasupat2022AutonomousInSituSoundscape, title = { Autonomous In-Situ Soundscape Augmentation via Joint Selection of Masker and Gain }, author = {Watcharasupat, Karn N. and Ooi, Kenneth and Lam, Bhan and Wong, Trevor and Ong, Zhen Ting and Gan, Woon Seng}, year = {2022}, month = jul, journal = {IEEE Signal Processing Letters}, volume = {29}, pages = {1749--1753}, doi = {10.1109/lsp.2022.3194419}, issn = {15582361}, google_scholar_id = {5nxA0vEk-isC} }
- ConferenceAssessment of a Cost-Effective Headphone Calibration Procedure for Soundscape EvaluationsIn Proceedings of the 24th International Congress on Acoustics , Jul 2022
To increase the availability and adoption of the soundscape standard, a low-cost calibration procedure for reproduction of audio stimuli over headphones was proposed as part of the global “Soundscape Attributes Translation Project” (SATP) for validating ISO/TS~12913-2:2018 perceived affective quality (PAQ) attribute translations. A previous preliminary study revealed significant deviations from the intended equivalent continuous A-weighted sound pressure levels (Ł_{}text{A,eq}}\) using the open-circuit voltage (OCV) calibration procedure. For a more holistic human-centric perspective, the OCV method is further investigated here in terms of psychoacoustic parameters, including relevant exceedance levels to account for temporal effects on the same 27 stimuli from the SATP. Moreover, a within-subjects experiment with 36 participants was conducted to examine the effects of OCV calibration on the PAQ attributes in ISO/TS~12913-2:2018. Bland-Altman analysis of the objective indicators revealed large biases in the OCV method across all weighted sound level and loudness indicators; and roughness indicators at }SI{5}{}%} and }SI{10}{}%} exceedance levels. Significant perceptual differences due to the OCV method were observed in about }SI{20}{}%} of the stimuli, which did not correspond clearly with the biased acoustic indicators. A cautioned interpretation of the objective and perceptual differences due to small and unpaired samples nevertheless provide grounds for further investigation.
@inproceedings{Lam2022AssessmentCosteffectiveHeadphone, title = { Assessment of a Cost-Effective Headphone Calibration Procedure for Soundscape Evaluations }, author = {Lam, Bhan and Ooi, Kenneth and Ong, Zhen-Ting and Watcharasupat, Karn N. and Wong, Trevor and Gan, Woon-Seng}, year = {2022}, month = jul, booktitle = {Proceedings of the 24th International Congress on Acoustics}, google_scholar_id = {Zph67rFs4hoC} }
- JournalSingapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-Means ClusteringSustainability, Jun 2022
The ecological validity of soundscape studies usually rests on the choice of soundscapes that are representative of the perceptual space under investigation. For example, a soundscape pleasantness study might investigate locations with soundscapes ranging from “pleasant” to “annoying”. The choice of soundscapes is typically researcher led, but a participant-led process can reduce selection bias and improve result reliability. Hence, we propose a robust participant-led method to pinpoint characteristic soundscapes possessing arbitrary perceptual attributes. We validate our method by identifying Singaporean soundscapes spanning the perceptual quadrants generated from the “Pleasantness” and “Eventfulness” axes of the ISO 12913-2 circumplex model of soundscape perception, as perceived by local experts. From memory and experience, 67 participants first selected locations corresponding to each perceptual quadrant in each major planning region of Singapore. We then performed weighted k-means clustering on the selected locations, with weights for each location derived from previous frequencies and durations spent in each location by each participant. Weights hence acted as proxies for participant confidence. In total, 62 locations were thereby identified as suitable locations with characteristic soundscapes for further research utilizing the ISO 12913-2 perceptual quadrants. Audio–visual recordings and acoustic characterization of the soundscapes will be made in a future study.
@article{Ooi2022SingaporeSoundscapeSite, title = { Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-Means Clustering }, author = {Ooi, Kenneth and Lam, Bhan and Hong, Joo Young and Watcharasupat, Karn N. and Ong, Zhen Ting and Gan, Woon Seng}, year = {2022}, month = jun, journal = {Sustainability}, publisher = {MDPI}, volume = {14}, number = {12}, doi = {10.3390/su14127485}, issn = {20711050}, google_scholar_id = {3fE2CSJIrl8C} }
- ConferenceEnd-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise SuppressionKarn N. Watcharasupat, Thi Ngoc Tho Nguyen, Woon-Seng Gan, Shengkui Zhao, and Bin MaIn Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing , May 2022
Echo and noise suppression is an integral part of a full-duplex communication system. Many recent acoustic echo cancellation (AEC) systems rely on a separate adaptive filtering module for linear echo suppression and a neural module for residual echo suppression. However, not only do adaptive filtering modules require convergence and remain susceptible to changes in acoustic environments, but this two-stage framework also often introduces unnecessary delays to the AEC system when neural modules are already capable of both linear and nonlinear echo suppression. In this paper, we exploit the offset-compensating ability of complex time-frequency masks and propose an end-to-end complex-valued neural network architecture. The building block of the proposed model is a pseudocomplex extension based on the densely-connected multidilated DenseNet (D3Net) building block, resulting in a very small network of only 354K parameters. The architecture utilized the multi-resolution nature of the D3Net building blocks to eliminate the need for pooling, allowing the network to extract features using large receptive fields without any loss of output resolution. We also propose a dual-mask technique for joint echo and noise suppression with simultaneous speech enhancement. Evaluation on both synthetic and real test sets demonstrated promising results across multiple energy-based metrics and perceptual proxies.
@inproceedings{Watcharasupat2022EndtoEndComplexValuedMultidilated, title = { End-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise Suppression }, author = {Watcharasupat, Karn N. and Nguyen, Thi Ngoc Tho and Gan, Woon-Seng and Zhao, Shengkui and Ma, Bin}, year = {2022}, month = may, booktitle = { Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing }, publisher = {IEEE}, pages = {656--660}, doi = {10.1109/icassp43922.2022.9747034}, isbn = {978-1-66540-540-9}, google_scholar_id = {YsMSGLbcyi4C} }
- JournalSALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and DetectionIEEE/ACM Transactions on Audio, Speech, and Language Processing, May 2022
@article{Nguyen2022SALSASpatialCueAugmented, title = { SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection }, author = {Nguyen, Thi Ngoc Tho and Watcharasupat, Karn N. and Nguyen, Ngoc Khanh and Jones, Douglas L. and Gan, Woon-Seng}, year = {2022}, month = may, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume = {30}, pages = {1749--1762}, doi = {10.1109/taslp.2022.3173054}, issn = {2329-9304}, google_scholar_id = {W7OEmFMy1HYC} }
- ConferencePreliminary Assessment of a Cost-Effective Headphone Calibration Procedure for Soundscape EvaluationsBhan Lam, Kenneth Ooi, Karn N. Watcharasupat, Zhen-Ting Ong, Yun-Ting Lau, Trevor Wong, and Woon-Seng GanIn Proceedings of the 28th International Congress on Sound and Vibration , May 2022
The introduction of ISO 12913-2:2018 has provided a framework for standardized data collection and reporting procedures for soundscape practitioners. A strong emphasis was placed on the use of calibrated head and torso simulators (HATS) for binaural audio capture to obtain an accurate subjective impression and acoustic measure of the soundscape under evaluation. To auralise the binaural recordings as recorded or at set levels, the audio stimuli and the headphone setup are usually calibrated with a HATS. However, calibrated HATS are too financially prohibitive for most research teams, inevitably diminishing the availability of the soundscape standard. With the increasing availability of soundscape binaural recording datasets, and the importance of cross-cultural validation of the soundscape ISO standards, e.g.} via the Soundscape Attributes Translation Project (SATP), it is imperative to assess the suitability of cost-effective headphone calibration methods to maximise availability without severely compromising on accuracy. Hence, this study objectively examines an open-circuit voltage (OCV) calibration method in comparison to a calibrated HATS on various soundcard and headphone combinations. Preliminary experiments found that calibration with the OCV method differed significantly from the reference binaural recordings in sound pressure levels, whereas negligible differences in levels were observed with the HATS calibration.
@inproceedings{Lam2022PreliminaryAssessmentCosteffective, title = { Preliminary Assessment of a Cost-Effective Headphone Calibration Procedure for Soundscape Evaluations }, author = {Lam, Bhan and Ooi, Kenneth and Watcharasupat, Karn N. and Ong, Zhen-Ting and Lau, Yun-Ting and Wong, Trevor and Gan, Woon-Seng}, year = {2022}, month = may, booktitle = {Proceedings of the 28th International Congress on Sound and Vibration}, google_scholar_id = {8k81kl-MbHgC} }
- ConferenceSALSA-Lite: A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone ArraysIn Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2022
@inproceedings{Nguyen2022SALSALiteFastEffective, title = { SALSA-Lite: A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone Arrays }, author = {Nguyen, Thi Ngoc Tho and Jones, Douglas L. and Watcharasupat, Karn N. and Phan, Huy and Gan, Woon-Seng}, year = {2022}, month = may, booktitle = { Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) }, pages = {716--720}, doi = {10.1109/icassp43922.2022.9746132}, isbn = {2379-190X}, google_scholar_id = {_FxGoFyzp5QC} }
- ConferenceProbably Pleasant? A Neural-Probabilistic Approach to Automatic Masker Selection for Urban Soundscape AugmentationIn Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing , May 2022
@inproceedings{Ooi2022ProbablyPleasantNeuralProbabilistic, title = { Probably Pleasant? A Neural-Probabilistic Approach to Automatic Masker Selection for Urban Soundscape Augmentation }, author = {Ooi, Kenneth and Watcharasupat, Karn N. and Lam, Bhan and Ong, Zhen-Ting and Gan, Woon-Seng}, year = {2022}, month = may, booktitle = { Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing }, doi = {10.1109/icassp43922.2022.9746897}, google_scholar_id = {hqOjcs7Dif8C} }
- ConferenceFRCRN: Boosting Feature Representation Using Frequency Recurrence for Monaural Speech EnhancementShengkui Zhao, Bin Ma, Karn N. Watcharasupat, and Woon-Seng GanIn Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing , May 2022
Convolutional recurrent networks (CRN) integrating a convolutional encoder-decoder (CED) structure and a recurrent structure have achieved promising performance for monaural speech enhancement. However, feature representation across frequency context is highly constrained due to limited receptive fields in the convolutions of CED. In this paper, we propose a convolutional recurrent encoder-decoder (CRED) structure to boost feature representation along the frequency axis. The CRED applies frequency recurrence on 3D convolutional feature maps along the frequency axis following each convolution, therefore, it is capable of catching long-range frequency correlations and enhancing feature representations of speech inputs. The proposed frequency recurrence is realized efficiently using a feedforward sequential memory network (FSMN). Besides the CRED, we insert two stacked FSMN layers between the encoder and the decoder to model further temporal dynamics. We name the proposed framework as Frequency Recurrent CRN (FRCRN). We design FRCRN to predict complex Ideal Ratio Mask (cIRM) in complex-valued domain and optimize FRCRN using both time-frequency-domain and time-domain losses. Our proposed approach achieved state-of-the-art performance on wideband benchmark datasets and achieved 2nd place for the real-time fullband track in terms of Mean Opinion Score (MOS) and Word Accuracy (WAcc) in the ICASSP 2022 Deep Noise Suppression (DNS) challenge.
@inproceedings{Zhao2022FRCRNBoostingFeature, title = { FRCRN: Boosting Feature Representation Using Frequency Recurrence for Monaural Speech Enhancement }, author = {Zhao, Shengkui and Ma, Bin and Watcharasupat, Karn N. and Gan, Woon-Seng}, year = {2022}, month = may, booktitle = { Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing }, publisher = {IEEE}, pages = {9281--9285}, doi = {10.1109/icassp43922.2022.9747578}, isbn = {978-1-66540-540-9}, google_scholar_id = {UebtZRa9Y70C} }
- JournalLatte: Cross-framework Python Package for Evaluation of Latent-Based Generative ModelsKarn N. Watcharasupat, Junyoung Lee, and Alexander LerchSoftware Impacts, Feb 2022
Latte (for LATent Tensor Evaluation) is a Python library for evaluation of latent-based generative models in the fields of disentanglement learning and controllable generation. Latte is compatible with both PyTorch and TensorFlow/Keras, and provides both functional and modular APIs that can be easily extended to support other deep learning frameworks. Using NumPy-based and framework-agnostic implementation, Latte ensures reproducible, consistent, and deterministic metric calculations regardless of the deep learning framework of choice.
@article{Watcharasupat2022LatteCrossframeworkPython, title = { Latte: Cross-framework Python Package for Evaluation of Latent-Based Generative Models }, author = {Watcharasupat, Karn N. and Lee, Junyoung and Lerch, Alexander}, year = {2022}, month = feb, journal = {Software Impacts}, volume = {11}, pages = {100222}, doi = {10.1016/j.simpa.2022.100222}, issn = {2665-9638}, google_scholar_id = {roLk4NBRz8UC} }
2021
- ThesisControllable Music: Supervised Learning of Disentangled Representations for Music GenerationKarn N. WatcharasupatDec 2021
Controllability, despite being a much-desired property of a generative model, remains an ill-defined concept that is difficult to measure. In the context of neural music generation, a controllable system often implies an intuitive interaction between human agents and the neural model, allowing the relatively opaque neural model to be controlled by a human in a semantically understandable manner. In this work, we aim to tackle controllable music generation in the raw audio domain, which is significantly less attempted compared to the symbolic domain. Specifically, we focus on controlling multiple continuous, potentially interdependent timbral attributes of a musical note using a variational autoencoder (VAE) framework, and the necessary groundwork research needed to support the goal. Specifically, this work consists of three main parts. The first formulates the concept of controllability and how to evaluate a latent manifold of deep generative models in the presence of multiple interdependent attributes. The second focuses on the development of a composite latent space architecture for VAE, in order to allow encoding of interdependent attributes which having an easily sampled disentangled prior. Proofs of concept work for the second part was performed on several standard vision disentanglement learning datasets. Finally, the last part applies the composite latent space model on music generation in the raw audio domain and discusses the evaluation of the model against the criteria defined in the first part of this project. All in all, given the relatively uncharted nature of the controllable generation in the raw audio domain, this project provides a foundational work for the evaluation of controllable generation as a whole, and a promising proof of concept for musical audio generation with timbral control using variational autoencoders.
@thesis{Watcharasupat2021ControllableMusicSupervised, title = { Controllable Music: Supervised Learning of Disentangled Representations for Music Generation }, author = {Watcharasupat, Karn N.}, year = {2021}, month = dec, address = {Singapore}, type = {Final {{Year Project}} ({{FYP}})}, lccn = {CY3001-211}, school = {Nanyang Technological University}, google_scholar_id = {LkGwnXOMwfcC} }
- A Strongly-Labelled Polyphonic Dataset of Urban Sounds with Spatiotemporal ContextKenneth Ooi, Karn N. Watcharasupat, Santi Peksi, Furi Andi Karnapi, Zhen-Ting Ong, Danny Chua, Hui-Wen Leow, Li-Long Kwok, Xin-Lei Ng, Zhen-Ann Loh, and Woon-Seng GanIn Proceedings of the 13th Asia Pacific Signal and Information Processing Association Annual Summit and Conference , Dec 2021
@inproceedings{Ooi2021StronglyLabelledPolyphonicDataset, title = { A Strongly-Labelled Polyphonic Dataset of Urban Sounds with Spatiotemporal Context }, author = {Ooi, Kenneth and Watcharasupat, Karn N. and Peksi, Santi and Karnapi, Furi Andi and Ong, Zhen-Ting and Chua, Danny and Leow, Hui-Wen and Kwok, Li-Long and Ng, Xin-Lei and Loh, Zhen-Ann and Gan, Woon-Seng}, year = {2021}, month = dec, booktitle = { Proceedings of the 13th Asia Pacific Signal and Information Processing Association Annual Summit and Conference }, publisher = {Asia Pacific Signal and Information Processing Association}, address = {Tokyo, Japan}, google_scholar_id = {WF5omc3nYNoC} }
- Ext. AbstractEvaluation of Latent Space Disentanglement in the Presence of Interdependent AttributesKarn N. Watcharasupat, and Alexander LerchIn Extended Abstracts of the Late-Breaking Demo Session of the 22nd International Society for Music Information Retrieval Conference , Nov 2021
Controllable music generation with deep generative models has become increasingly reliant on disentanglement learning techniques. However, current disentanglement metrics, such as mutual information gap (MIG), are often inadequate and misleading when used for evaluating latent representations in the presence of interdependent semantic attributes often encountered in real-world music datasets. In this work, we propose a dependency-aware information metric as a drop-in replacement for MIG that accounts for the inherent relationship between semantic attributes.
@inproceedings{Watcharasupat2021EvaluationLatentSpace, title = { Evaluation of Latent Space Disentanglement in the Presence of Interdependent Attributes }, author = {Watcharasupat, Karn N. and Lerch, Alexander}, year = {2021}, month = nov, booktitle = { Extended Abstracts of the Late-Breaking Demo Session of the 22nd International Society for Music Information Retrieval Conference }, google_scholar_id = {eQOLeE2rZwMC} }
- Ext. AbstractAVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-OccurrenceYun-Ning Hung, Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife, Kelian Li, Pavan Seshadri, and Junyoung LeeIn Extended Abstracts of the Late-Breaking Demo Session of the 22nd International Society for Music Information Retrieval Conference , Nov 2021
This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate use of this corpus for music/speech discrimination on Broadcast news and VAD for speaker identification.
@inproceedings{Hung2021AVASpeechSMADStronglyLabelled, title = { AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence }, author = {Hung, Yun-Ning and Watcharasupat, Karn N. and Wu, Chih-Wei and Orife, Iroro and Li, Kelian and Seshadri, Pavan and Lee, Junyoung}, year = {2021}, month = nov, booktitle = { Extended Abstracts of the Late-Breaking Demo Session of the 22nd International Society for Music Information Retrieval Conference }, google_scholar_id = {ufrVoPGSRksC} }
- ConferenceWhat Makes Sound Event Localization and Detection Difficult? Insights from Error AnalysisThi Ngoc Tho Nguyen, Karn N. Watcharasupat, Zhen Jian Lee , Ngoc Khanh Nguyen , Douglas L Jones, and Woon Seng GanIn Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events , Nov 2021
@inproceedings{Nguyen2021WhatMakesSound, title = { What Makes Sound Event Localization and Detection Difficult? Insights from Error Analysis }, author = {Nguyen, Thi Ngoc Tho and Watcharasupat, Karn N. and Lee, Zhen Jian and Nguyen, Ngoc Khanh and Jones, Douglas L and Gan, Woon Seng}, year = {2021}, month = nov, booktitle = { Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events }, number = {November}, google_scholar_id = {UeHWp8X0CEIC} }
- ConferenceDevelopment of a Feedback Interface for In-Situ Soundscape EvaluationFuri Andi Karnapi, Bhan Lam, Kenneth Ooi, Yun-Ting Lau, Karn Watcharasupat, Trevor Wong, Woon-Seng Gan, Jooyoung Hong, Samuel Yeong, and Irene LeeIn Proceedings of the 50th International Congress and Expo on Noise Control Engineering , Aug 2021
@inproceedings{Karnapi2021DevelopmentFeedbackInterface, title = {Development of a Feedback Interface for In-Situ Soundscape Evaluation}, author = {Karnapi, Furi Andi and Lam, Bhan and Ooi, Kenneth and Lau, Yun-Ting and Watcharasupat, Karn and Wong, Trevor and Gan, Woon-Seng and Hong, Jooyoung and Yeong, Samuel and Lee, Irene}, year = {2021}, month = aug, booktitle = { Proceedings of the 50th International Congress and Expo on Noise Control Engineering }, publisher = {I-INCE}, address = {Washington, D.C., USA}, doi = {10.3397/in-2021-2084}, google_scholar_id = {zYLM7Y9cAGgC} }
- ConferenceAssessment of Inter-IC Sound Microelectromechanical Systems Microphones for Soundscape ReportingTrevor Wong, Bhan Lam, Karn Watcharasupat, Kenneth Ooi, Zhen-Ting Ong, Furi Andi Karnapi, Woon-Seng Gan, Jooyoung Hong, Samuel Yeong, and Irene LeeIn Proceedings of the 50th International Congress and Expo on Noise Control Engineering , Aug 2021
@inproceedings{Wong2021AssessmentInterICSound, title = { Assessment of Inter-IC Sound Microelectromechanical Systems Microphones for Soundscape Reporting }, author = {Wong, Trevor and Lam, Bhan and Watcharasupat, Karn and Ooi, Kenneth and Ong, Zhen-Ting and Karnapi, Furi Andi and Gan, Woon-Seng and Hong, Jooyoung and Yeong, Samuel and Lee, Irene}, year = {2021}, month = aug, booktitle = { Proceedings of the 50th International Congress and Expo on Noise Control Engineering }, publisher = {I-INCE}, address = {Washington, D.C., USA}, doi = {10.3397/in-2021-2086}, google_scholar_id = {IjCSPb-OGe4C} }
- PreprintImproving Polyphonic Sound Event Detection on Multichannel Recordings with the Sørensen-Dice Coefficient Loss and Transfer LearningKarn N. Watcharasupat, Thi Ngoc Tho Nguyen , Ngoc Khanh Nguyen, Zhen Jian Lee , Douglas L Jones, and Woon Seng GanJul 2021
@misc{Watcharasupat2021ImprovingPolyphonicSound, title = { Improving {{Polyphonic Sound Event Detection}} on {{Multichannel Recordings}} with the {{S{\o}rensen-Dice Coefficient Loss}} and {{Transfer Learning}} }, author = {Watcharasupat, Karn N. and Nguyen, Thi Ngoc Tho and Nguyen, Ngoc Khanh and Lee, Zhen Jian and Jones, Douglas L and Gan, Woon Seng}, year = {2021}, month = jul, google_scholar_id = {qjMakFHDy7sC} }
- Tech Rep.DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and DetectionThi Ngoc Tho Nguyen, Karn Watcharasupat , Ngoc Khanh Nguyen , Douglas L Jones, and Woon Seng GanJul 2021
@techreport{Nguyen2021DCASE2021Task, title = { DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and Detection }, author = {Nguyen, Thi Ngoc Tho and Watcharasupat, Karn and Nguyen, Ngoc Khanh and Jones, Douglas L and Gan, Woon Seng}, year = {2021}, month = jul, journal = { IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events }, google_scholar_id = {2osOgNQ5qMEC} }
- PatentBlind source separation method without calculating sampling frequency error and audio processing system (不须计算取样频率误差的盲源分离方法以及音频处理系统)Hai Trieu Anh Nguyen (阮海潮英), Wai Hoong Khong (邝伟雄), Karn Watcharasupat (瓦特察拉苏帕特 甘), and Qing Liu (刘晴)China Patent App. CN202110660272.4A, Filed Jun 2021
The present disclosure provides a blind source separation method and an audio processing system without calculating a sampling frequency error, wherein the blind source separation method is suitable for an audio processing system, and the audio processing system comprises a plurality of devices, and each device comprises a plurality of microphones. The difference between the signal vector sensed by each device and a row of the mixing matrix is calculated, the difference is used to establish an objective function, and then an optimization algorithm is performed to calculate the mixing matrix. The original signal can be calculated from the mixing matrix and the signal vector without calculating the sampling frequency error between the devices. Thus, there is no need to compensate for the sampling frequency error. 本揭露提出一种不须计算取样频率误差的盲源分离方法以及音频处理系统,盲源分离方法适用于一音频处理系统,此音频处理系统包括多个装置,每一个装置包括多个麦克风。先计算每个装置感测的信号向量与混和矩阵的一行之间的差异,此差异用来建立一目标函数,接着执行一最佳化演算法来计算混和矩阵。根据混和矩阵与信号向量可以计算出原始信号而不用计算装置之间的取样频率误差。如此一来,便不需要补偿取样频率误差。
@patent{Nguyen2021MethodAudioProcessingCN, author = {Nguyen (阮海潮英), Hai Trieu Anh and Khong (邝伟雄), Wai Hoong and Watcharasupat (瓦特察拉苏帕特 甘), Karn and Liu (刘晴), Qing}, title = {Blind source separation method without calculating sampling frequency error and audio processing system (不须计算取样频率误差的盲源分离方法以及音频处理系统)}, nationality = {China}, number = {CN202110660272.4A}, dayfiled = {15}, monthfiled = jun, yearfiled = {2021}, day = {15}, month = jun, year = {2021}, google_scholar_id = {qUcmZB5y_30C} }
- ConferenceDirectional Sparse Filtering Using Weighted Lehmer Mean for Blind Separation of Unbalanced Speech MixturesKarn Watcharasupat , Anh H. T. Nguyen , Ching-Hui Ooi , and Andy W. H. KhongIn Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing , Jun 2021
In blind source separation of speech signals, the inherent imbalance in the source spectrum poses a challenge for methods that rely on single-source dominance for the estimation of the mixing matrix. We propose an algorithm based on the directional sparse filtering (DSF) framework that utilizes the Lehmer mean with learnable weights to adaptively account for source imbalance. Performance evaluation in multiple real acoustic environments show improvements in source separation compared to the baseline methods.
@inproceedings{Watcharasupat2021DirectionalSparseFiltering, title = { Directional Sparse Filtering Using Weighted Lehmer Mean for Blind Separation of Unbalanced Speech Mixtures }, author = {Watcharasupat, Karn and Nguyen, Anh H. T. and Ooi, Ching-Hui and Khong, Andy W. H.}, year = {2021}, month = jun, booktitle = { Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing }, publisher = {IEEE}, address = {Toronto, Canada}, pages = {4485--4489}, doi = {10.1109/icassp39728.2021.9414336}, isbn = {978-1-72817-605-5}, issn = {23318422}, google_scholar_id = {u-x6o8ySG0sC} }
2020
- PreprintVisual Attention for Musical Instrument RecognitionKarn Watcharasupat, Siddharth Gururani, and Alexander LerchMay 2020
In the field of music information retrieval, the task of simultaneously identifying the presence or absence of multiple musical instruments in a polyphonic recording remains a hard problem. Previous works have seen some success in improving instrument classification by applying temporal attention in a multi-instance multi-label setting, while another series of work has also suggested the role of pitch and timbre in improving instrument recognition performance. In this project, we further explore the use of attention mechanism in a timbral-temporal sense, la visual attention, to improve the performance of musical instrument recognition using weakly-labeled data. Two approaches to this task have been explored. The first approach applies attention mechanism to the sliding-window paradigm, where a prediction based on each timbral-temporal ‘instance’ is given an attention weight, before aggregation to produce the final prediction. The second approach is based on a recurrent model of visual attention where the network only attends to parts of the spectrogram and decide where to attend to next, given a limited number of ‘glimpses’.
@misc{Watcharasupat2020VisualAttentionMusical, title = {Visual Attention for Musical Instrument Recognition}, author = {Watcharasupat, Karn and Gururani, Siddharth and Lerch, Alexander}, year = {2020}, month = may, google_scholar_id = {u5HHmVD_uO8C} }